1070 Commits

Author SHA1 Message Date
Chris Lu
c934b5dab6 fix(credential/postgres,s3api/iam): rename safety + pgxutil follow-ups to #9226 (#9230)
* refactor(util): extract pgx OpenDB + DSN builder into shared pgxutil

The postgres filer store had OpenPGXDB plus duplicated key=value DSN
assembly across postgres/ and postgres2/. Move the connection helper to
weed/util/pgxutil and add BuildDSN so the credential postgres store can
land on the same code path.

filer/postgres/pgx_conn.go keeps OpenPGXDB as a thin alias so postgres2
keeps building unchanged.

* refactor(credential/postgres): use shared pgxutil for connection setup

Replace the bespoke fmt.Sprintf DSN + sql.Open("pgx", ...) path with
pgxutil.BuildDSN + pgxutil.OpenDB so the credential store mirrors the
postgres filer store. This also drops the leaky RegisterConnConfig-style
init in favor of stdlib.OpenDB(*config), which doesn't accumulate
entries in the global pgx config map.

Adds parity knobs the filer store already exposes: sslcrl, and
configurable connection_max_idle / connection_max_open /
connection_max_lifetime_seconds (with the previous hardcoded 25/5/5min
as defaults). Also moves the jsonbParam helper here so other store
files can reuse it. (Helper is also referenced by postgres_identity.go,
which is migrated to it in the next commit.)

* refactor(credential/postgres): use jsonbParam helper across all writers

Consolidate JSONB write handling on the new pgxutil-adjacent helper
jsonbParam(b []byte) interface{}, which returns nil (driver writes SQL
NULL) when the marshaled JSON is empty and string(b) otherwise.

postgres_identity.go: replace the inline 'var fooParam any' /
'fooParam = string(b)' pattern with the helper. Same in CreateUser
and UpdateUser.

postgres_inline_policy.go, postgres_policy.go, postgres_service_account.go,
postgres_group.go: every JSONB writer was still passing []byte. Under
pgx simple_protocol (pgbouncer_compatible=true), []byte is encoded as
bytea and Postgres rejects that against a JSONB column with "invalid
input syntax for type json". Route them through jsonbParam too.

* fix(credential/postgres): rework SaveConfiguration to handle rename + UNIQUE access keys

The IAM rename path (s3api UpdateUser) renames an identity in place
and keeps its access keys. With the previous flow — upsert each user,
then per-user delete-and-insert credentials, then prune absent users —
the renamed user's access keys were still owned by the old row when
the INSERT for the new name ran, tripping credentials.access_key's
global UNIQUE constraint and failing every rename of a user with
credentials.

Reorder the SaveConfiguration body so the prune step runs BEFORE the
credential replace. CASCADE on the old user releases its access keys
in the same transaction, and the new name can then claim them.

While here:
- Replace the per-user loop DELETE FROM users WHERE username = $1 with
  a single DELETE ... WHERE username = ANY($1), one round trip instead
  of N inside the transaction.
- Surface inline-policy CASCADE losses: count user_inline_policies for
  the prune set and emit a Warningf when the count is non-zero so
  rename-driven drops are visible in operator logs (the structural
  fix for renames lives at the IAM layer in a follow-up commit).
- Two-pass credential replace: clear credentials for every user we are
  about to rewrite first, then insert, so an access key can be moved
  between two users in the same SaveConfiguration call.
- credErr := credRows.Err() before credRows.Close() in
  LoadConfiguration — Err() is documented as safe after Close, but
  the leading-capture pattern matches the rest of the file.

* fix(s3api/iam): preserve inline policies when renaming a user

EmbeddedIamApi.UpdateUser renames an identity in place and the caller
persists via SaveConfiguration, which prunes the old username and
CASCADE-drops its rows from user_inline_policies. GetUserPolicy and
ListUserPolicies then return nothing under the new name even though
the API reported success — silent data loss.

Before flipping sourceIdent.Name, list the user's stored inline
policies and re-attach each one under the new name. The subsequent
SaveConfiguration prune still CASCADE-removes the old-name rows; only
the duplicates we just wrote under the new name survive. Adds a
regression test that puts a policy on the old name, renames, and
asserts the policy is readable under the new name.

* perf(credential/postgres): batch the credential clear in SaveConfiguration

The two-pass credential replace was clearing each incoming user's
credentials with its own DELETE statement — N round-trips inside the
transaction. Match the pattern already used for the user prune and
issue a single DELETE FROM credentials WHERE username = ANY($1)
instead.

* refactor(s3api/iam): plumb context through UpdateUser

UpdateUser was synthesizing a fresh context.Background() inside the
inline-policy migration block, which discards the request deadline,
cancellation, and tracing carried by the caller. Add ctx as the first
parameter and pass r.Context() in via the ExecuteAction dispatcher,
mirroring the signature already used by CreatePolicy /
AttachUserPolicy / DetachUserPolicy.

* fix(util/pgxutil): quote DSN values per libpq rules

BuildDSN was concatenating values directly, so any password / cert path
/ database name with a space, single quote, or backslash produced a
malformed connection string and pgx.ParseConfig either errored or
mis-parsed the remainder. Critical now that the helper is shared with
the credential store: mTLS deployments routinely sourcing passwords or
secret-mounted cert paths from a vault are exactly the case where
spaces and quotes show up.

Add quoteDSNValue: empty values and values containing whitespace, `'`,
or `\` are wrapped in single quotes with `'` and `\` escaped per
PostgreSQL libpq rules; plain alphanumeric values pass through
unchanged. Apply it to every variable field in BuildDSN.

Adds a test that round-trips a password containing spaces, quotes and
backslashes through pgx.ParseConfig and confirms the parsed Config
matches the input.

* fix(credential,s3api/iam): atomic UserRenamer to avoid FK violation on rename

The previous IAM rename path called PutUserInlinePolicy(newName, ...)
before SaveConfiguration created the new users row. user_inline_policies
has a non-deferrable FOREIGN KEY (username) REFERENCES users(username),
which Postgres validates at statement time, so every rename of a user
that owned at least one inline policy failed with an FK violation. The
existing memory-store regression test missed it because the memory
backend has no FK enforcement.

Add an optional credential.UserRenamer interface plus a
CredentialManager.RenameUser thin shim that returns (supported, err).

Implement it on PostgresStore as an atomic in-transaction migration:
INSERT the new users row by SELECT-copying from the old, UPDATE
credentials.username and user_inline_policies.username to the new
name (FK satisfied because both rows now exist), then DELETE the old
row. ErrUserNotFound / ErrUserAlreadyExists are surfaced cleanly.

Implement it on MemoryStore by re-binding store.users / store.accessKeys
/ store.inlinePolicies under the new name. Also fixes a small leak in
DeleteUser, which was forgetting to drop the user's inline-policy
bucket.

EmbeddedIamApi.UpdateUser now calls RenameUser first; if the store
implements the interface, that's the whole migration. If it doesn't
(stores without FK enforcement), fall back to the previous
list / get / put copy.

Adds a focused test for MemoryStore.RenameUser that asserts the
identity, the access-key index, and the inline policies all land
under the new name.
2026-04-26 16:31:53 -07:00
Chris Lu
4f628ff4e5 fix(s3api): stream multipart-SSE chunks lazily to avoid truncated GETs (#8908) (#9228)
* fix(s3api): stream multipart SSE-S3 chunks lazily to avoid truncated GETs (#8908)

buildMultipartSSES3Reader opened a volume-server HTTP response for EVERY
chunk upfront, then walked them with io.MultiReader. For a multipart
SSE-S3 object with N internal chunks (e.g. a 200MB Docker Registry blob
with 25+ chunks), N volume-server bodies sat live at once; chunks
1..N-1 were idle while io.MultiReader drained chunk 0. Under concurrent
load the volume server's keep-alive logic closed those idle responses
mid-flight, and the S3 client saw `unexpected EOF` partway through the
GET. Truncated bytes hash to the wrong SHA-256, which is exactly the
"Digest did not match" symptom Docker Registry reports in #8908 (and
which persisted even after the per-chunk metadata fix in #9211 and the
completion backfill in #9224).

Introduce lazyMultipartChunkReader + preparedMultipartChunk{chunk,
wrap}: a generic lazy chunk streamer with a per-chunk wrap closure for
the SSE-specific decryption setup. Per-chunk metadata is still
validated UPFRONT so a malformed chunk fails fast without opening any
HTTP connection -- the eager validation contract callers and tests
rely on is preserved. The volume-server GET and the SSE-specific
decrypt wrap, however, fire LAZILY: at most one chunk body is live at
any time, regardless of object size.

This commit applies the new pattern to buildMultipartSSES3Reader only;
the SSE-KMS and SSE-C multipart readers retain their eager form for
now and will be migrated in follow-up commits, since the same shape
exists there too.

Tests:
  - TestBuildMultipartSSES3Reader_LazyChunkFetch pins the new contract:
    zero chunks opened at construction, peak liveness == 1, all closed
    after drain.
  - TestBuildMultipartSSES3Reader_RejectsBadChunkBeforeAnyFetch
    (replaces ClosesAppendedOnError) asserts a malformed chunk in
    position N causes zero fetches for chunks 0..N -- the previous
    test pinned a weaker contract (cleanup after eager open).
  - TestBuildMultipartSSES3Reader_InvalidIVLength updated for the same
    reason: the fetch callback must NOT be invoked at all on a bad-IV
    chunk.
  - TestMultipartSSES3RealisticEndToEnd round-trips multiple parts
    encrypted the way putToFiler writes them (shared DEK + baseIV,
    partOffset=0, post-completion global offsets) and walks them
    through buildMultipartSSES3Reader.

* fix(s3api): stream multipart SSE-KMS chunks lazily

Apply the same fix as the previous commit to
createMultipartSSEKMSDecryptedReaderDirect: per-chunk SSE-KMS metadata
is validated upfront, but volume-server GETs fire lazily through
lazyMultipartChunkReader. At most one chunk body is live at any time.

This is the same eager-open-all-chunks shape that produced #8908's
truncated GETs for SSE-S3; SSE-KMS multipart objects with many chunks
were exposed to the same idle-keepalive failure mode under concurrent
load.

The wire format on disk is unchanged (same per-chunk metadata, same
encrypted bytes, same object Extended attributes). Existing SSE-KMS
multipart objects read back identically -- only when the volume-server
GETs fire changes.

* fix(s3api): stream multipart SSE-C chunks lazily

Apply the same fix as the previous two commits to
createMultipartSSECDecryptedReaderDirect: per-chunk SSE-C metadata is
validated upfront (IV decode, IV length check, non-negative
PartOffset), but the volume-server GET and CreateSSECDecryptedReader-
WithOffset wrap fire lazily through lazyMultipartChunkReader. At most
one chunk body is live at any time.

This is the same eager-open-all-chunks shape that produced #8908's
truncated GETs for SSE-S3; SSE-C multipart objects with many chunks
were exposed to the same idle-keepalive failure mode under concurrent
load.

The pre-existing TODO note about CopyObject SSE-C PartOffset handling
is preserved verbatim. The wire format on disk is unchanged (same
per-chunk metadata, same encrypted bytes); existing SSE-C multipart
objects read back identically.

After this commit all three multipart SSE read paths (SSE-S3, SSE-KMS,
SSE-C) share lazyMultipartChunkReader as their streaming engine.

* test(s3): add Docker Registry-shape multipart SSE-S3 GET regression

Pin the end-to-end fix for #8908 with a test that mirrors what Docker
Registry actually does on pull: a 25-part * 5MB upload with bucket-
default SSE-S3, then a full GET, then SHA-256 over the streamed body
must match SHA-256 over the uploaded bytes.

The eager-multipart-reader bug was specifically a streaming truncation
under load: the response status was 200 with a Content-Length matching
the object size, but the body short-circuited mid-stream because
later chunks' volume-server connections had already been closed by
keepalive. The hash check is the symptom Docker Registry surfaces
("Digest did not match"), so this is the most faithful regression we
can pin without spinning up a registry.

uploadAndVerifyMultipartSSEObject already byte-compares the GET body,
but hashing on top is intentionally explicit -- it documents WHY the
test exists, and matches the failure mode reported in the issue.

* test(s3): add range-read coverage matrix across SSE modes and sizes

Existing range-read coverage in test/s3/sse was scoped to small (<= 1MB)
single-chunk objects, with one ad-hoc range case per SSE mode and one
129-byte boundary-crossing case in TestSSEMultipartUploadIntegration.
Nothing exercised:

  - Range reads on single-PUT objects whose content crosses the 8MB
    internal chunk boundary (medium size class).
  - Range reads on multipart objects whose parts each span multiple
    internal chunks (large size class) -- the shape #8908 originally
    surfaced for full-object GETs and the most likely site of any
    future regression in per-chunk IV / PartOffset plumbing for
    partial reads.
  - A consistent range-pattern set applied uniformly across SSE modes,
    so any divergence between modes (SSE-C uses random IV + PartOffset;
    SSE-S3/KMS use base IV + offset) is comparable at a glance.

TestSSERangeReadCoverageMatrix introduces a parameterized matrix:

  modes:     no_sse, sse_c, sse_kms, sse_s3
  sizes:     small (256KB single chunk),
             medium (12MB single PUT crossing one internal boundary),
             large (5x9MB multipart, ~10 internal chunks, every part
                    itself spans an 8MB boundary)
  ranges:    single byte at 0, prefix 512B, single byte at last,
             suffix bytes=-100, open-ended bytes=N-, whole object,
             AES-block boundary 15-31, mid straddling one internal
             boundary (medium+large), mid spanning many internal
             boundaries (large only)

Per case it asserts: body bytes equal the expected slice, Content-Length
matches the range length, Content-Range matches start-end/total, and the
SSE response headers match the mode.

The sse_kms branch probes once with a 1-byte SSE-KMS PUT and t.Skip's
the remaining sse_kms subtests with a clear reason if the local server
has no KMS provider configured -- the default `weed mini` setup lacks
one; the Makefile target `test-with-kms` provides one via OpenBao. Other
modes always run.

Verified locally: 75 subtests pass under no_sse / sse_c / sse_s3 against
weed mini, sse_kms cleanly skipped.

* test(s3): conform new test names to TestSSE*Integration so CI runs them

The two tests added in the previous commits had names that did NOT match
the patterns the test/s3/sse Makefile and .github/workflows/s3-sse-tests.yml
use to discover SSE integration tests:

  - test/s3/sse/Makefile `test` target:           TestSSE.*Integration
  - test/s3/sse/Makefile `test-multipart`:        TestSSEMultipartUploadIntegration
  - .github/workflows/s3-sse-tests.yml:           ...|.*Multipart.*Integration|.*RangeRequestsServerBehavior

Result: SSE-KMS coverage I added to TestSSERangeReadCoverageMatrix and
the Docker-Registry-shape multipart regression in
TestSSES3MultipartManyChunks_DockerRegistryShape were silently invisible
to CI even though the underlying test setup (start-seaweedfs-ci using
s3-config-template.json with the embedded `local` KMS provider) already
has SSE-KMS configured.

Renames:

  TestSSERangeReadCoverageMatrix              -> TestSSERangeReadIntegration
  TestSSES3MultipartManyChunks_...            -> TestSSEMultipartManyChunksIntegration

Both names now match `TestSSE.*Integration` (Makefile `test` target) and
TestSSEMultipartManyChunksIntegration additionally matches
`.*Multipart.*Integration` (CI's comprehensive subset). No behavior
change; only the function names move.

Verified locally against `weed mini` with s3-config-template.json:
TestSSERangeReadIntegration runs 96 leaf subtests across 4 SSE modes
(none, SSE-C, SSE-KMS, SSE-S3) x 3 size classes x 7-9 range patterns,
all passing, 0 skipped. The probe-and-skip in the SSE-KMS arm now only
fires for ad-hoc local setups that don't load any KMS provider; the
project's standard test setup loads the local provider, so CI has full
SSE-KMS range coverage.

* fix(s3api): validate SSE-KMS chunk IV during prep, before any fetch

Addresses CodeRabbit review on PR #9228: in
createMultipartSSEKMSDecryptedReaderDirect the per-chunk SSE-KMS metadata
was deserialized in the prep loop but the IV length was only validated
later, inside CreateSSEKMSDecryptedReader, which runs from the wrap
closure -- AFTER the chunk's volume-server fetch has already started.
That weakens the new "reject malformed chunks before any fetch" contract
for SSE-KMS specifically: a chunk with a missing/short/long IV would
fire its HTTP GET, then fail mid-stream during decrypt.

The fix moves the existing ValidateIV check into the prep loop, matching
the SSE-S3 and SSE-C paths.

Drive-by: extract the SSE-KMS prep loop into a free
buildMultipartSSEKMSReader helper that mirrors buildMultipartSSES3Reader,
so the new contract is unit-testable without an S3ApiServer. The
exported method (createMultipartSSEKMSDecryptedReaderDirect) stays a
thin caller, so behavior for production callers is unchanged.

New tests in weed/s3api/s3api_multipart_ssekms_test.go pin the contract:

  - TestBuildMultipartSSEKMSReader_RejectsBadIVBeforeAnyFetch covers
    missing IV, empty IV, short IV, long IV. Each case asserts both
    that an error is returned AND that the fetch callback is never
    invoked.
  - TestBuildMultipartSSEKMSReader_RejectsMissingMetadataBeforeAnyFetch
    pins the analogous behavior when SseMetadata is nil on a chunk in
    position N: chunks 0..N-1 must not be fetched (the earlier eager
    implementation depended on a closeAppendedReaders cleanup path; the
    new contract is stronger -- nothing is opened in the first place).
  - TestBuildMultipartSSEKMSReader_RejectsUnparseableMetadataBeforeAnyFetch
    covers the JSON-unmarshal failure branch.
  - TestBuildMultipartSSEKMSReader_SortsByOffset smoke-tests the
    documented sort-by-offset contract by recording the order in which
    fetch is invoked.

All four pass under `go test ./weed/s3api/`. Existing weed/s3api unit
suite + the SSE integration suite (with the local KMS provider enabled
via s3-config-template.json) continue to pass.

* test(s3): address CodeRabbit nitpicks on range coverage matrix

Three small follow-ups on the range-read coverage matrix from the
previous commit, per CodeRabbit nitpicks on PR #9228:

1. Promote the body-length check from `assert.Equal` to `require.Equal`
   so a truncation regression -- the canonical #8908 failure mode --
   aborts the subtest immediately. Previously the assertion logged a
   length mismatch and then `assertDataEqual` ran on differently-sized
   slices, producing a noisy byte-diff on top of the actual symptom.
   The redundant trailing `t.Fatalf` block becomes dead and is removed.

2. Broaden the SSE-KMS probe-skip heuristic. The probe previously
   produced the friendly "KMS provider not configured" message only
   for 5xx responses; KMS-misconfig surfaces also include 501
   NotImplemented, 4xx KMS.NotConfigured, and error messages
   containing "KMS.NotConfigured" / "NotImplemented" /
   "not configured". The behaviour change is purely cosmetic (the
   caller t.Skip's on any non-empty reason either way) but the new
   diagnostic is more useful in CI logs.

3. Add `t.Parallel()` at the mode and size-class levels of the matrix.
   Each (mode, size) writes an independent object key under the shared
   bucket, with no cross-talk, so parallel execution is safe. Local
   wall time on the full matrix dropped from ~2.0s to ~1.1s (~45%);
   the savings scale with chunk count and CI machine concurrency.

Verified locally against `weed mini` with s3-config-template.json:
  - go test ./weed/s3api/ -count=1                   PASS
  - TestSSERangeReadIntegration -v                   112 PASS, 0 SKIP
  - TestSSEMultipartUploadIntegration etc.           PASS

* fix(s3api): tighten lazy reader error path; unify SSE IV validation

Three CodeRabbit nitpicks on PR #9228:

1. lazyMultipartChunkReader: mark finished on non-EOF Read errors

   The Read loop's three earlier failure paths (chunk index past end,
   fetch error, wrap error) all set l.finished = true before returning.
   The non-EOF Read path -- where l.current.Read itself errors mid-chunk
   -- did not, leaving l.current/l.closer set and l.finished = false. A
   caller that retried Read after an error would re-enter the same
   broken stream instead of advancing or giving up. Set l.finished =
   true on non-EOF Read error so post-error state is consistent across
   all four failure sites; Close() (which the GetObjectHandler defers)
   still releases the chunk body.

2. Unify IV-length validation across SSE-S3, SSE-KMS, SSE-C prep paths

   The previous commit moved SSE-KMS to the shared ValidateIV helper
   but left SSE-S3 and SSE-C with bespoke inline `len(...) !=
   AESBlockSize` checks. All three are enforcing the same invariant;
   inconsistency obscures the symmetry. Move SSE-S3 and SSE-C to
   ValidateIV too, with the same `<algo> chunk <fileId> IV` name
   convention. Error message wording shifts from "<algo> chunk X has
   invalid IV length N (expected 16)" to ValidateIV's "invalid <algo>
   chunk X IV length: expected 16 bytes, got N". The substring
   "IV length" is preserved across both, so the existing
   TestBuildMultipartSSES3Reader_InvalidIVLength substring assertion
   is loosened to match either form.

3. TestBuildMultipartSSEKMSReader_SortsByOffset: verify full ordering

   The test previously drove Read() to observe fetch-call order, but
   CreateSSEKMSDecryptedReader requires a live KMS provider to unwrap
   the encrypted DEK -- unavailable in unit tests -- so the wrap
   closure failed on the first chunk and only one fetch was ever
   recorded. The test asserted only fetchOrder[0] == "c0", which is
   weaker than the comment promised.

   Switch to a static check: type-assert the returned reader to
   *lazyMultipartChunkReader (same package so unexported fields are
   accessible) and inspect the prepared chunks slice directly. This
   pins the entire [c0, c1, c2] sort order in one place, doesn't
   depend on KMS, and runs in zero fetch calls. The fetch closure
   now asserts it is never invoked during preparation.

All weed/s3api unit tests pass; integration suite (with KMS provider
configured via s3-config-template.json) passes.

* test(s3): switch range coverage cleanup to t.Cleanup; tighten KMS probe

Two CodeRabbit comments on PR #9228, both about
test/s3/sse/s3_sse_range_coverage_test.go:

1. CRITICAL: defer + t.Parallel() race in TestSSERangeReadIntegration

   The test creates one bucket up front, then runs subtests that call
   t.Parallel() at the mode and size levels (added in 058cbf27 to cut
   wall time). t.Parallel() pauses each subtest and yields back to the
   parent. The parent's for loop finishes scheduling, the function
   returns, and the deferred cleanupTestBucket fires -- BEFORE any
   parallel subtest body has executed. The bucket gets deleted out
   from under the parallel subtests, which then race the cleanup and
   either fail with NoSuchBucket or, depending on lazy-deletion
   behaviour on the server side, mask other regressions because
   chunks happen to still be readable for a brief window.

   The local matrix passing prior to this commit was a server-side
   coincidence; the t.Cleanup contract is the right one for parent
   tests with parallel children, and switching to it is a one-line
   change. t.Cleanup runs after the test AND all its (parallel)
   subtests complete, so the bucket survives until every leaf
   subtest is done.

2. MINOR: tighten the SSE-KMS probe-skip heuristic

   The previous broadening (058cbf27) treated `code == 400` as
   "KMS provider not configured", on the theory that some servers
   return 4xx for KMS misconfig. That is too aggressive: a real
   misconfiguration in the SSE-KMS test request itself (bad keyID
   format, missing header) ALSO surfaces as a 400, and would
   silently t.Skip the SSE-KMS subtree in CI -- which is exactly
   the integration coverage the new TestSSERangeReadIntegration is
   supposed to add. Drop the 400 branch (and the redundant 501
   match, since 501 >= 500 already covers it). Genuine
   "KMS.NotConfigured" / "NotImplemented" responses are still
   recognised via the string-match block immediately below,
   regardless of status code, so the friendly skip message survives
   for the cases where it actually applies.

Verified locally against `weed mini` with s3-config-template.json:

  - go test ./weed/s3api/                     PASS
  - TestSSERangeReadIntegration -v            113 PASS lines, 0 SKIP
  - TestSSEMultipartUploadIntegration etc.    PASS
2026-04-26 16:31:42 -07:00
Chris Lu
525900dfe4 fix(s3api): backfill multipart SSE-S3 metadata at completion (#9224)
* fix(s3api): backfill missing per-chunk SSE-S3 metadata at completion

When a part of an SSE-S3 multipart upload lands with SseType=NONE on
its chunks (e.g. a transient failure to apply SSE-S3 setup in
PutObjectPart), the completed object inherits NONE-tagged chunks and
detectPrimarySSEType then misses the chunked SSE-S3 encryption. The
read path falls through to the unencrypted serve and GET returns
ciphertext, producing the SHA mismatch reported in #8908.

Recover at completion using the base IV and key data the upload
directory recorded at CreateMultipartUpload:

  - extractMultipartSSES3Info validates upload-entry metadata up
    front and hard-fails completion if the base IV or key data are
    malformed; serializing chunk metadata we then could not decrypt
    is worse than rejecting the upload.
  - completedMultipartChunk re-derives a per-chunk IV from baseIV +
    chunk.Offset (matching what putToFiler would have written) and
    serializes per-chunk SSE-S3 metadata when the chunk has no tag.
    Existing per-chunk metadata is left alone; we cannot recover an
    already-derived IV from the upload-entry alone.

The IV formula intentionally has no partNumber term: putToFiler
hardcodes partOffset=0 when it calls handleSSES3MultipartEncryption
for every part, so each chunk's encryption IV is
calculateIVWithOffset(baseIV, chunk.Offset_part_local).
PartOffsetMultiplier is defined in s3_constants but is not consumed
by the encryption path. Adopting (partNumber-1)*PartOffsetMultiplier
+ chunk.Offset would produce IVs that fail to decrypt the bytes on
disk - a stronger failure mode than the bug being fixed. Tests pin
this:

  - TestCompletedMultipartChunkBackfilledIVDecryptsActualCiphertext
    runs the round trip across the encryption boundary: encrypt
    parts with CreateSSES3EncryptedReaderWithBaseIV (the call
    putToFiler uses), drop chunk metadata to reproduce #8908,
    backfill, decrypt with backfilled IV, assert plaintext intact.
  - TestCompletedMultipartChunkRejectsPartNumberMultiplierFormula
    constructs the IV the partNumber formula would produce and
    shows it does not decrypt the actual ciphertext.

This commit covers the chunk-level recovery only. The companion
fix for the object-level Extended attributes (SeaweedFSSSES3Key /
X-Amz-Server-Side-Encryption) follows separately.

* fix(s3api): backfill canonical SSE-S3 attributes onto multipart object

The previous commit ensures every chunk of an SSE-S3 multipart upload
carries SseType=SSE_S3 with a per-chunk IV, so the multipart-direct
read path can decrypt. The completed object's Extended map can still
miss the canonical pair detectPrimarySSEType and IsSSES3EncryptedInternal
look at:

  - X-Amz-Server-Side-Encryption (the AmzServerSideEncryption header
    detectPrimarySSEType reads on inline / small-object reads)
  - x-seaweedfs-sse-s3-key (SeaweedFSSSES3Key, required by
    IsSSES3EncryptedInternal and by the read-path key lookup)

When a part of the upload was written by a path that did not set
those (the same #8908 race that produced the NONE chunks),
copySSEHeadersFromFirstPart finds nothing to copy and the final entry
ends up with only the multipart-init keys (SeaweedFSSSES3Encryption /
BaseIV / KeyData). The read path then mis-detects the object as
unencrypted.

applyMultipartSSES3HeadersFromUploadEntry writes the canonical pair
from the multipart-init metadata in all three completion paths
(versioned, suspended, non-versioned), only when the keys are missing
so a healthy first part still wins. extractMultipartSSES3Info already
ran in prepareMultipartCompletionState, so the data is reused without
re-decoding.

Tests: TestApplyMultipartSSES3HeadersFromUploadEntry covers backfill,
do-not-clobber, and nil-info no-op cases.

* fix(s3api): drop double IV adjustment in SSE-KMS chunk view decrypt

decryptSSEKMSChunkView was pre-adjusting the SSE-KMS chunk IV
(calculateIVWithOffset(baseIV, ChunkOffset)) and then handing the
adjusted IV to CreateSSEKMSDecryptedReader, which itself runs
calculateIVWithOffset(IV, ChunkOffset) on whatever it receives. The
offset was being applied twice for any chunk with a non-zero
ChunkOffset, corrupting the keystream for range reads that cross
multipart chunk boundaries.

Pass the raw SSE-KMS key (with base IV and the original ChunkOffset
field) into CreateSSEKMSDecryptedReader so the offset is applied
exactly once, and remove the now-dead intra-block skip that was
compensating for the double adjustment.

Add an anti-test inside TestSSEKMSDecryptChunkView_RequiresOffsetAdjustment
that decrypts the same ciphertext with a deliberately double-adjusted
IV and asserts the output is corrupted, so any regression that
re-introduces the double application fails the unit test.

* test(s3): cover multipart SSE across chunk-spanning parts and ranges

Adds an integration subtest "Multipart Parts Larger Than Internal
Chunks Across SSE Types" to TestSSEMultipartUploadIntegration that
exercises the end-to-end S3 path for the bugs fixed in this branch:

  - Two-part multipart upload with each part larger than the 8MB
    internal SeaweedFS chunk, so each part itself spans multiple
    underlying chunks.
  - Subtests for SSE-C, SSE-KMS, explicit SSE-S3, and bucket-default
    SSE-S3 - the four paths multipart parts can take through the SSE
    pipeline.
  - Each subtest does a full GET (verifying every byte and the
    response Content-Length / SSE response headers) plus a 129-byte
    range read straddling the 8MB internal chunk boundary, which is
    the path that produced the SSE-KMS double-IV corruption (fix in
    the previous commit) and the SSE-S3 chunk-tag loss (fix in the
    earlier commits).

Factored the request shape behind multipartSSEOptions /
uploadAndVerifyMultipartSSEObject so all four SSE flavors share the
same upload+verify code; only the SSE-specific input/output
configuration differs per subtest.

* test(s3): abort orphan multipart uploads on test failure

Address coderabbit nitpick on uploadAndVerifyMultipartSSEObject. The
helper used require.NoError after CreateMultipartUpload, UploadPart
and CompleteMultipartUpload, so a failure in any of those (or in the
later GET / range read on a still-incomplete upload) called t.Fatal
without aborting the in-flight MPU, leaving an orphan upload in the
bucket. Harmless in CI where the data dir is wiped on shutdown, but a
real annoyance when iterating locally and a textbook AWS S3 caveat in
production.

Register a t.Cleanup that calls AbortMultipartUpload unless a
"completed" flag was set right after a successful
CompleteMultipartUpload. Use context.Background for the abort call
since the parent ctx may already be cancelled at cleanup time, and
t.Logf the abort error rather than failing the test so the original
failure remains visible in the run output.
2026-04-25 23:06:37 -07:00
Chris Lu
d65c568cbb fix(s3api): validate SSE-S3 chunk IV length; add multipart direct reader tests (#9218)
* fix(s3api): validate SSE-S3 chunk IV length; add multipart direct reader tests

DeserializeSSES3Metadata does not require an IV, and a corrupted or
legacy chunk without one would have flowed into cipher.NewCTR and
panicked. Validate that each per-chunk IV is exactly AESBlockSize bytes
before decryption, closing the current and any already-appended chunk
readers on error.

Factor the per-chunk decryption loop out of
createMultipartSSES3DecryptedReaderDirect into buildMultipartSSES3Reader
so it can be driven with a mock chunk fetcher, and add tests covering:
the happy path with two parts (distinct per-chunk DEKs/IVs, out-of-order
chunks) to lock in the fix from #9211; missing-IV and short-IV metadata
rejection without panic; and reader cleanup when a later chunk fails.

* address review: sort chunks copy; close encryptedStream on error

- buildMultipartSSES3Reader now sorts a copy of the chunks slice so
  callers do not observe entry.Chunks reordered (other code paths,
  e.g. ETag computation, can rely on the original order).
- createMultipartSSES3DecryptedReaderDirect now closes encryptedStream
  on the error path from buildMultipartSSES3Reader. All current
  callers pass nil, but this keeps cleanup symmetric with the
  success path.
- Extend TestBuildMultipartSSES3Reader_PerChunkKeys to assert the
  input slice is not mutated.

* address review: defer single close; extend chunk-copy + IV-guard pattern

- createMultipartSSES3DecryptedReaderDirect: collapse the duplicated
  encryptedStream.Close() calls into a single nil-guarded defer so the
  error and success paths share cleanup.
- createMultipartSSECDecryptedReaderDirect,
  createMultipartSSEKMSDecryptedReaderDirect: sort a copy of entry.Chunks
  instead of mutating the caller's slice, matching the SSE-S3 helper.
- createMultipartSSECDecryptedReaderDirect: validate per-chunk IV length
  before handing it to cipher.NewCTR; a base64-decoded empty or short
  IV from malformed/corrupt metadata would otherwise panic.
- SSE-KMS needs no IV guard: CreateSSEKMSDecryptedReader already calls
  ValidateIV before cipher.NewCTR. Note recorded in the sort comment.

* address review: close appended readers on SSE-C/SSE-KMS error paths

createMultipartSSECDecryptedReaderDirect and
createMultipartSSEKMSDecryptedReaderDirect only closed the current chunk
reader on error and leaked any chunk readers already appended to the
local readers slice, mirroring the leak previously fixed in the SSE-S3
helper. Add the same closeAppendedReaders() closure pattern to both
functions and invoke it on every error return inside the loop so failed
requests do not leak volume-server HTTP connections.

* address review: defer encryptedStream close in SSE-C/SSE-KMS; drop chunks reassignment

- Move encryptedStream.Close() to a nil-guarded defer at the top of
  createMultipartSSECDecryptedReaderDirect and
  createMultipartSSEKMSDecryptedReaderDirect so the stream is closed on
  every return path (including error returns from inside the per-chunk
  loop), mirroring the SSE-S3 helper.
- In buildMultipartSSES3Reader, iterate sortedChunks directly instead of
  reassigning chunks = sortedChunks.
2026-04-24 13:59:23 -07:00
os-pradipbabar
8815844278 fix(s3api): correct SSE-S3 decryption key handling in multipart uploads (#9211)
* fix(s3api): correct SSE-S3 decryption key handling in multipart uploads

* fix(s3api): preallocate readers and close on error in SSE-S3 direct path

Address review feedback on createMultipartSSES3DecryptedReaderDirect:
preallocate the readers slice with the known chunk count, and close any
already-appended chunk readers on error returns so failed requests do
not leak volume-server HTTP connections.

---------

Co-authored-by: Chris Lu <chris.lu@gmail.com>
2026-04-24 12:00:29 -07:00
Chris Lu
88c2f3c34d fix(iam): accept bare "*" resource in PutUserPolicy (#9209) (#9210)
AWS IAM treats a bare "*" in a statement's Resource as "any resource",
but the embedded IAM resource parser required a 6-segment S3 ARN and
silently skipped anything else. With a policy like
{Action: "s3:*", Resource: "*"}, every resource was dropped and the
statement produced no actions, so PutUserPolicy rejected the document
with "no valid actions found in policy document".

Short-circuit Resource == "*" to the same full-wildcard path that
"arn:aws:s3:::*" already takes.
2026-04-23 22:14:41 -07:00
Chris Lu
34b236acfa test(s3api): look up NewUser by name in CreateAccessKey collision test
The memory credential store backs LoadConfiguration with a map, so the
identity order is not stable across a save/load round trip. Indexing
Identities[1] intermittently pointed at the owner identity and produced
a spurious credential leak.
2026-04-23 11:09:17 -07:00
Jon E Nesvold
c6302fcb54 feat(iam): allow caller-supplied AccessKeyId and SecretAccessKey in CreateAccessKey (#9172)
* feat(iam): support caller-supplied AccessKeyId and SecretAccessKey in CreateAccessKey

Both IAM implementations (standalone and embedded) now check for
caller-supplied AccessKeyId and SecretAccessKey form parameters before
generating random credentials. If provided, the caller-supplied values
are used. If empty, random keys are generated as before.

This enables programmatic identity provisioning where the caller needs
to control the S3 credentials.

Backward-compatible: no behavior change for callers that omit these
parameters.

* refactor(iam): extract shared caller-supplied credential validation

Move the AccessKeyId/SecretAccessKey format checks and the in-memory
collision scan into weed/iam so the standalone IAM API, the embedded
IAM in s3api, and the admin dashboard all enforce the same rules.

- ValidateCallerSuppliedAccessKeyId: 4-128 alphanumeric (rejects
  SigV4-breaking characters like '/' and '=').
- ValidateCallerSuppliedSecretAccessKey: 8-128 chars.
- FindAccessKeyOwner: scans identities and service accounts and
  returns the owning entity type + name for debug logging, without
  exposing the owner in caller-facing error messages.

The admin dashboard previously only length-checked caller-supplied
keys; it now enforces the same alphanumeric rule, which matches what
SigV4 actually accepts anyway.

* fix(iam): reject partial caller-supplied AccessKeyId/SecretAccessKey

Previously, if a caller supplied only one of AccessKeyId or
SecretAccessKey, CreateAccessKey logged a warning and auto-generated
the missing half. That silently returns a credential the caller did
not fully choose, which is surprising and easy to miss in a response
they expected to echo back their input.

Return ErrCodeInvalidInputException instead: either both are supplied
or neither is. Updates the mixed-supply tests in weed/iamapi and
weed/s3api to assert the rejection.

* chore(iam): centralize and broaden sensitive form redaction

DoActions and ExecuteAction both had an inline loop that redacted
SecretAccessKey from their debug-level request log. Replace the two
copies with iam.RedactSensitiveFormValues, backed by an explicit
sensitive-keys set.

The set now also covers Password, OldPassword, NewPassword,
PrivateKey, and SessionToken. None of those parameters are used by
today's IAM actions, but naming them here makes the log-safety
guarantee survive future additions such as LoginProfile / STS.

* test(iam): cover the upper length bound for CreateAccessKey

TestCreateAccessKeyBoundary / TestEmbeddedIamCreateAccessKeyBoundary
only exercised the 3/4-char lower edge. Add cases for 128 (accepted)
and 129 (rejected) for AccessKeyId, plus 7 / 128 / 129-char cases
for SecretAccessKey, so both ends of the validator are locked in at
the handler level (the pure validators in weed/iam already cover
this).

* fix(s3api/iam): verify user existence before RNG and collision scan

In the embedded IAM CreateAccessKey, the user lookup ran last: a
request for a non-existent user still walked the whole identity /
service-account list for collisions and, if no caller-supplied keys
were present, generated fresh random credentials with crypto/rand
before the NoSuchEntity error finally surfaced.

Reorder: validate inputs, then find the target identity, then do the
collision scan, then generate keys. A missing user now fails fast and
consumes no entropy, and the handler returns NoSuchEntity instead of
a misleading EntityAlreadyExists when both the user is missing and
the supplied AccessKeyId happens to collide with another identity's
key.

Add TestEmbeddedIamCreateAccessKeyRejectsMissingUser to lock in the
"no mutation on unknown user" guarantee.

The standalone iamapi CreateAccessKey intentionally keeps its
pre-existing "create-or-attach" semantics where a missing user is
implicitly provisioned — that is a behavior change beyond the scope
of this PR.

* test(iam): tighten collision leak assertion and cover 8-char secret

- Rename the collision-owner identity in TestCreateAccessKeyRejectsCollision
  (both iamapi and the embedded s3api test) from "existing" / "ExistingUser"
  to "ownerAlpha". The old assert.NotContains check was effectively a no-op
  because the error message never contained those substrings; a distinctive
  name shared with no part of the expected error body makes the leak guard
  actually meaningful if the wording ever drifts. The embedded test also
  adds a NotContains assertion that was previously missing entirely.
- Add an explicit 8-char SecretAccessKey pass case to both boundary tests
  so the lower edge of the validator is locked in at the handler level
  alongside the 7 / 128 / 129-char cases.

* fix(iamapi): enforce both-or-none before the collision lookup

In the standalone IAM CreateAccessKey, FindAccessKeyOwner ran before
the partial-credential check. If a caller supplied only AccessKeyId
and it happened to collide with an existing key, the response was
EntityAlreadyExists instead of the more fundamental InvalidInput for
omitting SecretAccessKey — wrong error class, and leaked the fact
that the probed key is already in use.

Swap the order: validate both-or-none first, then do the collision
scan. Matches the embedded IAM path and AWS behavior.

Add a case to TestCreateAccessKeyRejectsPartialSupply that combines
partial supply with a collision to lock in the ordering.

* fix(admin): reject partial caller-supplied AccessKey/SecretKey

The admin dashboard path silently generated the missing half when a
caller supplied only one of AccessKey or SecretKey, while the IAM
API and embedded IAM paths now reject this. Align the three: if
exactly one is provided, return ErrInvalidInput.

Also simplifies the generator block — either both are provided or
neither is, so there is no mixed path to handle.

* test(s3api/iam): guard dereferences in caller-supplied-keys test

TestEmbeddedIamCreateAccessKeyWithCallerSuppliedKeys dereferenced
*AccessKeyId/*SecretAccessKey/*UserName and indexed
Identities[0].Credentials[0] without first verifying shape, so any
future regression that returns a partial response or skips the
config mutation would panic mid-assertion instead of failing with
a clear message.

Add require.NotNil on the response pointers and require.Len on
the identities/credentials slices before the asserts.

* test(iamapi): exercise the service-account branch of the collision check

FindAccessKeyOwner scans both Identities[*].Credentials and
ServiceAccounts[*].Credential, but TestCreateAccessKeyRejectsCollision
only covered the identity branch. Split the test into two subtests —
one per branch — so a future refactor that drops the service-account
scan (or mutates the existing credential) trips a failure.

Also asserts the existing service-account credential is not mutated
and no credential is attached to the target identity on rejection.

* test(iam): isolate 129-char secret subcase from prior credential

In both TestCreateAccessKeyBoundary (iamapi) and
TestEmbeddedIamCreateAccessKeyBoundary (s3api), the 129-char
SecretAccessKey subcase reused the "validkey" AccessKeyId that the
preceding 8-char subcase had just persisted into the config. The
test still asserted the right outcome because the handler validates
secret length before running the collision scan — but if the two
checks ever swap, the subcase would pass (or fail) for the wrong
reason.

Reset the in-memory credentials before the 129-char subcase, matching
the pattern already used by the 3/128/129-char AccessKeyId and
7-char secret subcases. No behavior change; purely test isolation.

---------

Co-authored-by: Chris Lu <chris.lu@gmail.com>
2026-04-22 12:35:55 -07:00
Chris Lu
7364f148bd fix(s3/shell): factor EC volumes into bucket size metrics and collection.list (#9182)
* fix(s3/shell): include EC volumes in bucket size metrics and collection.list

S3 bucket size metrics exported to Prometheus (and fed through
stats.UpdateBucketSizeMetrics) are computed by
collectCollectionInfoFromTopology, which only walked diskInfo.VolumeInfos.
As soon as a volume was encoded to EC it dropped out of every aggregate,
so Grafana showed bucket sizes shrinking while physical disk usage kept
climbing. The shell helper collectCollectionInfo — used by collection.list
and s3.bucket.quota.enforce — had the same gap, with the EC branch left as
a commented-out TODO.

Fold EC shards into both paths using the same approach the admin dashboard
already uses (PR #9093):

- PhysicalSize / Size sum across shard holders: EC shards are node-local
  (not replicas), so per-node TotalSize() and MinusParityShards().TotalSize()
  sum to the whole-volume physical and logical sizes respectively.
- FileCount is deduped via max across reporters (every shard holder reports
  the same .ecx count; a slow node with a not-yet-loaded .ecx reports 0 and
  must not pin the aggregate).
- DeleteCount is summed (each delete tombstones exactly one node's .ecj).
- VolumeCount increments once per unique EC volume id.

Adds regression tests covering pure-EC, mixed regular+EC, and the
slow-reporter FileCount dedupe case.

Refs #9086

* Address PR review feedback: EC size helpers, composite key, VolumeCount dedupe

- Add EcShardsTotalSize / EcShardsDataSize helpers in the erasure_coding
  package that walk the shard bitmap directly instead of materializing a
  ShardsInfo and copying it via MinusParityShards(). Keeps the
  DataShardsCount dependency encapsulated in one place and avoids the
  per-shard allocation/copy overhead in the metrics hot path.
- Switch shell collectCollectionInfo ecVolumes map to a composite
  {collection, volumeId} key, matching the bucket_size_metrics collector
  and defending against any cross-collection volume id aliasing.
- Dedupe VolumeCount in shell addToCollection by volume id so regular
  volumes aren't counted once per replica presence. Aligns the shell's
  collection.list output with the S3 metrics collector and the EC branch,
  all of which now report logical volume counts.
- Add unit tests for the new helpers and for the regular-volume
  VolumeCount dedupe.

* Parameterize EcShardsDataSize with dataShards for custom EC ratios

Add a dataShards parameter to EcShardsDataSize so forks with per-volume
ratio metadata (e.g. the enterprise data_shards field carried on an
extended VolumeEcShardInformationMessage) can pass the configured value
and get accurate logical sizes under custom EC policies like 6+3 or 16+6.
Passing 0 or a negative value falls back to the upstream DataShardsCount
default, which is correct for the fixed 10+4 layout — so OSS callers in
s3api and shell pass 0 and keep their current behavior.

Added table cases covering the custom 6+3 and 16+6 paths so the
parameterization is pinned by tests.
2026-04-21 20:17:42 -07:00
Chris Lu
e77f8ae204 fix(s3api): route STS GetFederationToken to STS handler (#9157) (#9167)
* fix(s3api): route STS GetFederationToken requests to STS handler (#9157)

The STS GetFederationToken handler was implemented but never reachable.
Three routing gaps sent requests to the S3/IAM path instead of STS:

- No explicit mux route for Action=GetFederationToken in the URL query
- iamMatcher did not exclude GetFederationToken, so authenticated POSTs
  with Action in the form body were matched and dispatched to IAM
- UnifiedPostHandler only dispatched AssumeRole* and GetCallerIdentity
  to STS, leaving GetFederationToken to fall through to DoActions and
  return NotImplemented

Add the missing route, the matcher exclusion, and the dispatch branch.

Also wire TestSTS, TestAssumeRoleWithWebIdentity, and TestServiceAccount
into the s3-iam-tests workflow as a new "sts" matrix entry. Before this
change, none of test/s3/iam/s3_sts_get_federation_token_test.go's four
test functions ran in CI, which is why this regression shipped.

* test(iam): make orphaned STS/service-account tests pass under auth-enabled CI

Follow-up to wiring STS tests into CI: fixes several pre-existing issues
that made the newly-included tests fail locally.

Server fixes:
- weed/s3api/s3api_sts.go: handleGetFederationToken no longer 500s when
  the caller is a legacy S3-config identity (not in the IAM user store).
  Previously any GetPoliciesForUser error short-circuited to InternalError,
  which hard-failed every SigV4 caller using keys from -s3.config.
- weed/s3api/s3api_embedded_iam.go: CreateServiceAccount now generates IDs
  in the sa:<parent>:<uuid> format required by
  credential.ValidateServiceAccountId. The old "sa-XXXXXXXX" format failed
  the persistence-layer regex and caused every CreateServiceAccount call
  to return 500 once a filer-backed credential store validated the ID.

Test helpers:
- test/s3/iam/s3_sts_assume_role_test.go: callSTSAPIWithSigV4 no longer
  sets req.Header["Host"]. aws-sdk-go v1 v4.Signer already signs Host from
  req.URL.Host, and a manual Host header made the signer emit host;host in
  SignedHeaders, producing SignatureDoesNotMatch. Updated missing_role_arn
  subtest to match the existing SeaweedFS behavior (user-context
  assumption).
- test/s3/iam/s3_service_account_test.go: callIAMAPI now SigV4-signs
  requests when STS_TEST_{ACCESS,SECRET}_KEY env vars are set. Unsigned
  IAM writes otherwise fall through to the STS fallback and return
  InvalidAction.

CI matrix:
- .github/workflows/s3-iam-tests.yml: skip
  TestServiceAccountLifecycle/use_service_account_credentials only. The
  rest of the service-account suite passes; that one subtest depends on a
  separate credential-reload issue where new ABIA keys briefly register
  into accessKeyIdent but aren't persisted to the filer, so they vanish
  on the next reload. Out of scope for the #9157 GetFederationToken fix.

* fix(credential): accept AWS IAM username chars in service-account IDs

Gemini review on #9167 pointed out that ServiceAccountIdPattern's
parent-user segment was more restrictive than an AWS IAM username:
`[A-Za-z0-9_-]` vs. IAM's `[\w+=,.@-]`. Realistic usernames with
`@`, `.`, `+`, `=`, or `,` (e.g. email-style principals) would fail
validation at the filer store even though the embedded IAM API
happily created them.

Broaden the regex to `[A-Za-z0-9_+=,.@-]` (matching the AWS IAM spec
at https://docs.aws.amazon.com/IAM/latest/APIReference/API_User.html)
and add a table-driven test that locks the expansion in.

* address PR review feedback on #9167

All five review items were valid; changes keyed to review bullets:

- weed/s3api/s3api_sts.go: handleGetFederationToken no longer swallows
  arbitrary policy-lookup failures. Only credential.ErrUserNotFound is
  treated leniently (the legacy-config SigV4 path); any other error now
  returns InternalError so we don't mint tokens with an incomplete
  policy set.
- weed/credential/grpc/grpc_identity.go: GetUser translates gRPC
  NotFound back to credential.ErrUserNotFound so errors.Is(...) above
  matches for gRPC-backed stores, not just memory/filer-direct.
- weed/s3api/s3api_embedded_iam.go: CreateServiceAccount now validates
  the generated saId against credential.ValidateServiceAccountId before
  returning. Surfaces a client 400 with the offending ID instead of the
  opaque 500 that used to bubble up from the persistence layer.
- weed/s3api/s3api_server_routing_test.go: seed a routing-test identity
  with a known AK/SK, sign TestRouting_GetFederationTokenAuthenticatedBody
  with aws-sdk-go v4.Signer so the request actually passes
  AuthSignatureOnly. Assert 503 ServiceUnavailable (from STSHandlers
  with no stsService) instead of just NotEqual(501) — 503 proves the
  dispatch reached STSHandlers.HandleSTSRequest.
- test/s3/iam/s3_service_account_test.go: callIAMAPI signs with
  service="iam" instead of "s3" (SeaweedFS verifies against whichever
  service the client signed with, but "iam" is semantically correct).
- weed/credential/validation_test.go: add positive rows for an
  uppercase parent (sa:ALICE:...) and a canonical hyphenated UUID
  suffix (sa:alice:123e4567-e89b-12d3-a456-426614174000).
2026-04-20 19:33:22 -07:00
Lars Lehtonen
93d604d799 chore(weed/s3api/policy): prune unused test functions (#9150) 2026-04-20 12:04:41 -07:00
Chris Lu
0315da9022 fix(s3api): self-heal stale .versions latest-version pointer on read (#9125)
* fix(s3api): self-heal stale .versions latest-version pointer on read

When the `.versions` directory metadata points at a version file that has
gone missing (e.g. a crash between deleting the latest version and
rewriting the pointer, or a concurrent delete racing with a read),
`getLatestObjectVersion` bailed with a hard error that required manual
repair. Tagging, ACL, retention, copy-source, and HEAD/GET all surfaced
NoSuchKey even though other versions remained on disk.

On `ErrNotFound`/`codes.NotFound` from the pointed-at version lookup,
rescan the `.versions` directory, pick the newest remaining non-delete-
marker entry, persist the repaired pointer (best-effort), and return
that entry. If only delete markers (or nothing) remain, the caller still
sees an error and the object correctly appears absent.

Extracted the selection logic into a pure `selectLatestContentVersion`
helper so `updateLatestVersionAfterDeletion` and the new self-heal path
share a single implementation. A warning is logged whenever the heal
kicks in so stale-pointer incidents remain visible in operator logs.

* fix(s3api): self-heal must promote newest version even if delete marker

Review feedback (gemini-code-assist): the self-heal path used
`selectLatestContentVersion`, which skips delete markers. That had two
bugs:

1. If the chronologically newest entry was a delete marker, an older
   content version would be promoted, effectively "undeleting" an
   object that was actually deleted.
2. If only delete markers remained, heal returned an error and the
   caller surfaced a hard 500 instead of the correct 404-with-
   x-amz-delete-marker response.

Add `selectLatestVersion` that picks the newest entry regardless of
type (content or delete marker) and use it in
`healStaleLatestVersionPointer`. The promoted entry flows back through
`doGetLatestObjectVersion` unchanged; downstream handlers already
detect `ExtDeleteMarkerKey` on the returned entry and render
NoSuchKey + `x-amz-delete-marker: true` (see
`s3api_object_handlers.go:722-728`).

Kept `selectLatestContentVersion` in place for
`updateLatestVersionAfterDeletion`, which deliberately limits the
pointer to a live content version in the post-deletion flow — changing
that is out of scope for this fix.

Added four tests for the new selector:
- promotes newest delete marker over older content (the reviewer case)
- picks content when content is newest
- promotes newest delete marker when only delete markers remain
- returns nil latestEntry on empty/untagged input

* fix(s3api): paginate .versions scan in self-heal path

Review feedback (gemini-code-assist, coderabbitai): the self-heal rescan
did a single-shot list(..., 1000). For objects whose version ids use the
old raw-timestamp format, filer ordering is lexicographic-ascending =
oldest-first. If the .versions directory held more than one page of
entries, the first page contained only the oldest, and the heal would
promote an older version as the new "latest" — silently surfacing stale
data on subsequent reads.

Paginate through the whole directory with `filer.PaginationSize`,
running `selectLatestVersion` per page and keeping a single running
best candidate across pages (via `compareVersionIds`). This mirrors the
pagination pattern already used by `getObjectVersionList` in the same
file and closes the window for old-format stacks larger than one page.

New-format (inverted-timestamp) ids were not affected because their
lexicographic order matches newest-first, but paginating is still the
right fix.

Also updated the function doc to reflect that self-heal now promotes
the newest entry regardless of type (content version or delete marker).

* fix(s3api): don't resurrect deleted objects; wrap ErrNotFound sentinel

Two review findings addressed:

1. `updateLatestVersionAfterDeletion` was using `selectLatestContentVersion`
   which skipped delete markers. Scenario: PUT v1, DELETE (dm1 written,
   pointer->dm1), PUT v3 (pointer->v3), user explicitly deletes v3 by
   versionId. Remaining files: v1, dm1. S3 semantics say the current
   version is the chronologically newest = dm1 (object appears deleted).
   Old code would promote v1, "undeleting" the object silently.

   Switched to `selectLatestVersion`, which picks the newest entry
   regardless of type. Also paginated the scan the same way the self-heal
   path does — otherwise a single-page `list(1000)` still mis-selects the
   latest for old-format version id stacks that exceed one page.

   Removed the `hasDeleteMarkers || !isLast` branch: with `selectLatestVersion`
   any delete marker participates in the comparison and shows up as the
   latest when it is newest, so the "keep the directory on ambiguity"
   guard becomes unreachable. Full pagination also makes the `!isLast`
   guard unnecessary — we either see every entry or surface a list error.

2. `healStaleLatestVersionPointer`'s "no remaining version" error was a
   plain `fmt.Errorf`, so callers could not distinguish genuine
   object-absence from a scan failure via `errors.Is`. Wrap it with
   `filer_pb.ErrNotFound` so the sentinel flows through the chain (the
   outer wrap already uses `%w`).

No test additions needed — `TestSelectLatestVersion_PromotesNewestDeleteMarker`
already asserts the "newer delete marker beats older content" invariant
that drives both fixes.

* refactor(s3api): drop unused selectLatestContentVersion after review

Review feedback flagged a stale comment claiming
selectLatestContentVersion "mirrors" the post-deletion semantics of
updateLatestVersionAfterDeletion. That claim became false when the
post-deletion path switched to selectLatestVersion in the previous
commit. Verified no production callers remain — only the helper's own
tests referenced it — so the cleaner fix is to delete the dead code
rather than rewrite the comment to explain "this exists but isn't
used."

- Removed selectLatestContentVersion.
- Removed the four TestSelectLatestContentVersion_* tests.
- Preserved a renamed TestSelectLatestVersion_MixedFormats so the
  mixed-format comparator coverage still runs against the active
  selector.
2026-04-17 14:57:59 -07:00
Chris Lu
bfea14320a fix(s3): reject unknown POST policy conditions and extra x-amz form fields (#9124)
* fix(s3): reject unknown POST policy conditions and extra x-amz form fields

CheckPostPolicy previously accepted policy conditions with unknown $keys
(e.g. "$foo") as satisfied, and only rejected stray X-Amz-Meta-* form
fields. Reject unknown condition keys outright, and extend the extra-
input-fields check to all X-Amz-* form fields except the reserved
auth/signing headers. Matches AWS S3 POST Object behavior.

* refactor(s3): drop redundant $x-amz-meta- prefix check in CheckPostPolicy

The $x-amz- prefix already subsumes $x-amz-meta-, so the explicit
$x-amz-meta- check adds no coverage. Simplify the else-if condition.
Addresses gemini-code-assist review on PR #9124.

* style(s3): align unknown-key policy error with [op, key, value] trailer

Reformat the unknown-condition-key error in CheckPostPolicy to include
the same "[op, key, value]" trailer used by the other condition-failed
messages. The value slot is empty because no comparison occurs for an
unknown key. The descriptive "unknown condition key" suffix is kept so
operators can still tell this failure from a mismatched value.

* fix(s3): honor starts-with prefix-stem POST policies when checking extras

AWS POST policies use ["starts-with","$x-amz-meta-",""] to allow any
X-Amz-Meta-* form field. The previous exact-match policyXAmzKeys would
flag every X-Amz-Meta-Foo as an "Extra input fields" failure because
only the stem X-Amz-Meta- was stored. Track starts-with conditions whose
key ends in "-" with an empty value as prefix stems, and accept any
X-Amz-* form field matching one of those stems.

* fix(s3): validate value prefix for starts-with POST policy stems

Drop the policy.Value == "" gate when detecting prefix-stem conditions
so that ["starts-with","$x-amz-meta-","pfx-"] is recognized as a prefix
rule. Track the required value prefix alongside the name prefix, enforce
it against every matching form field in the extras loop, and skip the
prefix-stem condition in the main iteration (it has no single form
field to evaluate). Also include policy.Value in the unknown-condition
error trailer for clearer debugging. Addresses gemini-code-assist
review on PR #9124.

* fix(s3): check every matching POST policy rule, not just the first

The extras loop exited early on exact-key match and broke on the first
matching prefix stem. Per AWS, a form field must satisfy every policy
condition that applies to it, so an exact-match field must still honor
any overlapping starts-with stem's value prefix, and multiple stems on
the same field must all hold. Drop both early exits: start matched from
the exact-key lookup, iterate all prefix stems, and fail on the first
value-prefix violation. Addresses gemini-code-assist review on PR
#9124.
2026-04-17 14:55:12 -07:00
Chris Lu
0bddc2652e fix(s3): propagate validated POST form fields to upload headers (#9123)
* fix(s3): propagate validated POST form fields to upload headers

POST Object form fields like acl, Content-Encoding, x-amz-storage-class,
x-amz-tagging, and x-amz-server-side-encryption were validated against
the POST policy but never forwarded to the underlying PUT, so the
validated values had no effect. Forward all non-reserved x-amz-* fields
plus acl (as x-amz-acl) and Content-Encoding. Reserved POST policy
mechanism fields (Policy, Signature, Key, etc.) are still excluded.

* fix(s3): also forward Content-Language from POST form to upload headers

AWS S3 POST Object supports Content-Language; add it to the set of
content headers forwarded by applyPostPolicyFormHeaders alongside
Cache-Control, Expires, Content-Disposition, and Content-Encoding.
Addresses gemini-code-assist review on PR #9123.

* refactor(s3): only look up form value in branches that use it

applyPostPolicyFormHeaders previously called formValues.Get(k) for every
form field, including fields that fall through the switch (non-reserved
fields that are neither Acl, a forwarded content header, nor X-Amz-*).
Move the lookup inside the switch cases that actually use it.
2026-04-17 14:55:06 -07:00
Chris Lu
2da24cc230 fix(s3): return 403 on POST policy violation instead of 307 redirect (#9122)
* fix(s3): return 403 on POST policy violation instead of 307 redirect

CheckPostPolicy failures previously responded with HTTP 307 Temporary
Redirect to the request URL, which causes clients to re-POST and
obscures the failure. Return 403 AccessDenied so the client surfaces
the error.

* test(s3): exercise PostPolicyBucketHandler end-to-end for 403 mapping

Replace the shallow ErrAccessDenied tautology test with one that builds
a signed POST multipart request whose policy conditions cannot be
satisfied, calls PostPolicyBucketHandler directly, and asserts HTTP 403
with no Location redirect header. Addresses gemini-code-assist review on
PR #9122.

* fix(s3): surface POST policy failure reason in AccessDenied response

Add s3err.WriteErrorResponseWithMessage so a caller can keep the
standard error code mapping while providing a specific Message. Use it
from PostPolicyBucketHandler so the XML body carries the CheckPostPolicy
error (e.g. which condition failed or that the policy expired) rather
than the generic "Access Denied." description. Addresses gemini-code-
assist review on PR #9122.

* refactor(s3err): delegate WriteErrorResponse to WriteErrorResponseWithMessage

The two helpers shared every line except the Message override. Fold
WriteErrorResponse into a one-line delegation that passes an empty
message, so the request-id/mux/apiError logic lives in exactly one
place. Addresses gemini-code-assist review on PR #9122.
2026-04-17 14:54:58 -07:00
Chris Lu
cce98fcecf fix(s3): strip client-supplied X-SeaweedFS-Principal/Session-Token in AuthSignatureOnly (#9120)
* fix(s3): strip client-supplied X-SeaweedFS-Principal/Session-Token in AuthSignatureOnly

AuthSignatureOnly is the only auth gate in front of S3Tables routes
(incl. CreateTableBucket) and UnifiedPostHandler, but unlike
authenticateRequestInternal it did not clear the internal IAM
trust headers before running signature verification. S3Tables
authorizeIAMAction reads X-SeaweedFS-Principal directly from the
request and prefers it over the authenticated identity's PrincipalArn,
so a signed low-privilege caller could append that header after signing
(unsigned header, SigV4 still verifies) and have IAM policy evaluated
against a spoofed principal, bypassing authorization.

Clear both X-SeaweedFS-Principal and X-SeaweedFS-Session-Token at the
top of AuthSignatureOnly, mirroring the existing guard in
authenticateRequestInternal. Add a regression test covering the
header-injection path.

* refactor(s3): route AuthSignatureOnly through authenticateRequestInternal

Addresses review feedback: both entry points were independently
maintaining the internal-IAM-header stripping and the auth-type dispatch
switch. Collapse AuthSignatureOnly into a thin wrapper around
authenticateRequestInternal so the security-critical header scrub and
the signature-verify switch live in one place. Post-auth behavior
unique to AuthSignatureOnly (AmzAccountId header) stays inline.

No functional change beyond two harmless telemetry tweaks that now
match authenticateRequestInternal: the per-branch glog verbosity shifts
from V(3) to V(4), and the anonymous-found path now sets AmzAuthType.

* refactor(s3): centralize X-SeaweedFS-Principal/Session-Token header names

Introduce SeaweedFSPrincipalHeader and SeaweedFSSessionTokenHeader in
weed/s3api/s3_constants so the trust-header literals are defined once and
referenced consistently by the auth scrub, JWT auth path, bucket policy
principal resolution, IAM authorization, and S3Tables IAM evaluation.
Replace every remaining usage in weed/s3api and weed/s3api/s3tables.
This removes the drift risk the reviewer called out: adding another call
site with a typo can no longer silently bypass the scrub.

Pure rename, no behavior change. No-op integration-test helper in
test/s3/iam/s3_iam_framework.go left untouched (separate module, and the
server now strips the client-supplied value regardless).
2026-04-17 12:23:21 -07:00
Chris Lu
88ac2d0431 security(s3api): reject unsigned x-amz-* headers in SigV4 requests (#9121)
A presigned URL holder could attach arbitrary x-amz-* headers to a PUT
request (e.g. x-amz-tagging, x-amz-acl, x-amz-storage-class,
x-amz-server-side-encryption*, x-amz-object-lock-*, x-amz-meta-*,
x-amz-website-redirect-location, x-amz-grant-*). Because only the
headers declared in SignedHeaders participate in signature
verification, the added headers bypass authentication; the PUT handler
then persists them into the object's Extended metadata.

Match the AWS SigV4 rule: every x-amz-* header present in the request
must appear in SignedHeaders. Exempt x-amz-content-sha256 (already
tamper-protected via the canonical request's payload-hash line) and,
for presigned URLs, the SigV4 protocol parameters that live in the
query string (X-Amz-Algorithm/Credential/Date/Expires/SignedHeaders/
Signature) in case they are duplicated as headers.

Applies to both header-based and presigned SigV4; non-amz headers are
unaffected.
2026-04-17 12:20:28 -07:00
Chris Lu
9554e259dd fix(iceberg): route catalog clients to the right bucket and vend S3 endpoint (#9109)
* fix(iceberg): route catalog clients to the right bucket and vend S3 endpoint

DuckDB ATTACH 's3://<bucket>/' AS cat (TYPE 'ICEBERG', ...) was failing
with "schema does not exist" because GET /v1/config ignored the warehouse
query parameter and returned no overrides.prefix, so subsequent requests
fell through to the hard-coded "warehouse" default bucket instead of the
one the client attached. LoadTable also returned an empty config, forcing
clients to discover the S3 endpoint out-of-band and producing 403s on
direct iceberg_scan calls.

- handleConfig now echoes overrides.prefix = bucket and defaults.warehouse
  when ?warehouse=s3://<bucket>/ is supplied.
- getBucketFromPrefix honors a warehouse query parameter as a fallback for
  clients that skip the /v1/config handshake.
- LoadTable responses advertise s3.endpoint and s3.path-style-access so
  clients can reach data files without separate configuration.

Refs #9103

* address review feedback on iceberg S3 endpoint vending

- deriveS3AdvertisedEndpoint is now a method on S3Options; honors
  externalUrl / S3_EXTERNAL_URL, switches to https when -s3.key.file is
  set, uses the https port when configured, and brackets IPv6 literals
  via util.JoinHostPort.
- handleCreateTable returns s.buildFileIOConfig() in both its staged and
  final LoadTableResult branches so create and load flows see the same
  FileIO hints.
- Add unit test coverage for the endpoint derivation scenarios.

* address CI and review feedback for #9109

- DuckDB integration test now runs under its own newOAuthTestEnv (with a
  valid IAM config) so the OAuth2 client_credentials flow DuckDB requires
  actually works; the shared env has no registered credentials, which was
  the cause of the CI failure. Helper createTableWithToken was added to
  create tables via Bearer auth.
- Tighten TestIssue9103_LoadTableDoesNotVendS3FileIOCredentials to also
  assert s3.path-style-access = "true", so a partial regression where the
  endpoint is vended but path-style is dropped still fails.
- deriveS3AdvertisedEndpoint now logs a startup hint when it infers the
  host from os.Hostname because the bind IP is a wildcard, pointing
  operators at -s3.externalUrl / S3_EXTERNAL_URL for reverse-proxy
  deployments where the inferred name is not externally reachable.
- handleConfig has a comment explaining that any sub-path in the
  warehouse URL is dropped because catalog routing is bucket-scoped.

* fix(iceberg): make advertised S3 endpoint strictly opt-in; add region

The wildcard-bind fallback to os.Hostname() in deriveS3AdvertisedEndpoint
was hijacking correctly-configured clients: on the CI runner it produced
http://runnervmrc6n4:<port>, which Spark (running in Docker) could not
resolve, so Spark iceberg tests began failing after the endpoint started
being vended in LoadTable responses.

Change the rule so advertising is opt-in and never guesses a host that
might not be routable:
  - -s3.externalUrl / S3_EXTERNAL_URL wins (covers reverse-proxy).
  - Otherwise, only an explicit, non-wildcard -s3.bindIp is used.
  - Wildcard / empty bind returns "" so no FileIO endpoint is vended and
    existing clients keep using their own configuration.

buildFileIOConfig additionally vends s3.region (defaulting to the same
value baked into table bucket ARNs) whenever it vends an endpoint, so
DuckDB's attach does not fail with "No region was provided via the
vended credentials" when the operator has opted in.

The DuckDB issue-9103 integration test runs under an env with a
wildcard bind, so it explicitly sets AWS_REGION in the docker run to
pick up the same default. The HTTP-level LoadTable-vending test was
dropped because its expectation is now conditional and already covered
by unit tests in iceberg_issue_9103_test.go.
2026-04-16 15:51:43 -07:00
Chris Lu
00a2e22478 fix(mount): remove fid pool to stop master over-allocating volumes (#9111)
* fix(mount): remove fid pool to stop master over-allocating volumes

The writeback-cache fid pool pre-allocated file IDs with
ExpectedDataSize = ChunkSizeLimit (typically 8+ MB). The master's
PickForWrite charges count * expectedDataSize against the volume's
effectiveSize, so a full pool refill could charge hundreds of MB
against a single volume before any bytes were actually written.
That tripped RecordAssign's hard-limit path and eagerly removed
volumes from writable, causing the master to grow new volumes
even when the real data being written was tiny.

Drop the pool entirely. Every chunk upload goes through
UploadWithRetry -> AssignVolume with no ExpectedDataSize hint,
letting the master fall back to the 1 MB default estimate. The
mount->filer grpc connection is already cached in pb.WithGrpcClient
(non-streaming mode), so per-chunk AssignVolume is a unary RPC
over an existing HTTP/2 stream, not a full dial. Path-based
filer.conf storage rules now apply to mount chunk assigns again,
which the pool had to skip.

Also remove the now-unused operation.UploadWithAssignFunc and its
AssignFunc type.

* fix(upload): populate ExpectedDataSize from actual chunk bytes

UploadWithRetry already buffers the full chunk into `data` before
calling AssignVolume, so the real size is known. Previously the
assign request went out with ExpectedDataSize=0, making the master
fall back to the 1 MB DefaultNeedleSizeEstimate per fid — same
over-reservation symptom the pool had, just smaller per call.

Stamp ExpectedDataSize = len(data) before the assign RPC when the
caller hasn't already set it. This covers mount chunk uploads,
filer_copy, filersink, mq/logstore, broker_write, gateway_upload,
and nfs — all the UploadWithRetry paths.

* fix(assign): pass real ExpectedDataSize at every assign call site

After removing the mount fid pool, per-chunk AssignVolume calls went
out with ExpectedDataSize=0, making the master fall back to its 1 MB
DefaultNeedleSizeEstimate. That's still an over-estimate for small
writes. Thread the real payload size through every remaining assign
site so RecordAssign charges effectiveSize accurately and stops
prematurely marking volumes full.

- filer: assignNewFileInfo now takes expectedDataSize and stamps it
  on both primary and alternate VolumeAssignRequests. Callers pass:
  - SSE data-to-chunk: len(data)
  - copy manifest save: len(data)
  - streamCopyChunk: srcChunk.Size
  - TUS sub-chunk: bytes read
  - saveAsChunk (autochunk/manifestize): 0 (small, size unknown
    until the reader is drained; master uses 1 MB default)
- filer gRPC remote fetch-and-write: ExpectedDataSize = chunkSize
  after the adaptive chunkSize is computed.
- ChunkedUploadOption.AssignFunc gains an expectedDataSize parameter;
  upload_chunked.go passes the buffered dataSize at the call site.
  S3 PUT assignFunc stamps it on the AssignVolumeRequest.
- S3 copy: assignNewVolume / prepareChunkCopy take expectedDataSize;
  all seven call sites pass the source chunk's Size.
- operation.SubmitFiles / FilePart.Upload: derive per-fid size from
  FileSize (average for batched requests, real per-chunk size for
  sequential chunk assigns).
- benchmark: pass fileSize.
- filer append-to-file: pass len(data).

* fix(assign): thread size through SaveDataAsChunkFunctionType

The saveAsChunk path (autochunk, filer_copy, webdav, mount) ran
AssignVolume before the reader was drained, so it had to pass
ExpectedDataSize=0 and fall back to the master's 1 MB default.

Add an expectedDataSize parameter to SaveDataAsChunkFunctionType.
- mergeIntoManifest already has the serialized manifest bytes, so
  it passes uint64(len(data)) directly.
- Mount's saveDataAsChunk ignores the parameter because it uses
  UploadWithRetry, which already stamps len(data) on the assign
  after reading the payload.
- webdav and filer_copy saveDataAsChunk follow the same UploadWithRetry
  path and also ignore the hint.
- Filer's saveAsChunk (used for manifestize) plumbs the value to
  assignNewFileInfo so manifest-chunk assigns get a real size.

Callers of saveFunc-as-value (weedfs_file_sync, dirty_pages_chunked)
pass the chunk size they're about to upload.
2026-04-16 15:51:13 -07:00
Chris Lu
d0a09ea178 fix(s3): honor ChecksumAlgorithm on presigned URL uploads (#9076)
* fix(s3): honor ChecksumAlgorithm on presigned URL uploads

AWS SDK presigners hoist x-amz-sdk-checksum-algorithm (and related
checksum headers) into the signed URL's query string, so servers must
read either location. detectRequestedChecksumAlgorithm only looked at
request headers, so presigned PUTs with ChecksumAlgorithm set validated
and stored no additional checksum, and HEAD/GET never returned the
x-amz-checksum-* header.

Read these parameters from headers first, then fall back to a
case-insensitive query-string lookup. Apply the same fallback when
comparing an object-level checksum value against the computed one.

Fixes #9075

* test(s3): presigned URL checksum integration tests (#9075)

Adds test/s3/checksum with end-to-end coverage for flexible-checksum
behavior on presigned URL uploads. Tests generate a presigned PUT URL
with ChecksumAlgorithm set, upload the body with a plain http.Client
(bypassing AWS SDK middleware so the server must honor the query-string
hoisted x-amz-sdk-checksum-algorithm), then HEAD/GET with
ChecksumMode=ENABLED and assert the stored x-amz-checksum-* header.

Covers SHA256, SHA1, and a negative control with no checksum requested.
Wires the new directory into s3-go-tests.yml as its own CI job.

* perf(s3): parse presigned query once in detectRequestedChecksumAlgorithm

Previously, each header fallback called getHeaderOrQuery, which re-parsed
r.URL.Query() and allocated a new map on every invocation — up to eight
times per PutObject request. Parse the raw query at most once per request
(only when non-empty) and pass the pre-parsed url.Values into a new
lookupHeaderOrQuery helper.

Also drops a redundant strings.ToLower allocation in the case-insensitive
query key scan (strings.EqualFold already handles ASCII case folding).

Addresses review feedback from gemini-code-assist on PR #9076.

* test(s3): honor credential env vars and add presigned upload timeout

- init() now reads S3_ACCESS_KEY/S3_SECRET_KEY (and AWS_ACCESS_KEY_ID /
  AWS_SECRET_ACCESS_KEY / AWS_REGION fallbacks) so that
  `make test-with-server ACCESS_KEY=... SECRET_KEY=...` no longer
  authenticates with hardcoded defaults while the server has been
  started with different credentials.
- uploadViaPresignedURL uses a dedicated http.Client with a 30s timeout
  instead of http.DefaultClient, so a stalled server fails fast in CI
  instead of blocking until the suite's global timeout fires.

Addresses review feedback from coderabbitai on PR #9076.

* test(s3): pass S3_PORT and credentials through to checksum tests

- 'make test' now exports S3_ENDPOINT, S3_ACCESS_KEY, and S3_SECRET_KEY
  derived from the Makefile variables so the Go test process talks to
  the same endpoint/credentials that start-server was launched with.
- start-server cleans up the background SeaweedFS process and PID file
  when the readiness poll times out, preventing stale port conflicts on
  subsequent runs.

Addresses review feedback from coderabbitai on PR #9076.

* ci(s3): raise checksum tests step timeout

make test-with-server builds weed_binary, waits up to 90s for readiness,
then runs go test -timeout=10m. The previous 12-minute step timeout only
had ~2 minutes of headroom over the Go timeout, risking the Actions
runner killing the step before tests reported a real failure.

Bumps the job timeout from 15 to 20 minutes and the step timeout from
12 to 16 minutes, matching other S3 integration jobs.

Addresses review feedback from coderabbitai on PR #9076.

* perf(s3): thread pre-parsed query through putToFiler hot path

Parse the request's query string once at the top of putToFiler and
reuse the resulting url.Values for both the checksum-algorithm detection
and the expected-checksum verification. Previously, the verification
path called getHeaderOrQuery which re-parsed r.URL.Query() again on
every PutObject, defeating the previous commit's single-parse goal.

- Add parseRequestQuery + detectRequestedChecksumAlgorithmQ (the
  pre-parsed-query variant). detectRequestedChecksumAlgorithm is now a
  thin wrapper used by callers that do a single lookup.
- putToFiler parses once and threads the result through both call sites.
- Remove getHeaderOrQuery and update the unit test to use
  lookupHeaderOrQuery directly.

Addresses follow-up review from gemini-code-assist on PR #9076.

* test(s3): check io.ReadAll error in uploadViaPresignedURL helper

* test(s3): drop SHA1 presigned test case

The AWS SDK v2 presigner signs a Content-MD5 header at presign time
for SHA1 PutObject requests even when no body is attached (the MD5 of
the empty payload gets baked into the signed headers). Uploading the
real body via a plain http.Client then trips SeaweedFS's MD5 validation
and returns BadDigest — an SDK/presigner quirk, not a SeaweedFS bug.

The SHA256 positive case already exercises the server-side
query-hoisted algorithm path that issue #9075 is about, and the unit
tests in weed/s3api cover each algorithm's header mapping. Drop the
SHA1 integration case rather than chase SDK-specific workarounds.

* test(s3): provide real Content-MD5 to presigned checksum test

AWS SDK v2's flexible-checksum middleware signs a Content-MD5 header at
presign time. There is no body to hash at that point, so it seeds the
header with MD5 of the empty payload. When the real body is then PUT
with a plain http.Client, SeaweedFS's server-side Content-MD5
verification correctly rejects the upload with BadDigest.

Pre-compute the MD5 of the test body and thread it into
PutObjectInput.ContentMD5 so the signed Content-MD5 matches the body
that will actually be uploaded. The test still exercises the
server-side path that reads X-Amz-Sdk-Checksum-Algorithm from the
query string (the fix that PR #9076 is validating).

* test(s3): send the signed Content-MD5 header on presigned upload

uploadViaPresignedURL now accepts an extraHeaders map so callers can
thread through headers that the presigner signed but the raw http
request would otherwise omit. The SHA256 test passes the Content-MD5
it computed, matching what the presigner baked into the signature.

Fixes SignatureDoesNotMatch seen in CI after the previous commit set
ContentMD5 on the presign input without sending the corresponding
header on the actual upload.

* test(s3): build presigned URL with the raw v4 signer

The AWS SDK v2 s3.PresignClient runs the flexible-checksum middleware
for any PutObject input that carries ChecksumAlgorithm. That middleware
injects a Content-MD5 header at presign time, and with no body present
it seeds MD5-of-empty. Any subsequent upload of a non-empty body
through a plain http.Client then trips SeaweedFS's Content-MD5
verification and returns BadDigest — not the code path that issue
#9075 is about.

Replace the PresignClient usage in the integration test with a direct
call to v4.Signer.PresignHTTP, building a canonical URL whose query
string already contains x-amz-sdk-checksum-algorithm=SHA256. This is
exactly the shape of URL a browser/curl client would receive from any
presigner that hoists the algorithm header, and it exercises the
server-side fix from PR #9076 without dragging in SDK-specific
middleware quirks.

* test(s3): set X-Amz-Expires on presigned URL before signing

v4.Signer.PresignHTTP does not add X-Amz-Expires on its own — the
caller has to seed it into the request's query string so the signer
includes it in the canonical query and the server accepts the
presigned URL. Without it, SeaweedFS correctly returns
AuthorizationQueryParametersError.

Also adds a .gitignore for the make-managed test volume data, log
file, and PID file so local `make test-with-server` runs do not leave
artifacts tracked by git.

Verified by running the integration tests locally:
  make test-with-server → both presigned checksum tests PASS.
2026-04-14 21:52:49 -07:00
Chris Lu
2818251dd5 fix(iceberg): clean stale data before creating a table (#9074) (#9077)
* fix(iceberg): clean stale data before creating a table

CREATE TABLE AS from Trino fails against the SeaweedFS Iceberg REST
catalog with "Cannot create a table on a non-empty location". The
catalog assigns every new table the deterministic <ns>/<table> path,
and Trino's pre-write check rejects the CTAS whenever leftover objects
live there — typically files from a prior DROP that did not purge the
data, or an earlier aborted CTAS.

Make the catalog the authority for table existence: before writing any
metadata, look up the table in the S3 Tables catalog. If it is already
registered, return the existing definition (idempotent create). If it
is not registered, any objects still sitting at the target location
are stale, so purge them before proceeding. Live tables are never
touched — the cleanup path is guarded by the catalog lookup.

Fixes #9074

* test(trino): regression for create/drop/recreate table (#9074)

Exercises the exact sequence from the reported bug: CREATE TABLE without
an explicit location, INSERT, DROP, then CREATE again with the same name,
followed by a CTAS on top. Previously the recreate failed with
"Cannot create a table on a non-empty location" because stale data files
from the dropped table lingered at the deterministic <schema>/<table>
path. The test also asserts the recreated table does not see the dropped
data and that a drop/recreate CTAS cycle works.

* fix(iceberg): purge dropped table location on DROP TABLE

Recreate-at-same-location was failing with "Cannot create a table on a
non-empty location" because DROP TABLE only removed the catalog entry
and left data files behind. Trino's pre-write emptiness check then
rejected the subsequent CREATE.

Look up the table's storage location before deleting the catalog entry,
and after a successful DeleteTable, purge the filer subtree at that
location. The lookup uses the catalog — the authoritative owner of the
name→location mapping — so cleanup is gated on a live table having
existed; failed lookups skip cleanup and leave storage alone.

Also pin the regression test to an explicit, fixed location so the
recreate genuinely targets the same path the drop was supposed to free.

* refactor(iceberg): use errors.As in isNoSuchTableError

* fix(iceberg): drop create-preflight cleanup; harden cleanup path

- Remove the destructive cleanupStaleTableLocation call from the
  CreateTable preflight. Storage cleanup now lives exclusively on the
  DROP path, so CreateTable has no side effects on storage beyond the
  new table's own metadata write.
- Validate tablePath in cleanupStaleTableLocation by rejecting empty,
  ".", "..", or backslash-bearing segments before joining with the
  bucket prefix, so a crafted location cannot escape the bucket
  subtree. path.Clean would silently collapse traversal, so segments
  are checked raw.

* test(trino): defer table drops in recreate test for failure-safe cleanup
2026-04-14 20:34:53 -07:00
Chris Lu
40c1797f8e fix(s3): allow anonymous ListBuckets with prefix-scoped List action (#9073)
* fix(s3): allow anonymous ListBuckets with prefix-scoped List action

An anonymous identity holding a prefix-scoped action such as
"List:prefix-*" was denied at the auth middleware before ListBucketsHandler
could apply the per-bucket visibility check. The middleware called
CanDo with an empty bucket, which never matches a scoped action, so
every anonymous ListBuckets request returned 403 even though matching
buckets should have been visible.

Defer ListBuckets authorization to the handler for the anonymous
identity when it actually carries a List action, mirroring the
existing behavior for authenticated users. Anonymous identities with
no List action continue to be rejected at the global layer, preserving
the secure-by-default posture.

Fixes #9072

* refactor(s3): make hasListAction a method on Identity

Addresses PR review — consistent with existing CanDo/isAdmin methods and
also treats Admin identities as implicitly having List permission.
2026-04-14 10:52:00 -07:00
Chris Lu
eaf561e86c perf(s3): add optional shared in-memory chunk cache for GET (#9069)
Adds the -s3.cacheCapacityMB flag (default 0, disabled) that attaches
an in-memory chunk_cache.ChunkCacheInMemory to the server-wide
ReaderCache introduced in the previous commit. When enabled,
completed chunks are deposited into the shared cache as they are
downloaded, so concurrent and repeat GETs of the same object hit
memory instead of re-fetching chunks from volume servers.

When 0 (the default) the shared ReaderCache still runs — it just
attaches a nil chunk cache, so behaviour matches the previous commit
exactly. No behaviour change for clusters that don't opt in.

Disk-backed TieredChunkCache was evaluated and rejected: its
synchronous SetChunk writes regressed cold reads ~12x on loopback
because the chunk fetchers block on local disk I/O that is *slower*
than the TCP volume-server fetch it is supposed to accelerate.
Memory-only avoids that.

Flag registered in all four S3 flag sites (s3.go, server.go,
filer.go, mini.go) per the comment on command.S3Options. The chunk
size used to convert CacheSizeMB → entry count is encapsulated in
the s3ChunkCacheChunkSizeMB constant so it's easy to grep and
revisit if the filer default chunk size changes.

Measured on weed mini + 1 GiB random object over loopback, single
curl on a presigned URL:

    cacheCapacityMB=0 (off):  cold ~2900, warm ~2900 MB/s
    cacheCapacityMB=4096:     cold ~2790, warm ~5050 MB/s (+70%)
2026-04-14 09:24:35 -07:00
Chris Lu
228ed25a01 perf(s3): route GET through ChunkReadAt + per-request ReaderCache (#9068)
perf(s3): route GET through ChunkReadAt + shared ReaderCache

The S3 GET path previously used filer.PrepareStreamContentWithPrefetch,
which hands chunk bytes from the volume-server fetch goroutine to the
consumer through an io.Pipe. io.Pipe is a synchronous rendezvous, so
the prefetch=4 window only overlapped HTTP connection setup — the
actual data bytes still flowed one pipe at a time.

Switch to the same path WebDAV uses (server/webdav_server.go): build
a filer.ChunkReadAt backed by a server-wide filer.ReaderCache.
ReaderCache prefetches whole chunks into []byte buffers, so the
prefetch window translates into real in-flight bytes and the consumer
copies them out as memcpys.

The ReaderCache is server-wide (not per-request) for two reasons:

1. ChunkReadAt.Close() destroys the ReaderCache's downloader map.
   With a per-request cache, the defer on the handler would wait for
   background chunk downloads that run on context.Background() — so
   a client disconnect would block handler cleanup on downloads that
   the client no longer wants, tying up goroutines and memory.

2. Concurrent requests for the same object can share in-flight
   downloads through the shared downloader map.

No persistent ChunkCache is added in this commit — the ReaderCache is
constructed with a nil *chunk_cache.TieredChunkCache (all its methods
are nil-receiver safe). A follow-up PR wires in an in-memory chunk
cache for cross-request warm hits.

JWT for volume-server requests is generated internally by
util_http.RetriedFetchChunkData from jwtSigningReadKey, so the new
path remains compatible with JWT-protected clusters — this is the
same mechanism the WebDAV and mount read paths have been using.

Measured on weed mini + 1 GiB random object over loopback, cold
cache, single-stream curl on a presigned URL:

    before (io.Pipe):        2100-2200 MB/s
    after  (ChunkReadAt):    2900-3800 MB/s
2026-04-14 07:46:05 -07:00
Chris Lu
7aaa431bb4 s3api: prune bucket-scoped IAM actions on DeleteBucket (#9054)
* s3api: prune bucket-scoped IAM actions on DeleteBucket

DeleteBucket removed the bucket directory and collection but left
behind any identity actions configured via s3.configure that were
scoped to that bucket (e.g. Read:bucket, Write:bucket/prefix),
leaving stale auth metadata that users expected to be cleaned up
along with the bucket.

After a successful delete, strip actions whose resource is exactly
the bucket or a prefix under it, save via the credential manager,
and let the existing filer metadata subscription fan the reload out
to every S3 server. Wildcarded resources and global actions are
preserved since they may cover other buckets; static identities
are left untouched.

Fixes #5310

* s3api: address review feedback on bucket IAM prune

- Apply per-identity updates via credentialManager.UpdateUser instead
  of a full LoadConfiguration/SaveConfiguration round-trip, so the
  prune no longer clobbers concurrent IAM edits made by s3.configure
  or the IAM API during a DeleteBucket.
- Use a 30s bounded background context for the post-delete cleanup so
  it survives client disconnect — the bucket is already gone by then
  and this is best-effort bookkeeping.
- Skip static identities via IsStaticIdentity, since the credential
  store never persists them and UpdateUser would return NotFound.
2026-04-13 12:13:38 -07:00
Chris Lu
c390448906 fix(s3): preserve exact policy document in embedded IAM put/get-user-policy (#9025)
* fix(s3): preserve exact policy document in embedded IAM PutUserPolicy/GetUserPolicy (#9008)

The embedded IAM implementation (used when IAM requests go through the
S3 gateway) discarded the original policy document on PutUserPolicy,
storing only the lossy ident.Actions representation. GetUserPolicy then
reconstructed the document from these coarse-grained actions, producing
wildcard-expanded actions (s3:GetObject → s3:Get*), duplicates, and
collapsed resources (array → single string).

PR #9009 fixed this in the standalone IAM server (weed/iamapi/) but the
embedded IAM (weed/s3api/) — which is the code path most users hit —
had the same bugs.

Changes:

- Add InlinePolicyStore optional interface to credential store, with
  implementations for FilerEtcStore (uses existing PoliciesCollection),
  MemoryStore, and PropagatingCredentialStore.

- Embedded IAM PutUserPolicy now persists the original policy document
  via CredentialManager.PutUserInlinePolicy for lossless round-trips.

- Embedded IAM GetUserPolicy first tries the stored inline policy; only
  falls back to lossy reconstruction from ident.Actions when no stored
  document exists (e.g. policies created before this fix).

- Fix the fallback reconstruction: add action deduplication and preserve
  resource paths verbatim (no more spurious /* appending).

- Update DeleteUserPolicy/ListUserPolicies to use stored inline policies.

* fix(s3): address PR review feedback for embedded IAM inline policies

- Validate PolicyName is non-empty in PutUserPolicy and DeleteUserPolicy
- Add recomputeActions() to aggregate ident.Actions from ALL stored
  inline policies on put/delete, fixing the issue where a second
  PutUserPolicy would overwrite the first policy's enforcement
- Log errors from GetUserInlinePolicy in the GetUserPolicy fallback
  instead of silently ignoring them
- Add initialization guards to MemoryStore GetUserInlinePolicy and
  ListUserInlinePolicies for consistency with other read methods

* fix(s3): make inline policy persistence fatal and propagate recompute errors

Address second round of review feedback:

- recomputeActions() now returns ([]string, error) so callers can
  distinguish store failures from "no stored policies" and abort the
  mutation on transient errors instead of silently falling back.

- PutUserInlinePolicy and DeleteUserInlinePolicy failures are now fatal:
  the API call returns ServiceFailure instead of logging and continuing,
  keeping ident.Actions and stored policy state in sync.

* chore: gofmt weed/s3api/iceberg/handlers_oauth.go

Pre-existing formatting issue from #9017; fixes S3 Tables Format Check CI.
2026-04-10 18:09:22 -07:00
Chris Lu
e648c76bcf go fmt 2026-04-10 17:31:14 -07:00
Chris Lu
2b8c16160f feat(iceberg): add OAuth2 token endpoint for DuckDB compatibility (#9017)
* feat(iceberg): add OAuth2 token endpoint for DuckDB compatibility (#9015)

DuckDB's Iceberg connector uses OAuth2 client_credentials flow,
hitting POST /v1/oauth/tokens which was not implemented, returning 404.

Add the OAuth2 token endpoint that accepts S3 access key / secret key
as client_id / client_secret, validates them against IAM, and returns
a signed JWT bearer token. The Auth middleware now accepts Bearer tokens
in addition to S3 signature auth.

* fix(test): use weed shell for table bucket creation with IAM enabled

The S3 Tables REST API requires SigV4 auth when IAM is configured.
Use weed shell (which bypasses S3 auth) to create table buckets,
matching the pattern used by the Trino integration tests.

* address review feedback: access key in JWT, full identity in Bearer auth

- Include AccessKey in JWT claims so token verification uses the exact
  credential that signed the token (no ambiguity with multi-key identities)
- Return full Identity object from Bearer auth so downstream IAM/policy
  code sees an authenticated request, not anonymous
- Replace GetSecretKeyForIdentity with GetCredentialByAccessKey for
  unambiguous credential lookup
- DuckDB test now tries the full SQL script first (CREATE SECRET +
  catalog access), falling back to simple CREATE SECRET if needed
- Tighten bearer auth test assertion to only accept 200/500

Addresses review comments from coderabbitai and gemini-code-assist.

* security: use PostFormValue, bind signing key to access key, fix port conflict

- Use r.PostFormValue instead of r.FormValue to prevent credentials from
  leaking via query string into logs and caches
- Reject client_secret in URL query parameters explicitly
- Include access key in HMAC signing key derivation to prevent
  cross-credential token forgery when secrets happen to match
- Allocate dedicated webdav port in OAuth test env to avoid port
  collision with the shared TestMain cluster
2026-04-10 11:18:11 -07:00
Chris Lu
ba90ae5c94 fix(s3): don't count ErrNotFound as filer health failure in failover (#8995)
* fix(s3): don't count ErrNotFound as filer health failure in failover

The S3 gateway's filer client failover was recording ErrNotFound
(entry doesn't exist) as a filer health failure. In multi-filer
setups where filers have separate metadata stores, normal object
lookups that return "not found" accumulated in the circuit breaker,
eventually marking healthy filers as unhealthy after just 3 lookups.

This caused the distributed lock integration test to fail with 500
InternalError: once a filer was circuit-broken, subsequent lookups
could no longer fall back, turning a would-be 412 PreconditionFailed
into an unrecoverable internal error.

Only record actual transport/server failures in the health tracker.
The failover still tries other filers for data locality, but no
longer penalizes filers for correctly reporting missing entries.

* style: inline isNotFound variable for consistency

The variable was only used once; inlining it matches the pattern
already used in the failover loop a few lines below.
2026-04-08 17:08:57 -07:00
Chris Lu
e21d7602c3 feat(iam): implement group inline policy actions (#8992)
* feat(iam): implement group inline policy actions

Add PutGroupPolicy, GetGroupPolicy, DeleteGroupPolicy, and
ListGroupPolicies to both embedded and standalone IAM servers.

The standalone IAM stores group inline policies in a new
GroupInlinePolicies field in the Policies JSON, mirroring the
existing user inline policy pattern. DeleteGroup now also checks
for inline policies before allowing deletion.

* fix: address review feedback for group inline policies

- Embedded IAM: return NotImplemented for group inline policies
  instead of silently succeeding as no-ops (Gemini + CodeRabbit)
- Standalone IAM: recompute member actions after PutGroupPolicy
  and DeleteGroupPolicy (Gemini)
- Add parameter validation for GroupName/PolicyName/PolicyDocument
  on PutGroupPolicy, DeleteGroupPolicy, ListGroupPolicies (Gemini)
- Add UserName validation for ListUserPolicies in standalone IAM
- Call cleanupGroupInlinePolicies from DeleteGroup (Gemini)
- Migrate GroupInlinePolicies on group rename in UpdateGroup (CodeRabbit)
- Fix integration test cleanup order (CodeRabbit)

* fix: persist recomputed actions and improve error handling

- Set changed=true for PutGroupPolicy/DeleteGroupPolicy in standalone
  IAM DoActions so recomputed member actions are persisted (Gemini critical)
- Make cleanupGroupInlinePolicies accept policies parameter to avoid
  redundant I/O, return error (Gemini)
- Make migrateGroupInlinePolicies return error, handle in caller (Gemini)

* fix: include group policies in action recomputation

Extend computeAllActionsForUser to also aggregate group inline
policies and group managed policies when s3cfg is provided.
Previously, group inline policies were stored but never reflected
in member Identity.Actions. (CodeRabbit critical)

* perf: use identity index in recomputeActionsForGroupMembers for O(N+M)

* fix: skip group inline policy integration test on embedded IAM

The embedded IAM returns NotImplemented for group inline policies.
Skip TestIAMGroupInlinePolicy when running against embedded mode
to avoid CI failures in the group integration test matrix.
2026-04-08 15:57:04 -07:00
Chris Lu
45ee2ab4b9 feat(iam): implement ListUserPolicies API action (#8991)
* feat(iam): implement ListUserPolicies API action (#8987)

Add ListUserPolicies support to both embedded and standalone IAM servers,
resolving the NotImplemented error when calling `aws iam list-user-policies`.

* fix: address review feedback for ListUserPolicies

- Add handleImplicitUsername for ListUserPolicies in both IAM servers
  so omitting UserName defaults to the calling user (Gemini review)
- Assert synthetic policy name in unit test (CodeRabbit)
- Use require.True for error type assertion in integration test (CodeRabbit)
2026-04-08 12:27:03 -07:00
Chris Lu
efc7f3936f fix(s3): handle empty URL path in forwarded prefix signature verification (#8973)
fix(s3): handle empty URL path in forwarded prefix signature verification (#8966)

When S3 is behind a reverse proxy with a forwarded prefix (e.g. /s3),
requests with an empty URL path (like ListBuckets) would incorrectly
get a trailing slash appended (e.g. /s3/), causing signature
verification to fail because the client signs /s3 without the slash.
2026-04-07 13:22:21 -07:00
Chris Lu
79a48256f5 fix(s3): populate s3:prefix from query param for ListObjects policy conditions (#8971)
* fix(s3): populate s3:prefix from query param for ListObjects policy conditions (#8969)

ListObjectsV2/V1 requests with prefix-restricted STS session policies
were denied because:
1. s3:prefix was derived from objectKey, which the auth middleware set to
   the prefix value, but the resource ARN then included the prefix
   (e.g. arn:aws:s3:::bucket/prefix) instead of staying at bucket level
   (arn:aws:s3:::bucket) as AWS requires for ListBucket.
2. When objectKey was empty (no middleware propagation), s3:prefix was
   never populated from the query parameter at all.

Now AuthorizeAction extracts the prefix query parameter directly, sets it
as s3:prefix in the request context, and uses a bucket-level resource ARN
when the objectKey matches the propagated prefix.

* fix(s3): use AWS-style wildcard matching for StringLike policy conditions

filepath.Match treats * as not matching /, which breaks IAM StringLike
conditions on paths (e.g. arn:aws:s3:::bucket/* won't match nested keys).
Replace with a case-sensitive variant of AwsWildcardMatch that correctly
treats * as matching any character including /.

* refactor(s3): replace regex wildcard matching with string-based matcher

Use the existing wildcard.MatchesWildcard utility instead of compiling
and caching regexes for IAM wildcard matching. Removes the regexCache,
its mutex, and the sync import.

* refactor(s3): inline and remove AwsWildcardMatch wrapper functions

Replace all call sites with direct wildcard.MatchesWildcard calls.

* fix(s3): scope s3:prefix condition key to list operations only

The s3:prefix logic was running for all actions, so a GetObject on
"foo/bar" would wrongly populate s3:prefix. Restrict it to action "List"
and always reset resourceObjectKey to "" so the resource ARN stays at
bucket level. Also set s3:prefix to "" when no prefix is provided, so
policies with StringEquals {"s3:prefix": ""} evaluate correctly.
2026-04-07 13:21:30 -07:00
Chris Lu
761ec7da00 fix(iceberg): use dot separator for namespace paths instead of unit separator (#8960)
* fix(iceberg): use dot separator for namespace paths instead of unit separator

The Iceberg REST Catalog handler was using \x1F (unit separator) to join
multi-level namespaces when constructing S3 location and filer paths. The
S3 Tables storage layer uses "." (dot) as the namespace separator, causing
tables created via the Iceberg REST API to point to different paths than
where S3 Tables actually stores them.

Fixes #8959

* fix(iceberg): use dot separator in log messages for readable namespace output

* fix(iceberg): use path.Join for S3 location path segments

Use path.Join to construct the namespace/table path segments in fallback
S3 locations for robustness and consistency with handleCreateTable.

* test(iceberg): add multi-level namespace integration tests for Spark and Trino

Add regression tests for #8959 that create a two-level namespace (e.g.
"analytics.daily"), create a table under it, insert data, and query it
back. This exercises the dot-separated namespace path construction and
verifies that Spark/Trino can actually read the data at the S3 location
returned by the Iceberg REST API.

* fix(test): enable nested namespace in Trino Iceberg catalog config

Trino requires `iceberg.rest-catalog.nested-namespace-enabled=true` to
support multi-level namespaces. Without this, CREATE SCHEMA with a
dotted name fails with "Nested namespace is not enabled for this catalog".

* fix(test): parse Trino COUNT(*) output as integer instead of substring match

Avoids false matches from strings.Contains(output, "3") by parsing the
actual numeric result with strconv.Atoi and asserting equality.

* fix(test): use separate Trino config for nested namespace test

The nested-namespace-enabled=true setting in Trino changes how SHOW
SCHEMAS works, causing "Internal error" for all tests sharing that
catalog config. Move the flag to a dedicated config used only by
TestTrinoMultiLevelNamespace.

* fix(iceberg): support parent query parameter in ListNamespaces for nested namespaces

Add handling for the Iceberg REST spec's `parent` query parameter in
handleListNamespaces. When Trino has nested-namespace-enabled=true, it
sends `GET /v1/namespaces?parent=<ns>` to list child namespaces. The
parent value is decoded from the Iceberg unit separator format and
converted to a dot-separated prefix for the S3 Tables layer.

Also simplify TestTrinoMultiLevelNamespace to focus on namespace
operations (create, list, show tables) rather than data operations,
since Trino's REST catalog has a non-empty location check that conflicts
with server-side metadata creation.

* fix(test): expand Trino multi-level namespace test and merge config helpers

- Expand TestTrinoMultiLevelNamespace to create a table with explicit
  location, insert rows, query them back, and verify the S3 file path
  contains the dot-separated namespace (not \x1F). This ensures the
  original #8959 bug would be caught by the Trino integration test.
- Merge writeTrinoConfig and writeTrinoNestedNamespaceConfig into a
  single parameterized function using functional options.
2026-04-07 12:21:22 -07:00
Chris Lu
733517df30 fix(s3): s3:PutObject bucket policy now implicitly allows multipart uploads (#8968)
* fix(s3): s3:PutObject bucket policy now implicitly allows multipart uploads

The PolicyEngine.evaluateStatement() method used raw regex matching for
actions, bypassing the multipart-inherits-PutObject logic that only
existed in the unused CompiledStatement.MatchesAction() code path.

When a bucket policy granted only s3:PutObject, multipart upload
operations (CreateMultipartUpload, UploadPart, CompleteMultipartUpload,
etc.) were denied, forcing users to explicitly list every multipart
action.

Fixes https://github.com/seaweedfs/seaweedfs/discussions/8751

* fix(s3): add s3:UploadPartCopy to multipartActionSet and improve test coverage

Add missing S3_ACTION_UPLOAD_PART_COPY constant and include it in
multipartActionSet so UploadPartCopy is implicitly allowed by s3:PutObject.

Also add a bucket-ARN sub-test for ListBucketMultipartUploads to verify
that an object-only resource pattern does not match bucket-level requests.
2026-04-07 11:13:29 -07:00
Chris Lu
d1823d3784 fix(s3): include static identities in listing operations (#8903)
* fix(s3): include static identities in listing operations

Static identities loaded from -s3.config file were only stored in the
S3 API server's in-memory state. Listing operations (s3.configure shell
command, aws iam list-users) queried the credential manager which only
returned dynamic identities from the backend store.

Register static identities with the credential manager after loading
so they are included in LoadConfiguration and ListUsers results, and
filtered out before SaveConfiguration to avoid persisting them to the
dynamic store.

Fixes https://github.com/seaweedfs/seaweedfs/discussions/8896

* fix: avoid mutating caller's config and defensive copies

- SaveConfiguration: use shallow struct copy instead of mutating the
  caller's config.Identities field
- SetStaticIdentities: skip nil entries to avoid panics
- GetStaticIdentities: defensively copy PolicyNames slice to avoid
  aliasing the original

* fix: filter nil static identities and sync on config reload

- SetStaticIdentities: filter nil entries from the stored slice (not
  just from staticNames) to prevent panics in LoadConfiguration/ListUsers
- Extract updateCredentialManagerStaticIdentities helper and call it
  from both startup and the grace.OnReload handler so the credential
  manager's static snapshot stays current after config file reloads

* fix: add mutex for static identity fields and fix ListUsers for store callers

- Add sync.RWMutex to protect staticIdentities/staticNames against
  concurrent reads during config reload
- Revert CredentialManager.ListUsers to return only store users, since
  internal callers (e.g. DeletePolicy) look up each user in the store
  and fail on non-existent static entries
- Merge static usernames in the filer gRPC ListUsers handler instead,
  via the new GetStaticUsernames method
- Fix CI: TestIAMPolicyManagement/managed_policy_crud_lifecycle was
  failing because DeletePolicy iterated static users that don't exist
  in the store

* fix: show static identities in admin UI and weed shell

The admin UI and weed shell s3.configure command query the filer's
credential manager via gRPC, which is a separate instance from the S3
server's credential manager. Static identities were only registered
on the S3 server's credential manager, so they never appeared in the
filer's responses.

- Add CredentialManager.LoadS3ConfigFile to parse a static S3 config
  file and register its identities
- Add FilerOptions.s3ConfigFile so the filer can load the same static
  config that the S3 server uses
- Wire s3ConfigFile through in weed mini and weed server modes
- Merge static usernames in filer gRPC ListUsers handler
- Add CredentialManager.GetStaticUsernames helper
- Add sync.RWMutex to protect concurrent access to static identity
  fields
- Avoid importing weed/filer from weed/credential (which pulled in
  filer store init() registrations and broke test isolation)
- Add docker/compose/s3_static_users_example.json

* fix(admin): make static users read-only in admin UI

Static users loaded from the -s3.config file should not be editable
or deletable through the admin UI since they are managed via the
config file.

- Add IsStatic field to ObjectStoreUser, set from credential manager
- Hide edit, delete, and access key buttons for static users in the
  users table template
- Show a "static" badge next to static user names
- Return 403 Forbidden from UpdateUser and DeleteUser API handlers
  when the target user is a static identity

* fix(admin): show details for static users

GetObjectStoreUserDetails called credentialManager.GetUser which only
queries the dynamic store. For static users this returned
ErrUserNotFound. Fall back to GetStaticIdentity when the store lookup
fails.

* fix(admin): load static S3 identities in admin server

The admin server has its own credential manager (gRPC store) which is
a separate instance from the S3 server's and filer's. It had no static
identity data, so IsStaticIdentity returned false (edit/delete buttons
shown) and GetStaticIdentity returned nil (details page failed).

Pass the -s3.config file path through to the admin server and call
LoadS3ConfigFile on its credential manager, matching the approach
used for the filer.

* fix: use protobuf is_static field instead of passing config file path

The previous approach passed -s3.config file path to every component
(filer, admin). This is wrong because the admin server should not need
to know about S3 config files.

Instead, add an is_static field to the Identity protobuf message.
The field is set when static identities are serialized (in
GetStaticIdentities and LoadS3ConfigFile). Any gRPC client that loads
configuration via GetConfiguration automatically sees which identities
are static, without needing the config file.

- Add is_static field (tag 8) to iam_pb.Identity proto message
- Set IsStatic=true in GetStaticIdentities and LoadS3ConfigFile
- Admin GetObjectStoreUsers reads identity.IsStatic from proto
- Admin IsStaticUser helper loads config via gRPC to check the flag
- Filer GetUser gRPC handler falls back to GetStaticIdentity
- Remove s3ConfigFile from AdminOptions and NewAdminServer signature
2026-04-03 20:01:28 -07:00
Chris Lu
0798b274dd feat(s3): add concurrent chunk prefetch for large file downloads (#8917)
* feat(s3): add concurrent chunk prefetch for large file downloads

Add a pipe-based prefetch pipeline that overlaps chunk fetching with
response writing during S3 GetObject, SSE downloads, and filer proxy.

While chunk N streams to the HTTP response, fetch goroutines for the
next K chunks establish HTTP connections to volume servers ahead of
time, eliminating the RTT gap between sequential chunk fetches.

Uses io.Pipe for minimal memory overhead (~1MB per download regardless
of chunk size, vs buffering entire chunks). Also increases the
streaming read buffer from 64KB to 256KB to reduce syscall overhead.

Benchmark results (64KB chunks, prefetch=4):
- 0ms latency:  1058 → 2362 MB/s (2.2× faster)
- 5ms latency:  11.0 → 41.7 MB/s (3.8× faster)
- 10ms latency: 5.9  → 23.3 MB/s (4.0× faster)
- 20ms latency: 3.1  → 12.1 MB/s (3.9× faster)

* fix: address review feedback for prefetch pipeline

- Fix data race: use *chunkPipeResult (pointer) on channel to avoid
  copying struct while fetch goroutines write to it. Confirmed clean
  with -race detector.
- Remove concurrent map write: retryWithCacheInvalidation no longer
  updates fileId2Url map. Producer only reads it; consumer never writes.
- Use mem.Allocate/mem.Free for copy buffer to reduce GC pressure.
- Add local cancellable context so consumer errors (client disconnect)
  immediately stop the producer and all in-flight fetch goroutines.

* fix(test): remove dead code and add Range header support in test server

- Remove unused allData variable in makeChunksAndServer
- Add Range header handling to createTestServer for partial chunk
  read coverage (206 Partial Content, 416 Range Not Satisfiable)

* fix: correct retry condition and goroutine leak in prefetch pipeline

- Fix retry condition: use result.fetchErr/result.written instead of
  copied to decide cache-invalidation retry. The old condition wrongly
  triggered retry when the fetch succeeded but the response writer
  failed on the first write (copied==0 despite fetcher having data).
  Now matches the sequential path (stream.go:197) which checks whether
  the fetcher itself wrote zero bytes.

- Fix goroutine leak: when the producer's send to the results channel
  is interrupted by context cancellation, the fetch goroutine was
  already launched but the result was never sent to the channel. The
  drain loop couldn't handle it. Now waits on result.done before
  returning so every fetch goroutine is properly awaited.
2026-04-03 19:57:30 -07:00
Chris Lu
3efe88c718 feat(s3): store and return checksum headers for additional checksum algorithms (#8914)
* feat(s3): store and return checksum headers for additional checksum algorithms

When clients upload with --checksum-algorithm (SHA256, CRC32, etc.),
SeaweedFS validated the checksum but discarded it. The checksum was
never stored in metadata or returned in PUT/HEAD/GET responses.

Now the checksum is computed alongside MD5 during upload, stored in
entry extended attributes, and returned as the appropriate
x-amz-checksum-* header in all responses.

Fixes #8911

* fix(s3): address review feedback and CI failures for checksum support

- Gate GET/HEAD checksum response headers on x-amz-checksum-mode: ENABLED
  per AWS S3 spec, fixing FlexibleChecksumError on ranged GETs and
  multipart copies
- Verify computed checksum against client-provided header value for
  non-chunked uploads, returning BadDigest on mismatch
- Add nil check for getCheckSumWriter to prevent panic
- Handle comma-separated values in X-Amz-Trailer header
- Use ordered slice instead of map for deterministic checksum header
  selection; extract shared mappings into package-level vars

* fix(s3): skip checksum header for ranged GET responses

The stored checksum covers the full object. Returning it for ranged
(partial) responses causes SDK checksum validation failures because the
SDK validates the header value against the partial content received.

Skip emitting x-amz-checksum-* headers when a Range request header is
present, fixing PyArrow large file read failures.

* fix(s3): reject unsupported checksum algorithm with 400

detectRequestedChecksumAlgorithm now returns an error code when
x-amz-sdk-checksum-algorithm or x-amz-checksum-algorithm contains
an unsupported value, instead of silently ignoring it.

* feat(s3): compute composite checksum for multipart uploads

Store the checksum algorithm during CreateMultipartUpload, then during
CompleteMultipartUpload compute a composite checksum from per-part
checksums following the AWS S3 spec: concatenate raw per-part checksums,
hash with the same algorithm, format as "base64-N" where N is part count.

The composite checksum is persisted on the final object entry and
returned in HEAD/GET responses (gated on x-amz-checksum-mode: ENABLED).

Reuses existing per-part checksum storage from putToFiler and the
getCheckSumWriter/checksumHeaders infrastructure.

* fix(s3): validate checksum algorithm in CreateMultipartUpload, error on missing part checksums

- Move detectRequestedChecksumAlgorithm call before mkdir callback so
  an unsupported algorithm returns 400 before the upload is created
- Change computeCompositeChecksum to return an error when a part is
  missing its checksum (the upload was initiated with a checksum
  algorithm, so all parts must have checksums)
- Propagate the error as ErrInvalidPart in CompleteMultipartUpload

* fix(s3): return checksum header in CompleteMultipartUpload response, validate per-part algorithm

- Add ChecksumHeaderName/ChecksumValue fields to CompleteMultipartUploadResult
  and set the x-amz-checksum-* HTTP response header in the handler, matching
  the AWS S3 CompleteMultipartUpload response spec
- Validate that each part's stored checksum algorithm matches the upload's
  expected algorithm before assembling the composite checksum; return an
  error if a part was uploaded with a different algorithm
2026-04-03 18:37:54 -07:00
Chris Lu
995dfc4d5d chore: remove ~50k lines of unreachable dead code (#8913)
* chore: remove unreachable dead code across the codebase

Remove ~50,000 lines of unreachable code identified by static analysis.

Major removals:
- weed/filer/redis_lua: entire unused Redis Lua filer store implementation
- weed/wdclient/net2, resource_pool: unused connection/resource pool packages
- weed/plugin/worker/lifecycle: unused lifecycle plugin worker
- weed/s3api: unused S3 policy templates, presigned URL IAM, streaming copy,
  multipart IAM, key rotation, and various SSE helper functions
- weed/mq/kafka: unused partition mapping, compression, schema, and protocol functions
- weed/mq/offset: unused SQL storage and migration code
- weed/worker: unused registry, task, and monitoring functions
- weed/query: unused SQL engine, parquet scanner, and type functions
- weed/shell: unused EC proportional rebalance functions
- weed/storage/erasure_coding/distribution: unused distribution analysis functions
- Individual unreachable functions removed from 150+ files across admin,
  credential, filer, iam, kms, mount, mq, operation, pb, s3api, server,
  shell, storage, topology, and util packages

* fix(s3): reset shared memory store in IAM test to prevent flaky failure

TestLoadIAMManagerFromConfig_EmptyConfigWithFallbackKey was flaky because
the MemoryStore credential backend is a singleton registered via init().
Earlier tests that create anonymous identities pollute the shared store,
causing LookupAnonymous() to unexpectedly return true.

Fix by calling Reset() on the memory store before the test runs.

* style: run gofmt on changed files

* fix: restore KMS functions used by integration tests

* fix(plugin): prevent panic on send to closed worker session channel

The Plugin.sendToWorker method could panic with "send on closed channel"
when a worker disconnected while a message was being sent. The race was
between streamSession.close() closing the outgoing channel and sendToWorker
writing to it concurrently.

Add a done channel to streamSession that is closed before the outgoing
channel, and check it in sendToWorker's select to safely detect closed
sessions without panicking.
2026-04-03 16:04:27 -07:00
Chris Lu
8fad85aed7 feat(s3): support WEED_S3_SSE_KEY env var for SSE-S3 KEK (#8904)
* feat(s3): support WEED_S3_SSE_KEY env var for SSE-S3 KEK

Add support for providing the SSE-S3 Key Encryption Key (KEK) via the
WEED_S3_SSE_KEY environment variable (hex-encoded 256-bit key). This
avoids storing the master key in plaintext on the filer at /etc/s3/sse_kek.

Key source priority:
1. WEED_S3_SSE_KEY environment variable (recommended)
2. Existing filer KEK at /etc/s3/sse_kek (backward compatible)
3. Auto-generate and save to filer (deprecated for new deployments)

Existing deployments with a filer-stored KEK continue to work unchanged.
A deprecation warning is logged when auto-generating a new filer KEK.

* refactor(s3): derive KEK from any string via HKDF instead of requiring hex

Accept any secret string in WEED_S3_SSE_KEY and derive a 256-bit key
using HKDF-SHA256 instead of requiring a hex-encoded key. This is
simpler for users — no need to generate hex, just set a passphrase.

* feat(s3): add WEED_S3_SSE_KEK and WEED_S3_SSE_KEY env vars for KEK

Two env vars for providing the SSE-S3 Key Encryption Key:

- WEED_S3_SSE_KEK: hex-encoded, same format as /etc/s3/sse_kek.
  If the filer file also exists, they must match.
- WEED_S3_SSE_KEY: any string, 256-bit key derived via HKDF-SHA256.
  Refuses to start if /etc/s3/sse_kek exists (must delete first).

Only one may be set. Existing filer-stored KEKs continue to work.
Auto-generating and storing new KEKs on filer is deprecated.

* fix(s3): stop auto-generating KEK, fail only when SSE-S3 is used

Instead of auto-generating a KEK and storing it on the filer when no
key source is configured, simply leave SSE-S3 disabled. Encrypt and
decrypt operations return a clear error directing the user to set
WEED_S3_SSE_KEK or WEED_S3_SSE_KEY.

* refactor(s3): move SSE-S3 KEK config to security.toml

Move KEK configuration from standalone env vars to security.toml's new
[sse_s3] section, following the same pattern as JWT keys and TLS certs.

  [sse_s3]
  kek = ""   # hex-encoded 256-bit key (same format as /etc/s3/sse_kek)
  key = ""   # any string, HKDF-derived

Viper's WEED_ prefix auto-mapping provides env var support:
WEED_SSE_S3_KEK and WEED_SSE_S3_KEY.

All existing behavior is preserved: filer KEK fallback, mismatch
detection, and HKDF derivation.

* refactor(s3): rename SSE-S3 config keys to s3.sse.kek / s3.sse.key

Use [s3.sse] section in security.toml, matching the existing naming
convention (e.g. [s3.*]). Env vars: WEED_S3_SSE_KEK, WEED_S3_SSE_KEY.

* fix(s3): address code review findings for SSE-S3 KEK

- Don't hold mutex during filer retry loop (up to 20s of sleep).
  Lock only to write filerClient and superKey.
- Remove dead generateAndSaveSuperKeyToFiler and unused constants.
- Return error from deriveKeyFromSecret instead of ignoring it.
- Fix outdated doc comment on InitializeWithFiler.
- Use t.Setenv in tests instead of manual os.Setenv/Unsetenv.

* fix(s3): don't block startup on filer errors when KEK is configured

- When s3.sse.kek is set, a temporarily unreachable filer no longer
  prevents startup. The filer consistency check becomes best-effort
  with a warning.
- Same treatment for s3.sse.key: filer unreachable logs a warning
  instead of failing.
- Rewrite error messages to suggest migration instead of file deletion,
  avoiding the risk of orphaning encrypted data.

Finding 3 (restore auto-generation) intentionally skipped — auto-gen
was removed by design to avoid storing plaintext KEK on filer.

* fix(test): set WEED_S3_SSE_KEY in SSE integration test server startup

SSE-S3 no longer auto-generates a KEK, so integration tests must
provide one. Set WEED_S3_SSE_KEY=test-sse-s3-key in all weed mini
invocations in the test Makefile.
2026-04-03 13:01:21 -07:00
Chris Lu
059bee683f feat(s3): add STS GetFederationToken support (#8891)
* feat(s3): add STS GetFederationToken support

Implement the AWS STS GetFederationToken API, which allows long-term IAM
users to obtain temporary credentials scoped down by an optional inline
session policy. This is useful for server-side applications that mint
per-user temporary credentials.

Key behaviors:
- Requires SigV4 authentication from a long-term IAM user
- Rejects calls from temporary credentials (session tokens)
- Name parameter (2-64 chars) identifies the federated user
- DurationSeconds supports 900-129600 (15 min to 36 hours, default 12h)
- Optional inline session policy for permission scoping
- Caller's attached policies are embedded in the JWT token
- Returns federated user ARN: arn:aws:sts::<account>:federated-user/<Name>

No performance impact on the S3 hot path — credential vending is a
separate control-plane operation, and all policy data is embedded in
the stateless JWT token.

* fix(s3): address GetFederationToken PR review feedback

- Fix Name validation: max 32 chars (not 64) per AWS spec, add regex
  validation for [\w+=,.@-]+ character whitelist
- Refactor parseDurationSeconds into parseDurationSecondsWithBounds to
  eliminate duplicated duration parsing logic
- Add sts:GetFederationToken permission check via VerifyActionPermission
  mirroring the AssumeRole authorization pattern
- Change GetPoliciesForUser to return ([]string, error) so callers fail
  closed on policy-resolution failures instead of silently returning nil
- Move temporary-credentials rejection before SigV4 verification for
  early rejection and proper test coverage
- Update tests: verify specific error message for temp cred rejection,
  add regex validation test cases (spaces, slashes rejected)

* refactor(s3): use sts.Action* constants instead of hard-coded strings

Replace hard-coded "sts:AssumeRole" and "sts:GetFederationToken" strings
in VerifyActionPermission calls with sts.ActionAssumeRole and
sts.ActionGetFederationToken package constants.

* fix(s3): pass through sts: prefix in action resolver and merge policies

Two fixes:

1. mapBaseActionToS3Format now passes through "sts:" prefix alongside
   "s3:" and "iam:", preventing sts:GetFederationToken from being
   rewritten to s3:sts:GetFederationToken in VerifyActionPermission.
   This also fixes the existing sts:AssumeRole permission checks.

2. GetFederationToken policy embedding now merges identity.PolicyNames
   (from SigV4 identity) with policies from the IAM manager (which may
   include group-attached policies), deduplicated via a map. Previously
   the IAM manager lookup was skipped when identity.PolicyNames was
   non-empty, causing group policies to be omitted from the token.

* test(s3): add integration tests for sts: action passthrough and policy merge

Action resolver tests:
- TestMapBaseActionToS3Format_ServicePrefixPassthrough: verifies s3:, iam:,
  and sts: prefixed actions pass through unchanged while coarse actions
  (Read, Write) are mapped to S3 format
- TestResolveS3Action_STSActionsPassthrough: verifies sts:AssumeRole,
  sts:GetFederationToken, sts:GetCallerIdentity pass through ResolveS3Action
  unchanged with both nil and real HTTP requests

Policy merge tests:
- TestGetFederationToken_GetPoliciesForUser: tests IAMManager.GetPoliciesForUser
  with no user store (error), missing user, user with policies, user without
- TestGetFederationToken_PolicyMergeAndDedup: tests that identity.PolicyNames
  and IAM-manager-resolved policies are merged and deduplicated (SharedPolicy
  appears in both sources, result has 3 unique policies)
- TestGetFederationToken_PolicyMergeNoManager: tests that when IAM manager is
  unavailable, identity.PolicyNames alone are embedded

* test(s3): add end-to-end integration tests for GetFederationToken

Add integration tests that call GetFederationToken using real AWS SigV4
signed HTTP requests against a running SeaweedFS instance, following the
existing pattern in test/s3/iam/s3_sts_assume_role_test.go.

Tests:
- TestSTSGetFederationTokenValidation: missing name, name too short/long,
  invalid characters, duration too short/long, malformed policy, anonymous
  rejection (7 subtests)
- TestSTSGetFederationTokenRejectTemporaryCredentials: obtains temp creds
  via AssumeRole then verifies GetFederationToken rejects them
- TestSTSGetFederationTokenSuccess: basic success, custom 1h duration,
  36h max duration with expiration time verification
- TestSTSGetFederationTokenWithSessionPolicy: creates a bucket, obtains
  federated creds with GetObject-only session policy, verifies GetObject
  succeeds and PutObject is denied using the AWS SDK S3 client
2026-04-02 17:37:05 -07:00
Chris Lu
a4b896a224 fix(s3): skip directories before marker in ListObjectVersions pagination (#8890)
* fix(s3): skip directories before marker in ListObjectVersions pagination

ListObjectVersions was re-traversing the entire directory tree from the
beginning on every paginated request, only skipping entries at the leaf
level. For buckets with millions of objects in deep hierarchies, this
caused exponentially slower responses as pagination progressed.

Two optimizations:
1. Use keyMarker to compute a startFrom position at each directory level,
   skipping directly to the relevant entry instead of scanning from the
   beginning (mirroring how ListObjects uses marker descent).
2. Skip recursing into subdirectories whose keys are entirely before the
   keyMarker.

Changes per-page cost from O(entries_before_marker) to O(tree_depth).

* test(s3): add integration test for deep-hierarchy version listing pagination

Adds TestVersioningPaginationDeepDirectoryHierarchy which creates objects
across 20 subdirectories at depth 6 (mimicking Veeam 365 backup layout)
and paginates through them with small maxKeys. Verifies correctness
(no duplicates, sorted order, all objects found) and checks that later
pages don't take dramatically longer than earlier ones — the symptom
of the pre-fix re-traversal bug. Also tests delimiter+pagination
interaction across subdirectories.

* test(s3): strengthen deep-hierarchy pagination assertions

- Replace timing warning (t.Logf) with a failing assertion (t.Errorf)
  so pagination regressions actually fail the test.
- Replace generic count/uniqueness/sort checks on CommonPrefixes with
  exact equality against the expected prefix slice, catching wrong-but-
  sorted results.

* test(s3): use allKeys for exact assertion in deep-hierarchy pagination test

Wire the allKeys slice (previously unused dead code) into the version
listing assertion, replacing generic count/uniqueness/sort checks with
an exact equality comparison against the keys that were created.
2026-04-02 15:59:52 -07:00
Chris Lu
7c59b639c9 STS: add GetCallerIdentity support (#8893)
* STS: add GetCallerIdentity support

Implement the AWS STS GetCallerIdentity action, which returns the
ARN, account ID, and user ID of the caller based on SigV4 authentication.
This is commonly used by AWS SDKs and CLI tools (e.g. `aws sts get-caller-identity`)
to verify credentials and determine the authenticated identity.

* test: remove trivial GetCallerIdentity tests

Remove the XML unmarshal test (we don't consume this response as input)
and the routing constant test (just asserts a literal equals itself).

* fix: route GetCallerIdentity through STS in UnifiedPostHandler and use stable UserId

- UnifiedPostHandler only dispatched actions starting with "AssumeRole" to STS,
  so GetCallerIdentity in a POST body would fall through to the IAM path and
  get AccessDenied for non-admin users. Add explicit check for GetCallerIdentity.
- Use identity.Name as UserId instead of credential.AccessKey, which is a
  transient value and incorrect for STS assumed-role callers.
2026-04-02 15:59:09 -07:00
Chris Lu
ab4e52ae2f fix(s3): use recursive delete for .versions directory cleanup (#8887)
* fix(s3): use recursive delete for .versions directory cleanup

When only delete markers remain in a .versions directory,
updateLatestVersionAfterDeletion tried to delete it non-recursively,
which failed with "fail to delete non-empty folder" because the delete
marker entries were still present. Use recursive deletion so the
directory and its remaining delete marker entries are cleaned up together.

* fix(s3): guard .versions directory deletion against truncated listings

When the version listing is truncated (>1000 entries), content versions
may exist beyond the first page. Skip the recursive directory deletion
in this case to prevent data loss.

* fix(s3): preserve delete markers in .versions directory

Delete markers must be preserved per S3 semantics — they are only
removed by an explicit DELETE with versionId. The previous fix would
recursively delete the entire .versions directory (including delete
markers) when no content versions were found.

Now the logic distinguishes three cases:
1. Content versions exist → update latest version metadata
2. Only delete markers remain (or listing truncated) → keep directory
3. Truly empty → safe to delete directory (non-recursive)
2026-04-02 11:55:13 -07:00
Chris Lu
efbed39e25 S3: map canned ACL to file permissions and add configurable default file mode (#8886)
* S3: map canned ACL to file permissions and add configurable default file mode

S3 uploads were hardcoded to 0660 regardless of ACL headers. Now the
X-Amz-Acl header maps to Unix file permissions per-object:
- public-read, authenticated-read, bucket-owner-read → 0644
- public-read-write → 0666
- private, bucket-owner-full-control → 0660

Also adds -defaultFileMode / -s3.defaultFileMode flag to set a
server-wide default when no ACL header is present.

Closes #8874

* Address review feedback for S3 file mode feature

- Extract hardcoded 0660 to defaultFileMode constant
- Change parseDefaultFileMode to return error instead of calling Fatalf
- Add -s3.defaultFileMode flag to filer.go and mini.go (was missing)
- Add doc comment to S3Options about updating all four flag sites
- Add TestResolveFileMode with 10 test cases covering ACL mapping,
  server default, and priority ordering
2026-04-02 11:51:54 -07:00
Chris Lu
b3e50bb12f fix(s3): remove customer encryption key from SSE-C debug log (#8875)
* fix(s3): remove customer encryption key from SSE-C debug log

The debug log in validateAndParseSSECHeaders was logging the raw
customer-provided encryption key bytes in hex format (keyBytes=%x),
leaking sensitive key material to log output. Remove the key bytes
from the log statement while keeping the MD5 hash comparison info.

* Apply suggestion from @gemini-code-assist[bot]

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

---------

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-04-01 23:23:56 -07:00
msementsov
4c13a9ce65 Client disconnects create context cancelled errors, 500x errors and Filer lookup failures (#8845)
* Update stream.go

Client disconnects create context cancelled errors and Filer lookup failures

* s3api: handle canceled stream requests cleanly

* s3api: address canceled streaming review feedback

---------

Co-authored-by: Chris Lu <chris.lu@gmail.com>
2026-03-30 12:11:30 -07:00
Lars Lehtonen
5c5d377277 weed/s3api: prune test-only functions (#8840)
weed/s3api: prune functions that are referenced only from tests and the tests that exercise them.
2026-03-30 09:43:33 -07:00
Chris Lu
e8a6fcaafb s3api: skip TTL fast-path for versioned buckets (#8823)
* s3api: skip TTL fast-path for versioned buckets (#8757)

PutBucketLifecycleConfiguration was translating Expiration.Days into
filer.conf TTL entries for all buckets. For versioned buckets this is
wrong:

 1. TTL volumes expire as a unit, destroying all data — including
    noncurrent versions that should be preserved.
 2. Filer-backend TTL (RocksDB compaction filter, Redis key expiry)
    removes entries without triggering chunk deletion, leaving orphaned
    volume data with 0 deleted bytes.
 3. On AWS S3, Expiration.Days on a versioned bucket creates a delete
    marker — it does not hard-delete data. TTL has no such nuance.

Fix: skip the TTL fast-path when the bucket has versioning enabled or
suspended. All lifecycle rules are evaluated at scan time by the
lifecycle worker instead.

Also fix the lifecycle worker to evaluate Expiration rules against the
latest version in .versions/ directories, which was previously skipped
entirely — only NoncurrentVersionExpiration was handled.

* lifecycle worker: handle SeaweedList error in versions dir cleanup

Do not assume the directory is empty when the list call fails — log
the error and skip the directory to avoid incorrect deletion.

* address review feedback

- Fetch version file for tag-based rules instead of reading tags from
  the .versions directory entry where they are not cached.
- Handle getBucketVersioningStatus error by failing closed (treat as
  versioned) to avoid creating TTL entries on transient failures.
- Capture and assert deleteExpiredObjects return values in test.
- Improve test documentation.
2026-03-29 00:05:53 -07:00
Chris Lu
297cdef1a3 s3api: accept all supported lifecycle rule types (#8813)
* s3api: accept NoncurrentVersionExpiration, AbortIncompleteMultipartUpload, Expiration.Date

Update PutBucketLifecycleConfigurationHandler to accept all newly-supported
lifecycle rule types. Only Transition and NoncurrentVersionTransition are
still rejected (require storage class tier infrastructure).

Changes:
- Remove ErrNotImplemented for Expiration.Date (handled by worker at scan time)
- Only reject rules with Transition.set or NoncurrentVersionTransition.set
- Extract prefix from Filter.And when present
- Add comment explaining that non-Expiration.Days rules are evaluated by
  the lifecycle worker from stored lifecycle XML, not via filer.conf TTL

The lifecycle XML is already stored verbatim in bucket metadata, so new
rule types are preserved on Get even without explicit handler support.
Filer.conf TTL entries are only created for Expiration.Days (fast path).

* s3api: skip TTL fast path for rules with tag or size filters

Rules with tag or size constraints (Filter.Tag, Filter.And with tags
or size bounds, Filter.ObjectSizeGreaterThan/LessThan) must not be
lowered to filer.conf TTL entries, because TTL applies unconditionally
to all objects under the prefix. These rules are evaluated at scan
time by the lifecycle worker which checks each object's tags and size.

Only simple Expiration.Days rules with prefix-only filters use the
TTL fast path (RocksDB compaction filter).

---------

Co-authored-by: Copilot <copilot@github.com>
2026-03-28 19:39:21 -07:00