345 Commits

Author SHA1 Message Date
Chris Lu
87bb0a4115 test(s3tables): capture weed mini stdout/stderr in catalog_spark tests
Without this, when "weed mini" fails to bind its master port within the
45s startup timeout the only diagnostic is the timeout message itself,
making flakes impossible to root-cause. The sibling catalog, catalog_dremio,
and catalog_trino packages already wire stdout/stderr through; bring
catalog_spark in line.
2026-05-03 21:39:42 -07:00
Chris Lu
6844ec067c fix(s3): cache remote-only source before CopyObject (#9304) (#9305)
* fix(s3): cache remote-only source before CopyObject (#9304)

CopyObject from a remote.mount source whose object lives only upstream
created a destination entry with FileSize > 0 but no chunks/content,
because the resolved source entry has no local chunks and the copy path
fell into the "inline/empty chunks" branch with empty entry.Content.
A subsequent GET returned 500 with "data integrity error: size N
reported but no content available". CopyObjectPart had the same shape
via copyChunksForRange iterating an empty chunk list.

Detect entry.IsInRemoteOnly() right after resolving the source in both
CopyObjectHandler and CopyObjectPartHandler and cache the object to the
local cluster first via a new cacheRemoteObjectForCopy helper (a
copy-time analogue of cacheRemoteObjectForStreaming with a bounded 30s
timeout and version-aware path resolution). If caching fails or
produces no chunks, return 503 with Retry-After: 5 instead of writing
a metadata-only destination, mirroring the GetObject behavior added in
the #7817 cold-cache fix.

Adds TestCopyObjectRemoteOnlySourceDetection pinning the four entry
shapes the fix branches on plus the pre-fix broken-output shape.

* address PR review on remote-only copy fix

- Use the resolved entry's version id when srcVersionId is empty so a
  CopyObject reading the latest object in a versioning-enabled bucket
  caches the correct .versions/v_<id> path instead of getting stuck in
  a 503 retry loop. New helper resolvedSourceVersionId handles the
  fallback for both CopyObject and CopyObjectPart.
- Drop the redundant cachedEntry.IsInRemoteOnly() recheck in both
  handlers; the cache helper now reports success based on local data
  presence, and IsInRemoteOnly does not look at inline Content so
  keeping the check would 503 on small inline-cached objects.
- Treat inline Content as a successful cache result in both
  cacheRemoteObjectForStreaming and cacheRemoteObjectForCopy via a
  shared cachedEntryHasLocalData predicate. The CopyObject inline
  branch already handles entries that have Content but no chunks.
- Extract buildVersionedRemoteObjectPath so the streaming and copy
  cache helpers share path construction.

Adds TestResolvedSourceVersionId and TestCachedEntryHasLocalData to
pin the new helpers' contracts.

* narrow streaming cache contract back to chunks-only

CodeRabbit flagged that cacheRemoteObjectForStreaming's caller in
streamFromVolumeServers (lines 997-1002) still required non-empty
chunks, so content-only cache hits would fall through to a 503 retry
loop instead of being honored.

Resolve by keeping the helper's contract chunks-only: the filer's
caching code only ever writes chunks, the streaming downstream isn't
wired to read inline Content from a cached entry, and a partial range-
aware inline writer here would be overkill for a path that doesn't
actually occur in practice. cacheRemoteObjectForCopy keeps the relaxed
contract since the copy path's inline branch genuinely handles both
chunked and content-only entries.

Document the asymmetry on cachedEntryHasLocalData and on
cacheRemoteObjectForStreaming so a future reader can see why the two
helpers diverge.

* extend version-id resolution to streaming cache path

CodeRabbit flagged that GetObjectHandler still passed the raw query
versionId to cacheRemoteObjectForStreaming. For latest-version reads
in versioning-enabled buckets that stays empty even though the
resolved entry lives at .versions/v_<id>, so remote-only GETs would
keep caching the wrong path and 503-ing forever. Reuse the new
resolvedSourceVersionId helper at the streaming call site too.

Also document on cachedEntryHasLocalData that the zero-byte case
flagged in the same review is handled upstream (IsInRemoteOnly
requires RemoteSize > 0, so the cache helper is never invoked for
empty remote objects -- CopyObject's pre-existing inline branch
writes a correct empty destination directly). Pin this with a new
test case.

* trim verbose comments

Drop tutorial-style and review-history comments. Keep only the WHY
that isn't obvious from identifiers: the #9304 reference on the new
branches in CopyObject / CopyObjectPart, the latest-version-fallback
rationale on resolvedSourceVersionId, and the streaming/copy contract
asymmetry on cachedEntryHasLocalData.

* drop issue references from comments

Issue numbers belong in PR descriptions and commit messages, not in
source comments where they rot. Replace with the underlying invariant
the code is preserving.

* test: drive remote-object cache helpers through real gRPC

Existing tests only re-enacted helper-function branching in test
space, so they could not have caught a handler that consumed a
remote-only entry without going through the cache. Stand up an
in-process filer gRPC server (UnimplementedSeaweedFilerServer +
configurable CacheRemoteObjectToLocalCluster response) and exercise
the two cache helpers end-to-end.

What's pinned:
- cacheRemoteObjectForCopy returns nil when the cache makes no
  progress (response is still remote-only), lets gRPC errors
  through as nil, accepts both chunked and inline-content cache
  hits, and surfaces deadline-exceeded as nil so callers can 503
  instead of holding the request open.
- Versioned source paths route to .versions/v_<id>; non-versioned
  and "null" stay at the bucket-relative path. Captured by reading
  the request the stub server received.
- cacheRemoteObjectForStreaming holds the stricter chunks-only
  contract: a content-only cache hit is not propagated, since
  streamFromVolumeServers' downstream isn't wired to read from
  inline Content there.

Any current or future handler that calls these helpers exercises
the same gRPC path under test, so the bug class is closed for
helper-routed cache calls.

* move remote-only copy test into the integration suite

The previous gRPC-stub test in weed/s3api/ was integration-flavored
but stubbed; relocate the coverage to the existing two-server suite
under test/s3/remote_cache/, which already exercises the real
remote.mount + remote.uncache flow against a primary SeaweedFS plus
a secondary acting as remote storage.

The new test/s3/remote_cache/remote_cache_copy_test.go drives:
- TestRemoteCacheCopyObject: upload to primary, uncache (entry now
  remote-only), CopyObject to a new key, GET the destination. Pre-
  fix the GET returned 500 'data integrity error: size N reported
  but no content'; this pins the fixed behavior over real HTTP
  through the actual handler stack.
- TestRemoteCacheCopyObjectPart: same shape via multipart
  UploadPartCopy on a 6 MiB object split into two parts, exercising
  CopyObjectPartHandler's range-copy path.

Drop weed/s3api/s3api_remote_storage_grpc_test.go: the helper-level
classification tests in s3api_remote_storage_test.go still cover
the contract pieces (cachedEntryHasLocalData, resolvedSourceVersionId,
the remote-only entry shape), and the integration suite covers the
end-to-end behavior that those classifications enable.
2026-05-03 18:52:45 -07:00
Chris Lu
fc75f16c30 test(s3tables): expand Dremio Iceberg catalog test coverage (#9303)
* test(s3tables): expand Dremio Iceberg catalog test coverage

Restructure TestDremioIcebergCatalog into subtests and add three new
checks that go beyond a connectivity smoke test:

- ColumnProjection: SELECT id, label proves Dremio parsed the schema
  served by the SeaweedFS REST catalog (the previous SELECT COUNT(*)
  passed without exercising any column metadata).
- InformationSchemaColumns: verifies the table's columns are listed in
  Dremio's INFORMATION_SCHEMA.COLUMNS in the expected ordinal order.
- InformationSchemaTables: verifies the table is registered in
  INFORMATION_SCHEMA.TABLES.

All subtests share a single Dremio container startup, so total
runtime is unchanged.

* test(s3tables): exercise multi-level Iceberg namespaces from Dremio

Seed a 2-level Iceberg namespace (and a table inside it) via the REST
catalog before bootstrapping Dremio, then add a MultiLevelNamespace
subtest that scans the nested table by its dot-separated reference.

This relies on isRecursiveAllowedNamespaces=true (already set in the
Dremio source config) to surface the nested levels as folders. A
regression in either the SeaweedFS namespace path encoding (#8959-style)
or Dremio's recursive-namespace discovery would surface here.

Adds two helpers to keep the existing single-level call sites unchanged:

- createIcebergNamespaceLevels: namespace creation with []string levels
- createIcebergTableInLevels: table creation with []string levels and
  unit-separator (0x1F) URL encoding for the namespace path component

* test(s3tables): verify Dremio reads PyIceberg-written rows

The previous Dremio subtests only scanned empty tables, so they did not
exercise the data path - just the catalog/metadata path. Add a
PyIceberg-based writer that materializes parquet files plus a snapshot
on a separate table before Dremio bootstraps, and two new subtests:

- ReadWrittenDataCount: SELECT COUNT(*) returns 3.
- ReadWrittenDataValues: SELECT id, label ORDER BY id returns the three
  written rows with the expected (id, label) pairs.

The writer runs in a small image (Dockerfile.writer) built locally on
demand. It pip-installs pyiceberg+pyarrow once and reuses the layer
cache on subsequent runs. The CI workflow pre-pulls python:3.11-slim
to keep cold runs predictable.

The writer authenticates via the OAuth2 client_credentials flow that
SeaweedFS already exposes at /v1/oauth/tokens, mirroring the Go-side
helper used for REST-API table creation.

* test(s3tables): fix Dremio writer required-field schema mismatch

PyIceberg's append() compatibility check rejects an arrow column whose
nullability does not match the Iceberg field. The table schema declares
id as `required long`, but the default pyarrow int64 column is nullable
- so the writer failed with:

    1: id: required long  vs.  1: id: optional long

Declare an explicit pyarrow schema with nullable=False on id and
nullable=True on label to match the Iceberg side.
2026-05-03 00:17:16 -07:00
Chris Lu
1f6f473995 refactor(worker): co-locate plugin handlers with their task packages (#9301)
* refactor(worker): co-locate plugin handlers with their task packages

Move every per-task plugin handler from weed/plugin/worker/ into the
matching weed/worker/tasks/<name>/ package, so each task owns its
detection, scheduling, execution, and plugin handler in one place.

Step 0 (within pluginworker, no behavior change): extract shared helpers
that previously lived inside individual handler files into dedicated
files and export the ones now consumed across packages.

  - activity.go: BuildExecutorActivity, BuildDetectorActivity
  - config.go: ReadStringConfig/Double/Int64/Bytes/StringList, MapTaskPriority
  - interval.go: ShouldSkipDetectionByInterval
  - volume_state.go: VolumeState + consts, FilterMetricsByVolumeState/Location
  - collection_filter.go: CollectionFilterMode + consts
  - volume_metrics.go: export CollectVolumeMetricsFromMasters,
    MasterAddressCandidates, FetchVolumeList
  - testing_senders_test.go: shared test stubs

Phase 1: move the per-task plugin handlers (and the iceberg subpackage)
into their task packages.

  weed/plugin/worker/vacuum_handler.go         -> weed/worker/tasks/vacuum/plugin_handler.go
  weed/plugin/worker/ec_balance_handler.go     -> weed/worker/tasks/ec_balance/plugin_handler.go
  weed/plugin/worker/erasure_coding_handler.go -> weed/worker/tasks/erasure_coding/plugin_handler.go
  weed/plugin/worker/volume_balance_handler.go -> weed/worker/tasks/balance/plugin_handler.go
  weed/plugin/worker/iceberg/                   -> weed/worker/tasks/iceberg/

  weed/plugin/worker/handlers/handlers.go now blank-imports all five
  task subpackages so their init() registrations fire.

  weed/command/mini.go and the worker tests construct the handler with
  vacuum.DefaultMaxExecutionConcurrency (the constant moved with the
  vacuum handler).

admin_script remains in weed/plugin/worker/ because there is no
underlying weed/worker/tasks/admin_script/ package to merge with.

* refactor(worker): update test/plugin_workers imports for moved handlers

Three handler constructors moved out of pluginworker into their task
packages — update the integration test files in test/plugin_workers/
to import from the new locations:

  pluginworker.NewVacuumHandler        -> vacuum.NewVacuumHandler
  pluginworker.NewVolumeBalanceHandler -> balance.NewVolumeBalanceHandler
  pluginworker.NewErasureCodingHandler -> erasure_coding.NewErasureCodingHandler

The pluginworker import is kept where the file still uses
pluginworker.WorkerOptions / pluginworker.JobHandler.

* refactor(worker): update test/s3tables iceberg import path

The iceberg subpackage moved from weed/plugin/worker/iceberg/ to
weed/worker/tasks/iceberg/. test/s3tables/maintenance/maintenance_integration_test.go
still imported the old path, breaking S3 Tables / RisingWave / Trino /
Spark / Iceberg-catalog / STS integration test builds.

Mirrors the OSS-side fix needed by every job in the run that
transitively imports test/s3tables/maintenance.

* chore: gofmt PR-touched files

The S3 Tables Format Check job runs `gofmt -l` over weed/s3api/s3tables
and test/s3tables, then fails if anything is unformatted. Files this
PR moved or modified had import-grouping and trailing-spacing issues
introduced by perl-based renames; reformat them with gofmt -w.

Touched files:
  test/plugin_workers/erasure_coding/{detection,execution}_test.go
  test/s3tables/maintenance/maintenance_integration_test.go
  weed/plugin/worker/handlers/handlers.go
  weed/worker/tasks/{balance,ec_balance,erasure_coding,vacuum}/plugin_handler*.go

* refactor(worker): bounds-checked int conversions for plugin config values

CodeQL flagged 18 go/incorrect-integer-conversion warnings on the moved
plugin handler files: results of pluginworker.ReadInt64Config (which
ultimately calls strconv.ParseInt with bit size 64) were being narrowed
to int32/uint32/int without an upper-bound check, so a malicious or
malformed admin/worker config value could overflow the target type.

Add three helpers in weed/plugin/worker/config.go that wrap
ReadInt64Config and clamp out-of-range values back to the caller's
fallback:

  ReadInt32Config (math.MinInt32 .. math.MaxInt32)
  ReadUint32Config (0 .. math.MaxUint32)
  ReadIntConfig    (math.MinInt32 .. math.MaxInt32, platform-portable)

Update each flagged call site in the four moved task packages to use
the bounds-checked helper. For protobuf uint32 fields (volume IDs)
the variable type also becomes uint32, removing the trailing
uint32(volumeID) casts and changing the "missing volume_id" check
from `<= 0` to `== 0`.

Touched files:
  weed/plugin/worker/config.go
  weed/worker/tasks/balance/plugin_handler.go
  weed/worker/tasks/erasure_coding/plugin_handler.go
  weed/worker/tasks/vacuum/plugin_handler.go

* refactor(worker): use ReadIntConfig for clamped derive-worker-config helpers

CodeQL still flagged three call sites where ReadInt64Config was being
narrowed to int after a value-range clamp (max_concurrent_moves <= 50,
batch_size <= 100, min_server_count >= 2). The clamp is correct but
CodeQL's flow analysis didn't recognize the bound, so it flagged them
as unbounded narrowing.

Switch to ReadIntConfig (already int32-bounded by the helper) for
those three sites, drop the now-redundant int64 intermediate variables.

Also drops the now-unused `> math.MaxInt32` clamp in
ec_balance.deriveECBalanceWorkerConfig (the helper covers it).
2026-05-02 18:03:13 -07:00
Chris Lu
b2f4ebb776 test(s3tables): add Dremio Iceberg catalog integration tests (#9299)
* test(s3tables): add Dremio Iceberg catalog integration tests

Add comprehensive integration tests for Dremio with SeaweedFS's Iceberg
REST Catalog, following the same patterns as existing Spark and Trino tests.

Tests include:
- Basic catalog connectivity and schema operations
- Table creation, insertion, and querying (CRUD)
- Deterministic table location specification
- Multi-level namespace support

Implementation includes:
- dremio_catalog_test.go: Core test environment and basic operations
- dremio_crud_operations_test.go: Schema and table CRUD testing
- dremio_deterministic_location_test.go: Location and namespace testing
- Comprehensive README and implementation documentation

CI/CD:
- Added dremio-iceberg-catalog-tests job to s3-tables-tests.yml
- Pre-pulls Dremio image, runs with 25m timeout
- Uploads artifacts on failure

* add docstrings to Dremio integration tests and fix CI image pre-pull

- Add function docstrings to all test functions and helper functions
  in dremio_catalog_test.go, dremio_crud_operations_test.go, and
  dremio_deterministic_location_test.go to improve code documentation
  and satisfy CodeRabbit's docstring coverage requirements.

- Make Dremio Docker image pre-pull non-critical in CI workflow.
  The pre-pull was failing with access denied error, but the image
  can still be pulled at runtime. Using continue-on-error to allow
  tests to proceed.

* fix: correct YAML syntax in Dremio CI workflow

Use multi-line run command with pipe operator (|) instead of
inline command with || operator to avoid YAML parsing errors.
The || operator was causing 'mapping values are not allowed here'
syntax errors in the YAML parser.

* make Dremio tests gracefully skip if container unavailable

Modify startDremioContainer and waitForDremio to return boolean values
instead of fataling. Tests now skip gracefully if:
- Dremio Docker image is unavailable
- Container fails to start
- Container doesn't become ready within timeout

This prevents CI failure when Dremio image is not accessible while
still testing the integration when it is available.

* Revert "make Dremio tests gracefully skip if container unavailable"

This reverts commit e4b43e1447.

* ci(s3-tables-tests): remove Dremio job from automated CI

The Dremio integration tests are designed to test Dremio's integration
with SeaweedFS Iceberg catalog and should not gracefully skip. Since
the Dremio Docker image is not available in GitHub Actions environments,
running the tests in CI would cause failures.

The Dremio integration test code is preserved in test/s3tables/catalog_dremio/
for local development and testing. This can be run locally where Docker
and the Dremio image are available.

* test(s3tables): remove hardcoded port mapping from Dremio container

The tests use docker exec to communicate with Dremio, not direct port bindings.
Removing the fixed 9047:9047 port mapping prevents port conflicts and is unnecessary
for the test infrastructure.

* test(s3tables): fix runDremioSQL helper with proper JSON encoding and error handling

- Use encoding/json to properly marshal SQL payload instead of using unescaped %q
- Change t.Logf to t.Fatalf to properly fail tests on Dremio command errors
- Remove dependency on jq for response parsing
- Add parseDremioResponse helper to decode and validate Dremio JSON responses
- Fail tests immediately if response contains error messages

* test(s3tables): add proper assertions to CRUD operation tests

- Change test assertions from t.Logf to t.Fatalf for schema listing and table listing
- Validate COUNT(*) results in TestTableCRUD with proper row parsing and count validation
- Add assertions for COUNT, WHERE clause, and GROUP BY aggregation queries in TestDataInsertAndQuery
- Use parseDremioResponse to properly decode JSON responses instead of ignoring results

* test(s3tables): fix deterministic location tests and multi-level namespace handling

- Add assertions to verify table row counts in TestDeterministicTableLocation using parseDremioResponse
- Add DESCRIBE FORMATTED query to verify table location matches explicit S3 path
- Fix multi-level namespace handling to use separate quoted identifiers per level
- Change logging assertions to fatal assertions for proper test failure reporting
- Validate row counts in TestMultiLevelNamespace instead of just logging

* test(s3tables): fix shell quoting in runDremioSQL for proper JSON payload handling

Use %q for proper shell escaping of JSON payload to handle SQL statements
that contain single quotes or other special characters without breaking the command.

* fix(s3tables): correct master address format in weed shell command

Use colon separator between master port and gRPC port instead of dot.
The format should be host:port:grpcport, not host:port.grpcport

* fix(s3tables): pin Dremio Docker image to specific version

Use dremio/dremio:25.0.0 instead of latest to ensure test stability
and reproducibility across different environments and time periods

* chore: remove accidentally committed lock file

Remove .claude/scheduled_tasks.lock which is a local runtime artifact
that should not be version controlled

* docs: remove IMPLEMENTATION.md from Dremio test suite

The implementation details are better documented in README.md and inline code comments

* ci(s3-tables-tests): add Dremio Iceberg catalog integration test job

Add dremio-iceberg-catalog-tests job to the S3 Tables Integration Tests workflow.
The job runs the Dremio integration tests that validate SeaweedFS's Iceberg REST
Catalog integration with the Dremio SQL engine. Tests will skip gracefully if
Docker or the Dremio image is unavailable.

* fix(ci): increase Dremio test timeout to 30 minutes

* fix(s3tables): use stdin for Dremio SQL payload instead of shell wrapping

* test(s3tables): validate aggregation sums in CRUD test

* docs: add package documentation for Dremio catalog tests

* ci: trigger workflow run

* ci: remove concurrency lock to allow workflow execution

* ci: trigger fresh workflow run

* ci: add push trigger to workflow for better scheduling

* ci: create dedicated Dremio workflow to isolate test execution

* ci: add minimal test to debug workflow execution

* ci: fix workflow triggers to use pull_request only

* ci: remove diagnostic test workflow

* fix(s3tables): replace weed shell bucket create with REST API in Dremio tests

The Dremio integration test invoked `weed shell` with a malformed
`-master=host:port:grpcPort` flag (the correct separator is `.`), causing
the shell to spend ~65s reconnecting and never cleanly exit after running
`s3tables.bucket -create`. The shell process hung until the 30m test
deadline killed the suite. Switch to the same direct PUT /buckets call
the other catalog tests use, removing the shell dependency entirely.

* fix(s3tables): use weed shell with correct master format for Dremio bucket create

The HTTP /buckets endpoint requires SigV4 auth because the Dremio test
runs the S3 server with `-s3.config <iam>`. Switch back to `weed shell`
(which calls the filer directly over gRPC, bypassing S3 auth — same
approach as duckdb_oauth_test.go) and use the correct master flag
format `host:port.grpcPort` (dot, not colon). The earlier failure was
caused by passing `host:port:grpcPort`, which the shell silently
retried forever. Also wrap the command in a 60s context timeout so any
future hang surfaces a clear error instead of a 30m test deadline.

* fix(s3tables): switch Dremio container image to public dremio-oss tag

The previous image reference dremio/dremio:25.0.0 does not exist on
Docker Hub (the dremio/dremio repo is unlisted), so docker pull fails
with "pull access denied" both in the pre-pull step and at test runtime.
The public Dremio OSS image is dremio/dremio-oss and 25.0.0 is published
there. Update the test source and both workflow pre-pull commands.

* chore: gitignore .claude/ harness directory

* fix(s3tables): bind-mount Dremio dremio.conf as a single file, capture container logs

The previous run mounted the whole configDir over /opt/dremio/conf,
which masked Dremio's bundled log4j2.properties, dremio-env, and
distrib.conf. The JVM crashed within ~10s of starting, so every test
hit "container is not running". Mount only dremio.conf (read-only) so
the defaults stay in place.

Also include the container logs in waitForDremio's failure message —
without them the only signal was "container is not running", which
gave us no path forward when startup fails.

* fix(s3tables): drop invalid catalog.iceberg.* keys from dremio.conf

Dremio rejects catalog.iceberg.* properties at startup with
"Failure reading configuration file. The following properties were
invalid" — sources are not configured via dremio.conf, only via the
REST API at /apiv2/source. Write an empty {} so Dremio boots with
defaults; subsequent SQL queries will surface the missing-source
condition with a meaningful error instead of crashing the JVM.

A real Iceberg-source bringup (Dremio first-user bootstrap + POST
/apiv2/source) is the follow-up work to actually exercise queries.

* test(s3tables): skip Dremio catalog tests pending REST-API bootstrap

The Dremio container now starts cleanly, but every SQL call returns
HTTP 401 because the test never bootstraps an admin user, never logs
in to obtain a token, and never registers an Iceberg REST source. With
those three pieces missing the suite cannot exercise anything beyond
"can we boot a Dremio container" — useful coverage requires a proper
bootstrap helper.

Skip the seven test functions via a single requireDremioCatalogConfigured
helper that documents the missing bring-up steps, so CI is no longer red
on a half-built suite. The skip is loud (clear message in the helper)
so the follow-up work won't be forgotten.

* ci(s3tables): make Dremio tests opt-in

* test(s3tables): run Dremio catalog smoke test

* test(s3tables): align Dremio 25 catalog config
2026-05-02 11:31:27 -07:00
Chris Lu
1da091f798 ci: bring previously-uncovered integration tests into CI (#9281 follow-up) (#9283)
* ci: bring previously-uncovered integration tests into CI (#9281 follow-up)

Six integration test packages had _test.go files but no GitHub workflow
running them. The s3-sse-tests CI gap that let #8908's UploadPartCopy bug
(and the four cross-SSE copy bugs in #9281) ship undetected was an
instance of this same pattern. This change wires three of them into CI
and removes a fourth that was deadcode:

  test/multi_master/                    NEW workflow: multi-master-tests.yml
    - 3-node master raft cluster failover/recovery (5 tests, ~65s)
  test/testutil/                         (run alongside multi_master)
    - port-allocator regression test
  test/s3/etag/                          NEW workflow: s3-etag-acl-tests.yml
    - PutObject ETag format regression for #7768 (must be pure MD5 hex,
      not "<md5>-N" composite, for AWS Java SDK v2 compatibility)
  test/s3/acl/                           (same workflow as etag)
    - object-ACL behavior on versioned buckets

  test/s3/catalog_trino/                 DELETED (deadcode)
    - Single-file copy of test/s3tables/catalog_trino/trino_catalog_test.go
      from a 2024 commit that was never iterated, while the
      test/s3tables/ counterpart has been actively maintained (and IS
      in CI via s3-tables-tests.yml's trino-iceberg-catalog-tests job).

Both workflows trigger only on changes to relevant code paths and use the
existing simple "build weed → run go test" pattern (no per-test-dir
Makefile boilerplate). The S3 workflow starts a single `weed mini`
shared by etag and acl, which keeps the job under 2 minutes on a fresh
runner.

Two tests remain knowingly uncovered:

  test/s3/basic/  — order-dependent state across tests (TestListObjectV2
    expects a bucket created by an earlier test, etc.) and uses the
    deprecated aws-sdk-go v1. Treated as sample programs, not a
    regression suite. Fixing them is out of scope for this PR.

  test/s3/catalog_trino/  — see "DELETED" above.

Verified locally:
  - go test -v -timeout=8m ./test/multi_master/... ./test/testutil/...
    PASS  (5 multi_master + 1 testutil tests, 64s)
  - weed mini + go test ./test/s3/etag/... + go test ./test/s3/acl/...
    PASS  (8 etag + 5 acl tests, ~6s after server startup)

* ci: fix log-collector glob for multi-master tests (review feedback on #9283)

test/multi_master/cluster.go creates per-test temp dirs via
os.MkdirTemp("", "seaweedfs_multi_master_it_"), so the glob has to
match that prefix. The previous version looked for MasterCluster* /
TestLeader* / TestTwoMasters* / TestAllMasters* which never matches —
the failure-artifact upload would have been empty on a real failure.

Switch the find to /tmp/seaweedfs_multi_master_it_* (maxdepth 1) so it
actually picks up the per-node master*.log files under
<baseDir>/logs/.

Found by coderabbitai review on PR #9283.
2026-04-29 10:06:59 -07:00
Chris Lu
82cf60a44f fix(s3api): re-encrypt UploadPartCopy bytes for the destination's SSE config (#8908) (#9280)
* fix(s3api): re-encrypt UploadPartCopy bytes for the destination's SSE config (#8908)

The remaining failure mode in #8908 was that Docker Registry's blob
finalization (server-side Move via UploadPartCopy) silently corrupts
SSE-S3 multipart objects. Reproduces with `aws s3api upload-part-copy`
under bucket-default SSE-S3: the GET on the completed object returns
deterministic wrong bytes (correct length, same wrong SHA-256 across
runs). The metadata is mathematically self-consistent — every chunk's
stored IV equals `calculateIVWithOffset(baseIV_dst, partLocalOffset)` —
but the bytes on disk were encrypted with the SOURCE upload's key+baseIV.

Root cause:

  - `copyChunksForRange` (and `createDestinationChunk`) constructs new
    chunks for UploadPartCopy without copying `SseType` / `SseMetadata`,
    so destination chunks are written with `SseType=NONE`.

  - At completion, `completedMultipartChunk` (PR #9224's NONE→SSE_S3
    backfill, intended to recover from a different missing-metadata
    bug) sees those NONE chunks under an SSE-S3 multipart upload and
    backfills SSE-S3 metadata derived from the destination upload's
    baseIV. The chunk metadata is now internally consistent and the
    GET path applies decryption — but the bytes on disk are encrypted
    with the source upload's key, not the destination's. Decryption
    produces deterministic garbage. Docker Registry pulls then fail
    with "Digest did not match".

Fix: when either the source object or the destination multipart upload
has any SSE configured, take a slow-path UploadPartCopy that
(1) opens a plaintext reader of the source range — decrypting the
source's per-chunk SSE-S3 metadata if needed via a reused
`buildMultipartSSES3Reader`, and
(2) feeds that plaintext through `putToFiler`'s existing encryption
pipeline by staging the destination upload entry's SSE-S3/SSE-KMS
headers on a cloned request. Encryption then matches PutObjectPart's
contract: every part starts a fresh CTR stream from counter 0 with
`baseIV_dst`, and each internal chunk's metadata records
`calculateIVWithOffset(baseIV_dst, chunk.partLocalOffset)`.

The `non-SSE → non-SSE` case still takes the existing fast raw-byte
copy path — bytes on disk are plaintext on both sides, so chunk-level
metadata is irrelevant.

Cross-encryption from SSE-KMS / SSE-C sources is left as TODO — the
new path returns an explicit error rather than the previous silent
corruption. SSE-S3 (the user-reported case) round-trips correctly.

Tests:

  - test/s3/sse/s3_sse_uploadpartcopy_integration_test.go pins three
    UploadPartCopy shapes against bucket-default SSE-S3:
    * Docker-Registry-shape 32MB+tail (the user's exact 5-chunk /
      2-part metadata layout)
    * single full-object UploadPartCopy
    * many small range copies
    Each round-trips SHA-256.

  - test/s3/sse/s3_sse_concurrent_repro_test.go covers the parallel
    multipart-upload shape from the user report (5 blobs in parallel,
    full GET and chunked range GET both hash-checked) — pre-existing
    coverage; added here as a regression sentinel.

* test(s3-sse): rename UploadPartCopy regression test so CI matches it

The CI workflow .github/workflows/s3-sse-tests.yml dispatches on the
TEST_PATTERN ".*Multipart.*Integration" — i.e. the test name must contain
both "Multipart" and "Integration" for CI to run it. The previous name
TestSSES3UploadPartCopyIntegration had only "Integration"; "UploadPart"
isn't "Multipart". Rename to TestSSES3MultipartUploadPartCopyIntegration
so the regression test actually runs in CI rather than only locally.

* fix(s3api): map unsupported UploadPartCopy SSE source to 501, not 500 (review feedback on #9280)

openSourcePlaintextReader explicitly rejects SSE-KMS and SSE-C sources
(SSE-S3 is the only one wired up in this slow path so far). Earlier
the caller blanket-mapped that to ErrInternalError, which collapses
"this shape isn't implemented yet" into the same 500 response a
real server failure would produce. Clients can no longer tell whether
they hit a feature gap or a bug.

Introduce a sentinel errCopySourceSSEUnsupported and have
copyObjectPartViaReencryption errors.Is-check it; on match, return
ErrNotImplemented (501) instead of ErrInternalError (500). Other
failures still map to 500.

Found by coderabbitai review on PR #9280.

* fix(s3api): UploadPartCopy must fail with NoSuchUpload when upload entry is missing (review feedback on #9280)

CopyObjectPartHandler's earlier checkUploadId call only verifies that
the uploadID's hash prefix matches dstObject; it does not prove the
upload directory exists in the filer. The previous logic silently
swallowed filer_pb.ErrNotFound from getEntry(uploadDir) and fell
through with uploadEntry=nil, which then skipped the destination
SSE check and could route a plain-source copy through the raw-byte
fast path even though the destination's encryption state is unknown.

Treat ErrNotFound as ErrNoSuchUpload so the client sees the right
status, matching the AWS S3 contract for UploadPartCopy on a
non-existent upload.

Found by coderabbitai review on PR #9280.

* feat(s3api): set SSE response headers on UploadPartCopy slow path (review feedback on #9280)

PutObjectPartHandler writes x-amz-server-side-encryption (and the KMS
key-id header for SSE-KMS) on every successful part response so clients
can confirm the destination's encryption state. The new UploadPartCopy
slow path was missing this — it returned only the ETag in the response
body and no SSE response headers.

Plumb putToFiler's SSEResponseMetadata back through
copyObjectPartViaReencryption to the handler, then call
setSSEResponseHeaders before writing the XML response, matching the
PutObjectPart contract.

Found by gemini-code-assist review on PR #9280.

* fix(s3api): map transient filer errors on UploadPartCopy upload-entry fetch to 503 (review feedback on #9280)

Earlier non-ErrNotFound errors from getEntry(uploadDir, uploadID) all
returned 500 InternalError, which most SDKs treat as fatal — even
though a transient filer outage (gRPC Unavailable, leader election in
flight, deadline exceeded) is exactly the kind of failure SDK retry
logic is supposed to recover from.

Add an isTransientFilerError helper that recognises:
  - context.DeadlineExceeded / context.Canceled
  - gRPC codes.Unavailable, DeadlineExceeded, ResourceExhausted, Aborted

When the upload-entry fetch fails for one of those reasons, return
503 ServiceUnavailable so the client retries; everything else still
maps to 500. Log line now also carries dstObject (in addition to
dstBucket and uploadID) to make incident triage easier.

Found by gemini-code-assist review on PR #9280.
2026-04-29 09:46:44 -07:00
Chris Lu
02574314f6 test(s3): force-drop collection after deleteBucket in tagging/versioning/cors/copying (#9270)
* test(s3): force-drop collection after deleteBucket across tagging/versioning/cors/copying

Each test creates a unique bucket (= new SeaweedFS collection) and the master's
warm-create issues a 7-volume grow batch. The S3 DeleteBucket-driven collection
sweep snapshots the layout once, but in-flight `volume_grow` requests keep
registering volumes after the snapshot, leaking 1-3 volumes per bucket. On a
single `weed mini` data node with the auto-derived volume cap, those leaks pile
up fast and every subsequent PutObject 500s with "Not enough data nodes found".

Mirror the retention-suite fix (commits ac3a756d, 363d5caa) into the four other
suites that share the same shape: defer a master-side /col/delete after each
bucket teardown, and sweep stale prefix-matching buckets before every new
createBucket. Each suite gets its own MASTER_ENDPOINT default plus the
allTestBucketPrefixes / cleanupAllTestBuckets / cleanupLeftoverTestBuckets /
forceDeleteCollection helpers.

Skipped: iam (separate framework + v1 SDK; already passes
-master.volumeSizeLimitMB=100), sse (different cleanup signature; docker-compose
already passes -volumeSizeLimitMB=50), policy (its TestCluster already uses
-master.volumeSizeLimitMB=32 and tears the whole cluster down). Suites under 5
tests cannot exhaust the cap and were left untouched.

* test(s3): default master endpoint to 127.0.0.1 to avoid IPv6 resolution

Mirror the explicit IPv4 default already used by test/s3/copying. On hosts
where `localhost` resolves to ::1 first the master HTTP listener (bound on
0.0.0.0) is unreachable, so the new force-delete-collection helper would silently
skip cleanup. Pin the default to 127.0.0.1 in tagging/cors/versioning; the
MASTER_ENDPOINT env var still wins when set.

* test(s3): skip buckets newer than this process in leftover-bucket sweep

cleanupAllTestBuckets/cleanupTestBuckets used to delete every bucket whose name
matched the suite's prefix. With shared S3/master endpoints, a second concurrent
`go test` run could have its live buckets torn down mid-test by the first run's
sweep.

Capture testRunStart at package init (with a 1-minute backdate for clock skew)
and skip any bucket whose CreationDate is newer than that. Stale buckets from
panicked or interrupted prior runs (the original target of the sweep) still get
collected because they were created before this process started.

* test(s3-copying): sweep stale prefix buckets from createBucket, not just from a few callers

The s3-copying suite already had cleanupTestBuckets that walks every bucket and
drops the test-copying-* prefix matches, but only four tests in the file invoke
it; the rest go straight to createBucket. So a panicked or interrupted prior run
could leak buckets that survive into the next run and exhaust the data node's
volume slots before any of the prefix-sweeping tests get a chance to run.

Hoist the sweep into createBucket so every test that creates a bucket starts on
a clean slate. The per-process CreationDate filter from the prior commit keeps
this safe under concurrent runs.

* test(s3): scope leftover-bucket sweep to this run via runID marker

The CreationDate filter only protected against runs that started later than this
process. A different `go test` against the same S3/master endpoints that started
*earlier* and is still active has buckets older than testRunStart, so the sweep
would still tear them down mid-test.

Replace the time window with a per-process runID baked into bucket names: every
bucket this run creates gets `r{runID}-` after the suite's BucketPrefix, and the
sweep only matches that owned subset. Concurrent runs each carry their own
runID and never see each other's buckets.

Trade-off: buckets left behind by a crashed prior run carry a different runID
and won't be cleaned up by this sweep. That recovery path now belongs to
`make clean` / data-dir wipe, which is what CI already does between jobs.
Fixed-name versioning buckets (e.g. test-versioning-directories) bypass
getNewBucketName and so also bypass this sweep — they are short-lived and
handled by their own deferred deleteBucket.

* test(s3): drop cleanupLeftoverTestBuckets wrapper, call cleanupAllTestBuckets directly

cleanupLeftoverTestBuckets was a one-line forwarder kept only as a semantic name
at the createBucket call site. Inline the call and update the doc comments to
point at the actual implementation. tagging/cors/versioning all had the wrapper;
copying never did.
2026-04-28 22:13:42 -07:00
Chris Lu
35fe3c801b feat(nfs): UDP MOUNT v3 responder + real-Linux e2e mount harness (#9267)
* feat(nfs): add UDP MOUNT v3 responder

The upstream willscott/go-nfs library only serves the MOUNT protocol
over TCP. Linux's mount.nfs and the in-kernel NFS client default
mountproto to UDP in many configurations, so against a stock weed nfs
deployment the kernel queries portmap for "MOUNT v3 UDP", gets port=0
("not registered"), and either falls back inconsistently or surfaces
EPROTONOSUPPORT — surfacing as the user-visible "requested NFS version
or transport protocol is not supported" reported in #9263. The user has
to add `mountproto=tcp` or `mountport=2049` to mount options to coerce
TCP just for the MOUNT phase.

Add a small UDP responder that speaks just enough of MOUNT v3 to handle
the procedures the kernel actually invokes during mount setup and
teardown: NULL, MNT, and UMNT. The wire layout for MNT mirrors
handler.go's TCP path so both transports produce the same root
filehandle and the same auth flavor list for the same export. Other
v3 procedures (DUMP, EXPORT, UMNTALL) cleanly return PROC_UNAVAIL.

This commit only adds the responder; portmap-advertise and Server.Start
wire-up follow in subsequent commits so each step stays independently
reviewable.

References: RFC 1813 §5 (NFSv3/MOUNTv3), RFC 5531 (RPC). Existing
constants and parseRPCCall / encodeAcceptedReply helpers from
portmap.go are reused so behaviour stays consistent across both UDP
listening goroutines.

* feat(nfs): advertise UDP MOUNT v3 in the portmap responder

The portmap responder advertised TCP-only entries because go-nfs only
serves TCP, but with the new UDP MOUNT responder in place we can now
honestly advertise MOUNT v3 over UDP as well. Linux clients whose
default mountproto is UDP query portmap during mount setup; if the
answer is "not registered" some kernels translate the result to
EPROTONOSUPPORT instead of falling back to TCP, which is exactly the
failure pattern reported in #9263.

Add the entry, refresh the doc comment, and extend the existing
GETPORT and DUMP unit tests so a regression that drops the entry shows
up at unit-test granularity rather than only in an end-to-end mount.

* feat(nfs): start UDP MOUNT v3 responder alongside the TCP NFS listener

Plug the new mountUDPServer into Server.Start so it comes up on the
same bind/port as the TCP NFS listener. Started before portmap so a
portmap query that races a fast client never returns a UDP MOUNT entry
the responder isn't actually answering, and shut down via the same
defer chain so a portmap-or-listener startup failure doesn't leave the
UDP responder dangling.

The portmap startup log now reflects all three advertised entries
(NFS v3 tcp, MOUNT v3 tcp, MOUNT v3 udp) so operators can confirm at a
glance that the UDP MOUNT path is up.

Verified end-to-end: built a Linux/arm64 binary, ran weed nfs in a
container with -portmap.bind, and mounted from another container using
both the user-reported failing setup from #9263 (vers=3 + tcp without
mountport) and an explicit mountproto=udp to force the new code path.
The trace `mount.nfs: trying ... prog 100005 vers 3 prot UDP port 2049`
now leads to a successful mount instead of EPROTONOSUPPORT.

* docs(nfs): note that the plain mount form works on UDP-default clients

With UDP MOUNT v3 now served alongside TCP, the only path that ever
required mountproto=tcp / mountport=2049 — clients whose default
mountproto is UDP — works against the plain mount example. Update the
startup mount hint and the `weed nfs` long help so users don't go
hunting for a mount-option workaround that no longer applies.

The "without -portmap.bind" branch is unchanged: that path still has
to bypass portmap entirely because there is no portmap responder for
the kernel to query.

* test(nfs): add kernel-mount e2e tests under test/nfs

The existing test/nfs/ harness boots a real master + volume + filer +
weed nfs subprocess stack and drives it via go-nfs-client. That covers
protocol behaviour from a Go client's perspective, but anything
mis-coded once a real Linux kernel parses the wire bytes is invisible:
both ends of the test use the same RPC library, so identical bugs
round-trip cleanly. The two NFS issues hit recently were exactly that
shape — NFSv4 mis-routed to v3 SETATTR (#9262) and missing UDP MOUNT v3
— and only surfaced in a real client.

Add three end-to-end tests that mount the harness's running NFS server
through the in-tree Linux client:

  - TestKernelMountV3TCP: NFSv3 + MOUNT v3 over TCP (baseline).
  - TestKernelMountV3MountProtoUDP: NFSv3 over TCP, MOUNT v3 over UDP
    only — regression test for the new UDP MOUNT v3 responder.
  - TestKernelMountV4RejectsCleanly: vers=4 against the v3-only server,
    asserting the kernel surfaces a protocol/version-level error rather
    than a generic "mount system call failed" — regression test for the
    PROG_MISMATCH path from #9262.

The tests pass explicit port=/mountport= mount options so the kernel
never queries portmap, which means the harness doesn't need to bind
the privileged port 111 and won't collide with a system rpcbind on a
shared CI runner. They t.Skip cleanly when the host isn't Linux, when
mount.nfs isn't installed, or when the test process isn't running as
root.

Run locally with:

	cd test/nfs
	sudo go test -v -run TestKernelMount ./...

CI wiring follows in the next commit.

* ci(nfs): run kernel-mount e2e tests in nfs-tests workflow

Wire the new TestKernelMount* tests from test/nfs into the existing
NFS workflow:

  - Existing protocol-layer step now skips '^TestKernelMount' so a
    "skipped because not root" line doesn't appear on every run.
  - New "Install kernel NFS client" step pulls nfs-common (mount.nfs +
    helpers) and netbase (/etc/protocols, which mount.nfs's protocol-
    name lookups need to resolve `tcp`/`udp`).
  - New privileged step runs only the kernel-mount tests under sudo,
    preserving PATH and pointing GOMODCACHE/GOCACHE at the user's
    caches so the second `go test` invocation reuses already-built
    test binaries instead of redownloading modules under root.

The summary block now lists the three kernel-mount cases explicitly
so a regression on either of #9262 or this PR's UDP MOUNT change is
traceable from the workflow run page.
2026-04-28 14:06:35 -07:00
Chris Lu
363d5caa85 test(s3-retention): purge stale buckets before each create to avoid volume exhaustion
The WORM suite creates one bucket per test, each backed by ~3 reserved
volumes on the data node. With ~30 tests and the default `weed mini`
volume cap, the data node runs out of slots midway through the run and
every PutObject after that fails with InternalError.

Hook a sweep of every test-prefix bucket into the create helpers so a
panicked or interrupted prior test cannot leak buckets into the next.
2026-04-27 22:14:20 -07:00
Chris Lu
d92c5e057a test(iceberg): cross-engine regression coverage for deterministic table locations (#9074) (#9253)
* test(trino): regression for unique-table-location=false (#9074)

With #9246's namespace-location property, Trino's REST catalog can
resolve table locations even when the connector is configured with
`iceberg.unique-table-location=false`, and CREATE/CTAS lands at the
deterministic <namespace-location>/<tableName> path with no
UUID-suffixed sibling. Lock that in:

- writeTrinoConfig now parameterizes the unique-table-location flag via
  a withDeterministicTableLocation() option.
- setupTrinoTest forwards config options.
- TestTrinoDeterministicLocationCTAS exercises a fresh CREATE TABLE +
  CTAS with the flag flipped off and asserts the on-disk layout has
  no UUID-suffixed sibling under the namespace dir, proving each table
  occupies a single dir.

Refs #9074

* test(spark): regression for CTAS without explicit location (#9074)

iceberg-spark has no equivalent of Trino's unique-table-location flag —
its REST catalog interactions always produce the deterministic
<namespace-location>/<tableName> path. Without #9246's namespace-location
property, Spark cannot resolve a table location for a CREATE TABLE that
omits an explicit LOCATION clause; with it, the operation succeeds and
the table lands at the expected single-dir-per-table layout.

TestSparkDeterministicLocationCTAS walks the same scenario as the Trino
test: CREATE TABLE without LOCATION, INSERT, CTAS, SELECT count, then
asserts via S3 ListObjectsV2 that no UUID-suffixed sibling directory
appears under the namespace.

Refs #9074

* test(duckdb): read table at deterministic location via REST catalog (#9074)

DuckDB's iceberg extension is a read-only consumer in this flow — there
is no client-side UUID-suffixing toggle to test. The relevant question
post-#9246 is whether DuckDB can ATTACH the REST catalog and resolve a
table at a deterministic <bucket>/<namespace>/<tableName> path produced
by writers that don't suffix UUIDs (iceberg-spark, pyiceberg, Trino with
unique-table-location=false).

TestDuckDBDeterministicLocationRead creates a namespace + minimal table
via direct REST API calls (so no client-side UUID is added), confirms
the catalog returns a deterministic location URL, then runs DuckDB
through ATTACH ... TYPE 'ICEBERG' and DESCRIBE on the table. Asserting
DESCRIBE succeeds proves DuckDB walked the catalog → metadata → schema
chain against the deterministic on-disk path.

The test skips gracefully when the DuckDB image lacks the iceberg
extension or the ATTACH-iceberg syntax, so older base images don't fail
the suite.

Refs #9074
2026-04-27 20:14:48 -07:00
Chris Lu
fe50da4934 test(fuse): stream verify-phase diagnostics from writeback stress test
The 45m suite alarm fires on TestWritebackCacheStressSmallFiles with no
output from the test, since t.Logf is buffered until the test completes
and the alarm panic skips that flush. Add streaming stderr progress, an
explicit verify-phase budget that t.Fatalf's with a goroutine dump on
overrun, and per-retry/per-failure logging so the next hang shows which
file(s) the mount could not read back.
2026-04-27 19:44:26 -07:00
Chris Lu
f50917224a fix(iceberg): default namespace location so fresh CTAS does not race metadata write (#9074) (#9246)
* fix(iceberg): advertise default namespace location for REST clients

Trino's REST catalog has two code paths for CREATE TABLE depending on
whether a table location can be resolved before the catalog call:

    // TrinoRestCatalog.newCreateTableTransaction (Trino 479)
    if (location.isEmpty()) {
        return tableBuilder.create().newTransaction();   // EAGER: REST POST now
    }
    return tableBuilder.withLocation(...).createTransaction(); // DEFERRED

`tableLocation` resolves to null when the user does not pass
`location = '...'` AND `defaultTableLocation` returns null. The latter
happens whenever the namespace's `loadNamespaceMetadata` response has no
`location` property — and our handler returned exactly that.

In the eager branch Trino calls REST POST /v1/.../tables immediately, so
our handleCreateTable persists `<location>/metadata/v1.metadata.json` to
the filer. Trino's IcebergMetadata.beginCreateTable then runs
`fileSystem.listFiles(location).hasNext()` on the same path, finds the
metadata file we just wrote, and throws "Cannot create a table on a
non-empty location" — even on a brand-new schema and table name.

Synthesize a default `location` of `s3://<bucket>/<flattened-namespace>`
on Get/Create namespace responses when one is not stored. With a non-
null `defaultTableLocation`, Trino takes the deferred branch, picks a
unique `<namespace-location>/<table>-<UUID>` path (UUID added by the
standard `iceberg.unique-table-location=true` setting), and the empty-
location check passes. The actual REST POST is deferred to commit time,
so our metadata write lands alongside the data files Trino has already
produced.

Existing namespaces with an explicit `location` property are untouched —
the synthesis only kicks in when the property is absent.

Refs #9074

* test(trino): regression for fresh CREATE TABLE without explicit location

Exercises the follow-up scenario reported on issue #9074: a CREATE TABLE
on a brand-new schema and brand-new table name with NO `location = '...'`
clause, followed by a CTAS on top — exactly the SQL pattern from the
report. Before advertising a default namespace location, the first
CREATE TABLE failed with

    Cannot create a table on a non-empty location:
    s3://iceberg-tables/<schema>/<table>, set
    'iceberg.unique-table-location=true' in your Iceberg catalog
    properties to use unique table locations for every table.

even though the bucket was genuinely empty. The bug surfaced because our
handleCreateTable persists `<location>/metadata/v1.metadata.json` during
Trino's eager-create branch, and Trino's post-create listFiles
emptiness check then trips on the metadata file we just wrote.

The test asserts the CTAS succeeds AND the resulting table contains the
source rows, since the reporter saw the table get created with empty
data when the query failed.

Refs #9074
2026-04-27 16:37:33 -07:00
Chris Lu
ac3a756dae test(s3-retention): force-drop collection after deleteBucket to free volumes
COMPLIANCE-mode retention leaves objects that BypassGovernanceRetention cannot
clear, so the test's DeleteBucket keeps returning BucketNotEmpty and the
underlying SeaweedFS collection (with its 7 reserved volumes) leaks. After a
few leaks on the single-node `weed mini` server, the master logs "Not enough
data nodes found" and every subsequent PutObject 500s, timing the suite out.

Call the master's /col/delete admin endpoint from deleteBucket so the
collection's volumes are reclaimed even when S3-level cleanup is blocked.
2026-04-27 12:12:36 -07:00
Chris Lu
6cbcdf488c chore(mount,fuse-test): diagnostics for FUSE ConcurrentReadWrite ENOENT flake
PR #9230 attempt 1 hit an intermittent
TestConcurrentFileOperations/ConcurrentReadWrite failure where stat
returned ENOENT for a path all writers had just succeeded against, and
the captured mount.log carried no signal about which layer dropped the
entry because the relevant lookup logged at V(4).

Two diagnostic-only changes (no behavior change on the happy path):

- weed/mount/weedfs.go: in lookupEntry, when filer GetEntry returns
  ErrNotFound for a path whose inode is still tracked locally with no
  in-flight create or flush, log Warningf with inode + dirtyHandle +
  pendingFlush + localCache + dirCached. This surfaces layer-by-layer
  state at the moment of the suspicious ENOENT.

- test/fuse_integration/framework_test.go: on AssertFileExists failure,
  dump five 100ms-spaced stat retries, a parent ReadDir, and a direct
  O_RDONLY open before failing. Triangulates kernel dentry caching vs
  mount lookup vs filer state.
2026-04-26 16:57:37 -07:00
Chris Lu
4f628ff4e5 fix(s3api): stream multipart-SSE chunks lazily to avoid truncated GETs (#8908) (#9228)
* fix(s3api): stream multipart SSE-S3 chunks lazily to avoid truncated GETs (#8908)

buildMultipartSSES3Reader opened a volume-server HTTP response for EVERY
chunk upfront, then walked them with io.MultiReader. For a multipart
SSE-S3 object with N internal chunks (e.g. a 200MB Docker Registry blob
with 25+ chunks), N volume-server bodies sat live at once; chunks
1..N-1 were idle while io.MultiReader drained chunk 0. Under concurrent
load the volume server's keep-alive logic closed those idle responses
mid-flight, and the S3 client saw `unexpected EOF` partway through the
GET. Truncated bytes hash to the wrong SHA-256, which is exactly the
"Digest did not match" symptom Docker Registry reports in #8908 (and
which persisted even after the per-chunk metadata fix in #9211 and the
completion backfill in #9224).

Introduce lazyMultipartChunkReader + preparedMultipartChunk{chunk,
wrap}: a generic lazy chunk streamer with a per-chunk wrap closure for
the SSE-specific decryption setup. Per-chunk metadata is still
validated UPFRONT so a malformed chunk fails fast without opening any
HTTP connection -- the eager validation contract callers and tests
rely on is preserved. The volume-server GET and the SSE-specific
decrypt wrap, however, fire LAZILY: at most one chunk body is live at
any time, regardless of object size.

This commit applies the new pattern to buildMultipartSSES3Reader only;
the SSE-KMS and SSE-C multipart readers retain their eager form for
now and will be migrated in follow-up commits, since the same shape
exists there too.

Tests:
  - TestBuildMultipartSSES3Reader_LazyChunkFetch pins the new contract:
    zero chunks opened at construction, peak liveness == 1, all closed
    after drain.
  - TestBuildMultipartSSES3Reader_RejectsBadChunkBeforeAnyFetch
    (replaces ClosesAppendedOnError) asserts a malformed chunk in
    position N causes zero fetches for chunks 0..N -- the previous
    test pinned a weaker contract (cleanup after eager open).
  - TestBuildMultipartSSES3Reader_InvalidIVLength updated for the same
    reason: the fetch callback must NOT be invoked at all on a bad-IV
    chunk.
  - TestMultipartSSES3RealisticEndToEnd round-trips multiple parts
    encrypted the way putToFiler writes them (shared DEK + baseIV,
    partOffset=0, post-completion global offsets) and walks them
    through buildMultipartSSES3Reader.

* fix(s3api): stream multipart SSE-KMS chunks lazily

Apply the same fix as the previous commit to
createMultipartSSEKMSDecryptedReaderDirect: per-chunk SSE-KMS metadata
is validated upfront, but volume-server GETs fire lazily through
lazyMultipartChunkReader. At most one chunk body is live at any time.

This is the same eager-open-all-chunks shape that produced #8908's
truncated GETs for SSE-S3; SSE-KMS multipart objects with many chunks
were exposed to the same idle-keepalive failure mode under concurrent
load.

The wire format on disk is unchanged (same per-chunk metadata, same
encrypted bytes, same object Extended attributes). Existing SSE-KMS
multipart objects read back identically -- only when the volume-server
GETs fire changes.

* fix(s3api): stream multipart SSE-C chunks lazily

Apply the same fix as the previous two commits to
createMultipartSSECDecryptedReaderDirect: per-chunk SSE-C metadata is
validated upfront (IV decode, IV length check, non-negative
PartOffset), but the volume-server GET and CreateSSECDecryptedReader-
WithOffset wrap fire lazily through lazyMultipartChunkReader. At most
one chunk body is live at any time.

This is the same eager-open-all-chunks shape that produced #8908's
truncated GETs for SSE-S3; SSE-C multipart objects with many chunks
were exposed to the same idle-keepalive failure mode under concurrent
load.

The pre-existing TODO note about CopyObject SSE-C PartOffset handling
is preserved verbatim. The wire format on disk is unchanged (same
per-chunk metadata, same encrypted bytes); existing SSE-C multipart
objects read back identically.

After this commit all three multipart SSE read paths (SSE-S3, SSE-KMS,
SSE-C) share lazyMultipartChunkReader as their streaming engine.

* test(s3): add Docker Registry-shape multipart SSE-S3 GET regression

Pin the end-to-end fix for #8908 with a test that mirrors what Docker
Registry actually does on pull: a 25-part * 5MB upload with bucket-
default SSE-S3, then a full GET, then SHA-256 over the streamed body
must match SHA-256 over the uploaded bytes.

The eager-multipart-reader bug was specifically a streaming truncation
under load: the response status was 200 with a Content-Length matching
the object size, but the body short-circuited mid-stream because
later chunks' volume-server connections had already been closed by
keepalive. The hash check is the symptom Docker Registry surfaces
("Digest did not match"), so this is the most faithful regression we
can pin without spinning up a registry.

uploadAndVerifyMultipartSSEObject already byte-compares the GET body,
but hashing on top is intentionally explicit -- it documents WHY the
test exists, and matches the failure mode reported in the issue.

* test(s3): add range-read coverage matrix across SSE modes and sizes

Existing range-read coverage in test/s3/sse was scoped to small (<= 1MB)
single-chunk objects, with one ad-hoc range case per SSE mode and one
129-byte boundary-crossing case in TestSSEMultipartUploadIntegration.
Nothing exercised:

  - Range reads on single-PUT objects whose content crosses the 8MB
    internal chunk boundary (medium size class).
  - Range reads on multipart objects whose parts each span multiple
    internal chunks (large size class) -- the shape #8908 originally
    surfaced for full-object GETs and the most likely site of any
    future regression in per-chunk IV / PartOffset plumbing for
    partial reads.
  - A consistent range-pattern set applied uniformly across SSE modes,
    so any divergence between modes (SSE-C uses random IV + PartOffset;
    SSE-S3/KMS use base IV + offset) is comparable at a glance.

TestSSERangeReadCoverageMatrix introduces a parameterized matrix:

  modes:     no_sse, sse_c, sse_kms, sse_s3
  sizes:     small (256KB single chunk),
             medium (12MB single PUT crossing one internal boundary),
             large (5x9MB multipart, ~10 internal chunks, every part
                    itself spans an 8MB boundary)
  ranges:    single byte at 0, prefix 512B, single byte at last,
             suffix bytes=-100, open-ended bytes=N-, whole object,
             AES-block boundary 15-31, mid straddling one internal
             boundary (medium+large), mid spanning many internal
             boundaries (large only)

Per case it asserts: body bytes equal the expected slice, Content-Length
matches the range length, Content-Range matches start-end/total, and the
SSE response headers match the mode.

The sse_kms branch probes once with a 1-byte SSE-KMS PUT and t.Skip's
the remaining sse_kms subtests with a clear reason if the local server
has no KMS provider configured -- the default `weed mini` setup lacks
one; the Makefile target `test-with-kms` provides one via OpenBao. Other
modes always run.

Verified locally: 75 subtests pass under no_sse / sse_c / sse_s3 against
weed mini, sse_kms cleanly skipped.

* test(s3): conform new test names to TestSSE*Integration so CI runs them

The two tests added in the previous commits had names that did NOT match
the patterns the test/s3/sse Makefile and .github/workflows/s3-sse-tests.yml
use to discover SSE integration tests:

  - test/s3/sse/Makefile `test` target:           TestSSE.*Integration
  - test/s3/sse/Makefile `test-multipart`:        TestSSEMultipartUploadIntegration
  - .github/workflows/s3-sse-tests.yml:           ...|.*Multipart.*Integration|.*RangeRequestsServerBehavior

Result: SSE-KMS coverage I added to TestSSERangeReadCoverageMatrix and
the Docker-Registry-shape multipart regression in
TestSSES3MultipartManyChunks_DockerRegistryShape were silently invisible
to CI even though the underlying test setup (start-seaweedfs-ci using
s3-config-template.json with the embedded `local` KMS provider) already
has SSE-KMS configured.

Renames:

  TestSSERangeReadCoverageMatrix              -> TestSSERangeReadIntegration
  TestSSES3MultipartManyChunks_...            -> TestSSEMultipartManyChunksIntegration

Both names now match `TestSSE.*Integration` (Makefile `test` target) and
TestSSEMultipartManyChunksIntegration additionally matches
`.*Multipart.*Integration` (CI's comprehensive subset). No behavior
change; only the function names move.

Verified locally against `weed mini` with s3-config-template.json:
TestSSERangeReadIntegration runs 96 leaf subtests across 4 SSE modes
(none, SSE-C, SSE-KMS, SSE-S3) x 3 size classes x 7-9 range patterns,
all passing, 0 skipped. The probe-and-skip in the SSE-KMS arm now only
fires for ad-hoc local setups that don't load any KMS provider; the
project's standard test setup loads the local provider, so CI has full
SSE-KMS range coverage.

* fix(s3api): validate SSE-KMS chunk IV during prep, before any fetch

Addresses CodeRabbit review on PR #9228: in
createMultipartSSEKMSDecryptedReaderDirect the per-chunk SSE-KMS metadata
was deserialized in the prep loop but the IV length was only validated
later, inside CreateSSEKMSDecryptedReader, which runs from the wrap
closure -- AFTER the chunk's volume-server fetch has already started.
That weakens the new "reject malformed chunks before any fetch" contract
for SSE-KMS specifically: a chunk with a missing/short/long IV would
fire its HTTP GET, then fail mid-stream during decrypt.

The fix moves the existing ValidateIV check into the prep loop, matching
the SSE-S3 and SSE-C paths.

Drive-by: extract the SSE-KMS prep loop into a free
buildMultipartSSEKMSReader helper that mirrors buildMultipartSSES3Reader,
so the new contract is unit-testable without an S3ApiServer. The
exported method (createMultipartSSEKMSDecryptedReaderDirect) stays a
thin caller, so behavior for production callers is unchanged.

New tests in weed/s3api/s3api_multipart_ssekms_test.go pin the contract:

  - TestBuildMultipartSSEKMSReader_RejectsBadIVBeforeAnyFetch covers
    missing IV, empty IV, short IV, long IV. Each case asserts both
    that an error is returned AND that the fetch callback is never
    invoked.
  - TestBuildMultipartSSEKMSReader_RejectsMissingMetadataBeforeAnyFetch
    pins the analogous behavior when SseMetadata is nil on a chunk in
    position N: chunks 0..N-1 must not be fetched (the earlier eager
    implementation depended on a closeAppendedReaders cleanup path; the
    new contract is stronger -- nothing is opened in the first place).
  - TestBuildMultipartSSEKMSReader_RejectsUnparseableMetadataBeforeAnyFetch
    covers the JSON-unmarshal failure branch.
  - TestBuildMultipartSSEKMSReader_SortsByOffset smoke-tests the
    documented sort-by-offset contract by recording the order in which
    fetch is invoked.

All four pass under `go test ./weed/s3api/`. Existing weed/s3api unit
suite + the SSE integration suite (with the local KMS provider enabled
via s3-config-template.json) continue to pass.

* test(s3): address CodeRabbit nitpicks on range coverage matrix

Three small follow-ups on the range-read coverage matrix from the
previous commit, per CodeRabbit nitpicks on PR #9228:

1. Promote the body-length check from `assert.Equal` to `require.Equal`
   so a truncation regression -- the canonical #8908 failure mode --
   aborts the subtest immediately. Previously the assertion logged a
   length mismatch and then `assertDataEqual` ran on differently-sized
   slices, producing a noisy byte-diff on top of the actual symptom.
   The redundant trailing `t.Fatalf` block becomes dead and is removed.

2. Broaden the SSE-KMS probe-skip heuristic. The probe previously
   produced the friendly "KMS provider not configured" message only
   for 5xx responses; KMS-misconfig surfaces also include 501
   NotImplemented, 4xx KMS.NotConfigured, and error messages
   containing "KMS.NotConfigured" / "NotImplemented" /
   "not configured". The behaviour change is purely cosmetic (the
   caller t.Skip's on any non-empty reason either way) but the new
   diagnostic is more useful in CI logs.

3. Add `t.Parallel()` at the mode and size-class levels of the matrix.
   Each (mode, size) writes an independent object key under the shared
   bucket, with no cross-talk, so parallel execution is safe. Local
   wall time on the full matrix dropped from ~2.0s to ~1.1s (~45%);
   the savings scale with chunk count and CI machine concurrency.

Verified locally against `weed mini` with s3-config-template.json:
  - go test ./weed/s3api/ -count=1                   PASS
  - TestSSERangeReadIntegration -v                   112 PASS, 0 SKIP
  - TestSSEMultipartUploadIntegration etc.           PASS

* fix(s3api): tighten lazy reader error path; unify SSE IV validation

Three CodeRabbit nitpicks on PR #9228:

1. lazyMultipartChunkReader: mark finished on non-EOF Read errors

   The Read loop's three earlier failure paths (chunk index past end,
   fetch error, wrap error) all set l.finished = true before returning.
   The non-EOF Read path -- where l.current.Read itself errors mid-chunk
   -- did not, leaving l.current/l.closer set and l.finished = false. A
   caller that retried Read after an error would re-enter the same
   broken stream instead of advancing or giving up. Set l.finished =
   true on non-EOF Read error so post-error state is consistent across
   all four failure sites; Close() (which the GetObjectHandler defers)
   still releases the chunk body.

2. Unify IV-length validation across SSE-S3, SSE-KMS, SSE-C prep paths

   The previous commit moved SSE-KMS to the shared ValidateIV helper
   but left SSE-S3 and SSE-C with bespoke inline `len(...) !=
   AESBlockSize` checks. All three are enforcing the same invariant;
   inconsistency obscures the symmetry. Move SSE-S3 and SSE-C to
   ValidateIV too, with the same `<algo> chunk <fileId> IV` name
   convention. Error message wording shifts from "<algo> chunk X has
   invalid IV length N (expected 16)" to ValidateIV's "invalid <algo>
   chunk X IV length: expected 16 bytes, got N". The substring
   "IV length" is preserved across both, so the existing
   TestBuildMultipartSSES3Reader_InvalidIVLength substring assertion
   is loosened to match either form.

3. TestBuildMultipartSSEKMSReader_SortsByOffset: verify full ordering

   The test previously drove Read() to observe fetch-call order, but
   CreateSSEKMSDecryptedReader requires a live KMS provider to unwrap
   the encrypted DEK -- unavailable in unit tests -- so the wrap
   closure failed on the first chunk and only one fetch was ever
   recorded. The test asserted only fetchOrder[0] == "c0", which is
   weaker than the comment promised.

   Switch to a static check: type-assert the returned reader to
   *lazyMultipartChunkReader (same package so unexported fields are
   accessible) and inspect the prepared chunks slice directly. This
   pins the entire [c0, c1, c2] sort order in one place, doesn't
   depend on KMS, and runs in zero fetch calls. The fetch closure
   now asserts it is never invoked during preparation.

All weed/s3api unit tests pass; integration suite (with KMS provider
configured via s3-config-template.json) passes.

* test(s3): switch range coverage cleanup to t.Cleanup; tighten KMS probe

Two CodeRabbit comments on PR #9228, both about
test/s3/sse/s3_sse_range_coverage_test.go:

1. CRITICAL: defer + t.Parallel() race in TestSSERangeReadIntegration

   The test creates one bucket up front, then runs subtests that call
   t.Parallel() at the mode and size levels (added in 058cbf27 to cut
   wall time). t.Parallel() pauses each subtest and yields back to the
   parent. The parent's for loop finishes scheduling, the function
   returns, and the deferred cleanupTestBucket fires -- BEFORE any
   parallel subtest body has executed. The bucket gets deleted out
   from under the parallel subtests, which then race the cleanup and
   either fail with NoSuchBucket or, depending on lazy-deletion
   behaviour on the server side, mask other regressions because
   chunks happen to still be readable for a brief window.

   The local matrix passing prior to this commit was a server-side
   coincidence; the t.Cleanup contract is the right one for parent
   tests with parallel children, and switching to it is a one-line
   change. t.Cleanup runs after the test AND all its (parallel)
   subtests complete, so the bucket survives until every leaf
   subtest is done.

2. MINOR: tighten the SSE-KMS probe-skip heuristic

   The previous broadening (058cbf27) treated `code == 400` as
   "KMS provider not configured", on the theory that some servers
   return 4xx for KMS misconfig. That is too aggressive: a real
   misconfiguration in the SSE-KMS test request itself (bad keyID
   format, missing header) ALSO surfaces as a 400, and would
   silently t.Skip the SSE-KMS subtree in CI -- which is exactly
   the integration coverage the new TestSSERangeReadIntegration is
   supposed to add. Drop the 400 branch (and the redundant 501
   match, since 501 >= 500 already covers it). Genuine
   "KMS.NotConfigured" / "NotImplemented" responses are still
   recognised via the string-match block immediately below,
   regardless of status code, so the friendly skip message survives
   for the cases where it actually applies.

Verified locally against `weed mini` with s3-config-template.json:

  - go test ./weed/s3api/                     PASS
  - TestSSERangeReadIntegration -v            113 PASS lines, 0 SKIP
  - TestSSEMultipartUploadIntegration etc.    PASS
2026-04-26 16:31:42 -07:00
Chris Lu
525900dfe4 fix(s3api): backfill multipart SSE-S3 metadata at completion (#9224)
* fix(s3api): backfill missing per-chunk SSE-S3 metadata at completion

When a part of an SSE-S3 multipart upload lands with SseType=NONE on
its chunks (e.g. a transient failure to apply SSE-S3 setup in
PutObjectPart), the completed object inherits NONE-tagged chunks and
detectPrimarySSEType then misses the chunked SSE-S3 encryption. The
read path falls through to the unencrypted serve and GET returns
ciphertext, producing the SHA mismatch reported in #8908.

Recover at completion using the base IV and key data the upload
directory recorded at CreateMultipartUpload:

  - extractMultipartSSES3Info validates upload-entry metadata up
    front and hard-fails completion if the base IV or key data are
    malformed; serializing chunk metadata we then could not decrypt
    is worse than rejecting the upload.
  - completedMultipartChunk re-derives a per-chunk IV from baseIV +
    chunk.Offset (matching what putToFiler would have written) and
    serializes per-chunk SSE-S3 metadata when the chunk has no tag.
    Existing per-chunk metadata is left alone; we cannot recover an
    already-derived IV from the upload-entry alone.

The IV formula intentionally has no partNumber term: putToFiler
hardcodes partOffset=0 when it calls handleSSES3MultipartEncryption
for every part, so each chunk's encryption IV is
calculateIVWithOffset(baseIV, chunk.Offset_part_local).
PartOffsetMultiplier is defined in s3_constants but is not consumed
by the encryption path. Adopting (partNumber-1)*PartOffsetMultiplier
+ chunk.Offset would produce IVs that fail to decrypt the bytes on
disk - a stronger failure mode than the bug being fixed. Tests pin
this:

  - TestCompletedMultipartChunkBackfilledIVDecryptsActualCiphertext
    runs the round trip across the encryption boundary: encrypt
    parts with CreateSSES3EncryptedReaderWithBaseIV (the call
    putToFiler uses), drop chunk metadata to reproduce #8908,
    backfill, decrypt with backfilled IV, assert plaintext intact.
  - TestCompletedMultipartChunkRejectsPartNumberMultiplierFormula
    constructs the IV the partNumber formula would produce and
    shows it does not decrypt the actual ciphertext.

This commit covers the chunk-level recovery only. The companion
fix for the object-level Extended attributes (SeaweedFSSSES3Key /
X-Amz-Server-Side-Encryption) follows separately.

* fix(s3api): backfill canonical SSE-S3 attributes onto multipart object

The previous commit ensures every chunk of an SSE-S3 multipart upload
carries SseType=SSE_S3 with a per-chunk IV, so the multipart-direct
read path can decrypt. The completed object's Extended map can still
miss the canonical pair detectPrimarySSEType and IsSSES3EncryptedInternal
look at:

  - X-Amz-Server-Side-Encryption (the AmzServerSideEncryption header
    detectPrimarySSEType reads on inline / small-object reads)
  - x-seaweedfs-sse-s3-key (SeaweedFSSSES3Key, required by
    IsSSES3EncryptedInternal and by the read-path key lookup)

When a part of the upload was written by a path that did not set
those (the same #8908 race that produced the NONE chunks),
copySSEHeadersFromFirstPart finds nothing to copy and the final entry
ends up with only the multipart-init keys (SeaweedFSSSES3Encryption /
BaseIV / KeyData). The read path then mis-detects the object as
unencrypted.

applyMultipartSSES3HeadersFromUploadEntry writes the canonical pair
from the multipart-init metadata in all three completion paths
(versioned, suspended, non-versioned), only when the keys are missing
so a healthy first part still wins. extractMultipartSSES3Info already
ran in prepareMultipartCompletionState, so the data is reused without
re-decoding.

Tests: TestApplyMultipartSSES3HeadersFromUploadEntry covers backfill,
do-not-clobber, and nil-info no-op cases.

* fix(s3api): drop double IV adjustment in SSE-KMS chunk view decrypt

decryptSSEKMSChunkView was pre-adjusting the SSE-KMS chunk IV
(calculateIVWithOffset(baseIV, ChunkOffset)) and then handing the
adjusted IV to CreateSSEKMSDecryptedReader, which itself runs
calculateIVWithOffset(IV, ChunkOffset) on whatever it receives. The
offset was being applied twice for any chunk with a non-zero
ChunkOffset, corrupting the keystream for range reads that cross
multipart chunk boundaries.

Pass the raw SSE-KMS key (with base IV and the original ChunkOffset
field) into CreateSSEKMSDecryptedReader so the offset is applied
exactly once, and remove the now-dead intra-block skip that was
compensating for the double adjustment.

Add an anti-test inside TestSSEKMSDecryptChunkView_RequiresOffsetAdjustment
that decrypts the same ciphertext with a deliberately double-adjusted
IV and asserts the output is corrupted, so any regression that
re-introduces the double application fails the unit test.

* test(s3): cover multipart SSE across chunk-spanning parts and ranges

Adds an integration subtest "Multipart Parts Larger Than Internal
Chunks Across SSE Types" to TestSSEMultipartUploadIntegration that
exercises the end-to-end S3 path for the bugs fixed in this branch:

  - Two-part multipart upload with each part larger than the 8MB
    internal SeaweedFS chunk, so each part itself spans multiple
    underlying chunks.
  - Subtests for SSE-C, SSE-KMS, explicit SSE-S3, and bucket-default
    SSE-S3 - the four paths multipart parts can take through the SSE
    pipeline.
  - Each subtest does a full GET (verifying every byte and the
    response Content-Length / SSE response headers) plus a 129-byte
    range read straddling the 8MB internal chunk boundary, which is
    the path that produced the SSE-KMS double-IV corruption (fix in
    the previous commit) and the SSE-S3 chunk-tag loss (fix in the
    earlier commits).

Factored the request shape behind multipartSSEOptions /
uploadAndVerifyMultipartSSEObject so all four SSE flavors share the
same upload+verify code; only the SSE-specific input/output
configuration differs per subtest.

* test(s3): abort orphan multipart uploads on test failure

Address coderabbit nitpick on uploadAndVerifyMultipartSSEObject. The
helper used require.NoError after CreateMultipartUpload, UploadPart
and CompleteMultipartUpload, so a failure in any of those (or in the
later GET / range read on a still-incomplete upload) called t.Fatal
without aborting the in-flight MPU, leaving an orphan upload in the
bucket. Harmless in CI where the data dir is wiped on shutdown, but a
real annoyance when iterating locally and a textbook AWS S3 caveat in
production.

Register a t.Cleanup that calls AbortMultipartUpload unless a
"completed" flag was set right after a successful
CompleteMultipartUpload. Use context.Background for the abort call
since the parent ctx may already be cancelled at cleanup time, and
t.Logf the abort error rather than failing the test so the original
failure remains visible in the run output.
2026-04-25 23:06:37 -07:00
Chris Lu
a14cbc176b debug(kafka): add restart flake diagnostics 2026-04-24 15:02:07 -07:00
Chris Lu
a0be40e070 Merge branch 'master' of https://github.com/seaweedfs/seaweedfs 2026-04-23 16:25:12 -07:00
Chris Lu
b94ad82472 fix(test): stabilize ConcurrentLockContention; warn on coherence drift
TestPosixFileLocking/ConcurrentLockContention failed in CI (run
24857323067) with ENOENT when re-opening the file after all 8 workers
had successfully written and closed. The 20s openWithRetry budget was
exhausted, pointing at a real but unproven metaCache/parent-cache
coherence issue in the mount under bursts of concurrent Release.

Test: hold the initial fd open for the whole subtest; use it for the
post-workers Sync() and the verification read. Workers still exercise
the concurrent-flock invariant and per-record write correctness; the
re-open path is no longer load-bearing. On Eventually failure, dump
ReadDir of the parent, Stat, and a fresh O_RDONLY open so a future
recurrence has state to debug from. Drop the darwin-only ENOENT
t.Skip branches that hid this same flake.

Mount: in weedfs.lookupEntry, when returning ENOENT from the
"parent cached but child missing" branch, log at Warningf instead of
V(4) when the kernel is still tracking this path's inode. That
combination is the smoking-gun signal for cache drift and is rare
enough in normal use not to spam the log.
2026-04-23 15:57:35 -07:00
dependabot[bot]
cd5004cfbd build(deps): bump github.com/Azure/go-ntlmssp from 0.1.0 to 0.1.1 in /test/kafka (#9204)
build(deps): bump github.com/Azure/go-ntlmssp in /test/kafka

Bumps [github.com/Azure/go-ntlmssp](https://github.com/Azure/go-ntlmssp) from 0.1.0 to 0.1.1.
- [Release notes](https://github.com/Azure/go-ntlmssp/releases)
- [Commits](https://github.com/Azure/go-ntlmssp/compare/v0.1.0...v0.1.1)

---
updated-dependencies:
- dependency-name: github.com/Azure/go-ntlmssp
  dependency-version: 0.1.1
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-04-23 15:02:28 -07:00
Chris Lu
f438cc3544 fix(volume_server): refuse ReceiveFile overwrite of mounted EC shard (#9184) (#9186)
* test(volume_server): reproduce #9184 ReceiveFile truncating a mounted shard

ReceiveFile for an EC shard calls os.Create(filePath) which opens the
path with O_TRUNC. When the shard is already mounted, the in-memory
EcVolume holds a file descriptor against the same inode, so a second
ReceiveFile call for the same (volume, shard) truncates the live shard
file beneath the reader.

Reproducer: generate and mount shard 0 for a populated volume, capture
the on-disk size, then send a smaller payload for the same shard via
ReceiveFile. The current handler accepts the overwrite and leaves the
shard truncated in place; this test pins that behavior. When the fix
lands the server should reject (or rename-then-swap) and this test
must be inverted.

* fix(volume_server): refuse ReceiveFile overwrite of mounted EC shard

ReceiveFile used os.Create on EC shard paths, which opens with
O_TRUNC and truncates in place. When an EC shard is already
mounted, the in-memory EcVolume holds file descriptors against the
same inodes, so the truncation corrupts the live shard beneath any
ongoing read. On retries of an EC task this produced the "missing
parts" class of errors in #9184.

The fix rejects any ReceiveFile for an EC volume that currently
has mounted shards. The caller must unmount before retrying —
silent truncation is never an acceptable outcome. Non-EC writes and
ReceiveFile for volumes that have never been mounted on this server
continue to work as before.

Tests:
- TestReceiveFileRejectsOverwriteOfMountedEcShard: mounts a shard,
  attempts an overwrite, asserts the error response and that the
  on-disk file and live reads are undisturbed.
- TestReceiveFileAllowsEcShardWhenNoMount: pins the common-case
  contract that a first write to a target still succeeds.

* fix(volume-rust): refuse ReceiveFile overwrite of mounted EC shard

Mirror the Go-side change: reject receive_file for any EC volume that
currently has mounted shards on this server. std::fs::File::create
truncates in place and the in-memory EcVolume holds fds on the same
inodes, so an overwrite would corrupt live readers.
2026-04-22 16:47:01 -07:00
Chris Lu
cb882ced46 fix(test): retry ENOENT in fcntl lock subprocess helper
TestPosixFileLocking/FcntlReleaseOnClose was flaky because the
subprocess spawned by startLockHolder occasionally saw ENOENT when
opening a file the parent had just created on the FUSE mount. Retry on
ENOENT (matching the existing openWithRetry pattern used in
testConcurrentLockContention) so the subprocess waits for the mount's
dentry state to propagate before reporting the lock acquire as failed.
2026-04-22 10:33:21 -07:00
Chris Lu
c4e1885053 fix(ec): honor disk_id in ReceiveFile so EC shards respect admin placement (#9184) (#9185)
* test(volume_server): reproduce #9184 EC ReceiveFile disk-placement bug

The plugin-worker EC task sends shards via ReceiveFile, which picks
Locations[0] as the target directory regardless of the admin planner's
TargetDisk assignment. ReceiveFileInfo has no disk_id field, so there
is no wire channel to honor the plan.

Adds StartSingleVolumeClusterWithDataDirs to the integration framework
so tests can launch a volume server with N data directories. The new
repro asserts the current (buggy) behavior: sending three distinct EC
shards via ReceiveFile leaves all three files in dir[0] and the other
dirs empty. When the fix adds disk_id to ReceiveFileInfo, this
assertion must flip to verify the planned placement is respected.

* fix(ec): honor disk_id in ReceiveFile so EC shards respect admin placement

Before this change, VolumeServer.ReceiveFile for EC shards always
selected the first HDD location (Locations[0]). The plugin-worker EC
task had no way to pass the admin planner's per-shard disk
assignment — ReceiveFileInfo carried no disk_id field — so every
received EC shard piled onto a single disk per destination server.
On multi-disk servers this caused uneven load (one disk absorbing all
EC shard I/O), frequent ENOSPC retries, and a growing EC backlog
under sustained ingest (see issue #9184).

Changes:
- proto: add disk_id to ReceiveFileInfo, mirroring
  VolumeEcShardsCopyRequest.disk_id.
- worker: DistributeEcShards tracks the planner-assigned disk per
  shard; sendShardFileToDestination forwards that disk id. Metadata
  files (ecx/ecj/vif) inherit the disk of the first data shard
  targeting the same node so they land next to the shards.
- server: ReceiveFile honors disk_id when > 0 with bounds
  validation; disk_id=0 (unset) falls back to the same
  auto-selection pattern as VolumeEcShardsCopy (prefer disk that
  already has shards for this volume, then any HDD with free space,
  then any location with free space).

Tests updated:
- TestReceiveFileEcShardHonorsDiskID asserts three shards sent with
  disk_id={1,2,0} land on data dirs 1, 2, and 0 respectively.
- TestReceiveFileEcShardRejectsInvalidDiskID pins the out-of-range
  disk_id rejection path.

* fix(volume-rust): honor disk_id in ReceiveFile for EC shards

Mirror the Go-side change: when disk_id > 0 place the EC shard on the
requested disk; when unset, auto-select with the same preference order
as volume_ec_shards_copy (disk already holding shards, then any HDD,
then any disk).

* fix(volume): compare disk_id as uint32 to avoid 32-bit overflow

On 32-bit Go builds `int(fileInfo.DiskId) >= len(Locations)` can wrap a
high-bit uint32 to a negative int, bypassing the bounds check before the
index operation. Compare in the uint32 domain instead.

* test(ec): fail invalid-disk_id test on transport error

Previously a transport-level error from CloseAndRecv silently passed the
test by returning early, masking any real gRPC failure. Fail loudly so
only the structured ReceiveFileResponse rejection path counts as a pass.

* docs(test): explain why DiskId=0 auto-selects dir 0 in EC placement test

Documents the load-bearing assumption that shards are never mounted in
this test, so loc.FindEcVolume always returns false and auto-select
falls through to the first HDD. Saves future readers from re-deriving
the expected directory for the DiskId=0 case.

* fix(test): preserve baseDir/volume path for single-dir clusters

StartSingleVolumeClusterWithDataDirs started naming the data directory
volume0 even in the dataDirCount=1 case, which broke Scrub tests that
reach into baseDir/volume via CorruptDatFile / CorruptEcShardFile /
CorruptEcxFile. Keep the legacy name for single-dir clusters; only use
the indexed "volumeN" layout when multiple disks are requested.
2026-04-22 10:30:13 -07:00
dependabot[bot]
1220468a33 build(deps): bump github.com/rclone/rclone from 1.73.1 to 1.73.5 in /test/kafka (#9189)
build(deps): bump github.com/rclone/rclone in /test/kafka

Bumps [github.com/rclone/rclone](https://github.com/rclone/rclone) from 1.73.1 to 1.73.5.
- [Release notes](https://github.com/rclone/rclone/releases)
- [Changelog](https://github.com/rclone/rclone/blob/master/RELEASE.md)
- [Commits](https://github.com/rclone/rclone/compare/v1.73.1...v1.73.5)

---
updated-dependencies:
- dependency-name: github.com/rclone/rclone
  dependency-version: 1.73.5
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-04-22 09:27:30 -07:00
Chris Lu
be9996962d fix(test): avoid port collision between master gRPC and volume ports
AllocateMiniPorts(1) reserved masterPort and masterPort+GrpcPortOffset
by holding listeners open, but closed them on return. The subsequent
AllocatePorts call bound 127.0.0.1:0, so the OS could immediately reuse
the just-released mini gRPC port as a volume port — causing the volume
server to fail at bind time with "address already in use".

Introduce AllocatePortSet(miniCount, regularCount) that holds every
listener open until the full set is chosen, and route the five volume
test cluster builders through it.
2026-04-21 23:33:57 -07:00
Chris Lu
96e5fea08e test(catalog_spark): bound weed shell invocation with 30s timeout
createTableBucket ran `weed shell` via exec.Command with no deadline.
When the shell's first command retries on a transient master connection
blip, the trailing `exit` on stdin never gets processed and the
subprocess blocks until the outer 20m `go test` timeout fires — the
surfacing symptom is a flaky 20m panic with no diagnostic output.

Wrap the invocation in exec.CommandContext with a 30s timeout, matching
the existing pattern in test/s3tables/catalog_risingwave/setup_test.go.
2026-04-21 23:27:10 -07:00
Chris Lu
7f67995c24 chore(filer): remove -mount.p2p flag; registry is always on (#9183)
The filer-side mount peer registry (tier 1 of peer chunk sharing) was
gated behind -mount.p2p (default true). Idle cost is negligible — a
tiny in-memory map plus a 60s sweeper — so the opt-out is not worth the
surface area.

Removes the flag from weed filer, weed server (-filer.mount.p2p), and
weed mini, and always constructs the registry in NewFilerServer. Also
drops the now-dead nil guards in MountRegister/MountList/sweeper and
the TestMountRegister_DisabledIsNoOp case.
2026-04-21 23:00:11 -07:00
Chris Lu
9ae905e456 feat(security): hot-reload HTTPS certs without restart (k8s cert-manager) (#9181)
* feat(security): hot-reload HTTPS certs for master/volume/filer/webdav/admin

S3 and filer already use a refreshing pemfile provider for their HTTPS
cert, so rotated certificates (e.g. from k8s cert-manager) are picked up
without a restart. Master, volume, webdav, and admin, however, passed
cert/key paths straight to ServeTLS/ListenAndServeTLS and loaded once at
startup — rotating those certs required a pod restart.

Add a small helper NewReloadingServerCertificate in weed/security that
wraps pemfile.Provider and returns a tls.Config.GetCertificate closure,
then wire it into the four remaining HTTPS entry points. httpdown now
also calls ServeTLS when TLSConfig carries a GetCertificate/Certificates
but CertFile/KeyFile are empty, so volume server can pre-populate
TLSConfig.

A unit test exercises the rotation path (write cert, rotate on disk,
assert the callback returns the new cert) with a short refresh window.

* refactor(security): route filer/s3 HTTPS through the shared cert reloader

Before: filer.go and s3.go each kept a *certprovider.Provider on the
options struct plus a duplicated GetCertificateWithUpdate method. Both
were loading pemfile themselves. Behaviorally they already reloaded, but
the logic was duplicated two ways and neither path was shared with the
newly-added master/volume/webdav/admin wiring.

After: both use security.NewReloadingServerCertificate like the other
servers. The per-struct certProvider field and GetCertificateWithUpdate
method are removed, along with the now-unused certprovider and pemfile
imports. Net: -32 lines, one code path for all HTTPS cert reloading.

No behavior change — the refresh window, cache, and handshake contract
are identical (the helper wraps the same pemfile.NewProvider).

* feat(security): hot-reload HTTPS client certs for mount/backup/upload/etc

The HTTP client in weed/util/http/client loaded the mTLS client cert
once at startup via tls.LoadX509KeyPair. That left every long-lived
HTTPS client process (weed mount, backup, filer.copy, filer→volume,
s3→filer/volume) unable to pick up a rotated client cert without a
restart — even though the same cert-manager setup was already rotating
the server side fine.

Swap the client cert loader for a tls.Config.GetClientCertificate
callback backed by the same refreshing pemfile provider. New TLS
handshakes pick up the rotated cert; in-flight pooled connections keep
their old cert and drop as normal transport churn happens.

To keep this reusable from both server and client TLS code without an
import cycle (weed/security already imports weed/util/http/client for
LoadHTTPClientFromFile), extract the pemfile wrapper into a new
weed/security/certreload subpackage. weed/security keeps its thin
NewReloadingServerCertificate wrapper. The existing unit test moves
with the implementation.

gRPC mTLS was already handled by security.LoadServerTLS /
LoadClientTLS; this PR does not change any gRPC paths. MQ broker, MQ
agent, Kafka gateway, and FUSE mount control plane are gRPC-only and
therefore already rotate.

CA bundles (ClientCAs / RootCAs / grpc.ca) are still loaded once — noted
as a known limitation in the wiki.

* fix(security): address PR review feedback on cert reloader

Bots (gemini-code-assist + coderabbit) flagged three real issues and a
couple of nits. Addressing them here:

1. KeyMaterial used context.Background(). The grpc pemfile provider's
   KeyMaterial blocks until material arrives or the context deadline
   expires; with Background() a slow disk could hang the TLS handshake
   indefinitely. Switched both the server and client callbacks to use
   hello.Context() / cri.Context() so a stuck read is bounded by the
   handshake timeout.

2. Admin server loaded TLS inside the serve goroutine. If the cert was
   bad, the goroutine returned but startAdminServer kept blocking on
   <-ctx.Done() with no listener, making the process look healthy with
   nothing bound. Moved TLS setup to run before the goroutine starts
   and propagate errors via fmt.Errorf; also captures the provider and
   defers Close().

3. HTTP client discarded the certprovider.Provider from
   NewClientGetCertificate. That leaked the refresh goroutine, and
   NewHttpClientWithTLS had a worse case where a CA-file failure after
   provider creation orphaned the provider entirely. Added a
   certProvider field and a Close() method on HTTPClient, and made
   the constructors close the provider on subsequent error paths.

4. Server-side paths (master/volume/filer/s3/webdav/admin) now retain
   the provider. filer and webdav run ServeTLS synchronously, so a
   plain defer works. master/volume/s3 dispatch goroutines and return
   while the server keeps running, so they hook Close() into
   grace.OnInterrupt.

5. Test: certreload_test now tolerates transient read/parse errors
   during file rotation (writeSelfSigned rewrites cert before key) and
   reports the last error only if the deadline expires.

No user-visible behavior change for the happy path.

* test(tls): add end-to-end HTTPS cert rotation integration test

Boots a real `weed master` with HTTPS enabled, captures the leaf cert
served at TLS handshake time, atomically rewrites the cert/key files
on disk (the same rename-in-place pattern kubelet does when it swaps
a cert-manager Secret), and asserts that a subsequent TLS handshake
observes the rotated leaf — with no process restart, no SIGHUP, no
reloader sidecar. Verifies the full path: on-disk change → pemfile
refresh tick → provider.KeyMaterial → tls.Config.GetCertificate →
server TLS handshake.

Runtime is ~1s by exposing the reloader's refresh window as an env
var (WEED_TLS_CERT_REFRESH_INTERVAL) and setting it to 500ms for the
test. The same env var is user-facing — documented in the wiki — so
operators running short-lived certs (Vault, cert-manager with
duration: 24h, etc.) can tighten the rotation-pickup window without a
rebuild. Defaults to 5h to preserve prior behavior.

security.CredRefreshingInterval is kept for API compatibility but now
aliases certreload.DefaultRefreshInterval so the same env controls
both gRPC mTLS and HTTPS reload.

* ci(tls): wire the TLS rotation integration test into GitHub Actions

Mirrors the existing vacuum-integration-tests.yml shape: Ubuntu runner,
Go 1.25, build weed, run `go test` in test/tls_rotation, upload master
logs on failure. 10-minute job timeout; the test itself finishes in
about a second because WEED_TLS_CERT_REFRESH_INTERVAL is set to 500ms
inside the test.

Runs on every push to master and on every PR to master.

* fix(tls): address follow-up PR review comments

Three new comments on the integration test + volume shutdown path:

1. Test: peekServerCert was swallowing every dial/handshake error,
   which meant waitForCert's "last err: <nil>" fatal message lost all
   diagnostic value. Thread errors back through: peekServerCert now
   returns (*x509.Certificate, error), and waitForCert records the
   latest error so a CI flake points at the actual cause (master
   didn't come up, handshake rejected, CA pool mismatch, etc.).

2. Test: set HOME=<tempdir> on the master subprocess. Viper today
   registers the literal path "$HOME/.seaweedfs" without env
   expansion, so a developer's ~/.seaweedfs/security.toml is
   accidentally invisible — the test was relying on that. Pinning
   HOME is belt-and-braces against a future viper upgrade that does
   expand env vars.

3. volume.go: startClusterHttpService's provider close was registered
   via grace.OnInterrupt, which fires on SIGTERM but NOT on the
   v.shutdownCtx.Done() path used by mini / integration tests. The
   pemfile refresh goroutine leaked in that shutdown path. Now the
   helper returns a close func and the caller invokes it on BOTH
   shutdown paths for parity.

Also add MinVersion: TLS 1.2 to the test's tls.Config to quiet the
ast-grep static-analysis nit — zero-risk since the pool only trusts
our in-memory CA.

Test runs clean 3/3.
2026-04-21 20:20:11 -07:00
Chris Lu
e77f8ae204 fix(s3api): route STS GetFederationToken to STS handler (#9157) (#9167)
* fix(s3api): route STS GetFederationToken requests to STS handler (#9157)

The STS GetFederationToken handler was implemented but never reachable.
Three routing gaps sent requests to the S3/IAM path instead of STS:

- No explicit mux route for Action=GetFederationToken in the URL query
- iamMatcher did not exclude GetFederationToken, so authenticated POSTs
  with Action in the form body were matched and dispatched to IAM
- UnifiedPostHandler only dispatched AssumeRole* and GetCallerIdentity
  to STS, leaving GetFederationToken to fall through to DoActions and
  return NotImplemented

Add the missing route, the matcher exclusion, and the dispatch branch.

Also wire TestSTS, TestAssumeRoleWithWebIdentity, and TestServiceAccount
into the s3-iam-tests workflow as a new "sts" matrix entry. Before this
change, none of test/s3/iam/s3_sts_get_federation_token_test.go's four
test functions ran in CI, which is why this regression shipped.

* test(iam): make orphaned STS/service-account tests pass under auth-enabled CI

Follow-up to wiring STS tests into CI: fixes several pre-existing issues
that made the newly-included tests fail locally.

Server fixes:
- weed/s3api/s3api_sts.go: handleGetFederationToken no longer 500s when
  the caller is a legacy S3-config identity (not in the IAM user store).
  Previously any GetPoliciesForUser error short-circuited to InternalError,
  which hard-failed every SigV4 caller using keys from -s3.config.
- weed/s3api/s3api_embedded_iam.go: CreateServiceAccount now generates IDs
  in the sa:<parent>:<uuid> format required by
  credential.ValidateServiceAccountId. The old "sa-XXXXXXXX" format failed
  the persistence-layer regex and caused every CreateServiceAccount call
  to return 500 once a filer-backed credential store validated the ID.

Test helpers:
- test/s3/iam/s3_sts_assume_role_test.go: callSTSAPIWithSigV4 no longer
  sets req.Header["Host"]. aws-sdk-go v1 v4.Signer already signs Host from
  req.URL.Host, and a manual Host header made the signer emit host;host in
  SignedHeaders, producing SignatureDoesNotMatch. Updated missing_role_arn
  subtest to match the existing SeaweedFS behavior (user-context
  assumption).
- test/s3/iam/s3_service_account_test.go: callIAMAPI now SigV4-signs
  requests when STS_TEST_{ACCESS,SECRET}_KEY env vars are set. Unsigned
  IAM writes otherwise fall through to the STS fallback and return
  InvalidAction.

CI matrix:
- .github/workflows/s3-iam-tests.yml: skip
  TestServiceAccountLifecycle/use_service_account_credentials only. The
  rest of the service-account suite passes; that one subtest depends on a
  separate credential-reload issue where new ABIA keys briefly register
  into accessKeyIdent but aren't persisted to the filer, so they vanish
  on the next reload. Out of scope for the #9157 GetFederationToken fix.

* fix(credential): accept AWS IAM username chars in service-account IDs

Gemini review on #9167 pointed out that ServiceAccountIdPattern's
parent-user segment was more restrictive than an AWS IAM username:
`[A-Za-z0-9_-]` vs. IAM's `[\w+=,.@-]`. Realistic usernames with
`@`, `.`, `+`, `=`, or `,` (e.g. email-style principals) would fail
validation at the filer store even though the embedded IAM API
happily created them.

Broaden the regex to `[A-Za-z0-9_+=,.@-]` (matching the AWS IAM spec
at https://docs.aws.amazon.com/IAM/latest/APIReference/API_User.html)
and add a table-driven test that locks the expansion in.

* address PR review feedback on #9167

All five review items were valid; changes keyed to review bullets:

- weed/s3api/s3api_sts.go: handleGetFederationToken no longer swallows
  arbitrary policy-lookup failures. Only credential.ErrUserNotFound is
  treated leniently (the legacy-config SigV4 path); any other error now
  returns InternalError so we don't mint tokens with an incomplete
  policy set.
- weed/credential/grpc/grpc_identity.go: GetUser translates gRPC
  NotFound back to credential.ErrUserNotFound so errors.Is(...) above
  matches for gRPC-backed stores, not just memory/filer-direct.
- weed/s3api/s3api_embedded_iam.go: CreateServiceAccount now validates
  the generated saId against credential.ValidateServiceAccountId before
  returning. Surfaces a client 400 with the offending ID instead of the
  opaque 500 that used to bubble up from the persistence layer.
- weed/s3api/s3api_server_routing_test.go: seed a routing-test identity
  with a known AK/SK, sign TestRouting_GetFederationTokenAuthenticatedBody
  with aws-sdk-go v4.Signer so the request actually passes
  AuthSignatureOnly. Assert 503 ServiceUnavailable (from STSHandlers
  with no stsService) instead of just NotEqual(501) — 503 proves the
  dispatch reached STSHandlers.HandleSTSRequest.
- test/s3/iam/s3_service_account_test.go: callIAMAPI signs with
  service="iam" instead of "s3" (SeaweedFS verifies against whichever
  service the client signed with, but "iam" is semantically correct).
- weed/credential/validation_test.go: add positive rows for an
  uppercase parent (sa:ALICE:...) and a canonical hyphenated UUID
  suffix (sa:alice:123e4567-e89b-12d3-a456-426614174000).
2026-04-20 19:33:22 -07:00
Chris Lu
86c5e815d2 fix(kafka): make consumer-group rebalancing work end-to-end (#9143)
* fix(kafka): make consumer-group rebalancing work end-to-end

TestConsumerGroups was failing every run since the job was added
(2026-04-17) but the failures were masked by a `|| echo ...` trailer on
the go test invocation, so the CI reported green. Removing the mask
exposes several real bugs in the gateway's group-coordinator code:

1. JoinGroup deduplicated members by ClientID, which collapsed two
   Sarama consumers that share the default ClientID ("sarama") into a
   single member slot and broke rebalancing. Key dedup off the TCP
   ConnectionID instead; keep ClientID on the member for DescribeGroup
   fidelity.

2. Every JoinGroup replaced the *GroupMember struct, wiping the
   Assignment the leader had just published in its SyncGroup and leaving
   non-leader consumers with 0 partitions after a rebalance. Update the
   existing member in place on rejoin.

3. Non-leader SyncGroup returned an empty assignment while the leader
   was mid-rebalance, so consumers silently came up with no partitions.
   Return REBALANCE_IN_PROGRESS when the group is not Stable so Sarama
   retries the join/sync cycle (4 retries x 2s backoff by default).

4. Heartbeat returned ILLEGAL_GENERATION on a gen mismatch even when
   the group was in PreparingRebalance/CompletingRebalance. Return
   REBALANCE_IN_PROGRESS in that case so the heartbeat loop cleanly
   cancels the session instead of tearing it down on a fatal error.

5. LeaveGroup parser only handled v0-v2. Sarama at V2_8_0_0 sends v3
   (Members array) by default, so the gateway silently rejected the
   request as InvalidGroupID and dead consumers stayed in the group as
   phantom leaders. Added v3 (Members array) and v4+ (flexible/compact/
   tagged-fields) parsing.

The rebalancing integration tests called Consume() once per consumer,
which cannot survive a rebalance (heartbeat RBIP cancels the session
and Consume() returns - this is documented Sarama behaviour; callers
are expected to loop). Added a runConsumeLoop helper and used it in the
four affected sub-tests. RebalanceTestHandler.Setup now overwrites
stale entries in its assignments channel so the test observes the
settled post-rebalance snapshot rather than whatever arrived first.

* fix(kafka): address PR review feedback

- JoinGroup now snapshots existing members before mutating and restores
  the snapshot on INCONSISTENT_GROUP_PROTOCOL rollback. Previously the
  rollback path always deleted the entry, corrupting group state when
  an existing member rejoined with an incompatible protocol.

- handleLeaveGroup iterates request.Members instead of processing only
  the first entry, so v3+ batch departures (KIP-345 style) correctly
  remove every listed member and build a per-member response. A single
  group-state transition runs after the loop, with leader election
  only triggered if the actual group leader was among the departures.

- Added buildLeaveGroupFlexibleResponse for v4+ clients. The parser
  already decoded flexible versions, but the response still went out in
  non-flexible encoding (4-byte array lengths, 2-byte strings, no
  tagged fields), which v4+ clients could not parse. Route flexible
  versions through the new builder; v1-v3 keep buildLeaveGroupFullResponse.

- BasicFunctionality gives each consumer its own
  ConsumerGroupHandler/ready channel. The previous shared handler
  closed ready once, so readyCount advanced to numConsumers from a
  single signal; the test could proceed without the other consumers
  actually reaching Setup.

- RebalanceTestHandler.assignments is now a size-1 channel, so readers
  always observe the most recent rebalance snapshot instead of an
  intermediate one from an earlier round.
2026-04-20 10:11:45 -07:00
Chris Lu
8857dbfb74 fix(test): drop host port mapping from risingwave catalog test to kill TOCTOU flake
The random host port allocated by MustFreeMiniPorts was released before
docker run bound it, occasionally losing the race to another process and
failing with "address already in use". The sidecar already reaches
RisingWave via shared netns (--network container:...), so the host -p
mapping and the corresponding WaitForPort check were unused.
2026-04-19 23:26:45 -07:00
Chris Lu
a8ba9d106e peer chunk sharing 7/8: tryPeerRead read-path hook (#9136)
* mount: batched announcer + pooled peer conns for mount-to-mount RPCs

* peer_announcer.go: non-blocking EnqueueAnnounce + ticker flush that
  groups fids by HRW owner, fans out one ChunkAnnounce per owner in
  parallel. announcedAt is pruned at 2× TTL so it stays bounded.

* peer_dialer.go: PeerConnPool caches one grpc.ClientConn per peer
  address; the announcer and (next PR) the fetcher share it so
  steady-state owner RPCs skip the handshake cost entirely. Bounded
  at 4096 cached entries; shutdown conns are transparently replaced.

* WFS starts both alongside the gRPC server; stops them on unmount.

* mount: wire tryPeerRead via FetchChunk streaming gRPC

Replaces the HTTP GET byte-transfer path with a gRPC server-stream
FetchChunk call. Same fall-through semantics: any failure drops
through to entryChunkGroup.ReadDataAt, so reads never slow below
status quo.

* peer_fetcher.go: tryPeerRead resolves the offset to a leaf chunk
  (flattening manifests), asks the HRW owner for holders via
  ChunkLookup, then opens FetchChunk on each holder in LRU order
  (PR #5) until one succeeds. Assembled bytes are verified against
  FileChunk.ETag end-to-end — the peer is still treated as
  untrusted. Reuses the shared PeerConnPool from PR #6 for all
  outbound gRPC.

* peer_grpc.go: expose SelfAddr() so the fetcher can avoid dialing
  itself on a self-owned fid.

* filehandle_read.go: tryPeerRead slot between tryRDMARead and
  entryChunkGroup.ReadDataAt. Gated by option.PeerEnabled and the
  presence of peerGrpcServer (the single identity test).

Read ordering with the feature enabled is now:
   local cache -> RDMA sidecar -> peer mount (gRPC stream) -> volume server

One port, one identity, one connection pool — no more HTTP bytecast.

* test(fuse_p2p): end-to-end CI test for peer chunk sharing

Adds a FUSE-backed integration test that proves mount B can satisfy a
read from mount A's chunk cache instead of the volume tier.

Layout (modelled on test/fuse_dlm):

  test/fuse_p2p/framework_test.go        — cluster harness (1 master,
                                           1 volume, 1 filer, N mounts,
                                           all with -peer.enable)
  test/fuse_p2p/peer_chunk_sharing_test.go
                                         — writer-reader scenario

The test (TestPeerChunkSharing_ReadersPullFromPeerCache):

  1. Starts 3 mounts. Three is the sweet spot: with 2 mounts, HRW owner
     of a chunk is self ~50 % of the time (peer path short-circuits);
     with 3+ it drops to ≤ 1/3, so a multi-chunk file almost certainly
     exercises the remote-owner fan-out.
  2. Mount 0 writes a ~8 MiB file, then reads it back through its own
     FUSE to warm its chunk cache.
  3. Waits for seed convergence (one full MountList refresh) plus an
     announcer flush cycle, so chunk-holder entries have reached each
     HRW owner.
  4. Mount 1 reads the same file.
  5. Verifies byte-for-byte equality AND greps mount 1's log for
     "peer read successful" — content matching alone is not proof
     (the volume fallback would also succeed), so the log marker is
     what distinguishes p2p from fallback.

Workflow .github/workflows/fuse-p2p-integration.yml triggers on any
change to mount/filer peer code, the p2p protos, or the test itself.
Failure artifacts (server + mount logs) are uploaded for 3 days.

Mounts run with -v=4 so the tryPeerRead success/failure glog messages
land in the log file the test greps.
2026-04-19 00:53:12 -07:00
Chris Lu
6787a4b4e8 fix kafka gateway and consumer group e2e flakes (#9129)
* fix(test): reduce kafka gateway and consumer group flakes

* fix(kafka): make broker health-check backoff respect context

Replace time.Sleep in the retry loop with a select on bc.ctx.Done() and
time.After so the backoff is interruptible during shutdown, per review
feedback on PR #9129.

* fix(kafka): guard broker HealthCheck against nil client

Return the same "broker client not connected" error used by the other
exported BrokerClient methods instead of panicking on a partially
initialized client, per CodeRabbit review feedback on PR #9129.
2026-04-18 10:03:24 -07:00
Chris Lu
32cbed9658 fix(fuse-tests): avoid ephemeral port reuse racing weed mini bind
freePort allocated in [20000, 55535], which overlaps the Linux ephemeral
range (32768-60999). The kernel could reuse the chosen port for an
outbound connection between the test closing its listener and weed mini
re-checking availability, causing:

  Port allocation failed: port N for Filer (specified by flag
  filer.port) is not available on 0.0.0.0 and cannot be used

Narrow the range to [20000, 32000] to stay below the ephemeral floor,
and pass -ip.bind=127.0.0.1 so mini's pre-check runs on the same IP the
test actually reserved the port on.
2026-04-17 13:47:17 -07:00
Chris Lu
f720f559cb ci(kafka-loadtest): switch off Ubuntu/Debian base images to avoid apt mirror flakes (#9119)
* ci(kafka-loadtest): retry apt-get to survive Ubuntu mirror flakes

The Kafka Quick Test workflow's Docker build of Dockerfile.loadtest
keeps hitting "Connection failed [IP: ...]" on archive.ubuntu.com /
security.ubuntu.com mid-build, e.g.:

    failed to solve: process "/bin/sh -c apt-get update && \
      apt-get install -y ca-certificates curl jq bash netcat ..."
      did not complete successfully: exit code: 100

Same class of failure PR #9106 fixed for pjdfstest. Apply the same
two retry knobs to Dockerfile.loadtest and Dockerfile.seektest so a
transient mirror flake retries five times with a 30s timeout instead
of failing the whole workflow.

(The fuller pjdfstest fix also restructured the build to use
docker/build-push-action with type=gha cache. Doing that for
kafka-client-loadtest would mean rewriting the make/docker-compose
build path; defer until the apt-retry alone proves insufficient.)

* ci(kafka-loadtest): also drop recommends/suggests + apt-get clean

Address PR review (gemini-code-assist): fully align with PR #9106's
pattern by adding --no-install-recommends / --no-install-suggests so
the runtime images stay small and don't pull in extra packages, plus
apt-get clean before rm -rf /var/lib/apt/lists/* in Dockerfile.seektest.

* ci(kafka-loadtest): use alpine / maven base images instead of apt

The previous rounds of apt-retry / apt-clean knobs aren't enough:
the Ubuntu mirror is persistently unreachable from the GitHub runner
for minutes at a time, which blows past the 5-retry / 30-second
Acquire configuration and still kills the build (see run 24551809614).

Switch both runtime images so no apt fetch is needed at all:

- Dockerfile.loadtest now runs on alpine:3.20. All runtime deps
  (ca-certificates, curl, jq, bash, netcat-openbsd) are in the
  Alpine main repo, fetched from Alpine's CDN rather than the
  Ubuntu archive that keeps going dark.
- Dockerfile.seektest now uses maven:3.9-eclipse-temurin-11, which
  ships JDK 11 and Maven preinstalled — no apt-get maven step.

This also means the runtime images no longer care about
Acquire::Retries / DEBIAN_FRONTEND / apt-get clean, so those lines
are removed with the apt call they were configuring.
2026-04-17 07:43:19 -07:00
Chris Lu
9554e259dd fix(iceberg): route catalog clients to the right bucket and vend S3 endpoint (#9109)
* fix(iceberg): route catalog clients to the right bucket and vend S3 endpoint

DuckDB ATTACH 's3://<bucket>/' AS cat (TYPE 'ICEBERG', ...) was failing
with "schema does not exist" because GET /v1/config ignored the warehouse
query parameter and returned no overrides.prefix, so subsequent requests
fell through to the hard-coded "warehouse" default bucket instead of the
one the client attached. LoadTable also returned an empty config, forcing
clients to discover the S3 endpoint out-of-band and producing 403s on
direct iceberg_scan calls.

- handleConfig now echoes overrides.prefix = bucket and defaults.warehouse
  when ?warehouse=s3://<bucket>/ is supplied.
- getBucketFromPrefix honors a warehouse query parameter as a fallback for
  clients that skip the /v1/config handshake.
- LoadTable responses advertise s3.endpoint and s3.path-style-access so
  clients can reach data files without separate configuration.

Refs #9103

* address review feedback on iceberg S3 endpoint vending

- deriveS3AdvertisedEndpoint is now a method on S3Options; honors
  externalUrl / S3_EXTERNAL_URL, switches to https when -s3.key.file is
  set, uses the https port when configured, and brackets IPv6 literals
  via util.JoinHostPort.
- handleCreateTable returns s.buildFileIOConfig() in both its staged and
  final LoadTableResult branches so create and load flows see the same
  FileIO hints.
- Add unit test coverage for the endpoint derivation scenarios.

* address CI and review feedback for #9109

- DuckDB integration test now runs under its own newOAuthTestEnv (with a
  valid IAM config) so the OAuth2 client_credentials flow DuckDB requires
  actually works; the shared env has no registered credentials, which was
  the cause of the CI failure. Helper createTableWithToken was added to
  create tables via Bearer auth.
- Tighten TestIssue9103_LoadTableDoesNotVendS3FileIOCredentials to also
  assert s3.path-style-access = "true", so a partial regression where the
  endpoint is vended but path-style is dropped still fails.
- deriveS3AdvertisedEndpoint now logs a startup hint when it infers the
  host from os.Hostname because the bind IP is a wildcard, pointing
  operators at -s3.externalUrl / S3_EXTERNAL_URL for reverse-proxy
  deployments where the inferred name is not externally reachable.
- handleConfig has a comment explaining that any sub-path in the
  warehouse URL is dropped because catalog routing is bucket-scoped.

* fix(iceberg): make advertised S3 endpoint strictly opt-in; add region

The wildcard-bind fallback to os.Hostname() in deriveS3AdvertisedEndpoint
was hijacking correctly-configured clients: on the CI runner it produced
http://runnervmrc6n4:<port>, which Spark (running in Docker) could not
resolve, so Spark iceberg tests began failing after the endpoint started
being vended in LoadTable responses.

Change the rule so advertising is opt-in and never guesses a host that
might not be routable:
  - -s3.externalUrl / S3_EXTERNAL_URL wins (covers reverse-proxy).
  - Otherwise, only an explicit, non-wildcard -s3.bindIp is used.
  - Wildcard / empty bind returns "" so no FileIO endpoint is vended and
    existing clients keep using their own configuration.

buildFileIOConfig additionally vends s3.region (defaulting to the same
value baked into table bucket ARNs) whenever it vends an endpoint, so
DuckDB's attach does not fail with "No region was provided via the
vended credentials" when the operator has opted in.

The DuckDB issue-9103 integration test runs under an env with a
wildcard bind, so it explicitly sets AWS_REGION in the docker run to
pick up the same default. The HTTP-level LoadTable-vending test was
dropped because its expectation is now conditional and already covered
by unit tests in iceberg_issue_9103_test.go.
2026-04-16 15:51:43 -07:00
Chris Lu
40ffc73aa8 ci(pjdfstest): cache docker layers via GHA to avoid apt mirror flakes (#9106)
* ci(pjdfstest): cache docker layers via GHA to avoid apt mirror flakes

Replace the local buildx cache + manual fallback with docker/setup-buildx-action
and docker/build-push-action using type=gha cache. The e2e and pjdfstest Dockerfile
layers now persist across runs in GitHub's own cache backend, so apt-get update
only hits Ubuntu mirrors when the Dockerfiles change. Also add Acquire::Retries
and Timeout so first-run cache-miss builds survive transient mirror sync errors.

* ci(pjdfstest): use local registry to share e2e image across buildx builds

The docker-container buildx driver cannot see images loaded into the host
Docker daemon, so the second build's FROM chrislusf/seaweedfs:e2e failed with
"not found" on registry-1.docker.io.

Run a local registry:2 on the runner, push both images to localhost:5000,
remap the FROM via build-contexts so the Dockerfile stays unchanged, then
tag the pulled images locally for docker compose to consume.
2026-04-16 12:50:19 -07:00
Chris Lu
ceae01d05b test(kafka): retry ConsumeWithGroup on failed initial join (#9105)
The consumer group resumption flow in TestOffsetManagement occasionally fails
with `read tcp ... i/o timeout` on the second consumer's first FetchMessage.
Re-joining an existing group races with the previous member's LeaveGroup and
session cleanup; the new reader can observe transient coordinator state that
the kafka-go client surfaces as a connection read timeout.

Wrap ConsumeWithGroup in a 3-attempt retry that rebuilds the reader only when
no message was received yet, so a transient join failure doesn't fail the
whole test. Partial progress (at least one message consumed) still returns
immediately, preserving the original error semantics for post-join failures.
2026-04-16 11:31:26 -07:00
dependabot[bot]
2c460e0fc9 build(deps): bump org.apache.kafka:kafka-clients from 3.9.1 to 3.9.2 in /test/kafka/kafka-client-loadtest (#9082)
build(deps): bump org.apache.kafka:kafka-clients

Bumps org.apache.kafka:kafka-clients from 3.9.1 to 3.9.2.

---
updated-dependencies:
- dependency-name: org.apache.kafka:kafka-clients
  dependency-version: 3.9.2
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-04-14 21:54:21 -07:00
Chris Lu
d0a09ea178 fix(s3): honor ChecksumAlgorithm on presigned URL uploads (#9076)
* fix(s3): honor ChecksumAlgorithm on presigned URL uploads

AWS SDK presigners hoist x-amz-sdk-checksum-algorithm (and related
checksum headers) into the signed URL's query string, so servers must
read either location. detectRequestedChecksumAlgorithm only looked at
request headers, so presigned PUTs with ChecksumAlgorithm set validated
and stored no additional checksum, and HEAD/GET never returned the
x-amz-checksum-* header.

Read these parameters from headers first, then fall back to a
case-insensitive query-string lookup. Apply the same fallback when
comparing an object-level checksum value against the computed one.

Fixes #9075

* test(s3): presigned URL checksum integration tests (#9075)

Adds test/s3/checksum with end-to-end coverage for flexible-checksum
behavior on presigned URL uploads. Tests generate a presigned PUT URL
with ChecksumAlgorithm set, upload the body with a plain http.Client
(bypassing AWS SDK middleware so the server must honor the query-string
hoisted x-amz-sdk-checksum-algorithm), then HEAD/GET with
ChecksumMode=ENABLED and assert the stored x-amz-checksum-* header.

Covers SHA256, SHA1, and a negative control with no checksum requested.
Wires the new directory into s3-go-tests.yml as its own CI job.

* perf(s3): parse presigned query once in detectRequestedChecksumAlgorithm

Previously, each header fallback called getHeaderOrQuery, which re-parsed
r.URL.Query() and allocated a new map on every invocation — up to eight
times per PutObject request. Parse the raw query at most once per request
(only when non-empty) and pass the pre-parsed url.Values into a new
lookupHeaderOrQuery helper.

Also drops a redundant strings.ToLower allocation in the case-insensitive
query key scan (strings.EqualFold already handles ASCII case folding).

Addresses review feedback from gemini-code-assist on PR #9076.

* test(s3): honor credential env vars and add presigned upload timeout

- init() now reads S3_ACCESS_KEY/S3_SECRET_KEY (and AWS_ACCESS_KEY_ID /
  AWS_SECRET_ACCESS_KEY / AWS_REGION fallbacks) so that
  `make test-with-server ACCESS_KEY=... SECRET_KEY=...` no longer
  authenticates with hardcoded defaults while the server has been
  started with different credentials.
- uploadViaPresignedURL uses a dedicated http.Client with a 30s timeout
  instead of http.DefaultClient, so a stalled server fails fast in CI
  instead of blocking until the suite's global timeout fires.

Addresses review feedback from coderabbitai on PR #9076.

* test(s3): pass S3_PORT and credentials through to checksum tests

- 'make test' now exports S3_ENDPOINT, S3_ACCESS_KEY, and S3_SECRET_KEY
  derived from the Makefile variables so the Go test process talks to
  the same endpoint/credentials that start-server was launched with.
- start-server cleans up the background SeaweedFS process and PID file
  when the readiness poll times out, preventing stale port conflicts on
  subsequent runs.

Addresses review feedback from coderabbitai on PR #9076.

* ci(s3): raise checksum tests step timeout

make test-with-server builds weed_binary, waits up to 90s for readiness,
then runs go test -timeout=10m. The previous 12-minute step timeout only
had ~2 minutes of headroom over the Go timeout, risking the Actions
runner killing the step before tests reported a real failure.

Bumps the job timeout from 15 to 20 minutes and the step timeout from
12 to 16 minutes, matching other S3 integration jobs.

Addresses review feedback from coderabbitai on PR #9076.

* perf(s3): thread pre-parsed query through putToFiler hot path

Parse the request's query string once at the top of putToFiler and
reuse the resulting url.Values for both the checksum-algorithm detection
and the expected-checksum verification. Previously, the verification
path called getHeaderOrQuery which re-parsed r.URL.Query() again on
every PutObject, defeating the previous commit's single-parse goal.

- Add parseRequestQuery + detectRequestedChecksumAlgorithmQ (the
  pre-parsed-query variant). detectRequestedChecksumAlgorithm is now a
  thin wrapper used by callers that do a single lookup.
- putToFiler parses once and threads the result through both call sites.
- Remove getHeaderOrQuery and update the unit test to use
  lookupHeaderOrQuery directly.

Addresses follow-up review from gemini-code-assist on PR #9076.

* test(s3): check io.ReadAll error in uploadViaPresignedURL helper

* test(s3): drop SHA1 presigned test case

The AWS SDK v2 presigner signs a Content-MD5 header at presign time
for SHA1 PutObject requests even when no body is attached (the MD5 of
the empty payload gets baked into the signed headers). Uploading the
real body via a plain http.Client then trips SeaweedFS's MD5 validation
and returns BadDigest — an SDK/presigner quirk, not a SeaweedFS bug.

The SHA256 positive case already exercises the server-side
query-hoisted algorithm path that issue #9075 is about, and the unit
tests in weed/s3api cover each algorithm's header mapping. Drop the
SHA1 integration case rather than chase SDK-specific workarounds.

* test(s3): provide real Content-MD5 to presigned checksum test

AWS SDK v2's flexible-checksum middleware signs a Content-MD5 header at
presign time. There is no body to hash at that point, so it seeds the
header with MD5 of the empty payload. When the real body is then PUT
with a plain http.Client, SeaweedFS's server-side Content-MD5
verification correctly rejects the upload with BadDigest.

Pre-compute the MD5 of the test body and thread it into
PutObjectInput.ContentMD5 so the signed Content-MD5 matches the body
that will actually be uploaded. The test still exercises the
server-side path that reads X-Amz-Sdk-Checksum-Algorithm from the
query string (the fix that PR #9076 is validating).

* test(s3): send the signed Content-MD5 header on presigned upload

uploadViaPresignedURL now accepts an extraHeaders map so callers can
thread through headers that the presigner signed but the raw http
request would otherwise omit. The SHA256 test passes the Content-MD5
it computed, matching what the presigner baked into the signature.

Fixes SignatureDoesNotMatch seen in CI after the previous commit set
ContentMD5 on the presign input without sending the corresponding
header on the actual upload.

* test(s3): build presigned URL with the raw v4 signer

The AWS SDK v2 s3.PresignClient runs the flexible-checksum middleware
for any PutObject input that carries ChecksumAlgorithm. That middleware
injects a Content-MD5 header at presign time, and with no body present
it seeds MD5-of-empty. Any subsequent upload of a non-empty body
through a plain http.Client then trips SeaweedFS's Content-MD5
verification and returns BadDigest — not the code path that issue
#9075 is about.

Replace the PresignClient usage in the integration test with a direct
call to v4.Signer.PresignHTTP, building a canonical URL whose query
string already contains x-amz-sdk-checksum-algorithm=SHA256. This is
exactly the shape of URL a browser/curl client would receive from any
presigner that hoists the algorithm header, and it exercises the
server-side fix from PR #9076 without dragging in SDK-specific
middleware quirks.

* test(s3): set X-Amz-Expires on presigned URL before signing

v4.Signer.PresignHTTP does not add X-Amz-Expires on its own — the
caller has to seed it into the request's query string so the signer
includes it in the canonical query and the server accepts the
presigned URL. Without it, SeaweedFS correctly returns
AuthorizationQueryParametersError.

Also adds a .gitignore for the make-managed test volume data, log
file, and PID file so local `make test-with-server` runs do not leave
artifacts tracked by git.

Verified by running the integration tests locally:
  make test-with-server → both presigned checksum tests PASS.
2026-04-14 21:52:49 -07:00
Chris Lu
08d9193fe1 [nfs] Add NFS (#9067)
* add filer inode foundation for nfs

* nfs command skeleton

* add filer inode index foundation for nfs

* make nfs inode index hardlink aware

* add nfs filehandle and inode lookup plumbing

* add read-only nfs frontend foundation

* add nfs namespace mutation support

* add chunk-backed nfs write path

* add nfs protocol integration tests

* add stale handle nfs coverage

* complete nfs hardlink and failover coverage

* add nfs export access controls

* add nfs metadata cache invalidation

* fix nfs chunk read lookup routing

* fix nfs review findings and rename regression

* address pr 9067 review comments

- filer_inode: fail fast if the snowflake sequencer cannot start, and let
  operators override the 10-bit node id via SEAWEEDFS_FILER_SNOWFLAKE_ID
  to avoid multi-filer collisions
- filer_inode: drop the redundant retry loop in nextInode
- filerstore_wrapper: treat inode-index writes/removals as best-effort so
  a primary store success no longer surfaces as an operation failure
- filer_grpc_server_rename: defer overwritten-target chunk deletion until
  after CommitTransaction so a rolled-back rename does not strand live
  metadata pointing at freshly deleted chunks
- command/nfs: default ip.bind to loopback and require an explicit
  filer.path, so the experimental server does not expose the entire
  filer namespace on first run
- nfs integration_test: document why LinkArgs matches go-nfs's on-the-wire
  layout rather than RFC 1813 LINK3args

* mount: pre-allocate inode in Mkdir and Symlink

Mkdir and Symlink used to send filer_pb.CreateEntryRequest with
Attributes.Inode = 0. After PR 9067, the filer's CreateEntry now assigns
its own inode in that case, so the filer-side entry ends up with a
different inode than the one the mount allocates via inodeToPath.Lookup
and returns to the kernel. Once applyLocalMetadataEvent stores the
filer's entry in the meta cache, subsequent GetAttr calls read the
cached entry and hit the setAttrByPbEntry override at line 197 of
weedfs_attr.go, returning the filer-assigned inode instead of the
mount's local one. pjdfstest tests/rename/00.t (subtests 81/87/91)
caught this — it lstat'd a freshly-created directory/symlink, renamed
it, lstat'd again, and saw a different inode the second time.

createRegularFile already pre-allocates via inodeToPath.AllocateInode
and stamps it into the create request. Do the same thing in Mkdir and
Symlink so both sides agree on the object identity from the very first
request, and so GetAttr's cache path returns the same value as Mkdir /
Symlink's initial response.

* sequence: mask snowflake node id on int→uint32 conversion

CodeQL flagged the unchecked uint32(snowflakeId) cast in
NewSnowflakeSequencer as a potential truncation bug when snowflakeId is
sourced from user input (e.g. via SEAWEEDFS_FILER_SNOWFLAKE_ID). Mask
to the 10 bits the snowflake library actually uses so any caller-
supplied int is safely clamped into range.

* add test/nfs integration suite

Boots a real SeaweedFS cluster (master + volume + filer) plus the
experimental `weed nfs` frontend as subprocesses and drives it through
the NFSv3 wire protocol via go-nfs-client, mirroring the layout of
test/sftp. The tests run without a kernel NFS mount, privileged ports,
or any platform-specific tooling.

Coverage includes read/write round-trip, mkdir/rmdir, nested
directories, rename content preservation, overwrite + explicit
truncate, 3 MiB binary file, all-byte binary and empty files, symlink
round-trip, ReadDirPlus listing, missing-path remove, FSInfo sanity,
sequential appends, and readdir-after-remove.

Framework notes:

- Picks ephemeral ports with net.Listen("127.0.0.1:0") and passes
  -port.grpc explicitly so the default port+10000 convention cannot
  overflow uint16 on macOS.
- Pre-creates the /nfs_export directory via the filer HTTP API before
  starting the NFS server — the NFS server's ensureIndexedEntry check
  requires the export root to exist with a real entry, which filer.Root
  does not satisfy when the export path is "/".
- Reuses the same rpc.Client for mount and target so go-nfs-client does
  not try to re-dial via portmapper (which concatenates ":111" onto the
  address).

* ci: add NFS integration test workflow

Mirror test/sftp's workflow for the new test/nfs suite so PRs that touch
the NFS server, the inode filer plumbing it depends on, or the test
harness itself run the 14 NFSv3-over-RPC integration tests on Ubuntu
22.04 via `make test`.

* nfs: use append for buffer growth in Write and Truncate

The previous make+copy pattern reallocated the full buffer on every
extending write or truncate, giving O(N^2) behaviour for sequential
write loops. Switching to `append(f.content, make([]byte, delta)...)`
lets Go's amortized growth strategy absorb the repeated extensions.
Called out by gemini-code-assist on PR 9067.

* filer: honor caller cancellation in collectInodeIndexEntries

Dropping the WithoutCancel wrapper lets DeleteFolderChildren bail out of
the inode-index scan if the client disconnects mid-walk. The cleanup is
already treated as best-effort by the caller (it logs on error and
continues), so a cancelled walk just means the partial index rebuild is
skipped — the same failure mode as any other index write error.
Flagged as a DoS concern by gemini-code-assist on PR 9067.

* nfs: skip filer read on open when O_TRUNC is set

openFile used to unconditionally loadWritableContent for every writable
open and then discard the buffer if O_TRUNC was set. For large files
that is a pointless 64 MiB round-trip. Reorder the branches so we only
fetch existing content when the caller intends to keep it, and mark the
file dirty right away so the subsequent Close still issues the
truncating write. Called out by gemini-code-assist on PR 9067.

* nfs: allow Seek on O_APPEND files and document buffered write cap

Two related cleanups on filesystem.go:

- POSIX only restricts Write on an O_APPEND fd, not lseek. The existing
  Seek error ("append-only file descriptors may only seek to EOF")
  prevented read-and-write workloads that legitimately reposition the
  read cursor. Write already snaps the offset to EOF before persisting
  (see seaweedFile Write), so Seek can unconditionally accept any
  offset. Update the unit test that was asserting the old behaviour.
- Add a doc comment on maxBufferedWriteSize explaining that it is a
  per-file ceiling, the memory footprint it implies, and that the real
  fix for larger whole-file rewrites is streaming / multi-chunk support.

Both changes flagged by gemini-code-assist on PR 9067.

* nfs: guard offset before casting to int in Write

CodeQL flagged `int(f.offset) + len(p)` inside the Write growth path as
a potential overflow on architectures where `int` is 32-bit. The
existing check only bounded the post-cast value, which is too late.
Clamp f.offset against maxBufferedWriteSize before the cast and also
reject negative/overflowed endOffset results. Both branches fall
through to billy.ErrNotSupported, the same behaviour the caller gets
today for any out-of-range buffered write.

* nfs: compute Write endOffset in int64 to satisfy CodeQL

The previous guard bounded f.offset but left len(p) unchecked, so
CodeQL still flagged `int(f.offset) + len(p)` as a possible int-width
overflow path. Bound len(p) against maxBufferedWriteSize first, do the
addition in int64, and only cast down after the total has been clamped
against the buffer ceiling. Behaviour is unchanged: any out-of-range
write still returns billy.ErrNotSupported.

* ci: drop emojis from nfs-tests workflow summary

Plain-text step summary per user preference — no decorative glyphs in
the NFS CI output or checklist.

* nfs: annotate remaining DEV_PLAN TODOs with status

Three of the unchecked items are genuine follow-up PRs rather than
missing work in this one, and one was actually already done:

- Reuse chunk cache and mutation stream helpers without FUSE deps:
  checked off — the NFS server imports weed/filer.ReaderCache and
  weed/util/chunk_cache directly with no weed/mount or go-fuse imports.
- Extract shared read/write helpers from mount/WebDAV/SFTP: annotated
  as deferred to a separate refactor PR (touches four packages).
- Expand direct data-path writes beyond the 64 MiB buffered fallback:
  annotated as deferred — requires a streaming WRITE path.
- Shared lock state + lock tests: annotated as blocked upstream on
  go-nfs's missing NLM/NFSv4 lock state RPCs, matching the existing
  "Current Blockers" note.

* test/nfs: share port+readiness helpers with test/testutil

Drop the per-suite mustPickFreePort and waitForService re-implementations
in favor of testutil.MustAllocatePorts (atomic batch allocation; no
close-then-hope race) and testutil.WaitForPort / SeaweedMiniStartupTimeout.
Pull testutil in via a local replace directive so this standalone
seaweedfs-nfs-tests module can import the in-repo package without a
separate release.

Subprocess startup is still master + volume + filer + nfs — no switch to
weed mini yet, since mini does not know about the nfs frontend.

* nfs: stream writes to volume servers instead of buffering the whole file

Before this change the NFS write path held the full contents of every
writable open in memory:

  - OpenFile(write) called loadWritableContent which read the existing
    file into seaweedFile.content up to maxBufferedWriteSize (64 MiB)
  - each Write() extended content in-place
  - Close() uploaded the whole buffer as a single chunk via
    persistContent + AssignVolume

The 64 MiB ceiling made large NFS writes return NFS3ERR_NOTSUPP, and
even below the cap every Write paid a whole-file-in-memory cost. This
PR rewrites the write path to match how `weed filer` and the S3 gateway
persist data:

  - openFile(write) no longer loads the existing content at all; it
    only issues an UpdateEntry when O_TRUNC is set *and* the file is
    non-empty (so a fresh create+trunc is still zero-RPC)
  - Write() streams the caller's bytes straight to a volume server via
    one AssignVolume + one chunk upload, then atomically appends the
    resulting chunk to the filer entry through mutateEntry. Any
    previously inlined entry.Content is migrated to a chunk in the same
    update so the chunk list becomes the authoritative representation.
  - Truncate() becomes a direct mutateEntry (drop chunks past the new
    size, clip inline content, update FileSize) instead of resizing an
    in-memory buffer.
  - Close() is a no-op because everything was flushed inline.

The small-file fast path that the filer HTTP handler uses is preserved:
if the post-write size still fits in maxInlineWriteSize (4 MiB) and
the file has no existing chunks, we rewrite entry.Content directly and
skip the volume-server round-trip. This keeps single-shot tiny writes
(echo, small edits) cheap while completely removing the 64 MiB cap on
larger files. Read() now always reads through the chunk reader instead
of a local byte slice, so reads inside the same session see the freshly
appended data.

Drops the unused seaweedFile.content / dirty fields, the
maxBufferedWriteSize constant, and the loadWritableContent helper.
Updates TestSeaweedFileSystemSupportsNamespaceMutations expectations
to match the new "no extra O_TRUNC UpdateEntry on an empty file"
behavior (still 3 updates: Write + Chmod + Truncate).

* filer: extract shared gateway upload helper for NFS and WebDAV

Three filer-backed gateways (NFS, WebDAV, and mount) each had a local
saveDataAsChunk that wrapped operation.NewUploader().UploadWithRetry
with near-identical bodies: build AssignVolumeRequest, build
UploadOption, build genFileUrlFn with optional filerProxy rewriting,
call UploadWithRetry, validate the result, and call ToPbFileChunk.
Pull that body into filer.SaveGatewayDataAsChunk with a
GatewayChunkUploadRequest struct so both NFS and WebDAV can delegate
to one implementation.

- NFS's saveDataAsChunk is now a thin adapter that assembles the
  GatewayChunkUploadRequest from server options and calls the helper.
  The chunkUploader interface keeps working for test injection because
  the new GatewayChunkUploader interface is structurally identical.
- WebDAV's saveDataAsChunk is similarly a thin adapter — it drops the
  local operation.NewUploader call plus the AssignVolume/UploadOption
  scaffolding.
- mount is intentionally left alone. mount's saveDataAsChunk has two
  features that do not fit the shared helper (a pre-allocated file-id
  pool used to skip AssignVolume entirely, and a chunkCache
  write-through at offset 0 so future reads hit the mount's local
  cache), both of which are mount-specific.

Marks the Phase 2 "extract shared read/write helpers from mount,
WebDAV, and SFTP" DEV_PLAN item as done. The filer-level chunk read
path (NonOverlappingVisibleIntervals + ViewFromVisibleIntervals +
NewChunkReaderAtFromClient) was already shared.

* nfs: remove DESIGN.md and DEV_PLAN.md

The planning documents have served their purpose — all phase 1 and
phase 2 items are landed, phase 3 streaming writes are landed, phase 2
shared helpers are extracted, and the two remaining phase 4 items
(shared lock state + lock tests) are blocked upstream on
github.com/willscott/go-nfs which exposes no NLM or NFSv4 lock state
RPCs. The running decision log no longer reflects current code and
would just drift. The NFS wiki page
(https://github.com/seaweedfs/seaweedfs/wiki/NFS-Server) now carries
the overview, configuration surface, architecture notes, and known
limitations; the source is the source of truth for the rest.
2026-04-14 20:48:24 -07:00
Chris Lu
2818251dd5 fix(iceberg): clean stale data before creating a table (#9074) (#9077)
* fix(iceberg): clean stale data before creating a table

CREATE TABLE AS from Trino fails against the SeaweedFS Iceberg REST
catalog with "Cannot create a table on a non-empty location". The
catalog assigns every new table the deterministic <ns>/<table> path,
and Trino's pre-write check rejects the CTAS whenever leftover objects
live there — typically files from a prior DROP that did not purge the
data, or an earlier aborted CTAS.

Make the catalog the authority for table existence: before writing any
metadata, look up the table in the S3 Tables catalog. If it is already
registered, return the existing definition (idempotent create). If it
is not registered, any objects still sitting at the target location
are stale, so purge them before proceeding. Live tables are never
touched — the cleanup path is guarded by the catalog lookup.

Fixes #9074

* test(trino): regression for create/drop/recreate table (#9074)

Exercises the exact sequence from the reported bug: CREATE TABLE without
an explicit location, INSERT, DROP, then CREATE again with the same name,
followed by a CTAS on top. Previously the recreate failed with
"Cannot create a table on a non-empty location" because stale data files
from the dropped table lingered at the deterministic <schema>/<table>
path. The test also asserts the recreated table does not see the dropped
data and that a drop/recreate CTAS cycle works.

* fix(iceberg): purge dropped table location on DROP TABLE

Recreate-at-same-location was failing with "Cannot create a table on a
non-empty location" because DROP TABLE only removed the catalog entry
and left data files behind. Trino's pre-write emptiness check then
rejected the subsequent CREATE.

Look up the table's storage location before deleting the catalog entry,
and after a successful DeleteTable, purge the filer subtree at that
location. The lookup uses the catalog — the authoritative owner of the
name→location mapping — so cleanup is gated on a live table having
existed; failed lookups skip cleanup and leave storage alone.

Also pin the regression test to an explicit, fixed location so the
recreate genuinely targets the same path the drop was supposed to free.

* refactor(iceberg): use errors.As in isNoSuchTableError

* fix(iceberg): drop create-preflight cleanup; harden cleanup path

- Remove the destructive cleanupStaleTableLocation call from the
  CreateTable preflight. Storage cleanup now lives exclusively on the
  DROP path, so CreateTable has no side effects on storage beyond the
  new table's own metadata write.
- Validate tablePath in cleanupStaleTableLocation by rejecting empty,
  ".", "..", or backslash-bearing segments before joining with the
  bucket prefix, so a crafted location cannot escape the bucket
  subtree. path.Clean would silently collapse traversal, so segments
  are checked raw.

* test(trino): defer table drops in recreate test for failure-safe cleanup
2026-04-14 20:34:53 -07:00
Chris Lu
7a7f220224 feat(mount): cap write buffer with -writeBufferSizeMB (#9066)
* feat(mount): cap write buffer with -writeBufferSizeMB

Without a bound on the per-mount write pipeline, sustained upload
failures (e.g. volume server returning "Volume Size Exceeded" while
the master hasn't yet rotated assignments) let sealed chunks pile up
across open file handles until the swap directory — by default
os.TempDir() — fills the disk. Reported on 4.19 filling /tmp to 1.8 TB
during a large rclone sync.

Add a global WriteBufferAccountant shared across every UploadPipeline
in a mount. Creating a new page chunk (memory or swap) first reserves
ChunkSize bytes; when the cap is reached the writer blocks until an
uploader finishes and releases, turning swap overflow into natural
FUSE-level backpressure instead of unbounded disk growth.

The new -writeBufferSizeMB flag (also accepted via fuse.conf) defaults
to 0 = unlimited, preserving current behavior. Reserve drops
chunksLock while blocking so uploader goroutines — which take
chunksLock on completion before calling Release — cannot deadlock,
and an oversized reservation on an empty accountant succeeds to avoid
single-handle starvation.

* fix(mount): plug write-budget leaks in pipeline Shutdown

Review on #9066 caught two accounting bugs on the Destroy() path:

1. Writable-chunk leak (high). SaveDataAt() reserves ChunkSize before
   inserting into writableChunks, but Shutdown() only iterated
   sealedChunks. Truncate / metadata-invalidation flows call Destroy()
   (via ResetDirtyPages) without flushing first, so any dirty but
   unsealed chunks would permanently shrink the global write budget.
   Shutdown now frees and releases writable chunks too.

2. Double release with racing uploader (medium). Shutdown called
   accountant.Release directly after FreeReference, while the async
   uploader goroutine did the same on normal completion — under a
   Destroy-before-flush race this could underflow the accountant and
   let later writes exceed the configured cap. Move accounting into
   SealedChunk.FreeReference itself: the refcount-zero transition is
   exactly-once by construction, so any number of FreeReference calls
   release the slot precisely once.

Add regression tests for the writable-leak and the FreeReference
idempotency guarantee.

* test(mount): remove sleep-based race in accountant blocking test

Address review nits on #9066:
- Replace time.Sleep(50ms) proxy for "goroutine entered Reserve" with
  a started channel the goroutine closes immediately before calling
  Reserve. Reserve cannot make progress until Release is called, so
  landed is guaranteed false after the handshake — no arbitrary wait.
- Short-circuit WriteBufferAccountant.Used() in unlimited mode for
  consistency with Reserve/Release, avoiding a mutex round-trip.

* test(mount): add end-to-end write-buffer cap integration test

Exercises the full write-budget plumbing with a small cap (4 chunks of
64 KiB = 256 KiB) shared across three UploadPipelines fed by six
concurrent writers. A gated saveFn models the "volume server rejecting
uploads" condition from the original report: no sealed chunk can drain
until the test opens the gate. A background sampler records the peak
value of accountant.Used() throughout the run.

The test asserts:
  - writers fill the budget and then block on Reserve (Used() stays at
    the cap while stalled)
  - Used() never exceeds the configured cap even under concurrent
    pressure from multiple pipelines
  - after the gate opens, writers drain to zero
  - peak observed Used() matches the cap (262144 bytes in this run)

While wiring this up, the race detector surfaced a pre-existing data
race on UploadPipeline.uploaderCount: the two glog.V(4) lines around
the atomic Add sites read the field non-atomically. Capture the new
value from AddInt32 and log that instead — one-liner each, no
behavioral change.

* test(fuse): end-to-end integration test for -writeBufferSizeMB

Exercise the new write-buffer cap against a real weed mount so CI
(fuse-integration.yml) covers the FUSE→upload-pipeline→filer path, not
just the in-package unit tests. Uses a 4 MiB cap with 2 MiB chunks so
every subtest's total write demand is multiples of the budget and
Reserve/Release must drive forward progress for writes to complete.

Subtests:
- ConcurrentLargeWrites: six parallel 6 MiB files (36 MiB total, ~18
  chunk allocations) through the same mount, verifies every byte
  round-trips.
- SingleFileExceedingCap: one 20 MiB file (10 chunks) through a single
  handle, catching any self-deadlock when the pipeline's own earlier
  chunks already fill the global budget.
- DoesNotDeadlockAfterPressure: final small write with a 30s timeout,
  catching budget-slot leaks that would otherwise hang subsequent
  writes on a still-full accountant.

Ran locally on Darwin with macfuse against a real weed mini + mount:
  === RUN   TestWriteBufferCap
  --- PASS: TestWriteBufferCap (1.82s)

* test(fuse): loosen write-buffer cap e2e test + fail-fast on hang

On Linux CI the previous configuration (-writeBufferSizeMB=4,
-concurrentWriters=4 against a 20 MiB single-handle write)
deterministically hung the "Run FUSE Integration Tests" step to the
45-minute workflow timeout, while on macOS / macfuse the same test
completes in ~2 seconds (see run 24386197483). The Linux hang shows
up after TestWriteBufferCap/ConcurrentLargeWrites completes cleanly,
then TestWriteBufferCap/SingleFileExceedingCap starts and never
emits its PASS line.

Change:
- Loosen the cap to 16 MiB (8 × 2 MiB chunk slots) and drop the
  custom -concurrentWriters override. The subtests still drive demand
  well above the cap (32 MiB concurrent, 12 MiB single-handle), so
  Reserve/Release is still on every chunk-allocation path; the cap
  just gives the pipeline enough headroom that interactions with the
  per-file writableChunkLimit and the go-fuse MaxWrite batching don't
  wedge a single-handle writer on a slow runner.
- Wrap every os.WriteFile in a writeWithTimeout helper that dumps every
  live goroutine on timeout. If this ever re-regresses, CI surfaces
  the actual stuck goroutines instead of a 45-minute walltime.
- Also guard the concurrent-writer goroutines with the same timeout +
  stack dump.

The in-package unit test TestWriteBufferCap_SharedAcrossPipelines
remains the deterministic, controlled verification of the blocking
Reserve/Release path — this e2e test is now a smoke test for
correctness and absence of deadlocks through a real FUSE mount, which
is all it should be.

* fix: address PR #9066 review — idempotent FreeReference, subtest watchdog, larger single-handle test

FreeReference on SealedChunk now early-returns when referenceCounter is
already <= 0. The existing == 0 body guard already made side effects
idempotent, but the counter itself would still decrement into the
negatives on a double-call — ugly and a latent landmine for any future
caller that does math on the counter. Make double-call a strict no-op.

test(fuse): per-subtest watchdog + larger single-handle test

- Add runSubtestWithWatchdog and wrap every TestWriteBufferCap subtest
  with a 3-minute deadline. Individual writes were already
  timeout-wrapped but the readback loops and surrounding bookkeeping
  were not, leaving a gap where a subtest body could still hang. On
  watchdog fire, every live goroutine is dumped so CI surfaces the
  wedge instead of a 45-minute walltime.

- Bump testLargeFileUnderCap from 12 MiB → 20 MiB (10 chunks) to
  exceed the 16 MiB cap (8 slots) again and actually exercise
  Reserve/Release backpressure on a single file handle. The earlier
  e2e hang was under much tighter params (-writeBufferSizeMB=4,
  -concurrentWriters=4, writable limit 4); with the current loosened
  config the pressure is gentle and the goroutine-dump-on-timeout
  safety net is in place if it ever regresses.

Declined: adding an observable peak-Used() assertion to the e2e test.
The mount runs as a subprocess so its in-process WriteBufferAccountant
state isn't reachable from the test without adding a metrics/RPC
surface. The deterministic peak-vs-cap verification already lives in
the in-package unit test TestWriteBufferCap_SharedAcrossPipelines.
Recorded this rationale inline in TestWriteBufferCap's doc comment.

* test(fuse): capture mount pprof goroutine dump on write-timeout

The previous run (24388549058) hung on LargeFileUnderCap and the
test-side dumpAllGoroutines only showed the test process — the test's
syscall.Write is blocked in the kernel waiting for FUSE to respond,
which tells us nothing about where the MOUNT is stuck. The mount runs
as a subprocess so its in-process stacks aren't reachable from the
test.

Enable the mount's pprof endpoint via -debug=true -debug.port=<free>,
allocate the port from the test, and on write-timeout fetch
/debug/pprof/goroutine?debug=2 from the mount process and log it. This
gives CI the only view that can actually diagnose a write-buffer
backpressure deadlock (writer goroutines blocked on Reserve, uploader
goroutines stalled on something, etc).

Kept fileSize at 20 MiB so the Linux CI run will still hit the hang
(if it's genuinely there) and produce an actionable mount-side dump;
the alternative — silently shrinking the test below the cap — would
lose the regression signal entirely.

* review: constructor-inject accountant + subtest watchdog body on main

Two PR-#9066 review fixes:

1. NewUploadPipeline now takes the WriteBufferAccountant as a
   constructor parameter; SetWriteBufferAccountant is removed. In
   practice the previous setter was only called once during
   newMemoryChunkPages, before any goroutine could touch the
   pipeline, so there was no actual race — but constructor injection
   makes the "accountant is fixed at construction time" invariant
   explicit and eliminates the possibility of a future caller
   mutating it mid-flight. All three call sites (real + two tests)
   updated; the legacy TestUploadPipeline passes a nil accountant,
   preserving backward-compatible unlimited-mode behavior.

2. runSubtestWithWatchdog now runs body on the subtest main goroutine
   and starts a watcher goroutine that only calls goroutine-safe t
   methods (t.Log, t.Logf, t.Errorf). The previous version ran body
   on a spawned goroutine, which meant any require.* or writeWithTimeout
   t.Fatalf inside body was being called from a non-test goroutine —
   explicitly disallowed by Go's testing docs. The watcher no longer
   interrupts body (it can't), so body must return on its own —
   which it does via writeWithTimeout's internal 90s timeout firing
   t.Fatalf on (now) the main goroutine. The watchdog still provides
   the critical diagnostic: on timeout it dumps both test-side and
   mount-side (via pprof) goroutine stacks and marks the test failed
   via t.Errorf.

* fix(mount): IsComplete must detect coverage across adjacent intervals

Linux FUSE caps per-op writes at FUSE_MAX_PAGES_PER_REQ (typically
1 MiB on x86_64) regardless of go-fuse's requested MaxWrite, so a
2 MiB chunk filled by a sequential writer arrives as two adjacent
1 MiB write ops. addInterval in ChunkWrittenIntervalList does not
merge adjacent intervals, so the resulting list has two elements
{[0,1M], [1M,2M]} — fully covered, but list.size()==2.

IsComplete previously returned `list.size() == 1 &&
list.head.next.isComplete(chunkSize)`, which required a single
interval covering [0, chunkSize). Under that rule, chunks filled by
adjacent writes never reach IsComplete==true, so maybeMoveToSealed
never fires, and the chunks sit in writableChunks until
FlushAll/close. SaveContent handles the adjacency correctly via its
inline merge loop, so uploads work once they're triggered — but
IsComplete is the gate that triggers them.

This was a latent bug: without the write-buffer cap, the overflow
path kicks in at writableChunkLimit (default 128) and force-seals
chunks, hiding the leak. #9066's -writeBufferSizeMB adds a tighter
global cap, and with 8 slots / 20 MiB test, the budget trips long
before overflow. The writer blocks in Reserve, waiting for a slot
that never frees because no uploader ever ran — observed in the CI
run 24390596623 mount pprof dump: goroutine 1 stuck in
WriteBufferAccountant.Reserve → cond.Wait, zero uploader goroutines
anywhere in the 89-goroutine dump.

Walk the (sorted) interval list tracking the furthest covered
offset; return true if coverage reaches chunkSize with no gaps. This
correctly handles adjacent intervals, overlapping intervals, and
out-of-order inserts. Added TestIsComplete_AdjacentIntervals
covering single-write, two adjacent halves (both orderings), eight
adjacent eighths, gaps, missing edges, and overlaps.

* test(fuse): route mount glog to stderr + dump mount on any write error

Run 24392087737 (with the IsComplete fix) no longer hangs on Linux —
huge progress. Now TestWriteBufferCap/LargeFileUnderCap fails with
'close(...write_buffer_cap_large.bin): input/output error', meaning
a chunk upload failed and pages.lastErr propagated via FlushData to
close(). But the mount log in the CI artifact is empty because weed
mount's glog defaults to /tmp/weed.* files, which the CI upload step
never sees, so we can't tell WHICH upload failed or WHY.

Add -logtostderr=true -v=2 to MountOptions so glog output goes to
the mount process's stderr, which the framework's startProcess
redirects into f.logDir/mount.log, which the framework's DumpLogs
then prints to the test output on failure. The -v=2 floor enables
saveDataAsChunk upload errors (currently logged at V(0)) plus the
medium-level write_pipeline/upload traces without drowning the log
in V(4) noise.

Also dump MOUNT goroutines on any writeWithTimeout error (not just
timeout). The IsComplete fix means we now get explicit errors
instead of silent hangs, and the goroutine dump at the error moment
shows in-flight upload state (pending sealed chunks, retry loops,
etc) that a post-failure log alone can't capture.
2026-04-14 07:47:35 -07:00
Chris Lu
300e906330 admin: report file and delete counts for EC volumes (#9060)
* admin: report file and delete counts for EC volumes

The admin bucket size fix (#9058) left object counts at zero for
EC-encoded data because VolumeEcShardInformationMessage carried no file
count. Billing/monitoring dashboards therefore still under-report
objects once a bucket is EC-encoded.

Thread file_count and delete_count end-to-end:

- Add file_count/delete_count to VolumeEcShardInformationMessage (proto
  fields 8 and 9) and regenerate master_pb.
- Compute them lazily on volume servers by walking the .ecx index once
  per EcVolume, cache on the struct, and keep the cache in sync inside
  DeleteNeedleFromEcx (distinguishing live vs already-tombstoned
  entries so idempotent deletes do not drift the counts).
- Populate the new proto fields from EcVolume.ToVolumeEcShardInformationMessage
  and carry them through the master-side EcVolumeInfo / topology sync.
- Aggregate in admin collectCollectionStats, deduping per volume id:
  every node holding shards of an EC volume reports the same counts, so
  summing across nodes would otherwise multiply the object count by the
  number of shard holders.

Regression tests cover the initial .ecx walk, live/tombstoned delete
bookkeeping (including idempotent and missing-key cases), and the admin
dedup path for an EC volume reported by multiple nodes.

* ec: include .ecj journal in EcVolume delete count

The initial delete count only reflected .ecx tombstones, missing any
needle that was journaled in .ecj but not yet folded into .ecx — e.g.
on partial recovery. Expand initCountsLocked to take the union of
.ecx tombstones and .ecj journal entries, deduped by needle id, so:

  - an id that is both tombstoned in .ecx and listed in .ecj counts once
  - a duplicate .ecj entry counts once
  - an .ecj id with a live .ecx entry is counted as deleted (not live)
  - an .ecj id with no matching .ecx entry is still counted

Covered by TestEcVolumeFileAndDeleteCountEcjUnion.

* ec: report delete count authoritatively and tombstone once per delete

Address two issues with the previous EcVolume file/delete count work:

1. The delete count was computed lazily on first heartbeat and mixed
   in a .ecj-union fallback to "recover" partial state. That diverged
   from how regular volumes report counts (always live from the needle
   map) and had drift cases when .ecj got reconciled. Replace with an
   eager walk of .ecx at NewEcVolume time, maintained incrementally on
   every DeleteNeedleFromEcx call. Semantics now match needle_map_metric:
   FileCount is the total number of needles ever recorded in .ecx
   (live + tombstoned), DeleteCount is the tombstones — so live =
   FileCount - DeleteCount. Drop the .ecj-union logic entirely.

2. A single EC needle delete fanned out to every node holding a replica
   of the primary data shard and called DeleteNeedleFromEcx on each,
   which inflated the per-volume delete total by the replica factor.
   Rewrite doDeleteNeedleFromRemoteEcShardServers to try replicas in
   order and stop at the first success (one tombstone per delete), and
   only fall back to other shards when the primary shard has no home
   (ErrEcShardMissing sentinel), not on transient RPC errors.

Admin aggregation now folds EC counts correctly: FileCount is deduped
per volume id (every shard holder has an identical .ecx) and DeleteCount
is summed across nodes (each delete tombstones exactly one node). Live
object count = deduped FileCount - summed DeleteCount.

Tests updated to match the new semantics:
  - EC volume counts seed FileCount as total .ecx entries (live +
    tombstoned), DeleteCount as tombstones.
  - DeleteNeedleFromEcx keeps FileCount constant and increments
    DeleteCount only on live->tombstone transitions.
  - Admin dedup test uses distinct per-node delete counts (5 + 3 + 2)
    to prove they're summed, while FileCount=100 is applied once.

* ec: test fixture uses real vid; admin warns on skewed ec counts

- writeFixture now builds the .ecx/.ecj/.ec00/.vif filenames from the
  actual vid passed in, instead of hardcoding "_1". The existing tests
  all use vid=1 so behaviour is unchanged, but the helper no longer
  silently diverges from its documented parameter.
- collectCollectionStats logs a glog warning when an EC volume's summed
  delete count exceeds its deduped file count, surfacing the anomaly
  (stale heartbeat, counter drift, etc.) instead of silently dropping
  the volume from the object count.

* ec: derive file/delete counts from .ecx/.ecj file sizes

seedCountsFromEcx walked the full .ecx index at volume load, which is
wasted work: .ecx has fixed-size entries (NeedleMapEntrySize) and .ecj
has fixed-size deletion records (NeedleIdSize), so both counts are pure
file-size arithmetic.

  fileCount   = ecxFileSize / NeedleMapEntrySize
  deleteCount = ecjFileSize / NeedleIdSize

Rip out the cached counters, countsLock, seedCountsFromEcx, and the
recordDelete helper. Track ecjFileSize directly on the EcVolume struct,
seed it from Stat() at load, and bump it on every successful .ecj append
inside DeleteNeedleFromEcx under ecjFileAccessLock. Skip the .ecj write
entirely when the needle is already tombstoned so the derived delete
count stays idempotent on repeat deletes. Heartbeats now compute counts
in O(1).

Tests updated: the initial fixture pre-populates .ecj with two ids to
verify the file-size derivation end-to-end, and the delete test keeps
its idempotent-re-delete / missing-needle invariants (unchanged
externally, now enforced by the early return rather than a cache guard).

* ec: sync Rust volume server with Go file/delete count semantics

Mirror the Go-side EC file/delete count work in the Rust volume server
so mixed Go/Rust clusters report consistent bucket object counts in
the admin dashboard.

- Add file_count (8) and delete_count (9) to the Rust copy of
  VolumeEcShardInformationMessage (seaweed-volume/proto/master.proto).
- EcVolume gains ecj_file_size, seeded from the journal's metadata on
  open and bumped inside journal_delete on every successful append.
- file_and_delete_count() returns counts derived in O(1) from
  ecx_file_size / NEEDLE_MAP_ENTRY_SIZE and
  ecj_file_size / NEEDLE_ID_SIZE, matching Go's FileAndDeleteCount.
- to_volume_ec_shard_information_messages populates the new proto
  fields instead of defaulting them to zero.
- mark_needle_deleted_in_ecx now returns a DeleteOutcome enum
  (NotFound / AlreadyDeleted / Tombstoned) so journal_delete can skip
  both the .ecj append and the size bump when the needle is missing
  or already tombstoned, keeping the derived delete_count idempotent
  on repeat or no-op deletes.
- Rust's EcVolume::new no longer replays .ecj into .ecx on load. Go's
  RebuildEcxFile is only called from specific decode/rebuild gRPC
  handlers, not on volume open, and replaying on load was hiding the
  deletion journal from the new file-size-derived delete counter.
  rebuild_ecx_from_journal is kept as dead_code for future decode
  paths that may want the same replay semantics.

Also clean up the Go FileAndDeleteCount to drop unnecessary runtime
guards against zero constants — NeedleMapEntrySize and NeedleIdSize
are compile-time non-zero.

test_ec_volume_journal updated to pre-populate the .ecx with the
needles it deletes, and extended to verify that repeat and
missing-id deletes do not drift the derived counts.

* ec: document enterprise-reserved proto field range on ec shard info

Both OSS master.proto copies now note that fields 10-19 are reserved
for future upstream additions while 20+ are owned by the enterprise
fork. Enterprise already pins data_shards/parity_shards at 20/21, so
keeping OSS additions inside 8-19 avoids wire-level collisions for
mixed deployments.

* ec(rust): resolve .ecx/.ecj helpers from ecx_actual_dir

ecx_file_name() and ecj_file_name() resolved from self.dir_idx, but
new() opens the actual files from ecx_actual_dir (which may fall back
to the data dir when the idx dir does not contain the index). After a
fallback, read_deleted_needles() and rebuild_ecx_from_journal() would
read/rebuild the wrong (nonexistent) path while heartbeats reported
counts from the file actually in use — silently dropping deletes.

Point idx_base_name() at ecx_actual_dir, which is initialized to
dir_idx and only diverges after a successful fallback, so every call
site agrees with the file new() has open. The pre-fallback call in
new() (line 142) still returns the dir_idx path because
ecx_actual_dir == dir_idx at that point.

Update the destroy() sweep to build the dir_idx cleanup paths
explicitly instead of leaning on the helpers, so post-fallback stale
files in the idx dir are still removed.

* ec: reset ecj size after rebuild; rollback ecx tombstone on ecj failure

Two EC delete-count correctness fixes applied symmetrically to Go and
Rust volume servers.

1. rebuild_ecx_from_journal (Rust) now sets ecj_file_size = 0 after
   recreating the empty journal, matching the on-disk truth.
   Previously the cached size still reflected the pre-rebuild journal
   and file_and_delete_count() would keep reporting stale delete
   counts. The Go side has no equivalent bug because RebuildEcxFile
   runs in an offline helper that does not touch an EcVolume struct.

2. DeleteNeedleFromEcx / journal_delete used to tombstone the .ecx
   entry before writing the .ecj record. If the .ecj append then
   failed, the needle was permanently marked deleted but the
   heartbeat-reported delete_count never advanced (it is derived from
   .ecj file size), and a retry would see AlreadyDeleted and early-
   return, leaving the drift permanent.

   Both languages now capture the entry's file offset and original
   size bytes during the mark step, attempt the .ecj append, and on
   failure roll the .ecx tombstone back by writing the original size
   bytes at the known offset. A rollback that itself errors is
   logged (glog / tracing) but cannot re-sync the files — this is
   the same failure mode a double disk error would produce, and is
   unavoidable without a full on-disk transaction log.

Go: wrap MarkNeedleDeleted in a closure that captures the file
offset into an outer variable, then pass the offset + oldSize to the
new rollbackEcxTombstone helper on .ecj seek/write errors.

Rust: DeleteOutcome::Tombstoned now carries the size_offset and a
[u8; SIZE_SIZE] copy of the pre-tombstone size field. journal_delete
destructures on Tombstoned and calls restore_ecx_size on .ecj append
failure.

* test(ec): widen admin /health wait to 180s for cold CI

TestEcEndToEnd starts master, 14 volume servers, filer, 2 workers and
admin in sequence, then waited only 60s for admin's HTTP server to come
up. On cold GitHub runners the tail of the earlier subprocess startups
eats most of that budget and the wait occasionally times out (last hit
on run 24374773031). The local fast path is still ~20s total, so the
bump only extends the timeout ceiling, not the happy path.

* test(ec): fork volume servers in parallel in TestEcEndToEnd

startWeed is non-blocking (just cmd.Start()), so the per-process fork +
mkdir + log-file-open overhead for 14 volume servers was serialized for
no reason. On cold CI disks that overhead stacks up and eats into the
subsequent admin /health wait, which is how run 24374773031 flaked.

Wrap the volume-server loop in a sync.WaitGroup and guard runningCmds
with a mutex so concurrent appends are safe. startWeed still calls
t.Fatalf on failure, which is fine from a goroutine for a fatal test
abort; the fail-fast isn't something we rely on for precise ordering.

* ec: fsync ecx before ecj, truncate on failure, harden rebuild

Four correctness fixes covering both volume servers.

1. Durability ordering (Go + Rust). After marking the .ecx tombstone
   we now fsync .ecx before touching .ecj, so a crash between the two
   files cannot leave the journal with an entry for a needle whose
   tombstone is still sitting in page cache. Once the fsync returns,
   the tombstone is the source of truth: reads see "deleted",
   delete_count may under-count by one (benign, idempotent retries)
   but never over-reports. If the fsync itself fails we restore the
   original size bytes and surface the error. The .ecj append is then
   followed by its own Sync so the reported delete_count matches the
   on-disk journal once the write returns.

2. .ecj truncation on append failure. write_all may have extended the
   journal on disk before sync_all / Sync errors out, leaving the
   cached ecj_file_size out of sync with the physical length and
   drifting delete_count permanently after restart. Both languages
   now capture the pre-append size, truncate the file back via
   set_len / Truncate on any write or sync failure, and only then
   restore the .ecx tombstone. Truncation errors are logged — same-fd
   length resets cannot realistically fail — but cannot themselves
   re-sync the files.

3. Atomic rebuild_ecx_from_journal (Rust, dead code today but wired
   up on any future decode path). Previously a failed
   mark_needle_deleted_in_ecx call was swallowed with `let _ = ...`
   and the journal was still removed, silently losing tombstones.
   We now bubble up any non-NotFound error, fsync .ecx after the
   whole replay succeeds, and only then drop and recreate .ecj.
   NotFound is still ignored (expected race between delete and encode).

4. Missing-.ecx hardening (Rust). mark_needle_deleted_in_ecx used to
   return Ok(NotFound) when self.ecx_file was None, hiding a closed or
   corrupt volume behind what looks like an idempotent no-op. It now
   returns an io::Error carrying the volume id so callers (e.g.
   journal_delete) fail loudly instead.

Existing Go and Rust EC test suites stay green.

* ec: make .ecx immutable at runtime; track deletes in memory + .ecj

Refactors both volume servers so the sealed sorted .ecx index is never
mutated during normal operation. Runtime deletes are committed to the
.ecj deletion journal and tracked in an in-memory deleted-needle set;
read-path lookups consult that set to mask out deleted ids on top of
the immutable .ecx record. Mirrors the intended design on both Go and
Rust sides.

EcVolume gains a `deletedNeedles` / `deleted_needles` set seeded from
.ecj in NewEcVolume / EcVolume::new. DeleteNeedleFromEcx /
journal_delete:

  1. Looks the needle up read-only in .ecx.
  2. Missing needle -> no-op.
  3. Pre-existing .ecx tombstone (from a prior decode/rebuild) ->
     mirror into the in-memory set, no .ecj append.
  4. Otherwise append the id to .ecj, fsync, and only then publish
     the id into the set. A partial write is truncated back to the
     pre-append length so the on-disk journal and the in-memory set
     cannot drift.

FindNeedleFromEcx / find_needle_from_ecx now return
TombstoneFileSize when the id is in the in-memory set, even though
the bytes on disk still show the original size.

FileAndDeleteCount:
  fileCount   = .ecx size / NeedleMapEntrySize (unchanged)
  deleteCount = len(deletedNeedles) (was: .ecj size / NeedleIdSize)

The RebuildEcxFile / rebuild_ecx_from_journal decode-time helpers
still fold .ecj into .ecx — that is the one place tombstones land in
the physical index, and it runs offline on closed files. Rust's
rebuild helper now also clears the in-memory set when it succeeds.

Dead code removed on the Rust side: `DeleteOutcome`,
`mark_needle_deleted_in_ecx`, `restore_ecx_size`. Go drops the
runtime `rollbackEcxTombstone` path. Neither helper was needed once
.ecx stopped being a runtime mutation target.

TestEcVolumeSyncEnsuresDeletionsVisible (issue #7751) is rewritten
as TestEcVolumeDeleteDurableToJournal, which exercises the full
durability chain: delete -> .ecj fsync -> FindNeedleFromEcx masks
via the in-memory set -> raw .ecx bytes are *unchanged* -> Close +
RebuildEcxFile folds the journal into .ecx -> raw bytes now show
the tombstone, as CopyFile in the decode path expects.
2026-04-13 21:10:36 -07:00
Chris Lu
64af80c78d fix(mount): stop double-applying umask in Mkdir (#9063)
Mkdir was masking in.Mode with wfs.option.Umask on top of the kernel's
VFS umask pass, so a caller with umask=0 who requested mkdir(0777) got
0755 (0777 & ~022). Create and Symlink don't apply this second pass —
Mkdir was the odd one out. The resulting dirs had fewer write bits than
the caller asked for, which broke cross-user rename permission checks
(kernel may_delete rejects with EACCES when the parent lacks o+w even
though the caller explicitly requested it) and blocked pjdfstest
tests/rename/21.t and its cascading checks.

Drop the extra umask so Mkdir trusts in.Mode exactly like Create. The
CLI -umask flag still covers the internal cache dirs that the mount
creates for itself via os.MkdirAll; only the user-facing Mkdir path
changes.

Unblocks tests/rename/21.t — full pjdfstest suite is now 236 files /
8819 tests, all PASS, and known_failures.txt is empty.
2026-04-13 19:34:30 -07:00
Chris Lu
c8433a19f0 fix(mount): propagate hard-link nlink changes to sibling cache entries (#9062)
* fix(mount): propagate hard-link nlink changes to sibling cache entries

weed mount serves stat from its local metacache, and the kernel also
caches inode attrs from FUSE replies. When a hard link was unlinked or
a new link added, the filer updated the shared HardLink blob correctly,
but the sibling link entries in the mount's metacache still carried the
stale HardLinkCounter and the kernel attr cache on the shared inode was
not invalidated. Subsequent lstat on any sibling link returned the old
nlink — pjdfstest link/00.t caught this after `unlink n0` and on
`link n1 n2` stating n0.

Walk every path bound to the hard-linked inode via a new
InodeToPath.GetAllPaths, rewrite each cached sibling's HardLinkCounter
and ctime to the authoritative new value, and call
fuseServer.InodeNotify to invalidate the kernel attr cache for the
shared inode. Applied from both Link (bump) and Unlink (decrement).

Unblocks tests/link/00.t and tests/unlink/00.t in pjdfstest; full suite
(235 files, 8803 tests) passes end-to-end with no regressions.

* fix(mount): harden hard-link sibling sync against nil Attributes and id mismatch

Review follow-ups:
- Unlink: guard entry.Attributes for nil before reading Inode, with a
  fallback to inodeToPath.GetInode resolved before RemovePath. Fold the
  duplicated RemovePath into a single call.
- syncHardLinkSiblings: skip siblings whose HardLinkId does not match
  the authoritative entry. The shared-inode invariant normally
  guarantees a match, but a transient mismatch (e.g. a rename replaced
  one of the paths) would otherwise stamp an unrelated entry with the
  wrong counter.

Full pjdfstest suite still passes (235 files, 8803 tests).
2026-04-13 18:40:57 -07:00
Chris Lu
f00cbe4a6d test(vacuum): fix flaky TestVacuumIntegration across multiple volumes (#9061)
* test(vacuum): fix flaky TestVacuumIntegration across multiple volumes

The test assumed all uploaded files landed in a single volume and
tracked only the last file's volume id. With -volumeSizeLimitMB 10
and 16x500KB files, the master can spread uploads across volumes,
so the tracked id could point to a volume with no deletes and thus
0% garbage — causing verify_garbage_before_vacuum to fail even
though vacuum ran correctly on the other volume.

Track the set of volumes where deletes actually occurred and
verify garbage/cleanup against all of them. Also add a short
retry loop on the pre-vacuum check to absorb heartbeat jitter.

* test(vacuum): require all dirty volumes ready; retry cleanup check

Address review feedback: the pre-vacuum check now waits until every
volume in dirtyVolumes reports garbage > threshold (not just the
first), and the post-vacuum cleanup check retries per-volume with a
deadline instead of relying on a fixed sleep, since vacuum + heartbeat
reporting is asynchronous.

* test(vacuum): deterministic dirty volumes order, aggregate cleanup failures

- Sort dirtyVolumes after building from the set so logs and iteration
  are stable across runs.
- In verify_cleanup_after_vacuum, track per-volume failure reasons in a
  map and report all still-failing volumes on timeout instead of only
  the last one that happened to be written to lastErr.
2026-04-13 17:56:32 -07:00
dependabot[bot]
8d6c5cbb58 build(deps): bump org.apache.kafka:kafka-clients from 3.9.1 to 3.9.2 in /test/kafka/kafka-client-loadtest/tools (#9056)
build(deps): bump org.apache.kafka:kafka-clients

Bumps org.apache.kafka:kafka-clients from 3.9.1 to 3.9.2.

---
updated-dependencies:
- dependency-name: org.apache.kafka:kafka-clients
  dependency-version: 3.9.2
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-04-13 13:05:09 -07:00
Chris Lu
10e7f0f2bc fix(shell): s3.user.provision handles existing users by attaching policy (#9040)
* fix(shell): s3.user.provision handles existing users by attaching policy

Instead of erroring when the user already exists, the command now
creates the policy and attaches it to the existing user via UpdateUser.
Credentials are only generated and displayed for newly created users.

* fix(shell): skip duplicate policy attachment in s3.user.provision

Check if the policy is already attached before appending and calling
UpdateUser, making repeated runs idempotent.

* fix(shell): generate service account ID in s3.serviceaccount.create

The command built a ServiceAccount proto without setting Id, which was
rejected by credential.ValidateServiceAccountId on any real store. Now
generates sa:<parent>:<uuid> matching the format used by the admin UI.

* test(s3): integration tests for s3.* shell commands

Adds TestShell* integration tests covering ~40 previously untested
shell commands: user, accesskey, group, serviceaccount, anonymous,
bucket, policy.attach/detach, config.show, and iam.export/import.

Switches the test cluster's credential store from memory to filer_etc
because the memory store silently drops groups and service accounts
in LoadConfiguration/SaveConfiguration.

* fix(shell): rollback policy on key generation failure in s3.user.provision

If iam.GenerateRandomString or iam.GenerateSecretAccessKey fails after
the policy was persisted, the policy would be left orphaned. Extracts
the rollback logic into a local closure and invokes it on all failure
paths after policy creation for consistency.

* address PR review feedback for s3 shell tests and serviceaccount

- s3.serviceaccount.create: use 16 bytes of randomness (hex-encoded) for
  the service account UUID instead of 4 bytes to eliminate collision risk
- s3.serviceaccount.create: print the actual ID and drop the outdated
  "server-assigned" note (the ID is now client-generated)
- tests: guard createdAK in accesskey rotate/delete subtests so sibling
  failures don't run invalid CLI calls
- tests: requireContains/requireNotContains use t.Fatalf to fail fast
- tests: Provision subtest asserts the "Attached policy" message on the
  second provision call for an existing user
- tests: update extractServiceAccountID comment example to match the
  sa:<parent>:<uuid> format
- tests: drop redundant saID empty-check (extractServiceAccountID fatals)

* test(s3): use t.Fatalf for precondition check in serviceaccount test
2026-04-11 22:30:51 -07:00