13677 Commits

Author SHA1 Message Date
Chris Lu
73fc9e3833 4.23 4.23 2026-05-03 23:15:34 -07:00
Chris Lu
d605feb403 refactor(command): expand "~" in all path-style CLI flags (#9306)
* refactor(command): expand "~" in all path-style CLI flags

Many of weed's path-bearing flags (-s3.config, -s3.iam.config,
-admin.dataDir, -webdav.cacheDir, -volume.dir.idx, TLS cert/key
files, profile output paths, mount cache dirs, sftp key files, ...)
were never run through util.ResolvePath, so a value like "~/iam.json"
was used literally. Tilde only worked when the shell expanded it,
which silently fails for the common -flag=~/path form (bash leaves
the tilde literal in --opt=~/path).

- Extend util.ResolvePath to also handle "~user" / "~user/rest",
  matching shell tilde expansion. Add unit tests.
- Apply util.ResolvePath at the top of each shared start* function
  (s3, webdav, sftp) so mini/server/filer/standalone callers all
  inherit it; resolve at the few one-off use sites (mount cache
  dirs, volume idx folder, mini admin.dataDir, profile paths).
- Drop the duplicate expandHomeDir helper from admin.go in favor of
  the now-equivalent util.ResolvePath.

* fixup: handle comma-separated -dir flags for tilde expansion

`weed mini -dir`, `weed server -dir`, and `weed volume -dir` accept
comma-separated paths (`dir[,dir]...`). Calling util.ResolvePath on
the whole string mishandled multi-folder values with tilde, e.g.
"~/d1,~/d2" would resolve as if "d1,~/d2" were a single subpath.

- Add util.ResolveCommaSeparatedPaths: split on ",", run each entry
  through ResolvePath, rejoin. Short-circuits when no "~" present.
- Use it for *miniDataFolders (mini.go), *volumeDataFolders (server.go),
  and resolve each entry of v.folders in-place (volume.go) so all
  downstream consumers see resolved paths.
- Add 7-case TestResolveCommaSeparatedPaths covering empty, single,
  multiple, and mixed inputs.

* address PR review: metaFolder + Windows backslash

- master.go: resolve *m.metaFolder at the top of runMaster so
  util.FullPath(*m.metaFolder) on the next line sees an expanded
  path. Drop the now-redundant ResolvePath in TestFolderWritable.
- server.go: same treatment for *masterOptions.metaFolder, paired
  with the existing cpu/mem profile resolves. Drop the redundant
  inner ResolvePath at TestFolderWritable.
- file_util.go: ResolvePath now accepts filepath.Separator as a
  separator after the tilde, so "~\\data" works on Windows. Other
  platforms keep current behaviour (backslash stays literal because
  it is a valid filename character in usernames and paths).
- file_util_test.go: add two cases using filepath.Separator that
  exercise the new code path on Windows and remain a no-op on Unix.

* address PR review: resolve "~" in remaining command path flags

Comprehensive sweep of path-bearing flags across every weed
subcommand, applying util.ResolvePath in-place at the top of each
run* function so all downstream consumers see expanded paths.

- webdav.go: resolve *wo.cacheDir at the top of startWebDav so
  mini/server/filer/standalone callers all inherit it.
- mount_std.go: cpu/mem profile paths.
- filer_sync.go: cpu/mem profile paths.
- mq_broker.go: cpu/mem profile paths.
- benchmark.go: cpuprofile output path.
- backup.go: -dir resolved once at runBackup; drop the duplicated
  inline ResolvePath in NewVolume calls.
- compact.go: -dir resolved at runCompact; drop inline ResolvePath.
- export.go: -dir and -o resolved at runExport; drop inline
  ResolvePath in LoadFromIdx and ScanVolumeFile.
- download.go: -dir resolved at runDownload; drop inline.
- update.go: -dir resolved at runUpdate so filepath.Join uses the
  expanded path; drop inline ResolvePath in TestFolderWritable.
- scaffold.go: -output expanded before filepath.Join.
- worker.go: -workingDir expanded before being passed to runtime.

* address PR review: resolve option-struct paths at run* entry points

server.go:381 propagates s3Options.config to filerOptions.s3ConfigFile
*before* startS3Server runs, which meant the filer-side code saw the
unresolved tilde-prefixed pointer. Same pattern for webdavOptions and
sftpOptions (and equivalent in mini.go / filer.go).

The fix: hoist resolution from the shared start* functions up to the
run* entry points, where every shared pointer is set up before any
propagation happens.

- s3.go, webdav.go, sftp.go: extract a resolvePaths() method on each
  Options struct that runs every path field through util.ResolvePath
  in-place. Idempotent.
- runS3, runWebDav, runSftp: call the standalone struct's resolvePaths
  before starting metrics / loading security config.
- runServer, runMini, runFiler: call resolvePaths on every embedded
  options struct, plus resolve loose flags (serverIamConfig,
  miniS3Config, miniIamConfig, miniMasterOptions.metaFolder, and
  filer's defaultLevelDbDirectory) so they're expanded before any
  pointer copy or use.
- Drop the now-redundant inline ResolvePath at filer's
  defaultLevelDbDirectory composition.

* address PR review: re-resolve mini -dir post-config, cover misc paths

- mini.go: applyConfigFileOptions can overwrite -dir with a literal
  ~/data from mini.options. Re-resolve *miniDataFolders after the
  config-file apply, alongside the other path resolves, so the mini
  filer no longer ends up with a literal ~/data/filerldb2.
- benchmark.go: resolve *b.idListFile (-list).
- filer_sync.go: resolve *syncOptions.aSecurity / .bSecurity
  (-a.security / -b.security) before LoadClientTLSFromFile.
- filer_cat.go: resolve *filerCat.output (-o) before os.OpenFile.
- admin.go: drop trailing blank line at EOF (git diff --check).

* address PR review: resolve -a.security/-b.security/-config before use

Three follow-up fixes:

- filer_sync.go: the -a.security / -b.security resolves were placed
  *after* LoadClientTLSFromFile / LoadHTTPClientFromFile were called,
  so weed filer.sync -a.security=~/a.toml still passed the literal
  tilde path. Hoist the resolves above the security-loading block so
  TLS clients see expanded paths.
- filer_sync_verify.go: same flag pair was never resolved at all in
  the verify command; resolve at the top of runFilerSyncVerify.
- filer_meta_backup.go: -config (the backup_filer.toml path) was
  passed directly to viper. Resolve at the top of runFilerMetaBackup.
- mini.go: master.dir defaulted to the entire comma-joined
  miniDataFolders. With weed mini -dir=~/d1,~/d2 (or any multi-dir
  setup), TestFolderWritable then stat'd the joined string instead
  of a single directory. Default to the first entry via StringSplit
  to mirror the disk-space calculation a few lines below, and drop
  the now-redundant ResolvePath in TestFolderWritable.
2026-05-03 21:46:21 -07:00
Chris Lu
87bb0a4115 test(s3tables): capture weed mini stdout/stderr in catalog_spark tests
Without this, when "weed mini" fails to bind its master port within the
45s startup timeout the only diagnostic is the timeout message itself,
making flakes impossible to root-cause. The sibling catalog, catalog_dremio,
and catalog_trino packages already wire stdout/stderr through; bring
catalog_spark in line.
2026-05-03 21:39:42 -07:00
Chris Lu
6844ec067c fix(s3): cache remote-only source before CopyObject (#9304) (#9305)
* fix(s3): cache remote-only source before CopyObject (#9304)

CopyObject from a remote.mount source whose object lives only upstream
created a destination entry with FileSize > 0 but no chunks/content,
because the resolved source entry has no local chunks and the copy path
fell into the "inline/empty chunks" branch with empty entry.Content.
A subsequent GET returned 500 with "data integrity error: size N
reported but no content available". CopyObjectPart had the same shape
via copyChunksForRange iterating an empty chunk list.

Detect entry.IsInRemoteOnly() right after resolving the source in both
CopyObjectHandler and CopyObjectPartHandler and cache the object to the
local cluster first via a new cacheRemoteObjectForCopy helper (a
copy-time analogue of cacheRemoteObjectForStreaming with a bounded 30s
timeout and version-aware path resolution). If caching fails or
produces no chunks, return 503 with Retry-After: 5 instead of writing
a metadata-only destination, mirroring the GetObject behavior added in
the #7817 cold-cache fix.

Adds TestCopyObjectRemoteOnlySourceDetection pinning the four entry
shapes the fix branches on plus the pre-fix broken-output shape.

* address PR review on remote-only copy fix

- Use the resolved entry's version id when srcVersionId is empty so a
  CopyObject reading the latest object in a versioning-enabled bucket
  caches the correct .versions/v_<id> path instead of getting stuck in
  a 503 retry loop. New helper resolvedSourceVersionId handles the
  fallback for both CopyObject and CopyObjectPart.
- Drop the redundant cachedEntry.IsInRemoteOnly() recheck in both
  handlers; the cache helper now reports success based on local data
  presence, and IsInRemoteOnly does not look at inline Content so
  keeping the check would 503 on small inline-cached objects.
- Treat inline Content as a successful cache result in both
  cacheRemoteObjectForStreaming and cacheRemoteObjectForCopy via a
  shared cachedEntryHasLocalData predicate. The CopyObject inline
  branch already handles entries that have Content but no chunks.
- Extract buildVersionedRemoteObjectPath so the streaming and copy
  cache helpers share path construction.

Adds TestResolvedSourceVersionId and TestCachedEntryHasLocalData to
pin the new helpers' contracts.

* narrow streaming cache contract back to chunks-only

CodeRabbit flagged that cacheRemoteObjectForStreaming's caller in
streamFromVolumeServers (lines 997-1002) still required non-empty
chunks, so content-only cache hits would fall through to a 503 retry
loop instead of being honored.

Resolve by keeping the helper's contract chunks-only: the filer's
caching code only ever writes chunks, the streaming downstream isn't
wired to read inline Content from a cached entry, and a partial range-
aware inline writer here would be overkill for a path that doesn't
actually occur in practice. cacheRemoteObjectForCopy keeps the relaxed
contract since the copy path's inline branch genuinely handles both
chunked and content-only entries.

Document the asymmetry on cachedEntryHasLocalData and on
cacheRemoteObjectForStreaming so a future reader can see why the two
helpers diverge.

* extend version-id resolution to streaming cache path

CodeRabbit flagged that GetObjectHandler still passed the raw query
versionId to cacheRemoteObjectForStreaming. For latest-version reads
in versioning-enabled buckets that stays empty even though the
resolved entry lives at .versions/v_<id>, so remote-only GETs would
keep caching the wrong path and 503-ing forever. Reuse the new
resolvedSourceVersionId helper at the streaming call site too.

Also document on cachedEntryHasLocalData that the zero-byte case
flagged in the same review is handled upstream (IsInRemoteOnly
requires RemoteSize > 0, so the cache helper is never invoked for
empty remote objects -- CopyObject's pre-existing inline branch
writes a correct empty destination directly). Pin this with a new
test case.

* trim verbose comments

Drop tutorial-style and review-history comments. Keep only the WHY
that isn't obvious from identifiers: the #9304 reference on the new
branches in CopyObject / CopyObjectPart, the latest-version-fallback
rationale on resolvedSourceVersionId, and the streaming/copy contract
asymmetry on cachedEntryHasLocalData.

* drop issue references from comments

Issue numbers belong in PR descriptions and commit messages, not in
source comments where they rot. Replace with the underlying invariant
the code is preserving.

* test: drive remote-object cache helpers through real gRPC

Existing tests only re-enacted helper-function branching in test
space, so they could not have caught a handler that consumed a
remote-only entry without going through the cache. Stand up an
in-process filer gRPC server (UnimplementedSeaweedFilerServer +
configurable CacheRemoteObjectToLocalCluster response) and exercise
the two cache helpers end-to-end.

What's pinned:
- cacheRemoteObjectForCopy returns nil when the cache makes no
  progress (response is still remote-only), lets gRPC errors
  through as nil, accepts both chunked and inline-content cache
  hits, and surfaces deadline-exceeded as nil so callers can 503
  instead of holding the request open.
- Versioned source paths route to .versions/v_<id>; non-versioned
  and "null" stay at the bucket-relative path. Captured by reading
  the request the stub server received.
- cacheRemoteObjectForStreaming holds the stricter chunks-only
  contract: a content-only cache hit is not propagated, since
  streamFromVolumeServers' downstream isn't wired to read from
  inline Content there.

Any current or future handler that calls these helpers exercises
the same gRPC path under test, so the bug class is closed for
helper-routed cache calls.

* move remote-only copy test into the integration suite

The previous gRPC-stub test in weed/s3api/ was integration-flavored
but stubbed; relocate the coverage to the existing two-server suite
under test/s3/remote_cache/, which already exercises the real
remote.mount + remote.uncache flow against a primary SeaweedFS plus
a secondary acting as remote storage.

The new test/s3/remote_cache/remote_cache_copy_test.go drives:
- TestRemoteCacheCopyObject: upload to primary, uncache (entry now
  remote-only), CopyObject to a new key, GET the destination. Pre-
  fix the GET returned 500 'data integrity error: size N reported
  but no content'; this pins the fixed behavior over real HTTP
  through the actual handler stack.
- TestRemoteCacheCopyObjectPart: same shape via multipart
  UploadPartCopy on a 6 MiB object split into two parts, exercising
  CopyObjectPartHandler's range-copy path.

Drop weed/s3api/s3api_remote_storage_grpc_test.go: the helper-level
classification tests in s3api_remote_storage_test.go still cover
the contract pieces (cachedEntryHasLocalData, resolvedSourceVersionId,
the remote-only entry shape), and the integration suite covers the
end-to-end behavior that those classifications enable.
2026-05-03 18:52:45 -07:00
Chris Lu
fc75f16c30 test(s3tables): expand Dremio Iceberg catalog test coverage (#9303)
* test(s3tables): expand Dremio Iceberg catalog test coverage

Restructure TestDremioIcebergCatalog into subtests and add three new
checks that go beyond a connectivity smoke test:

- ColumnProjection: SELECT id, label proves Dremio parsed the schema
  served by the SeaweedFS REST catalog (the previous SELECT COUNT(*)
  passed without exercising any column metadata).
- InformationSchemaColumns: verifies the table's columns are listed in
  Dremio's INFORMATION_SCHEMA.COLUMNS in the expected ordinal order.
- InformationSchemaTables: verifies the table is registered in
  INFORMATION_SCHEMA.TABLES.

All subtests share a single Dremio container startup, so total
runtime is unchanged.

* test(s3tables): exercise multi-level Iceberg namespaces from Dremio

Seed a 2-level Iceberg namespace (and a table inside it) via the REST
catalog before bootstrapping Dremio, then add a MultiLevelNamespace
subtest that scans the nested table by its dot-separated reference.

This relies on isRecursiveAllowedNamespaces=true (already set in the
Dremio source config) to surface the nested levels as folders. A
regression in either the SeaweedFS namespace path encoding (#8959-style)
or Dremio's recursive-namespace discovery would surface here.

Adds two helpers to keep the existing single-level call sites unchanged:

- createIcebergNamespaceLevels: namespace creation with []string levels
- createIcebergTableInLevels: table creation with []string levels and
  unit-separator (0x1F) URL encoding for the namespace path component

* test(s3tables): verify Dremio reads PyIceberg-written rows

The previous Dremio subtests only scanned empty tables, so they did not
exercise the data path - just the catalog/metadata path. Add a
PyIceberg-based writer that materializes parquet files plus a snapshot
on a separate table before Dremio bootstraps, and two new subtests:

- ReadWrittenDataCount: SELECT COUNT(*) returns 3.
- ReadWrittenDataValues: SELECT id, label ORDER BY id returns the three
  written rows with the expected (id, label) pairs.

The writer runs in a small image (Dockerfile.writer) built locally on
demand. It pip-installs pyiceberg+pyarrow once and reuses the layer
cache on subsequent runs. The CI workflow pre-pulls python:3.11-slim
to keep cold runs predictable.

The writer authenticates via the OAuth2 client_credentials flow that
SeaweedFS already exposes at /v1/oauth/tokens, mirroring the Go-side
helper used for REST-API table creation.

* test(s3tables): fix Dremio writer required-field schema mismatch

PyIceberg's append() compatibility check rejects an arrow column whose
nullability does not match the Iceberg field. The table schema declares
id as `required long`, but the default pyarrow int64 column is nullable
- so the writer failed with:

    1: id: required long  vs.  1: id: optional long

Declare an explicit pyarrow schema with nullable=False on id and
nullable=True on label to match the Iceberg side.
2026-05-03 00:17:16 -07:00
Chris Lu
f16353de0b feat(mini): add -bucket flag to pre-create an S3 bucket on startup (#9302)
* feat(mini): add -bucket flag to pre-create an S3 bucket on startup

Lets users hand a pre-provisioned object store to clients/CI without a
post-start `weed shell s3.bucket.create` step. The flag is a no-op when
empty (default) and idempotent on subsequent starts.

* mini: bound bucket-creation RPCs with a timeout off miniClientsCtx

Address PR review feedback: derive the lookup/mkdir context from
miniClientsCtx() so Ctrl+C cancels the bucket RPCs, and cap with a 5s
timeout so a stalled filer cannot block the welcome message
indefinitely. Also wrap the DoMkdir error for parity with the lookup
path.

* mini: fall back to S3_BUCKET env var for -bucket

Mirrors the existing -s3.externalUrl / S3_EXTERNAL_URL pattern so
container/Kubernetes deployments can pre-create the bucket via env
without overriding the entrypoint command.

* docs(readme): lead weed mini quick start with credentials + bucket

Promote the one-line setup (env vars + bucket) so users get a
ready-to-use S3 endpoint without hopping between sections to find
credential and bucket setup.

* mini: accept comma-separated -bucket list

Lets a single startup pre-create multiple S3 buckets, e.g.
-bucket=bucket1,bucket2 (or S3_BUCKET=bucket1,bucket2). Names are
trimmed and deduped; per-bucket errors are logged and the loop continues
so one bad name does not block the rest.

* mini: add -tableBucket flag for pre-creating S3 Tables buckets

Mirrors -bucket but creates S3 Tables (Iceberg) buckets via
s3tables.Manager so users can hand the all-in-one binary a ready-to-use
table catalog without a follow-up weed shell call. Comma-separated, env
fallback to S3_TABLE_BUCKET, idempotent on restart, owned by the
DefaultAccountID placeholder.

* mini: use errors.Is for ErrNotFound check in bucket lookup

Matches the rest of the codebase (~20 call sites in weed/s3api). The
direct equality works today because LookupEntry returns ErrNotFound
unwrapped, but errors.Is future-proofs against any future wrapping.
2026-05-02 21:02:21 -07:00
Chris Lu
1f6f473995 refactor(worker): co-locate plugin handlers with their task packages (#9301)
* refactor(worker): co-locate plugin handlers with their task packages

Move every per-task plugin handler from weed/plugin/worker/ into the
matching weed/worker/tasks/<name>/ package, so each task owns its
detection, scheduling, execution, and plugin handler in one place.

Step 0 (within pluginworker, no behavior change): extract shared helpers
that previously lived inside individual handler files into dedicated
files and export the ones now consumed across packages.

  - activity.go: BuildExecutorActivity, BuildDetectorActivity
  - config.go: ReadStringConfig/Double/Int64/Bytes/StringList, MapTaskPriority
  - interval.go: ShouldSkipDetectionByInterval
  - volume_state.go: VolumeState + consts, FilterMetricsByVolumeState/Location
  - collection_filter.go: CollectionFilterMode + consts
  - volume_metrics.go: export CollectVolumeMetricsFromMasters,
    MasterAddressCandidates, FetchVolumeList
  - testing_senders_test.go: shared test stubs

Phase 1: move the per-task plugin handlers (and the iceberg subpackage)
into their task packages.

  weed/plugin/worker/vacuum_handler.go         -> weed/worker/tasks/vacuum/plugin_handler.go
  weed/plugin/worker/ec_balance_handler.go     -> weed/worker/tasks/ec_balance/plugin_handler.go
  weed/plugin/worker/erasure_coding_handler.go -> weed/worker/tasks/erasure_coding/plugin_handler.go
  weed/plugin/worker/volume_balance_handler.go -> weed/worker/tasks/balance/plugin_handler.go
  weed/plugin/worker/iceberg/                   -> weed/worker/tasks/iceberg/

  weed/plugin/worker/handlers/handlers.go now blank-imports all five
  task subpackages so their init() registrations fire.

  weed/command/mini.go and the worker tests construct the handler with
  vacuum.DefaultMaxExecutionConcurrency (the constant moved with the
  vacuum handler).

admin_script remains in weed/plugin/worker/ because there is no
underlying weed/worker/tasks/admin_script/ package to merge with.

* refactor(worker): update test/plugin_workers imports for moved handlers

Three handler constructors moved out of pluginworker into their task
packages — update the integration test files in test/plugin_workers/
to import from the new locations:

  pluginworker.NewVacuumHandler        -> vacuum.NewVacuumHandler
  pluginworker.NewVolumeBalanceHandler -> balance.NewVolumeBalanceHandler
  pluginworker.NewErasureCodingHandler -> erasure_coding.NewErasureCodingHandler

The pluginworker import is kept where the file still uses
pluginworker.WorkerOptions / pluginworker.JobHandler.

* refactor(worker): update test/s3tables iceberg import path

The iceberg subpackage moved from weed/plugin/worker/iceberg/ to
weed/worker/tasks/iceberg/. test/s3tables/maintenance/maintenance_integration_test.go
still imported the old path, breaking S3 Tables / RisingWave / Trino /
Spark / Iceberg-catalog / STS integration test builds.

Mirrors the OSS-side fix needed by every job in the run that
transitively imports test/s3tables/maintenance.

* chore: gofmt PR-touched files

The S3 Tables Format Check job runs `gofmt -l` over weed/s3api/s3tables
and test/s3tables, then fails if anything is unformatted. Files this
PR moved or modified had import-grouping and trailing-spacing issues
introduced by perl-based renames; reformat them with gofmt -w.

Touched files:
  test/plugin_workers/erasure_coding/{detection,execution}_test.go
  test/s3tables/maintenance/maintenance_integration_test.go
  weed/plugin/worker/handlers/handlers.go
  weed/worker/tasks/{balance,ec_balance,erasure_coding,vacuum}/plugin_handler*.go

* refactor(worker): bounds-checked int conversions for plugin config values

CodeQL flagged 18 go/incorrect-integer-conversion warnings on the moved
plugin handler files: results of pluginworker.ReadInt64Config (which
ultimately calls strconv.ParseInt with bit size 64) were being narrowed
to int32/uint32/int without an upper-bound check, so a malicious or
malformed admin/worker config value could overflow the target type.

Add three helpers in weed/plugin/worker/config.go that wrap
ReadInt64Config and clamp out-of-range values back to the caller's
fallback:

  ReadInt32Config (math.MinInt32 .. math.MaxInt32)
  ReadUint32Config (0 .. math.MaxUint32)
  ReadIntConfig    (math.MinInt32 .. math.MaxInt32, platform-portable)

Update each flagged call site in the four moved task packages to use
the bounds-checked helper. For protobuf uint32 fields (volume IDs)
the variable type also becomes uint32, removing the trailing
uint32(volumeID) casts and changing the "missing volume_id" check
from `<= 0` to `== 0`.

Touched files:
  weed/plugin/worker/config.go
  weed/worker/tasks/balance/plugin_handler.go
  weed/worker/tasks/erasure_coding/plugin_handler.go
  weed/worker/tasks/vacuum/plugin_handler.go

* refactor(worker): use ReadIntConfig for clamped derive-worker-config helpers

CodeQL still flagged three call sites where ReadInt64Config was being
narrowed to int after a value-range clamp (max_concurrent_moves <= 50,
batch_size <= 100, min_server_count >= 2). The clamp is correct but
CodeQL's flow analysis didn't recognize the bound, so it flagged them
as unbounded narrowing.

Switch to ReadIntConfig (already int32-bounded by the helper) for
those three sites, drop the now-redundant int64 intermediate variables.

Also drops the now-unused `> math.MaxInt32` clamp in
ec_balance.deriveECBalanceWorkerConfig (the helper covers it).
2026-05-02 18:03:13 -07:00
Chris Lu
b2f4ebb776 test(s3tables): add Dremio Iceberg catalog integration tests (#9299)
* test(s3tables): add Dremio Iceberg catalog integration tests

Add comprehensive integration tests for Dremio with SeaweedFS's Iceberg
REST Catalog, following the same patterns as existing Spark and Trino tests.

Tests include:
- Basic catalog connectivity and schema operations
- Table creation, insertion, and querying (CRUD)
- Deterministic table location specification
- Multi-level namespace support

Implementation includes:
- dremio_catalog_test.go: Core test environment and basic operations
- dremio_crud_operations_test.go: Schema and table CRUD testing
- dremio_deterministic_location_test.go: Location and namespace testing
- Comprehensive README and implementation documentation

CI/CD:
- Added dremio-iceberg-catalog-tests job to s3-tables-tests.yml
- Pre-pulls Dremio image, runs with 25m timeout
- Uploads artifacts on failure

* add docstrings to Dremio integration tests and fix CI image pre-pull

- Add function docstrings to all test functions and helper functions
  in dremio_catalog_test.go, dremio_crud_operations_test.go, and
  dremio_deterministic_location_test.go to improve code documentation
  and satisfy CodeRabbit's docstring coverage requirements.

- Make Dremio Docker image pre-pull non-critical in CI workflow.
  The pre-pull was failing with access denied error, but the image
  can still be pulled at runtime. Using continue-on-error to allow
  tests to proceed.

* fix: correct YAML syntax in Dremio CI workflow

Use multi-line run command with pipe operator (|) instead of
inline command with || operator to avoid YAML parsing errors.
The || operator was causing 'mapping values are not allowed here'
syntax errors in the YAML parser.

* make Dremio tests gracefully skip if container unavailable

Modify startDremioContainer and waitForDremio to return boolean values
instead of fataling. Tests now skip gracefully if:
- Dremio Docker image is unavailable
- Container fails to start
- Container doesn't become ready within timeout

This prevents CI failure when Dremio image is not accessible while
still testing the integration when it is available.

* Revert "make Dremio tests gracefully skip if container unavailable"

This reverts commit e4b43e1447.

* ci(s3-tables-tests): remove Dremio job from automated CI

The Dremio integration tests are designed to test Dremio's integration
with SeaweedFS Iceberg catalog and should not gracefully skip. Since
the Dremio Docker image is not available in GitHub Actions environments,
running the tests in CI would cause failures.

The Dremio integration test code is preserved in test/s3tables/catalog_dremio/
for local development and testing. This can be run locally where Docker
and the Dremio image are available.

* test(s3tables): remove hardcoded port mapping from Dremio container

The tests use docker exec to communicate with Dremio, not direct port bindings.
Removing the fixed 9047:9047 port mapping prevents port conflicts and is unnecessary
for the test infrastructure.

* test(s3tables): fix runDremioSQL helper with proper JSON encoding and error handling

- Use encoding/json to properly marshal SQL payload instead of using unescaped %q
- Change t.Logf to t.Fatalf to properly fail tests on Dremio command errors
- Remove dependency on jq for response parsing
- Add parseDremioResponse helper to decode and validate Dremio JSON responses
- Fail tests immediately if response contains error messages

* test(s3tables): add proper assertions to CRUD operation tests

- Change test assertions from t.Logf to t.Fatalf for schema listing and table listing
- Validate COUNT(*) results in TestTableCRUD with proper row parsing and count validation
- Add assertions for COUNT, WHERE clause, and GROUP BY aggregation queries in TestDataInsertAndQuery
- Use parseDremioResponse to properly decode JSON responses instead of ignoring results

* test(s3tables): fix deterministic location tests and multi-level namespace handling

- Add assertions to verify table row counts in TestDeterministicTableLocation using parseDremioResponse
- Add DESCRIBE FORMATTED query to verify table location matches explicit S3 path
- Fix multi-level namespace handling to use separate quoted identifiers per level
- Change logging assertions to fatal assertions for proper test failure reporting
- Validate row counts in TestMultiLevelNamespace instead of just logging

* test(s3tables): fix shell quoting in runDremioSQL for proper JSON payload handling

Use %q for proper shell escaping of JSON payload to handle SQL statements
that contain single quotes or other special characters without breaking the command.

* fix(s3tables): correct master address format in weed shell command

Use colon separator between master port and gRPC port instead of dot.
The format should be host:port:grpcport, not host:port.grpcport

* fix(s3tables): pin Dremio Docker image to specific version

Use dremio/dremio:25.0.0 instead of latest to ensure test stability
and reproducibility across different environments and time periods

* chore: remove accidentally committed lock file

Remove .claude/scheduled_tasks.lock which is a local runtime artifact
that should not be version controlled

* docs: remove IMPLEMENTATION.md from Dremio test suite

The implementation details are better documented in README.md and inline code comments

* ci(s3-tables-tests): add Dremio Iceberg catalog integration test job

Add dremio-iceberg-catalog-tests job to the S3 Tables Integration Tests workflow.
The job runs the Dremio integration tests that validate SeaweedFS's Iceberg REST
Catalog integration with the Dremio SQL engine. Tests will skip gracefully if
Docker or the Dremio image is unavailable.

* fix(ci): increase Dremio test timeout to 30 minutes

* fix(s3tables): use stdin for Dremio SQL payload instead of shell wrapping

* test(s3tables): validate aggregation sums in CRUD test

* docs: add package documentation for Dremio catalog tests

* ci: trigger workflow run

* ci: remove concurrency lock to allow workflow execution

* ci: trigger fresh workflow run

* ci: add push trigger to workflow for better scheduling

* ci: create dedicated Dremio workflow to isolate test execution

* ci: add minimal test to debug workflow execution

* ci: fix workflow triggers to use pull_request only

* ci: remove diagnostic test workflow

* fix(s3tables): replace weed shell bucket create with REST API in Dremio tests

The Dremio integration test invoked `weed shell` with a malformed
`-master=host:port:grpcPort` flag (the correct separator is `.`), causing
the shell to spend ~65s reconnecting and never cleanly exit after running
`s3tables.bucket -create`. The shell process hung until the 30m test
deadline killed the suite. Switch to the same direct PUT /buckets call
the other catalog tests use, removing the shell dependency entirely.

* fix(s3tables): use weed shell with correct master format for Dremio bucket create

The HTTP /buckets endpoint requires SigV4 auth because the Dremio test
runs the S3 server with `-s3.config <iam>`. Switch back to `weed shell`
(which calls the filer directly over gRPC, bypassing S3 auth — same
approach as duckdb_oauth_test.go) and use the correct master flag
format `host:port.grpcPort` (dot, not colon). The earlier failure was
caused by passing `host:port:grpcPort`, which the shell silently
retried forever. Also wrap the command in a 60s context timeout so any
future hang surfaces a clear error instead of a 30m test deadline.

* fix(s3tables): switch Dremio container image to public dremio-oss tag

The previous image reference dremio/dremio:25.0.0 does not exist on
Docker Hub (the dremio/dremio repo is unlisted), so docker pull fails
with "pull access denied" both in the pre-pull step and at test runtime.
The public Dremio OSS image is dremio/dremio-oss and 25.0.0 is published
there. Update the test source and both workflow pre-pull commands.

* chore: gitignore .claude/ harness directory

* fix(s3tables): bind-mount Dremio dremio.conf as a single file, capture container logs

The previous run mounted the whole configDir over /opt/dremio/conf,
which masked Dremio's bundled log4j2.properties, dremio-env, and
distrib.conf. The JVM crashed within ~10s of starting, so every test
hit "container is not running". Mount only dremio.conf (read-only) so
the defaults stay in place.

Also include the container logs in waitForDremio's failure message —
without them the only signal was "container is not running", which
gave us no path forward when startup fails.

* fix(s3tables): drop invalid catalog.iceberg.* keys from dremio.conf

Dremio rejects catalog.iceberg.* properties at startup with
"Failure reading configuration file. The following properties were
invalid" — sources are not configured via dremio.conf, only via the
REST API at /apiv2/source. Write an empty {} so Dremio boots with
defaults; subsequent SQL queries will surface the missing-source
condition with a meaningful error instead of crashing the JVM.

A real Iceberg-source bringup (Dremio first-user bootstrap + POST
/apiv2/source) is the follow-up work to actually exercise queries.

* test(s3tables): skip Dremio catalog tests pending REST-API bootstrap

The Dremio container now starts cleanly, but every SQL call returns
HTTP 401 because the test never bootstraps an admin user, never logs
in to obtain a token, and never registers an Iceberg REST source. With
those three pieces missing the suite cannot exercise anything beyond
"can we boot a Dremio container" — useful coverage requires a proper
bootstrap helper.

Skip the seven test functions via a single requireDremioCatalogConfigured
helper that documents the missing bring-up steps, so CI is no longer red
on a half-built suite. The skip is loud (clear message in the helper)
so the follow-up work won't be forgotten.

* ci(s3tables): make Dremio tests opt-in

* test(s3tables): run Dremio catalog smoke test

* test(s3tables): align Dremio 25 catalog config
2026-05-02 11:31:27 -07:00
Chris Lu
31e5e0dee2 fix(mount): keep async flush when LockOwner has no POSIX locks (#9300)
FlushIn.LockOwner is populated by the kernel for any fd that may have
participated in locking, not only when locks were actually taken. The
previous Flush logic treated any non-zero LockOwner as a closing lock
holder and forced a synchronous flush, which silently disabled the
writebackCache async-flush path (introduced in #8727) for most
ordinary close() calls.

Consult the POSIX lock table before forcing sync: only owners that
currently hold a non-flock byte-range lock need the synchronous path
to coordinate with blocked SetLkw waiters. Other closes go async as
intended.
2026-05-01 19:51:27 -07:00
Chris Lu
f2c3bd7b77 fix(admin/view): define basePath in plugin IIFE scopes (#9298)
The plugin.templ and plugin_lane.templ components use basePath() in their IIFE
(Immediately Invoked Function Expression) scopes to handle subdirectory
deployments. However, basePath was not defined locally, causing "basePath is
not defined" errors when accessing plugin pages.

Added local basePath function definitions in both files, matching the pattern
from admin.js. This function checks window.__BASE_PATH__ (set by the layout
during page initialization) and prepends it to API paths.
2026-05-01 18:22:39 -07:00
Lars Lehtonen
913f98db10 refactor(weed/storage): log volume file removal failures (#9297) 2026-05-01 11:52:40 -07:00
Yoann Katchourine
1127672d10 fix(s3api,iamapi): avoid full SaveConfiguration when creating a single IAM user (#9261)
* fix(s3api,iamapi): avoid full SaveConfiguration when creating a single IAM user

When CreateUser was called, both the embedded IAM path (s3api_embedded_iam.go)
and the standalone IAM path (iamapi_management_handlers.go) would:
  1. Load the full config (all N users)
  2. Append the new user
  3. Call SaveConfiguration, which re-writes ALL N user files in the filer_etc
     store, triggering N file-change events and N reload cycles.

The fix replaces the full-save path with credentialManager.CreateUser, which
writes only the single new user's file. Set changed=false to skip the redundant
SaveConfiguration call, and add "CreateUser" to the reload-after-targeted-write
block so the in-memory config is refreshed.

Also adds a nil guard around iama.iam.GetCredentialManager() in DoActions to
avoid a nil-pointer panic in legacy test fixtures that leave iam unset.

Add SetCredentialManagerForTest to IdentityAccessManagement so tests can inject
an in-memory credential store without touching production code paths.

* test(s3api,iamapi): regression tests for CreateUser targeted-write fix

Add TestCreateUserDoesNotSaveAllUsers (iamapi) and
TestEmbeddedIamCreateUserDoesNotSaveAllUsers (s3api) to guard against the
regression where CreateUser would call SaveConfiguration and re-write all
N existing user files.

Both tests:
  - Pre-populate 3 existing users
  - Invoke CreateUser via the HTTP API
  - Assert PutS3ApiConfiguration (full-config save) was NOT called
  - Assert the new user is visible in the credential store
  - Assert all pre-existing users are still intact

Also update TestEmbeddedIamExecuteAction to verify persistence via the
credential manager directly (mockConfig is no longer updated on CreateUser
since we skip the SaveConfiguration path).

* refactor(s3api,iamapi): share credential-error to IAM-code mapping

Move the credentialErrToIamErrCode helper from weed/iamapi to
weed/s3api as exported CredentialErrToIamErrCode and call it from
both the standalone IAM handler (iamapi) and the embedded IAM
handler (s3api). Previously the standalone path used the helper
while the embedded path duplicated the same switch inline; the two
sites could drift out of sync.

Also extend the mapping to cover ErrUserNotFound and
ErrAccessKeyNotFound (404 NoSuchEntity) so non-CreateUser callers
that opt into the helper get the right HTTP status.

* test(s3api): seed credential store explicitly in CreateUser regression

Previously the test relied on getS3ApiConfigurationFunc's syncOnce
side effect to populate the credential store with mockConfig
identities. If that fixture ever stopped seeding, the post-test
"pre-existing users still exist" assertion would silently start
passing for the wrong reason (the users were never created).

Seed cm.SaveConfiguration directly and assert the seed precondition
before the API call so any seeding regression fails loudly.

* fix(s3api): honor skipPersist in embedded CreateUser targeted path

ExecuteAction documents that skipPersist=true means "the changed
configuration is not saved to the persistent store", but the
targeted credentialManager.CreateUser added for the avoid-bulk-save
fix ran unconditionally. Gate it on !skipPersist so no-persist
callers don't silently leak a write to the credential store. Leave
changed=true in the skipPersist branch so the tail's existing
`if changed { if !skipPersist { ... } }` block keeps suppressing
persistence the same way it does for every other action.

Add TestEmbeddedIamCreateUserSkipPersist to pin the contract: a
CreateUser invocation with skipPersist=true must not call
PutS3ApiConfiguration and must not leave the user in the
credential store.

---------

Co-authored-by: Chris Lu <chris.lu@gmail.com>
2026-05-01 01:14:15 -07:00
Chris Lu
7428c48dd6 fix(master): bump seaweedfs/raft to v1.1.8 for Windows syncDir fix (#9296)
raft v1.1.7 added an fsync of the state directory after persisting
currentTerm/votedFor. On Windows the syscall (FlushFileBuffers) requires
GENERIC_WRITE, but os.Open opens directories read-only, so every call
panicked the master with ERROR_ACCESS_DENIED ("Zugriff verweigert").
This took down "weed mini" on every restart that loaded persisted state
and entered candidateLoop.

v1.1.8 splits syncDir into build-tagged files: unix keeps the durability
fsync; Windows is a no-op since NTFS journals rename metadata and there
is no portable equivalent of fsync(dir_fd) there.

Fixes #9292.
2026-04-30 21:53:49 -07:00
Chris Lu
6572b472c3 fix(s3): honor X-Forwarded-For in audit log remote_ip (#9295)
* fix(s3): honor X-Forwarded-For in audit log remote_ip

When SeaweedFS S3 sits behind a reverse proxy (e.g., Caddy), the audit
log's `remote_ip` was reporting the proxy's address because only
`X-Real-IP` and `r.RemoteAddr` were consulted. Caddy and most other
proxies set `X-Forwarded-For` by default but not `X-Real-IP`, so the
real client IP was lost.

Check `X-Forwarded-For` first (using the left-most non-empty entry as
the originating client), then fall back to `X-Real-IP`, then
`r.RemoteAddr`.

Fixes #9293

* fix(s3): strip port from RemoteAddr fallback in audit log

Address PR review: the X-Forwarded-For and X-Real-IP paths return
host-only values, while the RemoteAddr fallback was returning
"host:port", making the remote_ip field inconsistent with both the
other code paths and the "192.0.2.3" example in the AccessLog struct.
Use net.SplitHostPort to strip the port, falling back to the raw
RemoteAddr for non-IP markers (e.g., "@" for unix sockets).
2026-04-30 15:19:04 -07:00
dependabot[bot]
8596434938 build(deps): bump github.com/ydb-platform/ydb-go-sdk/v3 from 3.134.0 to 3.134.2 (#9294)
build(deps): bump github.com/ydb-platform/ydb-go-sdk/v3

Bumps [github.com/ydb-platform/ydb-go-sdk/v3](https://github.com/ydb-platform/ydb-go-sdk) from 3.134.0 to 3.134.2.
- [Release notes](https://github.com/ydb-platform/ydb-go-sdk/releases)
- [Changelog](https://github.com/ydb-platform/ydb-go-sdk/blob/master/CHANGELOG.md)
- [Commits](https://github.com/ydb-platform/ydb-go-sdk/compare/v3.134.0...v3.134.2)

---
updated-dependencies:
- dependency-name: github.com/ydb-platform/ydb-go-sdk/v3
  dependency-version: 3.134.2
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-04-30 15:02:22 -07:00
Chris Lu
d265274e13 fix(nfs): accept dirpath any-where under the export, mirroring rclone (#9291)
* fix(nfs): accept any MOUNT3 dirpath, mirroring rclone's permissive policy

weed nfs has exactly one export per process, so the MOUNT3 dirpath
argument has no second export to disambiguate against. Strict
comparison only translated PV-path typos into the inconsistent
"mount succeeds but empty" / "mount fails completely" split that
operators see.

Match rclone's serve nfs Handler.Mount: ignore the dirpath, log an INFO
line when it differs from the configured export, and always serve the
export root. Apply the same change to the UDP MOUNT3 path so kernel
clients defaulting to mountproto=udp see identical behaviour. Access
control still goes through -allowedClients / -ip.bind, and file-handle
scoping in FromHandle is unchanged so handles still cannot escape the
export.

Replace the prior single-path reject tests with table tests covering
the shapes operators commonly hit: root, parent, sibling, deeper child,
unrelated, empty, relative form, exact match, and trailing slash, at
the Handler.Mount, UDP MOUNT3, and full RPC layers.

* feat(nfs): mount at subdirectory when MOUNT3 dirpath is under the export

Make the dirpath argument meaningful when the client asks for a subtree
of the configured export. With -filer.path=/buckets, a client mounting
<server>:/buckets/data lands directly inside /buckets/data instead of
at the export root.

  - dirpath equals the export root: serve the export root.
  - dirpath strictly under the export, directory entry: serve that
    subdirectory; the returned filehandle encodes its inode.
  - dirpath strictly under the export, missing or non-directory: reject
    with NoEnt or NotDir.
  - dirpath outside the export: keep the rclone-style fallback to the
    export root.

TCP returns a sub-rooted seaweedFileSystem and lets go-nfs's onMount
call ToHandle to encode the FH; UDP encodes the FH itself. FromHandle
is unchanged: handles are content-addressed by inode and resolve via
the inode index, so they remain stable across mounts and across
process restarts.

The trimmed permissive tests keep their outside-export shapes; new
subexport tests cover under-export directories, missing entries, and
non-directory entries on Handler.Mount, the UDP MOUNT3 wire, and
through the full RPC stack.

* nfs: propagate request context through MOUNT3 resolution

Mount now accepts the gonfs context and threads it through
resolveMountFilesystem and lstatExportStatus so a slow filer call
during MOUNT cannot outlive a cancelled or timed-out request.

lstatExportStatus uses fileInfoForVirtualPath(ctx, "/") directly
instead of billy.Filesystem.Lstat, which would otherwise drop the
context on the floor by calling fileInfoForVirtualPathWithOptions
with context.Background().

Lower the successful subexport-mount log from V(0) to V(1). The
fallback log stays at V(0) so operator typos still surface; the
success line is per-mount churn that adds up on NFS-CSI deployments.

* nfs: mirror TCP defensive checks on the UDP MOUNT3 path

Two transport-parity bugs the rabbit caught:

(1) The exact-export-root and outside-export branches were returning
MNT3_OK unconditionally, while the TCP handler runs lstatExportStatus
on those same branches. If the configured -filer.path has been
removed from the filer, TCP returns NoEnt/ServerFault but UDP would
still hand out a synthetic root handle pointing at nothing. Add
rootMountStatus as the UDP analogue and call it on both branches.

(2) resolveSubexportFileHandle did filer I/O on the single UDP serve
loop with context.Background(). One slow filer round-trip would
block every later MOUNT packet. Wrap each MOUNT call's filer work in
context.WithTimeout(mountUDPLookupTimeout) and thread that ctx
through both rootMountStatus and resolveSubexportFileHandle.

Lower the successful subexport log to V(1) to match the TCP side.

* nfs: assert TCP/UDP MOUNT3 produce byte-identical filehandles

The existing UDP subexport assertions only checked the decoded inode
and kind. A regression that drifted the generation, exportID, or
encoding format on one transport but not the other would have slipped
through. Build the TCP Handler from the same Server, drive its Mount
with the same dirpath, and require ToHandle to match the raw UDP FH
bytes for every OK case.

* nfs: take MOUNT3 dirpath as string in resolveMountFilesystem

Convert req.Dirpath to string once at the call site instead of
sprinkling string(...) casts through every log line and conversion
inside the function. Behavior unchanged.

* nfs: share rootFS lifecycle between TCP and UDP MOUNT handlers

Server.rootFilesystem() lazily constructs the seaweedFileSystem rooted
at the configured export the first time anything asks for it, then
hands the same instance to every subsequent caller. newHandler() and
mountUDPServer.rootMountStatus() now both go through it, so:

  - Both transports observe the same chunk reader cache and chunk
    invalidator without depending on call order during startup.
  - The UDP defensive Lstat doesn't allocate a fresh wrapper per
    MOUNT request anymore; one struct lives for the life of the
    Server.

The sub-rooted seaweedFileSystem the subexport branch builds in
resolveSubexportFileHandle is still per-request because actualRoot
varies with the requested dirpath.

* nfs: drive rootFilesystem before reading sharedReaderCache on UDP

The UDP listener is started before serve() calls newHandler(), so an
under-export MOUNT3 request can reach resolveSubexportFileHandle before
Server.sharedReaderCache has been assigned. Reading it directly would
hand newSeaweedFileSystem a nil cache and the sub-fs would build a
throwaway ReaderCache that never gets shared with the TCP path.

Take rootFS off Server.rootFilesystem() (which drives the sync.Once
that initializes the shared cache) and read readerCache off that
instead, so subexport sub-fs instances always share the same reader
cache as rootFS regardless of which transport sees the first MOUNT.

* nfs: collapse exact-match and outside-export MOUNT branches

The two branches return the same filesystem (export root) and the
same status; only the log line differs. Combine the conditions and
guard the fallback log inline. Behavior unchanged.
2026-04-30 10:06:44 -07:00
Chris Lu
9b624a73fe ci: provide a Docker tag for foundationdb release container on workflow_dispatch
The metadata-action used type=ref,event=tag, which produces no tag on
workflow_dispatch, causing build-push to fail with
"tag is needed when pushing to registry". Add a release_tag input and
build the tag from a RELEASE_TAG env, mirroring container_release_unified.yml.
2026-04-29 23:15:33 -07:00
Chris Lu
14cd426cf9 templ 2026-04-29 16:08:38 -07:00
Lars Lehtonen
89b07f3e12 chore(weed/mq/kafka/protocol): prune no-op test (#9287) 2026-04-29 14:11:51 -07:00
Lars Lehtonen
02da928e25 chore(weed/mq/kafka/protocol): prune dead code (#9288) 2026-04-29 14:11:37 -07:00
Jaehoon Kim
be451d22b5 feat(filer.sync): add -verifySync mode to filer.sync for cross-cluster file comparison (#9284)
* Add -verifySync flag to filer.sync for cross-cluster file comparison

Add a verification mode to filer.sync that compares entries between two
filers without performing actual synchronization. Uses directory-level
sorted merge of ListEntries to detect missing files, size mismatches,
and ETag mismatches. Supports -isActivePassive for unidirectional check
and -modifyTimeAgo to skip recently modified files during sync lag.

* Add mtime annotation and JSON output to filer.sync -verifySync

Add automatic mtime relation analysis for SIZE_MISMATCH and
ETAG_MISMATCH diffs, and an NDJSON output mode for external tooling.

mtime classification:
- B_NEWER => "late_updates_skip_likely" hint. Surfaces the case
  where target has a stub entry whose mtime is ahead of source's
  real file, causing UpdateEntry's mtime guard in filersink to
  permanently skip the update.
- A_NEWER => "sync_lag_or_event_miss" hint.
- EQUAL   => no hint (chunk-level issue suspected).

Text output example:
  [SIZE_MISMATCH] /path (a=996, b=0, B newer +274d [late-updates skip likely])

Add -verifyJsonOutput flag. When set, emits one JSON object per
line (NDJSON) for diffs and a final SUMMARY object, suitable for
piping into external diagnostic pipelines.

Concurrent writes from the directory worker pool are now serialized
via outputMu to keep both text lines and JSON records atomic.

* fix(filer.sync): use shared global semaphore in verifySync to bound goroutine explosion

Replace the per-call local semaphore in compareDirectory with a single
shared semaphore created in runVerifySync. The old per-level semaphore
applied a limit of verifySyncConcurrency only within each directory level,
allowing effective concurrency to grow as verifySyncConcurrency^depth on
deep trees.

The shared semaphore is held only for each directory's I/O phase
(listEntries + merge) and released before recursing into subdirectories,
so a parent never blocks waiting for children to acquire slots — which
would deadlock once tree depth exceeds the semaphore capacity.

Extract the capacity into a named constant (verifySyncConcurrency = 5)
with a comment explaining the memory vs. performance trade-off.

Add unit tests:
- correctness: missing file, only-in-B, size mismatch, active-passive mode
- concurrency bound: peak concurrent listings ≤ verifySyncConcurrency
- no-deadlock: binary tree of depth 10 completes within timeout

* fix(filer.sync): stream directory entries to prevent OOM on large directories

Replace the listEntries helper (which accumulated all entries into a
single []filer_pb.Entry slice) with an entryStream type that pages
through the directory in the background and forwards entries one at a
time through a buffered channel. Memory per directory comparison is now
O(channel buffer size = 64) regardless of how many entries the directory
contains.

Key design points:
- entryStream wraps a goroutine + buffered channel with a one-entry
  lookahead (peek/advance) so the two-pointer sorted merge in
  compareDirectory can work without buffering any full listing.
- A child context (mergeCtx) is passed to both stream goroutines so
  they are cancelled promptly if compareDirectory returns early (e.g.
  on error); the ctx.Done() select arm in the callback prevents
  goroutine leaks when the consumer stops reading.
- stream.err is written by the goroutine before close(ch), so it is
  safe to read after the channel is exhausted (Go memory model:
  channel close happens-before the zero-value receive).
- countMissingRecursive is rewritten to use ReadDirAllEntries with a
  direct callback, eliminating its own slice allocation.
- listEntries is removed; it is no longer called anywhere.

* fix(filer.sync): address verifySync review findings

Four real bugs found and fixed; one finding already resolved (shared
semaphore was introduced in a prior commit).

path.Join for child paths (filer_sync_verify.go)
  fmt.Sprintf("%s/%s", dir, name) produced "//name" when dir was "/".
  Replace all child-path concatenations with path.Join so root-level
  walks emit clean paths.

cutoffTime check for ONLY_IN_B entries (filer_sync_verify.go)
  The B-only branch ignored -modifyTimeAgo, so files recently written
  to B were reported as ONLY_IN_B instead of being skipped. Mirror the
  A-side mtime guard: skip and increment skippedRecent when the entry
  is newer than cutoffTime.

Summary emitted before error check (filer_sync_verify.go)
  A filer I/O error mid-walk still caused a SUMMARY record (or text
  summary) to be printed, making partial runs appear complete. Move the
  error check to before summary emission; on error, return immediately
  without printing any summary.

Return false on verification failure (filer_sync.go)
  runVerifySync returned true (exit 0) even when diffs were found or the
  walk failed. Return false so the main binary sets exit status 1,
  consistent with how all other commands signal failure.

* test(filer.sync): add missing verifySync test coverage

Four new tests covering gaps identified during review:

TestVerifySyncETagMismatch
  Verifies that two files with identical size but different Md5 checksums
  are counted as etagMismatch (not sizeMismatch). Exercises the second
  branch of compareEntries that was previously untested.

TestVerifySyncCutoffTime (4 subtests)
  A-only recent  — recent file skipped (skippedRecent++), not MISSING
  A-only old     — old file reported as MISSING
  B-only recent  — recent file skipped (skippedRecent++), not ONLY_IN_B
  B-only old     — old file reported as ONLY_IN_B
  The B-only subtests specifically cover the cutoffTime fix added in the
  previous commit.

TestVerifySyncRootPath
  Regression for the path.Join fix: walks from "/" and verifies that the
  child directory is reached and compared correctly (the old Sprintf
  produced "//data" which would silently produce wrong results).
  Asserts dirCount=2 and fileCount=1 to confirm the full tree is walked.

* fix(filer.sync): use os.Exit(2) instead of return false on verify failure

return false triggered weed.go's error handler which printed the full
command usage — appropriate for invalid arguments, not for a completed
verification that found differences. Use os.Exit(2) consistent with
the existing pattern in filer_sync.go (lines 251, 293).

* refactor(filer.sync.verify): split verify into its own command

The verify mode is a one-shot batch operation with a fundamentally
different lifecycle from the long-running sync subscriber, and most of
filer.sync's flags (replication, metrics port, debug pprof, concurrency,
etc.) do not apply to it. Extract it into a sibling command alongside
filer.copy/filer.backup/filer.export rather than a flag mode on
filer.sync.

Also rename modifyTimeAgo to modifiedTimeAgo (grammatical) and drop the
verifyJsonOutput prefix to plain jsonOutput now that the verify context
is implicit in the command name.

* fix(filer.sync.verify): address review comments

- Bounded worker pool: cap subdirectory goroutines per level via a
  jobs channel and min(verifySyncConcurrency, len(subDirs)) workers
  instead of spawning one goroutine per child. Wide directories no
  longer park ~2KB per queued goroutine.

- Don't gate recursion on a directory's mtime: a fresh child write
  bumps the parent mtime, but older files inside should still be
  reported as missing. Always recurse for missing-in-B directories
  and apply the cutoff per-file inside countMissingRecursive.

- Apply -modifiedTimeAgo symmetrically: matched-name files now skip
  the comparison when EITHER side is recently modified, not just A.
  This restores lag tolerance when B was just rewritten.

Adds tests for both new behaviors and a shared isTooRecent helper.

---------

Co-authored-by: Chris Lu <chris.lu@gmail.com>
2026-04-29 12:33:53 -07:00
chenshi
eebffd9df6 fix(storage): fix verifyDeletedNeedleIntegrity using wrong offset (#9273)
* fix(storage): fix verifyDeletedNeedleIntegrity using wrong offset

verifyDeletedNeedleIntegrity was ignoring the offset recorded in the
index file and instead always reading from fileSize-DiskSize(0), which
is only correct when the deleted needle happens to be the very last
entry in the .dat file.

When a deletion tombstone is written in the middle of the volume (e.g.
after subsequent writes), the function reads the wrong bytes, causing
the key-mismatch check to either silently pass (if the bytes happen to
form a valid needle with the same ID) or produce a spurious integrity
error.

Fix: accept the actual offset from the index entry and use
types.TombstoneFileSize as the size, mirroring the approach used by
verifyNeedleIntegrity for live needles.

* fix(storage): fix error message in doCheckAndFixVolumeData

The error message in the deleted-needle branch referenced
verifyNeedleIntegrity, but the function being called is
verifyDeletedNeedleIntegrity. This makes log output misleading when
a deleted-needle integrity check fails.

Suggested by gemini-code-assist review on PR #9273.

---------

Co-authored-by: chenshi5012 <chenshi5012@github.com>
2026-04-29 11:22:29 -07:00
baracudaz
db34e8b6fd feat(admin): prefer stored S3 Content-Type metadata over key-extension MIME inference (#9286)
* feat(admin): prefer stored filer mime in file browser and properties

* feat(mime): enhance MIME type registration and improve fallback logic

Co-authored-by: Copilot <copilot@github.com>

---------

Co-authored-by: baracudaz <baracudaz@users.noreply.github.com>
Co-authored-by: Copilot <copilot@github.com>
2026-04-29 10:21:03 -07:00
qzhello
60c76120fc fix(shell): use exact match for volume.balance -racks/-nodes filter (#9279)
* fix(shell): correct volume.list -writable filter unit and comparison

* fix(shell): correct volume.list -writable filter unit and comparison

* chore(shell): fix typo in EC shard helper param names

* fix(shell): use exact match for volume.balance -racks/-nodes filter

The old strings.Contains-based filter quietly included any id that was a
  substring of the user-supplied flag value (e.g. -racks=rack10 also matched
  rack1). Replace it with an exact-match set parsed from the comma-separated
  flag value, and add regression tests for both -racks and -nodes paths.

  Also fix a small typo in the "remote storage" error returned by
  maybeMoveOneVolume.

* fix(shell): use exact match for volume.balance -racks/-nodes filter

The old strings.Contains-based filter quietly included any id that was a
  substring of the user-supplied flag value (e.g. -racks=rack10 also matched
  rack1). Replace it with an exact-match set parsed from the comma-separated
  flag value, and add regression tests for both -racks and -nodes paths.

  Also fix a small typo in the "remote storage" error returned by
  maybeMoveOneVolume.

* refactor(shell): drop nil sentinel in splitCSVSet, use len() in callers
2026-04-29 10:19:16 -07:00
Chris Lu
1da091f798 ci: bring previously-uncovered integration tests into CI (#9281 follow-up) (#9283)
* ci: bring previously-uncovered integration tests into CI (#9281 follow-up)

Six integration test packages had _test.go files but no GitHub workflow
running them. The s3-sse-tests CI gap that let #8908's UploadPartCopy bug
(and the four cross-SSE copy bugs in #9281) ship undetected was an
instance of this same pattern. This change wires three of them into CI
and removes a fourth that was deadcode:

  test/multi_master/                    NEW workflow: multi-master-tests.yml
    - 3-node master raft cluster failover/recovery (5 tests, ~65s)
  test/testutil/                         (run alongside multi_master)
    - port-allocator regression test
  test/s3/etag/                          NEW workflow: s3-etag-acl-tests.yml
    - PutObject ETag format regression for #7768 (must be pure MD5 hex,
      not "<md5>-N" composite, for AWS Java SDK v2 compatibility)
  test/s3/acl/                           (same workflow as etag)
    - object-ACL behavior on versioned buckets

  test/s3/catalog_trino/                 DELETED (deadcode)
    - Single-file copy of test/s3tables/catalog_trino/trino_catalog_test.go
      from a 2024 commit that was never iterated, while the
      test/s3tables/ counterpart has been actively maintained (and IS
      in CI via s3-tables-tests.yml's trino-iceberg-catalog-tests job).

Both workflows trigger only on changes to relevant code paths and use the
existing simple "build weed → run go test" pattern (no per-test-dir
Makefile boilerplate). The S3 workflow starts a single `weed mini`
shared by etag and acl, which keeps the job under 2 minutes on a fresh
runner.

Two tests remain knowingly uncovered:

  test/s3/basic/  — order-dependent state across tests (TestListObjectV2
    expects a bucket created by an earlier test, etc.) and uses the
    deprecated aws-sdk-go v1. Treated as sample programs, not a
    regression suite. Fixing them is out of scope for this PR.

  test/s3/catalog_trino/  — see "DELETED" above.

Verified locally:
  - go test -v -timeout=8m ./test/multi_master/... ./test/testutil/...
    PASS  (5 multi_master + 1 testutil tests, 64s)
  - weed mini + go test ./test/s3/etag/... + go test ./test/s3/acl/...
    PASS  (8 etag + 5 acl tests, ~6s after server startup)

* ci: fix log-collector glob for multi-master tests (review feedback on #9283)

test/multi_master/cluster.go creates per-test temp dirs via
os.MkdirTemp("", "seaweedfs_multi_master_it_"), so the glob has to
match that prefix. The previous version looked for MasterCluster* /
TestLeader* / TestTwoMasters* / TestAllMasters* which never matches —
the failure-artifact upload would have been empty on a real failure.

Switch the find to /tmp/seaweedfs_multi_master_it_* (maxdepth 1) so it
actually picks up the per-node master*.log files under
<baseDir>/logs/.

Found by coderabbitai review on PR #9283.
2026-04-29 10:06:59 -07:00
Chris Lu
1f515f9d02 fix(s3api): cross-SSE copy operations and bring them back into CI (#9281) (#9282)
* fix(s3api): cross-SSE copy operations and bring them back into CI (#9281)

Four cross-SSE copy tests were broken on master and excluded from CI with
the comment "pre-existing SSE-C issues":

  - TestSSECObjectCopyIntegration/Copy_SSE-C_to_SSE-C_with_different_key
  - TestSSEKMSObjectCopyIntegration/Copy_SSE-KMS_with_different_key
  - TestCrossSSECopy/SSE-S3_to_SSE-C
  - TestSSEMultipartCopy/Copy_SSE-KMS_Multipart_Object

Each surfaced as a different symptom — 500 InternalError, CRC32 mismatch,
"unexpected EOF", MD5 mismatch — but they were all instances of the same
root pattern that #8908 hit on UploadPartCopy: copy paths writing
destination chunks tagged inconsistently with the bytes on disk, so
detectPrimarySSEType / IsSSE*Encrypted disagreed about what the read
path should do.

Five fixes in this PR, each with its own targeted test:

  1. SSE-C IV format: putToFiler stored entry.Extended[SeaweedFSSSEIV]
     as raw bytes (with a comment saying so), but StoreSSECIVInMetadata
     stored it base64-encoded. The two readers (the GET handler reading
     it raw, and GetSSECIVFromMetadata reading it base64-decoded) each
     matched one writer but not the other. Standardise on raw bytes
     everywhere; GetSSECIVFromMetadata accepts the legacy base64 form
     for backward compat.

  2. SSE-C single-part copy chunk tagging: copyChunkWithReencryption
     re-encrypted the bytes for the destination but never set the
     destination chunk's SseType / SseMetadata. With chunks left
     SseType=NONE, detectPrimarySSEType returned "None" and the GET
     served still-encrypted volume bytes raw without decryption.
     Tag the chunk after re-encryption.

  3. SSE-KMS single-part copy chunk tagging: same shape as (2). Also,
     the function discarded the destSSEKey returned from
     CreateSSEKMSEncryptedReaderWithBucketKey (with `_`) — that key
     carries the freshly-minted EncryptedDataKey + IV the read path
     needs, so it must be captured and serialized into the destination
     chunk's per-chunk metadata (and bubbled up to the entry-level
     SeaweedFSSSEKMSKey for single-chunk objects whose read path falls
     back to the entry-level key).

  4. SSE-KMS multipart source decryption: copyChunkWithSSEKMSReencryption
     decrypted every source chunk with the entry-level sourceSSEKey.
     For multipart SSE-KMS objects each chunk has its own EDK + IV in
     per-chunk metadata, so the entry-level key is wrong. Decrypt with
     per-chunk metadata when present.

  5. Same-key copy fast path chunk tagging: copySingleChunk uses
     createDestinationChunk which dropped SseType / SseMetadata. For
     same-key copies (e.g. SSE-KMS source → SSE-KMS dest with the same
     KMS key) the fast path reuses the source ciphertext as-is, so the
     destination chunks must keep the source's SSE tagging. Add a
     createDestinationChunkPreservingSSE helper for the fast path; the
     re-encryption paths still call createDestinationChunk and then
     overwrite the SSE fields after re-encrypting.

CI: extend the comprehensive-test TEST_PATTERN to include the four test
families that were previously excluded (`.*ObjectCopyIntegration`,
`TestCrossSSECopy`, `TestSSEMultipartCopy`) so this category of
regression is caught going forward. The exclusion comment is removed.

Tests:
  - All four originally-failing tests pass.
  - The full pre-existing TestSSE* / TestCrossSSE / TestGitHub7562 /
    TestCopyToBucketDefaultEncryptedRegression / TestSSEMultipart suite
    still passes.
  - go test -race ./weed/s3api/ passes.

Refs #8908, #9280.

* fix(s3api): SSE-KMS copy ChunkOffset must stay 0 (review feedback on #9282)

CreateSSEKMSEncryptedReaderWithBucketKey initialises a fresh CTR stream
at counter 0 with a per-chunk random IV — there is no base-IV-plus-offset
relationship. The previous commit on this branch wrote
`destSSEKey.ChunkOffset = chunk.Offset` onto the per-chunk metadata,
which the read-side CreateSSEKMSDecryptedReader applies as
calculateIVWithOffset(IV, ChunkOffset) — i.e. it advances the
decryption IV by chunk.Offset/16 blocks beyond where the encryption
actually wrote.

The bug only manifests for SSE-KMS-to-SSE-KMS-with-different-key copies
of multipart sources (where source chunks live at non-zero offsets),
which is why the existing TestSSEKMSObjectCopyIntegration (single-chunk
source) and TestSSEMultipartCopy/Copy_SSE-KMS_Multipart_Object (same-key
copy that takes the fast preserving path, not the re-encrypt path)
both happened to pass. Set ChunkOffset to 0 to match the actual
encryption position. Existing tests still pass; the dangerous case is
only reachable with a multipart SSE-KMS source and a different
destination key, which is not currently exercised in CI.

Found by gemini-code-assist review on PR #9282.

* fix(s3api): use first dst chunk's full key for entry-level SSE-KMS metadata in remaining copy paths (review feedback on #9282)

Earlier this branch fixed copyChunksWithSSEKMSReencryption to populate
the entry-level SeaweedFSSSEKMSKey from the first destination chunk's
fully-formed metadata (with EDK + IV) instead of a stub key with only
KeyID + EncryptionContext + BucketKeyEnabled. The same fix needs to
apply to the other two paths that build entry-level SSE-KMS metadata:

  - copyMultipartCrossEncryption() — cross-encryption to SSE-KMS dest.
    Per-chunk metadata comes from copyCrossEncryptionChunk's
    CreateSSEKMSEncryptedReaderWithBucketKey call, so chunks[0] has
    a real EDK + IV. Use it.

  - copyChunksWithSSEKMS() direct (same-key) branch. After
    createDestinationChunkPreservingSSE in copySingleChunk, dst chunks
    carry the source's per-chunk SSE-KMS metadata. Use chunks[0] for
    the entry-level key so single-chunk same-key copies don't fall
    back to a stub key on the read path.

Without this, single-chunk SSE-KMS reads through these two paths
failed at GET with "Invalid ciphertext format" — KMS unwrap was
called on an empty EDK.

Found by coderabbitai review on PR #9282.

* fix(s3api): add 0-byte fallback to SSE-KMS reencryption entry-level metadata (review feedback on #9282)

copyChunksWithSSEKMSReencryption was missing the fallback for 0-byte
objects (where dstChunks is empty), inconsistent with the fallback in
copyChunksWithSSEKMS direct branch and copyMultipartCrossEncryption.
Without it, a 0-byte SSE-KMS copy would land with no entry-level
SeaweedFSSSEKMSKey, so the read path's IsSSEKMSEncryptedInternal check
would not recognise the empty object as SSE-KMS.

Mirror the existing fallback: build a stub SSEKMSKey with KeyID,
context and bucket-key state; serialize it as the entry-level key.

Found by gemini-code-assist review on PR #9282.

* fix(s3api): SSE-KMS direct copy must check encryption context + bucket-key, not just key ID (review feedback on #9282)

DetermineSSEKMSCopyStrategy / CanDirectCopySSEKMS only compares the
source and destination KMS key IDs, but the destination request can
also change the encryption context or the BucketKey flag. Both are
embedded in the source ciphertext's wrapped EDK; preserving the source
metadata verbatim does not satisfy a destination request that asks
for different settings, so the destination object would silently
report the source's context/flag instead of what was requested.

Add srcSSEKMSStateMatchesDest: deserialize the source's stored
SSEKMSKey and compare its EncryptionContext + BucketKeyEnabled to the
destination request. If either differs, force the slow re-encrypt
path (SSEKMSCopyStrategyDecryptEncrypt) so the destination gets a
freshly-wrapped EDK bound to the requested context/flag.

A malformed source key is treated as non-matching (conservative).
nil and empty encryption-context maps are treated as equal to avoid
spurious divergence when the request omits the context header.

Found by coderabbitai review on PR #9282.

* fix(s3api): copyMultipartSSEKMSChunk falls back to entry-level key + entry-level metadata uses first chunk's full key (review feedback on #9282)

Two related issues in copyMultipartSSEKMSChunks / copyMultipartSSEKMSChunk:

1. copyMultipartSSEKMSChunks built the destination's entry-level
   SeaweedFSSSEKMSKey from a stub (KeyID + context + bucket-key only),
   missing the EDK + IV. Single-chunk reads through this path fall
   back to entry-level keyData and would fail at GET because KMS would
   be asked to unwrap an empty EDK. Mirrors the fix in
   copyChunksWithSSEKMS / copyMultipartCrossEncryption /
   copyChunksWithSSEKMSReencryption: prefer the first dst chunk's
   full per-chunk metadata, fall back to the stub only for 0-byte
   objects.

2. copyMultipartSSEKMSChunk hard-failed when chunk.GetSseMetadata()
   was empty. Newer multipart SSE-KMS uploads populate per-chunk
   metadata, but legacy objects may have only entry-level metadata
   and would now be impossible to copy. Add a sourceEntrySSEKey
   fallback parameter (deserialized once by the caller from
   entry.Extended[SeaweedFSSSEKMSKey]); use it when per-chunk
   metadata is absent.

Found by coderabbitai review on PR #9282.

* refactor(s3api): extract entry-level SSE-KMS deserialization and per-chunk fallback into helpers (review feedback on #9282)

Three medium-priority maintainability comments from gemini-code-assist:

  - The same "deserialize entry.Extended[SeaweedFSSSEKMSKey]" pattern
    appeared in srcSSEKMSStateMatchesDest, copyMultipartSSEKMSChunks
    and copyChunksWithSSEKMSReencryption.

  - The "prefer per-chunk metadata, fall back to entry-level key"
    selection logic appeared inline in copyMultipartSSEKMSChunk and
    copyChunkWithSSEKMSReencryption with subtly different shapes.

  - encryptionContextEqual hand-rolled a map comparison.

Pull both patterns out into named helpers:

  - deserializeEntrySSEKMSKey: returns the entry-level SSEKMSKey or nil
    on missing/malformed data, with a single V(2) log line.
  - resolveChunkSSEKMSKey: centralises the chunk-vs-entry-level
    selection so all sites use the same decryption-side selection
    logic (which must mirror the encryption side).

Replace encryptionContextEqual's manual loop with reflect.DeepEqual,
keeping the empty-vs-nil shortcut at the top because DeepEqual treats
those as different.

No behaviour change; existing copy tests still pass.
2026-04-29 10:06:51 -07:00
Chris Lu
82cf60a44f fix(s3api): re-encrypt UploadPartCopy bytes for the destination's SSE config (#8908) (#9280)
* fix(s3api): re-encrypt UploadPartCopy bytes for the destination's SSE config (#8908)

The remaining failure mode in #8908 was that Docker Registry's blob
finalization (server-side Move via UploadPartCopy) silently corrupts
SSE-S3 multipart objects. Reproduces with `aws s3api upload-part-copy`
under bucket-default SSE-S3: the GET on the completed object returns
deterministic wrong bytes (correct length, same wrong SHA-256 across
runs). The metadata is mathematically self-consistent — every chunk's
stored IV equals `calculateIVWithOffset(baseIV_dst, partLocalOffset)` —
but the bytes on disk were encrypted with the SOURCE upload's key+baseIV.

Root cause:

  - `copyChunksForRange` (and `createDestinationChunk`) constructs new
    chunks for UploadPartCopy without copying `SseType` / `SseMetadata`,
    so destination chunks are written with `SseType=NONE`.

  - At completion, `completedMultipartChunk` (PR #9224's NONE→SSE_S3
    backfill, intended to recover from a different missing-metadata
    bug) sees those NONE chunks under an SSE-S3 multipart upload and
    backfills SSE-S3 metadata derived from the destination upload's
    baseIV. The chunk metadata is now internally consistent and the
    GET path applies decryption — but the bytes on disk are encrypted
    with the source upload's key, not the destination's. Decryption
    produces deterministic garbage. Docker Registry pulls then fail
    with "Digest did not match".

Fix: when either the source object or the destination multipart upload
has any SSE configured, take a slow-path UploadPartCopy that
(1) opens a plaintext reader of the source range — decrypting the
source's per-chunk SSE-S3 metadata if needed via a reused
`buildMultipartSSES3Reader`, and
(2) feeds that plaintext through `putToFiler`'s existing encryption
pipeline by staging the destination upload entry's SSE-S3/SSE-KMS
headers on a cloned request. Encryption then matches PutObjectPart's
contract: every part starts a fresh CTR stream from counter 0 with
`baseIV_dst`, and each internal chunk's metadata records
`calculateIVWithOffset(baseIV_dst, chunk.partLocalOffset)`.

The `non-SSE → non-SSE` case still takes the existing fast raw-byte
copy path — bytes on disk are plaintext on both sides, so chunk-level
metadata is irrelevant.

Cross-encryption from SSE-KMS / SSE-C sources is left as TODO — the
new path returns an explicit error rather than the previous silent
corruption. SSE-S3 (the user-reported case) round-trips correctly.

Tests:

  - test/s3/sse/s3_sse_uploadpartcopy_integration_test.go pins three
    UploadPartCopy shapes against bucket-default SSE-S3:
    * Docker-Registry-shape 32MB+tail (the user's exact 5-chunk /
      2-part metadata layout)
    * single full-object UploadPartCopy
    * many small range copies
    Each round-trips SHA-256.

  - test/s3/sse/s3_sse_concurrent_repro_test.go covers the parallel
    multipart-upload shape from the user report (5 blobs in parallel,
    full GET and chunked range GET both hash-checked) — pre-existing
    coverage; added here as a regression sentinel.

* test(s3-sse): rename UploadPartCopy regression test so CI matches it

The CI workflow .github/workflows/s3-sse-tests.yml dispatches on the
TEST_PATTERN ".*Multipart.*Integration" — i.e. the test name must contain
both "Multipart" and "Integration" for CI to run it. The previous name
TestSSES3UploadPartCopyIntegration had only "Integration"; "UploadPart"
isn't "Multipart". Rename to TestSSES3MultipartUploadPartCopyIntegration
so the regression test actually runs in CI rather than only locally.

* fix(s3api): map unsupported UploadPartCopy SSE source to 501, not 500 (review feedback on #9280)

openSourcePlaintextReader explicitly rejects SSE-KMS and SSE-C sources
(SSE-S3 is the only one wired up in this slow path so far). Earlier
the caller blanket-mapped that to ErrInternalError, which collapses
"this shape isn't implemented yet" into the same 500 response a
real server failure would produce. Clients can no longer tell whether
they hit a feature gap or a bug.

Introduce a sentinel errCopySourceSSEUnsupported and have
copyObjectPartViaReencryption errors.Is-check it; on match, return
ErrNotImplemented (501) instead of ErrInternalError (500). Other
failures still map to 500.

Found by coderabbitai review on PR #9280.

* fix(s3api): UploadPartCopy must fail with NoSuchUpload when upload entry is missing (review feedback on #9280)

CopyObjectPartHandler's earlier checkUploadId call only verifies that
the uploadID's hash prefix matches dstObject; it does not prove the
upload directory exists in the filer. The previous logic silently
swallowed filer_pb.ErrNotFound from getEntry(uploadDir) and fell
through with uploadEntry=nil, which then skipped the destination
SSE check and could route a plain-source copy through the raw-byte
fast path even though the destination's encryption state is unknown.

Treat ErrNotFound as ErrNoSuchUpload so the client sees the right
status, matching the AWS S3 contract for UploadPartCopy on a
non-existent upload.

Found by coderabbitai review on PR #9280.

* feat(s3api): set SSE response headers on UploadPartCopy slow path (review feedback on #9280)

PutObjectPartHandler writes x-amz-server-side-encryption (and the KMS
key-id header for SSE-KMS) on every successful part response so clients
can confirm the destination's encryption state. The new UploadPartCopy
slow path was missing this — it returned only the ETag in the response
body and no SSE response headers.

Plumb putToFiler's SSEResponseMetadata back through
copyObjectPartViaReencryption to the handler, then call
setSSEResponseHeaders before writing the XML response, matching the
PutObjectPart contract.

Found by gemini-code-assist review on PR #9280.

* fix(s3api): map transient filer errors on UploadPartCopy upload-entry fetch to 503 (review feedback on #9280)

Earlier non-ErrNotFound errors from getEntry(uploadDir, uploadID) all
returned 500 InternalError, which most SDKs treat as fatal — even
though a transient filer outage (gRPC Unavailable, leader election in
flight, deadline exceeded) is exactly the kind of failure SDK retry
logic is supposed to recover from.

Add an isTransientFilerError helper that recognises:
  - context.DeadlineExceeded / context.Canceled
  - gRPC codes.Unavailable, DeadlineExceeded, ResourceExhausted, Aborted

When the upload-entry fetch fails for one of those reasons, return
503 ServiceUnavailable so the client retries; everything else still
maps to 500. Log line now also carries dstObject (in addition to
dstBucket and uploadID) to make incident triage easier.

Found by gemini-code-assist review on PR #9280.
2026-04-29 09:46:44 -07:00
Chris Lu
e82789ea4b rust(volume): strip grpc-port suffix from master URL before HTTP lookup (#9276)
* rust(volume): strip grpc-port suffix from master URL before HTTP lookup

The volume server stores `master_url` in SeaweedFS's canonical
`host:httpPort.grpcPort` form (e.g. `node-a:5300.5310`). When
`lookup_volume` builds the master `/dir/lookup` URL, the appended
gRPC port turns the URL into `http://node-a:5300.5310/...`, which
reqwest rejects with "builder error". Every replicated write,
batch-delete lookup, and proxy/redirect read then fails.

Mirror Go's `pb.ServerAddress.ToHttpAddress()` with a new
`to_http_address` helper and apply it inside `lookup_volume`, the
single funnel for all three HTTP master lookups in the Rust volume
server. Other consumers of `master_url` already go through gRPC and
use `to_grpc_address` / `parse_grpc_address`.

Includes a regression test that mocks a master HTTP server and calls
`lookup_volume` with a `host:port.grpcPort` master URL — without the
fix it reproduces the exact "lookup request failed: builder error"
from issue #9274.

Fixes #9274

* rust(volume): only strip dotted suffix from address when both ports are numeric

Previously `to_http_address` rewrote any `host:foo.bar` to `host:foo`,
which would silently drop the suffix on malformed config (e.g. a
hostname like `host:abc.def` or `host:99999.19333`). Validate that
both halves of the dotted suffix parse as `u16` before stripping —
mirrors the validation that `to_grpc_address` already does in the
inverse direction. Also slice the input directly instead of going
through `format!`, since the result is just a prefix of `addr`.

Adds a test that asserts non-numeric / out-of-range dotted suffixes
are preserved unchanged.

* rust(volume): strip grpc-port suffix from peer URLs in replicate / proxy / redirect

In normal operation the master returns `Location.url` as plain
`host:port` and the gRPC port arrives in a separate field. But the
volume server already has defensive logic (`grpc_address_for_location`)
for the `host:httpPort.grpcPort` form on those same URLs, which
implies a code path where peer URLs do carry the suffix.

Apply `to_http_address` to `loc.url` / `target.url` before building
HTTP URLs in `do_replicated_request`, `proxy_request`, and
`redirect_request` to keep replicate-write, proxy-read, and redirect
paths from hitting the same `lookup request failed: builder error`
mode that #9274 documented for master lookups.

Adds a unit test exercising `redirect_request` with a `.grpcPort`
suffix on `target.url`.

* rust(volume): return Cow<str> from to_http_address to skip allocation on the no-suffix path

Most addresses pass through `to_http_address` unchanged (master and
peer URLs are normally plain `host:port`), so the previous String
return type allocated on every call for nothing. Switch to `Cow<str>`:
the common pass-through borrows from the input, and only the rewrite
branch allocates. Call sites use the result via `format!`/`Display`,
which both work transparently with `Cow<str>`.

Adds a test asserting the variant is Borrowed in the no-rewrite cases
and Owned only when the suffix is stripped.

* rust(volume): cover bracketed IPv6 literals in to_http_address tests

The current implementation already handles IPv6 correctly because
`rfind(':')` lands on the colon after the closing bracket, leaving
the dotted suffix logic unchanged. Add explicit test coverage so the
behavior is locked down for dual-stack deployments — both the strip
case (`[::1]:9333.19333` -> `[::1]:9333`) and the various passthrough
cases (no suffix, non-numeric, out-of-range, trailing colon).

* rust(volume): parse both ports in one tuple pattern match in to_http_address

Combine the two `is_ok()` checks into a single `if let (Ok(_), Ok(_))`
tuple match — equivalent semantics, slightly tighter expression of
intent. No behavior change.
2026-04-29 00:51:10 -07:00
Chris Lu
7a461ffc2f fix(mount): copy xattr value bytes to avoid FUSE buffer aliasing (#9278)
fix(mount): copy xattr value bytes to avoid FUSE buffer aliasing (#9275)

SetXAttr stored the caller-supplied `data` slice directly into
entry.Extended. That slice aliases go-fuse's per-request input buffer,
which is returned to a pool the moment the handler returns. When a file
is open during setxattr (the open-fh path defers persistence to flush),
the next FUSE request recycles the buffer and silently overwrites the
stored xattr bytes; flushMetadataToFiler then ships the corrupted bytes
to the filer. `cp -a` reproduces this because it issues a setxattr while
holding an open fh, then continues to issue follow-up FUSE ops that
reuse the same buffer.

The path-based setxattr (e.g. setfattr without an open fh) saves
synchronously inside the same handler, so the bytes were marshalled
before the buffer could be reused — that is why the source file in the
report looked fine and only the cp -a destination was garbage.

Defensively copy the bytes when storing them, and add a unit test that
mutates the caller buffer after SetXAttr returns to lock in the
invariant.
2026-04-28 23:54:35 -07:00
qzhello
108e42fb8b chore(shell): fix typo in EC shard helper param names (#9277)
* fix(shell): correct volume.list -writable filter unit and comparison

* fix(shell): correct volume.list -writable filter unit and comparison

* chore(shell): fix typo in EC shard helper param names
2026-04-28 23:09:26 -07:00
Chris Lu
02574314f6 test(s3): force-drop collection after deleteBucket in tagging/versioning/cors/copying (#9270)
* test(s3): force-drop collection after deleteBucket across tagging/versioning/cors/copying

Each test creates a unique bucket (= new SeaweedFS collection) and the master's
warm-create issues a 7-volume grow batch. The S3 DeleteBucket-driven collection
sweep snapshots the layout once, but in-flight `volume_grow` requests keep
registering volumes after the snapshot, leaking 1-3 volumes per bucket. On a
single `weed mini` data node with the auto-derived volume cap, those leaks pile
up fast and every subsequent PutObject 500s with "Not enough data nodes found".

Mirror the retention-suite fix (commits ac3a756d, 363d5caa) into the four other
suites that share the same shape: defer a master-side /col/delete after each
bucket teardown, and sweep stale prefix-matching buckets before every new
createBucket. Each suite gets its own MASTER_ENDPOINT default plus the
allTestBucketPrefixes / cleanupAllTestBuckets / cleanupLeftoverTestBuckets /
forceDeleteCollection helpers.

Skipped: iam (separate framework + v1 SDK; already passes
-master.volumeSizeLimitMB=100), sse (different cleanup signature; docker-compose
already passes -volumeSizeLimitMB=50), policy (its TestCluster already uses
-master.volumeSizeLimitMB=32 and tears the whole cluster down). Suites under 5
tests cannot exhaust the cap and were left untouched.

* test(s3): default master endpoint to 127.0.0.1 to avoid IPv6 resolution

Mirror the explicit IPv4 default already used by test/s3/copying. On hosts
where `localhost` resolves to ::1 first the master HTTP listener (bound on
0.0.0.0) is unreachable, so the new force-delete-collection helper would silently
skip cleanup. Pin the default to 127.0.0.1 in tagging/cors/versioning; the
MASTER_ENDPOINT env var still wins when set.

* test(s3): skip buckets newer than this process in leftover-bucket sweep

cleanupAllTestBuckets/cleanupTestBuckets used to delete every bucket whose name
matched the suite's prefix. With shared S3/master endpoints, a second concurrent
`go test` run could have its live buckets torn down mid-test by the first run's
sweep.

Capture testRunStart at package init (with a 1-minute backdate for clock skew)
and skip any bucket whose CreationDate is newer than that. Stale buckets from
panicked or interrupted prior runs (the original target of the sweep) still get
collected because they were created before this process started.

* test(s3-copying): sweep stale prefix buckets from createBucket, not just from a few callers

The s3-copying suite already had cleanupTestBuckets that walks every bucket and
drops the test-copying-* prefix matches, but only four tests in the file invoke
it; the rest go straight to createBucket. So a panicked or interrupted prior run
could leak buckets that survive into the next run and exhaust the data node's
volume slots before any of the prefix-sweeping tests get a chance to run.

Hoist the sweep into createBucket so every test that creates a bucket starts on
a clean slate. The per-process CreationDate filter from the prior commit keeps
this safe under concurrent runs.

* test(s3): scope leftover-bucket sweep to this run via runID marker

The CreationDate filter only protected against runs that started later than this
process. A different `go test` against the same S3/master endpoints that started
*earlier* and is still active has buckets older than testRunStart, so the sweep
would still tear them down mid-test.

Replace the time window with a per-process runID baked into bucket names: every
bucket this run creates gets `r{runID}-` after the suite's BucketPrefix, and the
sweep only matches that owned subset. Concurrent runs each carry their own
runID and never see each other's buckets.

Trade-off: buckets left behind by a crashed prior run carry a different runID
and won't be cleaned up by this sweep. That recovery path now belongs to
`make clean` / data-dir wipe, which is what CI already does between jobs.
Fixed-name versioning buckets (e.g. test-versioning-directories) bypass
getNewBucketName and so also bypass this sweep — they are short-lived and
handled by their own deferred deleteBucket.

* test(s3): drop cleanupLeftoverTestBuckets wrapper, call cleanupAllTestBuckets directly

cleanupLeftoverTestBuckets was a one-line forwarder kept only as a semantic name
at the createBucket call site. Inline the call and update the doc comments to
point at the actual implementation. tagging/cors/versioning all had the wrapper;
copying never did.
2026-04-28 22:13:42 -07:00
Chris Lu
d9b86fb495 fix(s3api): clear stale latest-version pointer when .versions dir cleanup is blocked (#9269)
* fix(s3api): clear stale latest-version pointer when post-deletion cleanup is blocked

In versioned buckets, when the only remaining children of a .versions
directory are orphan entries (v_<id> files that lack the version-id
extended attribute - e.g. left behind by older code paths or interrupted
writes), updateLatestVersionAfterDeletion's selectLatestVersion finds
zero promotable versions and falls through to s3a.rm. The non-recursive
rm fails because the orphan blocks the directory deletion. Previously
the code logged the failure and returned, leaving the LatestVersionId
pointer pointing at the version we just deleted.

For Veeam-style workloads that GET-PUT-GET-DELETE a small lock object on
every checkpoint, the stale pointer poisons every subsequent run: the
next GET re-enters getLatestObjectVersion's 13x retry loop on the
missing pointer file plus the self-heal rescan, all to return the same
404. The cycle is self-perpetuating until the orphan is removed by hand.

When rm fails, additionally clear the LatestVersionId / LatestVersionFile
pointer fields on the .versions directory entry. The orphan files stay
in place (an operator can audit and remove them); from the S3 API
perspective the object is now correctly absent and subsequent reads
short-circuit to ErrNotFound on the fast path instead of replaying the
heal cycle.

* fix(s3api): clear stale latest-version pointer on read-side self-heal failure

healStaleLatestVersionPointer is invoked by getLatestObjectVersion when the
pointed-at version file is missing. The rescan path can find no remaining
version-id-tagged entries (e.g. when only orphan v_<id> files lacking the
version-id extended attribute remain). Prior code returned ErrNotFound but
left the stale pointer in place, so every subsequent read replayed the same
13x retry loop on the missing file and re-entered self-heal, all to return
the same 404.

Reuse the same pointer-clear logic introduced for updateLatestVersionAfterDeletion.
The two call sites are now identical, so factor the body out into
clearStaleLatestVersionPointer. The caller parameter carries the source
function name so the log lines operators were already grepping
(updateLatestVersionAfterDeletion: cleared stale ... and
healStaleLatestVersionPointer: cleared stale ...) keep the same prefix.

* fix(s3api): re-validate before clearing latest-version pointer (CAS)

Reviewer feedback (gemini-code-assist, coderabbitai) on PR #9269 flagged
a TOCTOU in clearStaleLatestVersionPointer: between the caller loading
versionsEntry and this function persisting the cleared map, a concurrent
PUT could promote a fresh version. Persisting the caller's snapshot then
silently overwrites the live pointer and re-introduces the missing-pointer
state on a now-existing object.

Make the persist CAS-style:

  1. Re-scan .versions for any version-id-tagged entry. If one now exists,
     abort - the concurrent writer has populated the directory and either
     already updated the pointer or the next read's self-heal will pick
     up the new entry.
  2. Re-fetch the live .versions directory entry and only proceed if its
     latest-pointer fields still match the stale id the caller observed.
     A concurrent pointer update by another writer is detected here and
     the clear is skipped.
  3. Persist with mkFile using the live Extended map (minus the two
     pointer fields and the cached metadata) so any other Extended fields
     written concurrently between (2) and the persist are preserved.

A note on the literal suggestion of mutating updatedEntry.Extended in
the mkFile callback: that does not work because mkFile constructs a
fresh *filer_pb.Entry rather than reading the live entry first
(weed/pb/filer_pb/filer_client.go:247). The callback's updatedEntry is
nil at invocation, so a delete on it would be a no-op and we would lose
all Extended fields, not just the two we want to clear. The correct
shape - re-fetching the live entry before mkFile and carrying its
Extended map into the persist - is what this change implements.

True atomic CAS would require filer-level conditional update support;
this change narrows the race window from the full caller scope to the
~ms gap between (2) and (3), which is the best we can do without that.
2026-04-28 21:02:52 -07:00
chenshi
c93018d987 fix(s3api): fix uint16 overflow in doListFilerEntries Limit calculation (#9271) 2026-04-28 20:50:05 -07:00
Chris Lu
35fe3c801b feat(nfs): UDP MOUNT v3 responder + real-Linux e2e mount harness (#9267)
* feat(nfs): add UDP MOUNT v3 responder

The upstream willscott/go-nfs library only serves the MOUNT protocol
over TCP. Linux's mount.nfs and the in-kernel NFS client default
mountproto to UDP in many configurations, so against a stock weed nfs
deployment the kernel queries portmap for "MOUNT v3 UDP", gets port=0
("not registered"), and either falls back inconsistently or surfaces
EPROTONOSUPPORT — surfacing as the user-visible "requested NFS version
or transport protocol is not supported" reported in #9263. The user has
to add `mountproto=tcp` or `mountport=2049` to mount options to coerce
TCP just for the MOUNT phase.

Add a small UDP responder that speaks just enough of MOUNT v3 to handle
the procedures the kernel actually invokes during mount setup and
teardown: NULL, MNT, and UMNT. The wire layout for MNT mirrors
handler.go's TCP path so both transports produce the same root
filehandle and the same auth flavor list for the same export. Other
v3 procedures (DUMP, EXPORT, UMNTALL) cleanly return PROC_UNAVAIL.

This commit only adds the responder; portmap-advertise and Server.Start
wire-up follow in subsequent commits so each step stays independently
reviewable.

References: RFC 1813 §5 (NFSv3/MOUNTv3), RFC 5531 (RPC). Existing
constants and parseRPCCall / encodeAcceptedReply helpers from
portmap.go are reused so behaviour stays consistent across both UDP
listening goroutines.

* feat(nfs): advertise UDP MOUNT v3 in the portmap responder

The portmap responder advertised TCP-only entries because go-nfs only
serves TCP, but with the new UDP MOUNT responder in place we can now
honestly advertise MOUNT v3 over UDP as well. Linux clients whose
default mountproto is UDP query portmap during mount setup; if the
answer is "not registered" some kernels translate the result to
EPROTONOSUPPORT instead of falling back to TCP, which is exactly the
failure pattern reported in #9263.

Add the entry, refresh the doc comment, and extend the existing
GETPORT and DUMP unit tests so a regression that drops the entry shows
up at unit-test granularity rather than only in an end-to-end mount.

* feat(nfs): start UDP MOUNT v3 responder alongside the TCP NFS listener

Plug the new mountUDPServer into Server.Start so it comes up on the
same bind/port as the TCP NFS listener. Started before portmap so a
portmap query that races a fast client never returns a UDP MOUNT entry
the responder isn't actually answering, and shut down via the same
defer chain so a portmap-or-listener startup failure doesn't leave the
UDP responder dangling.

The portmap startup log now reflects all three advertised entries
(NFS v3 tcp, MOUNT v3 tcp, MOUNT v3 udp) so operators can confirm at a
glance that the UDP MOUNT path is up.

Verified end-to-end: built a Linux/arm64 binary, ran weed nfs in a
container with -portmap.bind, and mounted from another container using
both the user-reported failing setup from #9263 (vers=3 + tcp without
mountport) and an explicit mountproto=udp to force the new code path.
The trace `mount.nfs: trying ... prog 100005 vers 3 prot UDP port 2049`
now leads to a successful mount instead of EPROTONOSUPPORT.

* docs(nfs): note that the plain mount form works on UDP-default clients

With UDP MOUNT v3 now served alongside TCP, the only path that ever
required mountproto=tcp / mountport=2049 — clients whose default
mountproto is UDP — works against the plain mount example. Update the
startup mount hint and the `weed nfs` long help so users don't go
hunting for a mount-option workaround that no longer applies.

The "without -portmap.bind" branch is unchanged: that path still has
to bypass portmap entirely because there is no portmap responder for
the kernel to query.

* test(nfs): add kernel-mount e2e tests under test/nfs

The existing test/nfs/ harness boots a real master + volume + filer +
weed nfs subprocess stack and drives it via go-nfs-client. That covers
protocol behaviour from a Go client's perspective, but anything
mis-coded once a real Linux kernel parses the wire bytes is invisible:
both ends of the test use the same RPC library, so identical bugs
round-trip cleanly. The two NFS issues hit recently were exactly that
shape — NFSv4 mis-routed to v3 SETATTR (#9262) and missing UDP MOUNT v3
— and only surfaced in a real client.

Add three end-to-end tests that mount the harness's running NFS server
through the in-tree Linux client:

  - TestKernelMountV3TCP: NFSv3 + MOUNT v3 over TCP (baseline).
  - TestKernelMountV3MountProtoUDP: NFSv3 over TCP, MOUNT v3 over UDP
    only — regression test for the new UDP MOUNT v3 responder.
  - TestKernelMountV4RejectsCleanly: vers=4 against the v3-only server,
    asserting the kernel surfaces a protocol/version-level error rather
    than a generic "mount system call failed" — regression test for the
    PROG_MISMATCH path from #9262.

The tests pass explicit port=/mountport= mount options so the kernel
never queries portmap, which means the harness doesn't need to bind
the privileged port 111 and won't collide with a system rpcbind on a
shared CI runner. They t.Skip cleanly when the host isn't Linux, when
mount.nfs isn't installed, or when the test process isn't running as
root.

Run locally with:

	cd test/nfs
	sudo go test -v -run TestKernelMount ./...

CI wiring follows in the next commit.

* ci(nfs): run kernel-mount e2e tests in nfs-tests workflow

Wire the new TestKernelMount* tests from test/nfs into the existing
NFS workflow:

  - Existing protocol-layer step now skips '^TestKernelMount' so a
    "skipped because not root" line doesn't appear on every run.
  - New "Install kernel NFS client" step pulls nfs-common (mount.nfs +
    helpers) and netbase (/etc/protocols, which mount.nfs's protocol-
    name lookups need to resolve `tcp`/`udp`).
  - New privileged step runs only the kernel-mount tests under sudo,
    preserving PATH and pointing GOMODCACHE/GOCACHE at the user's
    caches so the second `go test` invocation reuses already-built
    test binaries instead of redownloading modules under root.

The summary block now lists the three kernel-mount cases explicitly
so a regression on either of #9262 or this PR's UDP MOUNT change is
traceable from the workflow run page.
2026-04-28 14:06:35 -07:00
Chris Lu
735e94f6ba mount: expose -fuse.maxBackground and -fuse.congestionThreshold flags (closes #9258) (#9268)
* mount: expose `-fuse.maxBackground` flag (closes #9258)

The Linux FUSE driver caps in-flight async requests via
`/sys/fs/fuse/connections/<id>/max_background` (and a derived
`congestion_threshold = 3/4 * max_background`). Heavy upload workloads
need this raised, but the cap currently lives only in `/sys`, so it
resets on reboot/remount.

`weed mount` was hardcoding `MaxBackground: 128`. Promote it to a flag,
default unchanged. Setting `-fuse.maxBackground=2048` reproduces the
manual `echo 2048 > .../max_background` (and gives 1536 for
congestion_threshold automatically) persistently across remounts.

`congestion_threshold` is not exposed as a separate flag because
go-fuse derives it as 3/4 of MaxBackground in InitOut and offers no
hook to override; users wanting a different ratio can still write
/sys/fs/fuse/connections/<id>/congestion_threshold post-mount.

* mount: add `-fuse.congestionThreshold` flag, bump go-fuse to v2.9.3

go-fuse v2.9.3 exposes CongestionThreshold as a separate MountOption,
so we can now let users override the kernel's default 3/4-of-max_background
ratio at mount time instead of having to write
/sys/fs/fuse/connections/<id>/congestion_threshold post-mount on every
remount/reboot.

Default 0 preserves existing behavior (kernel derives it as
3/4 * max_background). Non-zero is sent to the kernel verbatim; the
kernel clamps it to max_background if higher.
2026-04-28 13:42:58 -07:00
Chris Lu
08d59750ef rust(volume): export Prometheus metrics for scrubbing operations (#9266)
* rust(volume): export Prometheus metrics for scrubbing operations

Mirrors #9264 in the Rust volume server. Adds three metrics that match
the Go names so the same dashboards/alerts work against either binary:

  - SeaweedFS_volumeServer_scrub_last_time_seconds (gauge)
  - SeaweedFS_volumeServer_scrub_volume_failures   (counter)
  - SeaweedFS_volumeServer_scrub_shard_failures    (counter)

Metrics are aggregated at the volume / EC shard level, labelled by
VolumeScrubMode (UNKNOWN/INDEX/FULL/LOCAL) to match Go's
req.GetMode().String().

* rust(volume): record scrub metrics before post-scrub error check

Address PR feedback:
  - Move metric emission before the mark_broken_volumes_readonly error
    check so scrub failures are persisted even when the follow-up
    mark-readonly admin action fails (matches Go's volume_grpc_scrub.go).
  - Extract the duplicated metric block into emit_scrub_metrics() shared
    by both ScrubVolume and ScrubEcVolume. The shard-failures family
    stays untouched on regular volume scrubs to mirror Go.
2026-04-28 13:29:32 -07:00
Lisandro Pin
3f3aaa7cc8 Export Prometheus metrics for scrubbing operations. (#9264)
This PR introduces three new metrics...

  - `scrub_last_time_seconds`
  - `scrub_volume_failures`
  - `scrub_shard_failures`

...capturing overall volume scrub results, and allowing to construct alerts
and dashboards to monitor scrubbing progress.

Note that these metrics are aggregated at the volume/EC shard level, and not
intended for fine-grained tracking of scrubbing operations.
2026-04-28 12:34:02 -07:00
Chris Lu
294f7c3d04 shell: expand ~ in local file path arguments (#9265)
* shell: expand `~` in local file path arguments

The weed shell parses commands itself instead of going through an OS
shell, so a path like `~/Downloads/foo.meta` was passed verbatim to
`os.Open`, which fails because no `~` directory exists. Users had to
spell out absolute home paths in every command.

Add an `expandHomeDir` helper that resolves a leading `~` or `~/...` to
the user's home directory, and run user-supplied local file paths in
the affected shell commands through it:

  fs.meta.load          (positional file)
  fs.meta.save          (-o)
  fs.meta.changeVolumeId (-mapping)
  s3.iam.export         (-file)
  s3.iam.import         (-file)
  s3.policy             (-file)
  s3tables.bucket       (-file)
  s3tables.table        (-file, -metadata)
  volume.fsck           (-tempPath)

Filer-namespace path flags (`-dir`, `-path`, `-locationPrefix`, etc.)
are unaffected; they live in the filer, not on the local FS.

* shell: reuse util.ResolvePath instead of a new helper

util.ResolvePath already does tilde expansion; drop the local
expandHomeDir helper and route every shell call site through it.
2026-04-28 12:30:13 -07:00
Chris Lu
e2c8791441 fix(nfs): reject NFSv4 calls with PROG_MISMATCH so clients fall back to v3 (#9262)
* feat(nfs): add NFSv3-only RPC version filter

The upstream willscott/go-nfs library dispatches RPC calls by (program,
procedure) only — it does not validate the program version. A client
sending NFSv4 (prog 100003 vers 4 proc 1 COMPOUND) lands on the same
handler map as NFSv3 and gets routed to v3 SETATTR, which parses the
COMPOUND args as SETATTR3args and writes a malformed reply. The kernel
then returns EPROTONOSUPPORT and mount.nfs prints "requested NFS version
or transport protocol is not supported" without retrying v3.

This commit adds a listener wrapper that peeks the first RPC frame on
each new TCP connection. If the program is NFS or MOUNT and the version
is not 3, it writes a protocol-correct PROG_MISMATCH reply (supported
range 3..3, per RFC 5531) directly to the socket and closes the
connection. v3 frames are replayed unchanged via a bufio reader so go-nfs
sees the original bytes. Unknown programs pass through so go-nfs's own
PROG_UNAVAIL handling stays in charge.

The filter is not yet wired into the server; the next commit activates
it. Tests cover NFSv4 reject, MOUNTv4 reject, NFSv3 pass-through, and
unknown-program pass-through.

* fix(nfs): wire NFSv3 version filter into the listener chain

Place the version filter after the optional client allowlist so that
unauthorized peers are still rejected first by IP/CIDR before we look at
RPC content. With the filter active, a Linux client doing the default
v4-first probe gets a clean PROG_MISMATCH reply pointing at v3, which
lets mount.nfs (and the in-kernel client) skip v4 and reuse the same v3
mountOptions that already work for rclone serve nfs against this
deployment.

* test(nfs): exercise MOUNT v4 in the v4-rejection test, not v1

TestVersionFilterRejectsMOUNTv4WithProgMismatch was sending
mountProgramID with version 1, so the test never actually covered the
"reject MOUNT v4" path it claims to exercise. The filter does reject any
non-v3 version uniformly, so the test still passed, but a future change
that tightened the version check (for example, only rejecting v4) would
let this test silently lie about coverage. Bump the call to version 4 so
the name matches what is actually exercised.

* refactor(nfs): reuse package RPC constants and io.ReadFull in version filter

The RPC numeric constants (msg_type=CALL/REPLY, MSG_ACCEPTED, PROG_MISMATCH,
AUTH_NONE, the NFS/MOUNT program numbers) are already named in
portmap.go alongside the portmap responder. Reuse them here instead of
defining a parallel set in rpc_version_filter.go: keeping one source of
truth per package means a future correction in one spot can't drift away
from the other. The filter-only constants (peek timeout, peek length,
supportedNFSVer) stay local because they have no portmap analog.

In the test, drop the bespoke readFull loop in favor of io.ReadFull.
The custom version was a near-identical reimplementation that did not
return io.ErrUnexpectedEOF on short reads, so the standard library is
both shorter and more diagnostic-friendly.

* fix(nfs): move RPC peek off the Accept path

The previous wrapper called filterFirstRPCFrame inline inside
versionFilterListener.Accept, which meant a single slow or idle TCP
connect could hold rpcVersionFilterPeekTimeout (10s) of head-of-line
blocking against every other accept: gonfs.Serve calls Accept serially,
so each in-flight peek stalled the next legitimate client until the
deadline expired. An attacker who simply opens a TCP connection without
sending any RPC payload could trivially throttle accept throughput.

Restructure the wrapper so a background goroutine drives the inner
Accept loop and hands each raw conn to its own short-lived goroutine
that runs the peek. Validated conns are sent on a buffered-once channel,
which the wrapper's Accept reads from; rejected conns finish their
PROG_MISMATCH reply and disappear without ever reaching the channel.
This means N concurrent slow clients only block themselves, not the
N+1th fast client that connects after them.

Add Close coordination — sync.WaitGroup for the accept loop and per-conn
peek goroutines, plus a closed channel so Accept unblocks immediately on
shutdown — so the wrapper now satisfies the full net.Listener contract
instead of relying on the embedded listener.

Add a regression test that opens a slow conn (TCP only, never writes)
and a fast conn (sends a v3 frame) and asserts the fast conn reaches
the inner accept handler well below the peek timeout.

* test(nfs): assert io.EOF (not just any error) after PROG_MISMATCH close

The post-rejection check was only failing when conn.Read succeeded; any
error — including a deadline timeout because the server kept the socket
open — let the test pass. That defeats the point of the assertion: a
regression where the filter replies but forgets to close would slip
through silently.

Match against io.EOF explicitly. The TCP semantics are deterministic
here: the server writes PROG_MISMATCH, calls conn.Close(), the client
reads what's left in flight and then sees a clean FIN, which surfaces
as io.EOF on the next zero-byte read.

* fix(nfs): reject short first fragments before parsing RPC header fields

bufio.Reader.Peek(28) is willing to read across record boundaries to
satisfy the requested length, so a final fragment whose body is shorter
than the 24-byte fixed RPC CALL header (xid + msg_type + rpcvers + prog
+ vers + proc) leaves the trailing peek bytes pointing at the next
RPC's framing or whatever bytes happen to follow on the wire. Indexing
hdr[16:24] for prog/vers in that state can spuriously reject (or pass
through) traffic based on data that doesn't belong to the request being
classified.

Drop those frames out of the filter early: if the first fragment can't
possibly hold a full CALL header, pass the connection straight to
go-nfs, which has its own framing-error handling for malformed input.

Add a regression test that crafts a 12-byte first fragment whose
trailing peek bytes are deliberately shaped like an NFSv4 CALL — without
the length check the filter sends a PROG_MISMATCH; with it, the conn
passes through silently. Verified by stashing the production-code change
and running the test in isolation: it fails as expected without the fix.

* fix(nfs): retry transient Accept() errors instead of treating any error as terminal

acceptLoop previously exited on the first error returned by the inner
listener's Accept(). That conflates two very different failure modes:
permanent shutdown (the listener was Close()d, OS-level fatal failure)
and transient resource pressure (EMFILE, EAGAIN, ECONNABORTED on
accept). The transient case should not take the entire NFS server down
— a single fd-table-full event would leave the deployment offline until
restart.

Classify the error: errors.Is(err, net.ErrClosed) is the permanent
signal we already wanted to surface to Accept(); everything else is
transient. Log at V(1) and back off rpcVersionFilterAcceptBackoff
(50ms, mirroring portmap.go's portmapRetryBackoff) before retrying. The
backoff sleep is interruptible via the closed channel so Close() still
shuts the loop down promptly.

Add a regression test that wraps a real listener with one that injects
3 fake transient errors before delegating, and asserts Accept() still
delivers the next real connection. Verified the test fails on the old
"any error is terminal" loop and passes with this change.

* fix(nfs): only synthesize PROG_MISMATCH for ONC RPC v2 traffic

The filter was rejecting any CALL-shaped record with prog=100003 or
100005 and vers!=3, regardless of the rpcvers field. If the caller is
speaking some other protocol that happens to share the port — or just
sending garbled bytes — pretending to be an NFSv3 server replying
PROG_MISMATCH is misleading at best, and at worst fabricates a coherent
RPC reply for traffic we don't actually understand.

Add an rpcvers==2 check between the msg_type and prog/vers parses. Any
non-v2 record now passes through to go-nfs, whose RFC 5531 §9
RPC_MISMATCH handling is the correct place to reject mis-versioned RPC.

Regression test takes a normal v3 NFS CALL frame, overwrites the rpcvers
field with 99, and asserts no PROG_MISMATCH-shaped reply lands on the
client and that the conn is delivered to the inner accept handler.
Verified the test fails on the previous code (filter still rejected on
prog/vers alone) and passes with the guard in place.

* fix(nfs): bound Close() latency by evicting in-flight prefilter conns

Close() does wg.Wait() to drain handleConn goroutines, but each of those
goroutines can be parked inside filterFirstRPCFrame's bufio.Peek for up
to rpcVersionFilterPeekTimeout (10s) waiting for the very first RPC
header. A client that completes the TCP handshake but never sends a
byte therefore stretched shutdown by 10s per such conn — a real
regression for stop/restart paths and for tests that just want to tear
the listener down.

Track raw (pre-peek) conns in versionFilterListener.inFlight as
handleConn enters, untrack on exit, and have Close() forcibly close
every tracked conn before wg.Wait. Closing the underlying conn breaks
its Peek immediately, so handleConn returns within a single scheduler
hop. trackInFlight also short-circuits if shutdown has already started,
so a conn accepted after signalClose can't slip past the eviction.

Black-box regression test opens 4 idle TCP-handshake-only conns, lets
their handleConn goroutines settle into Peek, and asserts Close()
returns under 2s. Verified: same test fails on the previous code with
Close taking ~9.9s; passes here at ~100ms.
2026-04-28 12:17:54 -07:00
Chris Lu
0fa0a56a5a filer(mysql): TLS hostname/SNI knobs + MariaDB upsert documentation (#9260)
* refactor(filer/mysql): set tls.Config per-instance via Connector instead of global registry

Replace the use of `mysql.RegisterTLSConfig("mysql-tls", ...)` and the
`&tls=mysql-tls` DSN suffix with a per-instance setup that assigns the
`*tls.Config` directly to `mysql.Config.TLS` and opens the database via
`mysql.NewConnector` + `sql.OpenDB`.

The driver's TLS-config registry is process-wide; if a second `MysqlStore`
were ever initialized with different TLS settings (e.g., a filer plus a
separately configured store) the second registration would silently
overwrite the first. The connector pattern keeps the TLS configuration
attached to the connector and avoids that global side effect.

Behavior is otherwise unchanged: TLS is enabled when `enable_tls=true`,
the same `ca_crt`/`client_crt`/`client_key` knobs are honored, and the
TLS minimum version remains 1.2.

* filer(mysql): use system root CAs when ca_crt is empty

Previously, enabling `enable_tls=true` without setting `ca_crt` returned an
unhelpful empty-path read error. Many managed MySQL/MariaDB providers serve
certificates that chain to a public CA already in the host's trust store, so
requiring an explicit CA bundle adds friction with no security benefit.

Leave `RootCAs` unset when `ca_crt` is empty so Go's `tls.Config` falls back
to the system trust store, matching the standard behavior of `mysql --ssl`.
Existing setups with `ca_crt` configured are unaffected.

Also wraps the CA read/parse errors with the file path for easier diagnosis.

* filer(mysql): fail loudly when client_crt / client_key are unreadable

The previous implementation called `tls.LoadX509KeyPair` and silently
discarded any error, falling back to a non-mTLS connection. A typo or
permissions problem in `client_crt` / `client_key` therefore appeared as a
confusing server-side handshake error rather than as a config error,
because the server was expecting a client cert that the filer never sent.

Treat the keypair as required when either path is set, and surface the
underlying load error with both filenames so the misconfiguration is
obvious. The default (both paths empty) is unchanged: no client cert is
sent.

* filer(mysql): add tls_insecure_skip_verify and tls_server_name knobs

When the filer connects to a MySQL/MariaDB cluster whose server
certificate's SAN does not match the connection address (common with
internal load balancers, IP-only connection strings, or self-signed
cluster certs), the TLS handshake fails with `x509: certificate is valid
for X, not Y`. There was previously no way to fix this short of reissuing
the cert.

Expose two new optional knobs on `[mysql]`:

- `tls_server_name` overrides the SNI / cert hostname used for
  verification — the standard fix when the cert SAN is correct but the
  connection address is not.
- `tls_insecure_skip_verify` disables verification entirely as an escape
  hatch for testing or for clusters with no usable SAN.

Both default to off, so existing configurations continue to verify the
server certificate against the connection address as before.

* docs(scaffold/filer.toml): document mysql TLS knobs and MariaDB upsert override

- Document the new `tls_insecure_skip_verify` and `tls_server_name` options.
- Update the `ca_crt` comment to reflect that it is optional and that the
  system trust store is used when the path is empty (matches the runtime
  behavior in mysql_store.go).
- Reword the client cert comments to make the mTLS pairing requirement
  explicit (both `client_crt` and `client_key` must be set together).
- Add a commented-out MariaDB / MySQL 5.7 alternative for `upsertQuery`,
  noting that the default (`AS new` row alias) requires MySQL 8.0.19+.

* filer(mysql): drop redundant blank import of go-sql-driver/mysql

The package was imported twice: once with the `mysql` alias (used for
`mysql.MySQLError`, `mysql.Config`, `mysql.NewConnector`, etc.) and once
as `_` to register the driver. The named import already triggers
`init()` and registers the driver, so the blank import is dead weight.
2026-04-28 01:29:41 -07:00
Chris Lu
135af25b55 fix(grpc): require host match before routing dials to local Unix socket (#9254) (#9257)
* fix(grpc): require host match before routing dials to local Unix socket (#9254)

resolveLocalGrpcSocket keyed the Unix-socket hijack on port alone, so a
remote peer reusing a local gRPC port (e.g. a standalone `weed volume`
defaulting port.grpc=17334 against a `weed server` whose in-process
volume socket is also on 17334) had its inbound RPCs silently rerouted
to the local socket. In a cross-DC replication=100 cluster this surfaced
as persistent volume-grow failure: both AllocateVolume RPCs landed on
the local volume server, the second returned "Volume Id N already
exists!", the grow rolled back, and S3 PUTs to the new collection
returned 500.

Track a per-port set of host strings that count as "this machine" and
require the dial host to be in that set before redirecting. Loopback
aliases (localhost, 127.0.0.1, ::1, "") are always included so
same-process dials via loopback still take the socket fast path.

* test(grpc): cover empty-host bare-port dial in local-socket regression test

The empty alias is registered explicitly so SplitHostPort outputs like
":17334" (which can occur when a caller dials a bare port) take the
local-socket fast path. Add a case so that path is exercised.
2026-04-27 23:10:36 -07:00
Parviz Miriyev
e2f96687ff fix(admin): use protocol-relative URLs for component links so HTTPS clusters don't break clicks (#9256)
* fix(admin): use protocol-relative URLs for component links

Hardcoded http:// in admin UI templates breaks browser-initiated clicks
to master / volume / filer / EC shard / Iceberg REST URLs whenever the
target component runs HTTPS-only via security.toml [https.X] sections.
The browser sends plain HTTP to a TLS-only endpoint and gets 400
"client sent an HTTP request to an HTTPS server".

Same root pattern as #9227 (admin's own backend /dir/status fetch);
this PR is the browser-facing equivalent.

Replace fmt.Sprintf("http://%s...") with fmt.Sprintf("//%s...") and the
JS-string '<a href="http://' with '<a href="//' so the browser uses the
same scheme as the page hosting the link. Backwards compatible:
  - HTTPS-only deployments: links now work
  - HTTP-only deployments: identical behavior to before
  - Mixed: edge case, addressed by future per-component public-URL work

Affected templates (9 files), each kept in lockstep with its generated
_templ.go sibling so reviewers don't need to run templ generate:

- weed/admin/view/app/admin.templ
- weed/admin/view/app/cluster_filers.templ
- weed/admin/view/app/cluster_masters.templ (Go templ + JS modal)
- weed/admin/view/app/cluster_volume_servers.templ (Go templ + JS modal)
- weed/admin/view/app/cluster_volumes.templ
- weed/admin/view/app/ec_volume_details.templ
- weed/admin/view/app/volume_details.templ
- weed/admin/view/app/iceberg_catalog.templ
- weed/admin/view/app/s3tables_buckets.templ

17 link constructions total, +32/-32 lines.

* fix(admin): protocol-relative URLs in iceberg + s3tables JS overrides

Per Gemini code review on this PR: the JS scripts in iceberg_catalog
and s3tables_buckets templates overwrite the href attribute of the
"Open Iceberg REST" links after page load, replacing the
protocol-relative URL set by the templ render with a hardcoded
http://<host>:<port>/v1/config.

Apply the same protocol-relative fix to the JS template literals so
they don't undo the templ-side change. Browser uses the page scheme
(http or https) to fill in the protocol.

Mirrored in iceberg_catalog_templ.go and s3tables_buckets_templ.go.

* fix(admin): displayed Iceberg endpoint scheme follows page protocol

Per CodeRabbit review on this PR: the on-page guidance text in iceberg
and s3tables templates still showed a literal `http://` even after the
clickable link was switched to a protocol-relative URL. In HTTPS-only
deployments operators see `http://host:8181/v1` as the suggested
endpoint, copy it, and get a broken connection.

Wrap the scheme in <span id="iceberg-protocol"> (and the s3tables
counterpart) and have the existing inline script set its innerText to
window.location.protocol minus the trailing colon. Same pattern as the
existing dynamic host substitution. Mirrored in *_templ.go so reviewers
do not need templ generate.

SQL/JSON code-block examples (CREATE EXTERNAL TABLE ... ENDPOINT
'http://...', "uri": "http://..." ) are intentionally left as-is —
they are starter snippets users adapt to their environment, not
clickable or copy-paste-into-runtime values. Happy to follow up with
server-side scheme threading if requested.
2026-04-27 23:10:11 -07:00
Chris Lu
363d5caa85 test(s3-retention): purge stale buckets before each create to avoid volume exhaustion
The WORM suite creates one bucket per test, each backed by ~3 reserved
volumes on the data node. With ~30 tests and the default `weed mini`
volume cap, the data node runs out of slots midway through the run and
every PutObject after that fails with InternalError.

Hook a sweep of every test-prefix bucket into the create helpers so a
panicked or interrupted prior test cannot leak buckets into the next.
2026-04-27 22:14:20 -07:00
Chris Lu
e4a635a04d feat(docker): default CMD to mini -dir=/data for service-container use (#9255)
* feat(docker): default CMD to `mini -dir=/data` for service-container use

GitHub Actions service containers cannot pass arguments to the image
entrypoint, so `chrislusf/seaweedfs` is currently unusable as a service
because it requires a `weed` subcommand. Set a sensible default CMD so
the image starts a complete single-process cluster (master, volume,
filer, S3 on :8333, admin UI) out of the box, while still being
overridable by passing any other subcommand at `docker run` /
compose time.

Also add a `mini` case to entrypoint.sh so its logs go to stderr,
matching the existing master/volume/server cases.

Closes #9247

* fix(docker): make `isArgPassed` match `--flag` as well as `-flag`

The Go fla9 library accepts both `-flag` and `--flag` syntax, but
`isArgPassed` only matched the single-dash form. That meant a user
passing `--dir=/foo` to `weed mini` (or `--max=5` to `volume`,
`--volume.max=5` to `server`) would not suppress the entrypoint's
default, and the duplicate flag was silently appended to the command
line — relying on last-wins parsing for correctness. Match double-dash
explicitly so the override is detected for every case in the file.
2026-04-27 21:21:58 -07:00
Chris Lu
d92c5e057a test(iceberg): cross-engine regression coverage for deterministic table locations (#9074) (#9253)
* test(trino): regression for unique-table-location=false (#9074)

With #9246's namespace-location property, Trino's REST catalog can
resolve table locations even when the connector is configured with
`iceberg.unique-table-location=false`, and CREATE/CTAS lands at the
deterministic <namespace-location>/<tableName> path with no
UUID-suffixed sibling. Lock that in:

- writeTrinoConfig now parameterizes the unique-table-location flag via
  a withDeterministicTableLocation() option.
- setupTrinoTest forwards config options.
- TestTrinoDeterministicLocationCTAS exercises a fresh CREATE TABLE +
  CTAS with the flag flipped off and asserts the on-disk layout has
  no UUID-suffixed sibling under the namespace dir, proving each table
  occupies a single dir.

Refs #9074

* test(spark): regression for CTAS without explicit location (#9074)

iceberg-spark has no equivalent of Trino's unique-table-location flag —
its REST catalog interactions always produce the deterministic
<namespace-location>/<tableName> path. Without #9246's namespace-location
property, Spark cannot resolve a table location for a CREATE TABLE that
omits an explicit LOCATION clause; with it, the operation succeeds and
the table lands at the expected single-dir-per-table layout.

TestSparkDeterministicLocationCTAS walks the same scenario as the Trino
test: CREATE TABLE without LOCATION, INSERT, CTAS, SELECT count, then
asserts via S3 ListObjectsV2 that no UUID-suffixed sibling directory
appears under the namespace.

Refs #9074

* test(duckdb): read table at deterministic location via REST catalog (#9074)

DuckDB's iceberg extension is a read-only consumer in this flow — there
is no client-side UUID-suffixing toggle to test. The relevant question
post-#9246 is whether DuckDB can ATTACH the REST catalog and resolve a
table at a deterministic <bucket>/<namespace>/<tableName> path produced
by writers that don't suffix UUIDs (iceberg-spark, pyiceberg, Trino with
unique-table-location=false).

TestDuckDBDeterministicLocationRead creates a namespace + minimal table
via direct REST API calls (so no client-side UUID is added), confirms
the catalog returns a deterministic location URL, then runs DuckDB
through ATTACH ... TYPE 'ICEBERG' and DESCRIBE on the table. Asserting
DESCRIBE succeeds proves DuckDB walked the catalog → metadata → schema
chain against the deterministic on-disk path.

The test skips gracefully when the DuckDB image lacks the iceberg
extension or the ATTACH-iceberg syntax, so older base images don't fail
the suite.

Refs #9074
2026-04-27 20:14:48 -07:00
Chris Lu
fe50da4934 test(fuse): stream verify-phase diagnostics from writeback stress test
The 45m suite alarm fires on TestWritebackCacheStressSmallFiles with no
output from the test, since t.Logf is buffered until the test completes
and the alarm panic skips that flush. Add streaming stderr progress, an
explicit verify-phase budget that t.Fatalf's with a goroutine dump on
overrun, and per-retry/per-failure logging so the next hang shows which
file(s) the mount could not read back.
2026-04-27 19:44:26 -07:00
Chris Lu
9d6d068f41 feat(seaweed-volume): cross-disk EC shard reconciliation (#9212) (#9252)
* fix(seaweed-volume): fall back to idx dir when reading .vif

EcVolume::new and read_ec_shard_config only looked for .vif at the
data dir. With the cross-disk reconcile path (where shards live on
one disk and .ecx / .ecj / .vif live on a sibling disk —
seaweedfs/seaweedfs#9212 / #9244), this would either write a stub
.vif on the shard disk and lose the real EC config + dat_file_size
or fall back to default ratios despite a perfectly good .vif being
present elsewhere on the same volume server.

Add a small `locate_vif_path` helper that prefers the data dir and
falls back to the idx dir when it differs, and thread the data dir
+ idx dir pair through `read_ec_shard_config`. Three call sites in
grpc_server.rs (VolumeEcShardsGenerate, VolumeEcShardsRebuild, scrub)
updated; the scrub path passes the same dir for both args because
`find_ec_dir` is the only locator there.

* feat(seaweed-volume): primitives for cross-disk EC shard reconcile

Adds the three small helpers the reconcile pass needs:

- DiskLocation::mount_ec_shards_with_idx_dir — mounts shards on this
  disk while pointing the EcVolume at a sibling disk's idx dir for
  .ecx / .ecj / .vif. Mirrors loadEcShardsWithIdxDir in
  weed/storage/disk_location_ec.go. The existing mount_ec_shards is
  kept as a thin wrapper over it.

- EcVolume::has_shard — `pub` accessor over the internal Vec<Option>
  shard slot so the reconcile pass can skip shards that are already
  registered.

- pub(crate) re-exports of parse_collection_volume_id and
  parse_ec_shard_extension under names parse_collection_volume_id_pub
  and is_ec_shard_extension so the reconcile module can call them
  without re-implementing the parsers.

No behaviour change. Reconciliation logic in the next commit.

* feat(seaweed-volume): cross-disk EC shard reconciliation (#9212)

Closes the loader half of seaweedfs/seaweedfs#9212 on the Rust side,
mirroring the Go fix in seaweedfs/seaweedfs#9244. With the auto-load
in feat/rust-load-all-ec-shards-9212 in place, the only remaining gap
is shards that landed on a disk without their `.ecx` — for example
when ec.balance / ec.rebuild moved them onto a destination node's
second disk while leaving the index files on the disk that already
held the volume. Without this, those orphan shards stay invisible to
the master and ec.rebuild reports the volume as unrepairable.

After every DiskLocation has finished its per-disk EC scan, sweep the
store for shards that live on a disk without local index files and
load them by reaching across to a sibling disk's `.ecx` / `.ecj` /
`.vif`:

  - Store::reconcile_ec_shards_across_disks walks each disk for
    orphan `.ec??` files (present on disk, not yet registered to an
    EcVolume) and matches them against an `(collection, vid) ->
    EcxOwnerInfo` map of which disk owns each `.ecx`.
  - Each matched group is mounted on its physical disk's ec_volumes
    map (so heartbeat reporting carries the right disk_id per shard)
    via `mount_ec_shards_with_idx_dir`, pointing the EcVolume at the
    sibling's idx dir.
  - `index_ecx_owners` records the directory each `.ecx` was found in
    (IdxDirectory or Directory) so the loader doesn't ENOENT when the
    legacy "written before -dir.idx was set" layout puts `.ecx` in
    the data dir. This mirrors the PR #9244 review fix from
    @gemini-code-assist / @coderabbitai (see Go commit af57cc652).
  - True orphans (no `.ecx` anywhere on this server) log a warning
    and stay on disk untouched — operator can restore the index later.

Wired into Store::add_location and Store::load_new_volumes so a fresh
restart and any later disk additions both pick up cross-disk shards.

Tests cover all four behaviour shapes:
- shards on dir0 + .ecx on dir1 → reconciled to dir0's ec_volumes
- .ecx in owner's data dir (legacy layout) → reconciled correctly
- self-contained disks → reconcile is a no-op
- truly-orphan shards (no .ecx anywhere) → left on disk, logged

* fix(seaweed-volume): propagate EcVolume::new errors instead of unwrap

mount_ec_shards_with_idx_dir built the missing EcVolume inside an
entry().or_insert_with() closure, which can't return a Result — so
any EcVolume::new failure (e.g. .ecx open error, .ecj create error,
malformed .vif) panicked the volume server via unwrap(). The
constructor already returns Result<>, so propagate it as
VolumeError::Io instead.

Reported in PR #9252 review by @gemini-code-assist (high) and
@coderabbitai (critical).

* perf(seaweed-volume): use DirEntry::metadata in collect_orphan_ec_shards

Replaced the extra fs::metadata(&path) lookup with ent.metadata() so
we don't pay an additional stat syscall per directory entry beyond
what read_dir already returned. Drops the now-unused std::path::Path
import alongside.

Reported in PR #9252 review by @gemini-code-assist.

* fix(seaweed-volume): scrub uses EcVolume's real dir_idx for split-disk volumes

After cross-disk reconciliation an EcVolume can legitimately have
ecv.dir != ecv.dir_idx (shards on one disk, .ecx / .ecj / .vif on a
sibling). The scrub path collapsed both args to find_ec_dir's single
answer, so read_ec_shard_config fell back to the wrong .vif location
for exactly the split-disk layout this PR loads — skewing
shard-count detection and verification results.

Use ecv.dir / ecv.dir_idx directly so scrub reads the metadata from
where the volume's index files actually live.

Reported in PR #9252 review by @coderabbitai.

* feat(seaweed-volume): primitives for split-disk EC volume operations

Reconciliation can mount the same `vid` on multiple DiskLocations
with disjoint shard subsets. The existing first-match `find_ec_volume`
isn't enough for read/unmount/delete/decode paths that need to act on
a specific shard or aggregate across the whole volume — they have to
walk every location and find the right home for each shard.

Add the small Store-level lookup primitives Go's findEcShard /
CollectEcShards already provide:

- `Store::find_ec_shard_location(vid, shard_id)` — returns the index
  of the location that has `(vid, shard_id)` mounted, if any.
- `Store::find_ec_volume_with_shard(vid, shard_id)` — same idea but
  returns the EcVolume directly.
- `Store::collect_ec_shard_dirs(vid, max_shard_count)` — returns
  the EcVolume to use for metadata plus per-shard data dirs (None
  when the shard isn't mounted on any disk). Mirrors
  `Store.CollectEcShards` in `weed/storage/store_ec.go`.

And the EcVolume accessors callers need:

- `EcVolume::has_shard(shard_id)` — was already added for the cross-
  disk reconcile but is now a load-bearing primitive for placement
  decisions on a per-shard basis. Pulled into the dedicated commit.
- `EcVolume::ecx_actual_dir()` — exposes the directory the `.ecx`
  was actually opened from. The decoder needs it for the .ecx
  lookup when shards are split across data dirs and `.ecx` lives on
  a sibling idx dir.

Plus a small defensive change to `DiskLocation::unmount_ec_shards`:
only decrement the per-shard gauge for shards that were actually
mounted. Without this, the upcoming `Store::unmount_ec_shards`
fan-out to every location would underflow the metric whenever a
shard is requested for unmount on a sibling disk that doesn't have
it.

No behaviour change at the call sites yet — wiring follows in the
next commits.

* fix(seaweed-volume): unmount_ec_shards visits every location with the vid

Store::unmount_ec_shards and Store::unmount_ec_shard returned after
the first DiskLocation with the volume id, even if that location did
not contain the requested shard. With reconciled split-disk volumes
(shards 0/12 on disk 0, shard 1 on disk 1 — the issue #9212 layout
this PR loads), VolumeEcShardsUnmount for a later-disk shard became a
silent no-op and Store::delete_ec_shards could remove the shard file
while leaving an in-memory shard + open file handle stale on the
later location.

Walk all locations that have the EcVolume and ask each to unmount
whatever subset of `shard_ids` it actually has — the
`DiskLocation::unmount_ec_shards` defensive guard from the previous
commit makes the fan-out safe (no metric underflow when a sibling
disk is asked to unmount a shard it doesn't hold).

* fix(seaweed-volume): VolumeEcShardRead reads from the shard's home disk

VolumeEcShardRead resolved the EcVolume via first-match
`find_ec_volume(vid)` and then looked up the requested shard on that
single EcVolume. With reconciled split-disk volumes (the layout
seaweedfs/seaweedfs#9212 produces — shards 0/12 on disk 0, shard 1
on disk 1), a request for shard 1 hit disk 0 first and returned
"shard 1 not mounted" even though it was happily mounted on disk 1.

Switch to `find_ec_volume_with_shard(vid, shard_id)` so the lookup
walks every location and returns the EcVolume whose disk actually
holds the shard. The deleted-needle check still works because every
per-disk EcVolume for the same vid points at the same `.ecx` file
(post-reconcile, both disks open the same sealed index).

* fix(seaweed-volume): VolumeEcShardsToVolume aggregates shards across disks

VolumeEcShardsToVolume resolved a single EcVolume via
`find_ec_volume(vid)` and then checked `ec_vol.shards[i]` for each
data shard. With reconciled split-disk volumes that's the wrong
view: the first-match EcVolume only carries the shards on its disk,
so the presence check would either reject the request as
"missing shard" or — if shards happened to be on the first disk —
fall through to `write_dat_file_from_shards(&dir, ...)` which only
reads from the EcVolume's single dir.

Mirror Go's CollectEcShards by aggregating per-shard data dirs
across every location with the volume:

- Add `Store::collect_ec_shard_dirs` (in the previous primitives
  commit) returning the EcVolume to use for metadata + per-shard
  dir slots.
- Extend `find_dat_file_size` and `write_dat_file_from_shards` with
  `_with_dirs` variants that take the `.ec00` dir and per-shard
  dirs separately, so a decoded volume whose shards live on
  several disks can still be reconstructed. The original signatures
  delegate to the new ones with the same dir for all shards, so
  every existing caller keeps working unchanged.
- Rewire VolumeEcShardsToVolume through the helpers — presence
  check sees the union, dat_file_size reads `.ec00` from the right
  disk and `.ecx` from the EcVolume's actual idx dir, the decoder
  reads each shard from its own home dir.

* test(seaweed-volume): split-disk read / unmount / delete / collect

Five tests exercising the four behaviour shapes the PR #9252 review
flagged on multi-location EC volumes. Each builds the cross-disk
split layout from issue #9212 (shards 0 and 12 on disk 0, shard 1 +
.ecx on disk 1) via the new `build_split_disk_store` helper and
asserts:

- `find_ec_shard_location` / `find_ec_volume_with_shard` route to
  the disk that actually holds each shard (not first-match).
- `Store::unmount_ec_shards([1])` reaches disk 1 and removes shard 1
  while leaving disk 0's unrelated shards mounted (used to be a
  silent no-op).
- `Store::unmount_ec_shard(vid, 1)` ditto for the single-shard
  variant.
- `Store::delete_ec_shards` removes both the on-disk file and the
  in-memory mount on the right disk; previously deletion could
  remove the file while the in-memory shard with its open file
  handle survived on a different location.
- `collect_ec_shard_dirs` reports the right per-shard data dir for
  each location and `None` for unmounted shards.

* fix(seaweed-volume): retry same-disk legacy .ecx layout in reconcile

The unconditional `owner.location == loc_idx` skip missed the layout
where `idx_directory` is configured but the owner's `.ecx` / `.ecj` /
`.vif` still live in `loc.directory` (the legacy "written before
-dir.idx was set" shape). In that case the per-disk loader's
mount_ec_shards used `loc.idx_directory` and ENOENT'd, then this
branch suppressed the only recovery path — the owner disk's own
shards stayed unloaded after startup.

Tighten the skip so it only fires when the discovered owner dir is
already `loc.idx_directory` (the loader-already-tried-and-failed
case). When `owner.idx_dir` differs (legacy data-dir layout), queue
a same-disk retry through `mount_ec_shards_with_idx_dir(...,
&owner.idx_dir)` so reconcile becomes the recovery path.

Reported in PR #9252 review by @coderabbitai.

* fix(seaweed-volume): roll back partial mounts on cross-disk reconcile failure

mount_ec_shards_with_idx_dir adds shards one at a time and
increments the `ec_shards` gauge per shard that successfully attaches.
A mid-loop failure (e.g. an EcVolumeShard::open error after the
first few shards already attached) used to leave the EcVolume
half-mounted with stale metric increments — the warn!() branch only
logged the error.

Mirror DiskLocation::handle_found_ecx_file's recovery path: drive
the cleanup through `loc.unmount_ec_shards(vid, &shard_ids)` after
a failed mount. The defensive change in #9251 makes
unmount_ec_shards only decrement the gauge for shards that were
actually mounted and drops the EcVolume when it reaches zero
shards, so the rollback is safe even though some of `shard_ids`
never attached.

Reported in PR #9252 review by @coderabbitai.

* test(seaweed-volume): cover the two reconcile fixes from PR #9252 review

Two new tests in store_ec_reconcile:

- test_reconcile_recovers_same_disk_legacy_ecx_layout — sets up the
  layout where idx_directory is configured but the owner's .ecx
  lives in loc.directory. The per-disk loader's mount_ec_shards
  uses loc.idx_directory and fails; reconcile should retry on the
  same disk with the owner's actual idx_dir and the owner's own
  shards must come back online.

- test_reconcile_rolls_back_partial_mounts_on_failure — sabotages
  one of the orphan shard files (replaces it with a directory of
  the same name) so EcVolumeShard::open errors out partway through
  mount_ec_shards_with_idx_dir. Asserts the post-condition that no
  EcVolume entry retains a "shard mounted" claim that doesn't
  correspond to a real shard file.
2026-04-27 19:01:30 -07:00
Chris Lu
49e83a26cb feat(seaweed-volume): auto-load EC shards on startup (#9212) (#9251)
* feat(seaweed-volume): auto-load EC shards on startup

The Rust volume server's load_existing_volumes only scanned .dat
files; EC shards on disk stayed invisible until something explicitly
issued VolumeEcShardsMount. Strict superset of the issue
seaweedfs/seaweedfs#9212 reports for Go: after a fresh restart, every
local EC shard was missing from the master's view.

Port loadAllEcShards from weed/storage/disk_location_ec.go:

- DiskLocation::load_all_ec_shards walks Directory (and IdxDirectory
  if separate) sorted, groups .ec?? shard files by (collection, vid),
  validates and mounts each group when its matching .ecx is found.
- handle_found_ecx_file: validate_ec_volume + mount_ec_shards path,
  with cleanup when .dat exists and validation fails (incomplete
  encoding) or load fails.
- check_orphaned_shards: cleans up shard remnants whose .ecx never
  arrived AND whose stale .dat is still present (interrupted
  encoding); leaves them on disk otherwise so cross-disk
  reconciliation / operator recovery can find them.
- check_dat_file_exists / parse_collection_volume_id /
  parse_ec_shard_extension: small helpers mirroring Go's checkDatFileExists,
  parseCollectionVolumeId, and the `\.ec\d{2,3}` regex.
- Wire through load_existing_volumes after the .dat scan; failures
  log but don't fail the disk's startup.

Tests:
- test_parse_ec_shard_extension covers .ec00–.ec255 and the rejection
  of .ec0, .ec999, .ecx, .ecj, .dat, and missing leading dot.
- test_load_all_ec_shards_mounts_pairs_with_ecx: shards + .ecx + .vif
  on disk get mounted into ec_volumes after load_existing_volumes.
- test_load_all_ec_shards_keeps_orphan_shards_when_no_dat: orphan
  shards (no .ecx, no .dat) stay on disk untouched
  (distributed-EC scenario).
- test_load_all_ec_shards_cleans_orphan_shards_when_dat_exists:
  orphan shards alongside a stale .dat get cleaned up
  (interrupted-encoding scenario).

Prerequisite for porting the cross-disk orphan-shard reconciliation
in seaweedfs/seaweedfs#9244 to Rust.

* fix(seaweed-volume): dedupe filenames when scanning data + idx dirs

load_all_ec_shards scans both `directory` and `idx_directory` (when
they differ) so the loop can pair `.ec??` shards with their `.ecx`
regardless of which dir owns the index. If the same filename is
present in both — possible in idempotent legacy layouts that
pre-date `-dir.idx` — the previous implementation processed it
twice. mount_ec_shards increments the per-shard `ec_shards` metric
inside the loop, so a duplicated `.ec??` entry would double-count
the gauge.

Use a HashSet<String> while accumulating entries so each filename
is processed exactly once.

Reported in PR #9251 review by @gemini-code-assist.

* fix(seaweed-volume): drive partial-mount cleanup through unmount_ec_shards

handle_found_ecx_file calls mount_ec_shards which adds shards one at
a time. mount_ec_shards increments the `ec_shards` gauge per shard
that successfully attaches. If mount fails halfway, plain
ec_volumes.remove(vid) drops the EcVolume but leaves the gauge
incremented for whatever did mount.

Drive the cleanup branches through unmount_ec_shards instead — it
mirror-decrements the gauge per shard and only then drops the
EcVolume. Same shape applied to both .dat-exists and distributed-EC
fallbacks.

Reported in PR #9251 review by @gemini-code-assist.

* docs(seaweed-volume): clarify parse_ec_shard_extension shard-id range

Doc previously said `.ec00`–`.ec999` but the implementation rejects
any shard id > 255 (matches the `EcVolumeShard` u8 typed shard id
and Go's `strconv.ParseInt(... 10, 64)` + `> 255` guard). Fix the
doc to say `.ec00`–`.ec255` and explain why the 3-digit form is
still recognised.

Reported in PR #9251 review by @coderabbitai.
2026-04-27 16:41:46 -07:00
Chris Lu
933ae6e386 fix(seaweed-volume): port EC shard placement fix to Rust (#9212, mirrors #9245) (#9250)
* feat(seaweed-volume): add DiskLocation::has_ecx_file_on_disk

Mirrors `DiskLocation.HasEcxFileOnDisk` from the Go side
(seaweedfs/seaweedfs#9245). Reports whether this disk has a sealed
.ecx index file for (collection, vid) by stat'ing the IdxDirectory
first, then falling back to Directory if different — covers the
legacy "written before -dir.idx was set" layout. Skips entries that
are directories so a stray dir named `<col>_<vid>.ecx` doesn't
register as a present index file.

Unlike has_ec_volume() this does not require the EC volume to be
mounted in memory, which makes it the right primitive for placement
decisions during ec.balance / ec.rebuild flows where shards may
arrive before any VolumeEcShardsMount has happened on the receiving
disk.

Wiring + tests in follow-up commits.

* feat(seaweed-volume): add Store::find_ec_shard_target_location

Mirrors `Store.FindEcShardTargetLocation` from the Go side
(seaweedfs/seaweedfs#9245). Single canonical placement primitive for
new EC shard / index files. Selection order:

  1. a disk that already has the EC volume mounted (in-memory),
  2. a disk that owns the .ecx file on disk (volume not yet mounted),
  3. any HDD with free space,
  4. any disk with free space.

Step 2 is the missing primitive that pinned subsequent shards to the
first-shard disk during ec.rebuild — rebuild only sets
CopyEcxFile=true on the first shard, then relies on auto-select to
land later shards on the same disk. Without an on-disk check
has_ec_volume returns false (no mount yet) and the fallback picked
"any HDD with free space," splitting shards from their .ecx across
disks of the same node and producing the orphan-shard layout
seaweedfs/seaweedfs#9212 reports.

Implementation walks store.locations once with tier scoring; the
highest-tier disk wins, ties broken by free count. The earlier
4-pass waterfall in find_free_location_predicate would have
re-acquired locks per pass.

ec_free_shard_count returns the free count in shard slots (not
volume-equivalent slots). The pre-existing find_free_location*
helpers divide by DATA_SHARDS_COUNT at the end; that truncation can
exclude a disk that has room for several individual shards
(MaxVolumeCount=1, EcShardCount=1, dsc=10 → reports 0 despite 9
free slots), which would re-route subsequent shards off the
.ecx-owning disk and re-introduce the orphan layout. Keep the result
in shard slots throughout. The unlimited-disk branch
(MaxVolumeCount==0) reports a synthetic large free count
decremented by current usage so unlimited disks stay eligible and
tie-breaks still prefer the less-loaded one.

data_shard_count is taken as a parameter rather than read from
DATA_SHARDS_COUNT so custom-ratio builds can swap the default
without touching this helper.

Tests cover: pinning to .ecx on disk, mounted-wins-over-stray-.ecx,
HDD fallback, MaxVolumeCount=0 unlimited handling, and the
tight-provisioning truncation case.

* fix(seaweed-volume): route EC shard auto-select through new helper

VolumeEcShardsCopy and the ReceiveFile EC branch both used a 3-tier
inline waterfall: in-memory has_ec_volume → any HDD → any disk. That
checked in-memory state only and missed disks that own the .ecx on
disk but haven't been mounted yet — the orphan-shard placement
hazard from seaweedfs/seaweedfs#9212.

Replace both with a single call to
Store::find_ec_shard_target_location, which adds the .ecx-on-disk
tier between mounted and HDD, and accounts for free space in shard
slots so tight-provisioning configurations don't incorrectly skip a
disk that still has room for individual shards.

Pass DATA_SHARDS_COUNT as the data-shard count for free-slot maths;
the helper takes it as a parameter so custom-ratio builds can swap
the default without touching this file.

* fix(seaweed-volume): grow UNLIMITED_FREE budget and saturate the math

ec_free_shard_count's unlimited branch (MaxVolumeCount=0) used to
clamp to a constant `1` once usage exceeded `1 << 30 ≈ 1e9` shard
slots. With several unlimited disks all past that threshold, every
placement decision among them tied at 1 — tie-break degraded to
"first eligible disk."

Bump the synthetic budget to `1 << 60 ≈ 1.15e18` and use
saturating arithmetic so even pathological usage never wraps i64.
Clamp the return value to `≥ 1` so the disk stays eligible for
placement at any load. Tie-breaks among unlimited disks now keep
preferring the less-loaded one across all realistic deployments.

Reported in PR #9250 review by @gemini-code-assist.
2026-04-27 16:40:39 -07:00
Chris Lu
f50917224a fix(iceberg): default namespace location so fresh CTAS does not race metadata write (#9074) (#9246)
* fix(iceberg): advertise default namespace location for REST clients

Trino's REST catalog has two code paths for CREATE TABLE depending on
whether a table location can be resolved before the catalog call:

    // TrinoRestCatalog.newCreateTableTransaction (Trino 479)
    if (location.isEmpty()) {
        return tableBuilder.create().newTransaction();   // EAGER: REST POST now
    }
    return tableBuilder.withLocation(...).createTransaction(); // DEFERRED

`tableLocation` resolves to null when the user does not pass
`location = '...'` AND `defaultTableLocation` returns null. The latter
happens whenever the namespace's `loadNamespaceMetadata` response has no
`location` property — and our handler returned exactly that.

In the eager branch Trino calls REST POST /v1/.../tables immediately, so
our handleCreateTable persists `<location>/metadata/v1.metadata.json` to
the filer. Trino's IcebergMetadata.beginCreateTable then runs
`fileSystem.listFiles(location).hasNext()` on the same path, finds the
metadata file we just wrote, and throws "Cannot create a table on a
non-empty location" — even on a brand-new schema and table name.

Synthesize a default `location` of `s3://<bucket>/<flattened-namespace>`
on Get/Create namespace responses when one is not stored. With a non-
null `defaultTableLocation`, Trino takes the deferred branch, picks a
unique `<namespace-location>/<table>-<UUID>` path (UUID added by the
standard `iceberg.unique-table-location=true` setting), and the empty-
location check passes. The actual REST POST is deferred to commit time,
so our metadata write lands alongside the data files Trino has already
produced.

Existing namespaces with an explicit `location` property are untouched —
the synthesis only kicks in when the property is absent.

Refs #9074

* test(trino): regression for fresh CREATE TABLE without explicit location

Exercises the follow-up scenario reported on issue #9074: a CREATE TABLE
on a brand-new schema and brand-new table name with NO `location = '...'`
clause, followed by a CTAS on top — exactly the SQL pattern from the
report. Before advertising a default namespace location, the first
CREATE TABLE failed with

    Cannot create a table on a non-empty location:
    s3://iceberg-tables/<schema>/<table>, set
    'iceberg.unique-table-location=true' in your Iceberg catalog
    properties to use unique table locations for every table.

even though the bucket was genuinely empty. The bug surfaced because our
handleCreateTable persists `<location>/metadata/v1.metadata.json` during
Trino's eager-create branch, and Trino's post-create listFiles
emptiness check then trips on the metadata file we just wrote.

The test asserts the CTAS succeeds AND the resulting table contains the
source rows, since the reporter saw the table get created with empty
data when the query failed.

Refs #9074
2026-04-27 16:37:33 -07:00