13396 Commits

Author SHA1 Message Date
Chris Lu
0bdf9b0683 4.19 4.19 2026-04-07 19:21:35 -07:00
Chris Lu
75dcb97187 filer: bootstrap pre-existing metadata when a new filer joins (#8979)
* filer: bootstrap pre-existing metadata when a new filer joins a cluster

When a filer connects to a peer for the first time (no stored sync
offset), it now does a full BFS traversal of the peer's metadata via
TraverseBfsMetadata before starting the incremental change stream.
This ensures filer2 sees all data that existed before it started,
fixing the issue where only post-startup changes were synced.

Closes #8961

* filer: upsert during bootstrap and persist offset immediately

- Use upsert (insert, then update on conflict) during metadata
  traversal so the bootstrap doesn't fail on the root directory
  or after a partial previous attempt.
- Persist the sync offset right after a successful traversal so
  a retry doesn't redo the full BFS.

* filer: address review feedback on metadata bootstrap

- Use peer-side max Mtime as the streaming cursor instead of local
  time.Now() to avoid missing events due to clock skew between filers.
  traversePeerMetadata now returns the high-water Mtime (nanoseconds)
  observed during BFS traversal.

- Compare Mtime before overwriting during bootstrap: if a local entry
  is newer than the peer's version, skip the update instead of
  clobbering it.

- Only trigger full BFS traversal on ErrKvNotFound (key genuinely
  missing). Transient KvGet errors (connection issues, etc.) are now
  propagated instead of silently falling through to a full re-sync.
  Changed readOffset to use %w so errors.Is works through the chain.

* filer: address review findings on bootstrap sync

- Use wall-clock time with safety margin for stream cursor instead of
  entry Mtime. Mtime is file modification time (can be arbitrary),
  while the metadata stream uses TsNs (event log time). Using
  time.Now() minus 1 minute before traversal ensures no events are
  missed even with clock skew, matching the proven filer.meta.backup
  pattern.

- Pass ExcludedPrefixes=[SystemLogDir] to TraverseBfsMetadata so
  the server prunes internal log entries server-side instead of
  transferring them over the network only to be filtered client-side.

- Fail fast if updateOffset fails after bootstrap. If we can't
  persist the offset, bail out rather than proceeding and potentially
  losing the expensive BFS work on the next retry.
2026-04-07 19:05:45 -07:00
Chris Lu
940eed0bd3 fix(ec): generate .ecx before EC shards to prevent data inconsistency (#8972)
* fix(ec): generate .ecx before EC shards to prevent data inconsistency

In VolumeEcShardsGenerate, the .ecx index was generated from .idx AFTER
the EC shards were generated from .dat. If any write occurred between
these two steps (e.g. WriteNeedleBlob during replica sync, which bypasses
the read-only check), the .ecx would contain entries pointing to data
that doesn't exist in the EC shards, causing "shard too short" and
"size mismatch" errors on subsequent reads and scrubs.

Fix by generating .ecx FIRST, then snapshotting datFileSize, then
encoding EC shards. If a write sneaks in after .ecx generation, the
EC shards contain more data than .ecx references — which is harmless
(the extra data is simply not indexed).

Also snapshot datFileSize before EC encoding to ensure the .vif
reflects the same .dat state that .ecx was generated from.

Add TestEcConsistency_WritesBetweenEncodeAndEcx that reproduces the
race condition by appending data between EC encoding and .ecx generation.

* fix: pass actual offset to ReadBytes, improve test quality

- Pass offset.ToActualOffset() to ReadBytes instead of 0 to preserve
  correct error metrics and error messages within ReadBytes
- Handle Stat() error in assembleFromIntervalsAllowError
- Rename TestEcConsistency_DatFileGrowsDuringEncoding to
  TestEcConsistency_ExactLargeRowEncoding (test verifies fixed-size
  encoding, not concurrent growth)
- Update test comment to clarify it reproduces the old buggy sequence
- Fix verification loop to advance by readSize for full data coverage

* fix(ec): add dat/idx consistency check in worker EC encoding

The erasure_coding worker copies .dat and .idx as separate network
transfers. If a write lands on the source between these copies, the
.idx may have entries pointing past the end of .dat, leading to EC
volumes with .ecx entries that reference non-existent shard data.

Add verifyDatIdxConsistency() that walks the .idx and verifies no
entry's offset+size exceeds the .dat file size. This fails the EC
task early with a clear error instead of silently producing corrupt
EC volumes.

* test(ec): add integration test verifying .ecx/.ecd consistency

TestEcIndexConsistencyAfterEncode uploads multiple needles of varying
sizes (14B to 256KB), EC-encodes the volume, mounts data shards, then
reads every needle back via the EC read path and verifies payload
correctness. This catches any inconsistency between .ecx index entries
and EC shard data.

* fix(test): account for needle overhead in test volume fixture

WriteTestVolumeFiles created a .dat of exactly datSize bytes but the
.idx entry claimed a needle of that same size. GetActualSize adds
header + checksum + timestamp overhead, so the consistency check
correctly rejects this as the needle extends past the .dat file.

Fix by sizing the .dat to GetActualSize(datSize) so the .idx entry
is consistent with the .dat contents.

* fix(test): remove flaky shard ID assertion in EC scrub test

When shard 0 is truncated on disk after mount, the volume server may
detect corruption via parity mismatches (shards 10-13) rather than a
direct read failure on shard 0, depending on OS caching/mmap behavior.
Replace the brittle shard-0-specific check with a volume ID validation.

* fix(test): close upload response bodies and tighten file count assertion

Wrap UploadBytes calls with ReadAllAndClose to prevent connection/fd
leaks during test execution. Also tighten TotalFiles check from >= 1
to == 1 since ecSetup uploads exactly one file.
2026-04-07 19:05:36 -07:00
Chris Lu
6098ef4bd3 fix(test): remove flaky shard ID assertion in EC scrub test (#8978)
* test: add integration tests for volume and EC volume scrubbing

Add scrub integration tests covering normal volumes (full data scrub,
corrupt .dat detection, mixed healthy/broken batches, missing volume
error) and EC volumes (INDEX/LOCAL modes on healthy volumes, corrupt
shard detection with broken shard info reporting, corrupt .ecx index,
auto-select, unsupported mode error).

Also adds framework helpers: CorruptDatFile, CorruptEcxFile,
CorruptEcShardFile for fault injection in scrub tests.

* fix: correct dat/ecx corruption helpers and ecx test setup

- CorruptDatFile: truncate .dat to superblock size instead of overwriting
  bytes (ensures scrub detects data file size mismatch)
- TestScrubEcVolumeIndexCorruptEcx: corrupt .ecx before mount so the
  corrupted size is loaded into memory (EC volumes cache ecx size at mount)

* fix(test): remove flaky shard ID assertion in EC scrub test

When shard 0 is truncated on disk after mount, the volume server may
detect corruption via parity mismatches (shards 10-13) rather than a
direct read failure on shard 0, depending on OS caching/mmap behavior.
Replace the brittle shard-0-specific check with a volume ID validation.

* fix(test): close upload response bodies and tighten file count assertion

Wrap UploadBytes calls with ReadAllAndClose to prevent connection/fd
leaks during test execution. Also tighten TotalFiles check from >= 1
to == 1 since ecSetup uploads exactly one file.
2026-04-07 18:15:53 -07:00
Chris Lu
4bf6d195e4 test: add integration tests for volume and EC scrubbing (#8977)
* test: add integration tests for volume and EC volume scrubbing

Add scrub integration tests covering normal volumes (full data scrub,
corrupt .dat detection, mixed healthy/broken batches, missing volume
error) and EC volumes (INDEX/LOCAL modes on healthy volumes, corrupt
shard detection with broken shard info reporting, corrupt .ecx index,
auto-select, unsupported mode error).

Also adds framework helpers: CorruptDatFile, CorruptEcxFile,
CorruptEcShardFile for fault injection in scrub tests.

* fix: correct dat/ecx corruption helpers and ecx test setup

- CorruptDatFile: truncate .dat to superblock size instead of overwriting
  bytes (ensures scrub detects data file size mismatch)
- TestScrubEcVolumeIndexCorruptEcx: corrupt .ecx before mount so the
  corrupted size is loaded into memory (EC volumes cache ecx size at mount)
2026-04-07 16:31:32 -07:00
Chris Lu
74905c4b5d shell: s3.* commands always output JSON, connection messages to stderr (#8976)
* shell: s3.* commands output JSON, connection messages to stderr

All s3.user.* and s3.policy.attach|detach commands now output structured
JSON to stdout instead of human-readable text:

- s3.user.create: {"name","access_key"} (secret key to stderr only)
- s3.user.list: [{name,status,policies,keys}]
- s3.user.show: {name,status,source,account,policies,credentials,...}
- s3.user.delete: {"name"}
- s3.user.enable/disable: {"name","status"}
- s3.policy.attach/detach: {"policy","user"}

Connection startup messages (master/filer) moved to stderr so they
don't pollute structured output when piping.

Closes #8962 (partial — covers merged s3.user/policy commands).

* shell: fix secret leak, duplicate JSON output, and non-interactive prompt

- s3.user.create: only echo secret key to stderr when auto-generated,
  never echo caller-supplied secrets
- s3.user.enable/disable: fix duplicate JSON output — remove inner
  write in early-return path, keep single write site after gRPC call
- shell_liner: use bufio.Scanner when stdin is not a terminal instead
  of liner.Prompt, suppressing the "> " prompt in piped mode

* shell: check scanner error, idempotent enable output, history errors to stderr

- Check scanner.Err() after non-interactive input loop to surface read errors
- s3.user.enable: always emit JSON regardless of current state (idempotent)
- saveHistory: write error messages to stderr instead of stdout
2026-04-07 16:27:21 -07:00
Lars Lehtonen
df619ec3f6 fix(weed/filer/redis2): fix dropped error (#8952)
* fix(weed/filer/redis2): fix dropped error

* fix(weed/filer/redis2): break on non-ErrNotFound errors in ListDirectoryEntries

Without the break, a hard FindEntry error gets overwritten by subsequent
iterations and the function may return nil, silently losing the error.

---------

Co-authored-by: Chris Lu <chris.lu@gmail.com>
2026-04-07 14:59:01 -07:00
Chris Lu
fb0573ffc4 shell: rename -force to -apply in s3.iam.import for consistency 2026-04-07 14:17:07 -07:00
Chris Lu
b0e79ad207 fix(admin): respect urlPrefix for root redirect and JS API calls (#8975)
* fix(admin): respect urlPrefix for root redirect and JS API calls (#8967)

Two issues when running admin UI behind a reverse proxy with -urlPrefix:

1. Visiting the prefix path without trailing slash (e.g. /s3-admin) caused
   a redirect to / instead of /s3-admin/ because http.StripPrefix produced
   an empty path that the router redirected to root.

2. Several JavaScript API calls in admin.js used hardcoded paths instead
   of basePath(), causing file upload, download, and preview to fail.

* fix(admin): preserve query params in prefix redirect and use 302

Use http.StatusFound instead of 301 to avoid aggressive browser caching
of a configuration-dependent redirect, and preserve query parameters.
2026-04-07 14:12:05 -07:00
Chris Lu
2919bb27e5 fix(sync): use per-cluster TLS for HTTP volume connections in filer.sync (#8974)
* fix(sync): use per-cluster TLS for HTTP volume connections in filer.sync (#8965)

When filer.sync runs with -a.security and -b.security flags, only gRPC
connections received per-cluster TLS configuration. HTTP clients for
volume server reads and uploads used a global singleton with the default
security.toml, causing TLS verification failures when clusters use
different self-signed certificates.

Load per-cluster HTTPS client config from the security files and pass
dedicated HTTP clients to FilerSource (for downloads) and FilerSink
(for uploads) so each direction uses the correct cluster's certificates.

* fix(sync): address review feedback for per-cluster HTTP TLS

- Add insecure_skip_verify support to NewHttpClientWithTLS and read it
  from per-cluster security config via https.client.insecure_skip_verify
- Error on partial mTLS config (cert without key or vice versa)
- Add nil-check for client parameter in DownloadFileWithClient
- Document SetUploader as init-only (same pattern as SetChunkConcurrency)
2026-04-07 14:11:44 -07:00
Chris Lu
d50889002b shell: add s3.iam.*, s3.config.show, s3.user.provision; hide legacy commands (#8956)
* shell: add s3.iam.*, s3.config.show, s3.user.provision; hide legacy commands

Add import/export, configuration summary, and a convenience provisioning
command:

- s3.iam.export: dump full IAM state as JSON (stdout or file)
- s3.iam.import: replace IAM state from a JSON file
- s3.config.show: human-readable summary (users, policies, service
  accounts, groups with status and counts)
- s3.user.provision: one-step user+policy+credentials creation for
  common readonly/readwrite/admin roles

Hide legacy commands from help listing:
- s3.configure: still works but hidden from help output
- s3.bucket.access: still works but hidden from help output

Both hidden commands remain fully functional for existing scripts.

Also adds a Hidden command tag and filters it from printGenericHelp.

* shell: address review feedback for s3.iam.*, s3.config.show, s3.user.provision

- Simplify joinMax using strings.Join
- Fix rolePolicies: remove s3:ListBucket from object-level actions
  (already covered by bucket-level statement)
- Fix admin role: grant s3:* on bucket resource too
- Return flag parse errors instead of swallowing them

* shell: address missed review feedback for PR 3

- s3.iam.import: require -force flag for destructive IAM overwrite
- s3.config.show: add nil guard for resp.Configuration
- s3.user.provision: check if user exists before creating policy
- s3.user.provision: reject wildcard bucket names (* ?)

* shell: distinguish NotFound from transient errors in provision, use %w wrapping

- s3.user.provision: check gRPC status code on GetUser error — only
  proceed on NotFound, abort on transient/network errors
- s3.iam.import: use %w for error wrapping to preserve error chains,
  wrap PutConfiguration error with context

* shell: remove duplicate joinMax after PR 8954 merge

command_s3_helpers.go defined joinMax which is already in
command_s3_user_list.go from the merged PR 8954.

* shell: restrict export file permissions, rollback policy on user create failure

- s3.iam.export: use os.OpenFile with mode 0600 instead of os.Create
  to protect exported credentials from other users
- s3.user.provision: rollback the created policy if CreateUser fails,
  with a warning if the rollback itself fails
2026-04-07 14:10:15 -07:00
Chris Lu
efc7f3936f fix(s3): handle empty URL path in forwarded prefix signature verification (#8973)
fix(s3): handle empty URL path in forwarded prefix signature verification (#8966)

When S3 is behind a reverse proxy with a forwarded prefix (e.g. /s3),
requests with an empty URL path (like ListBuckets) would incorrectly
get a trailing slash appended (e.g. /s3/), causing signature
verification to fail because the client signs /s3 without the slash.
2026-04-07 13:22:21 -07:00
Chris Lu
79a48256f5 fix(s3): populate s3:prefix from query param for ListObjects policy conditions (#8971)
* fix(s3): populate s3:prefix from query param for ListObjects policy conditions (#8969)

ListObjectsV2/V1 requests with prefix-restricted STS session policies
were denied because:
1. s3:prefix was derived from objectKey, which the auth middleware set to
   the prefix value, but the resource ARN then included the prefix
   (e.g. arn:aws:s3:::bucket/prefix) instead of staying at bucket level
   (arn:aws:s3:::bucket) as AWS requires for ListBucket.
2. When objectKey was empty (no middleware propagation), s3:prefix was
   never populated from the query parameter at all.

Now AuthorizeAction extracts the prefix query parameter directly, sets it
as s3:prefix in the request context, and uses a bucket-level resource ARN
when the objectKey matches the propagated prefix.

* fix(s3): use AWS-style wildcard matching for StringLike policy conditions

filepath.Match treats * as not matching /, which breaks IAM StringLike
conditions on paths (e.g. arn:aws:s3:::bucket/* won't match nested keys).
Replace with a case-sensitive variant of AwsWildcardMatch that correctly
treats * as matching any character including /.

* refactor(s3): replace regex wildcard matching with string-based matcher

Use the existing wildcard.MatchesWildcard utility instead of compiling
and caching regexes for IAM wildcard matching. Removes the regexCache,
its mutex, and the sync import.

* refactor(s3): inline and remove AwsWildcardMatch wrapper functions

Replace all call sites with direct wildcard.MatchesWildcard calls.

* fix(s3): scope s3:prefix condition key to list operations only

The s3:prefix logic was running for all actions, so a GetObject on
"foo/bar" would wrongly populate s3:prefix. Restrict it to action "List"
and always reset resourceObjectKey to "" so the resource ARN stays at
bucket level. Also set s3:prefix to "" when no prefix is provided, so
policies with StringEquals {"s3:prefix": ""} evaluate correctly.
2026-04-07 13:21:30 -07:00
Chris Lu
a4753b6a3b S3: delay empty folder cleanup to prevent Spark write failures (#8970)
* S3: delay empty folder cleanup to prevent Spark write failures (#8963)

Empty folders were being cleaned up within seconds, causing Apache Spark
(s3a) writes to fail when temporary directories like _temporary/0/task_xxx/
were briefly empty.

- Increase default cleanup delay from 5s to 2 minutes
- Only process queue items that have individually aged past the delay
  (previously the entire queue was drained once any item triggered)
- Make the delay configurable via filer.toml:
  [filer.options]
  s3.empty_folder_cleanup_delay = "2m"

* test: increase cleanup wait timeout to match 2m delay

The empty folder cleanup delay was increased to 2 minutes, so the
Spark integration test needs to wait longer for temporary directories
to disappear.

* fix: eagerly clean parent directories after empty folder deletion

After deleting an empty folder, immediately try to clean its parent
rather than relying on cascading metadata events that each re-enter
the 2-minute delay queue. This prevents multi-minute waits when
cleaning nested temporary directory trees (e.g. Spark's _temporary
hierarchy with 3+ levels would take 6m+ vs near-instant).

Fixes the CI failure where lingering _temporary parent directories
were not cleaned within the test's 3-minute timeout.
2026-04-07 13:20:59 -07:00
Chris Lu
761ec7da00 fix(iceberg): use dot separator for namespace paths instead of unit separator (#8960)
* fix(iceberg): use dot separator for namespace paths instead of unit separator

The Iceberg REST Catalog handler was using \x1F (unit separator) to join
multi-level namespaces when constructing S3 location and filer paths. The
S3 Tables storage layer uses "." (dot) as the namespace separator, causing
tables created via the Iceberg REST API to point to different paths than
where S3 Tables actually stores them.

Fixes #8959

* fix(iceberg): use dot separator in log messages for readable namespace output

* fix(iceberg): use path.Join for S3 location path segments

Use path.Join to construct the namespace/table path segments in fallback
S3 locations for robustness and consistency with handleCreateTable.

* test(iceberg): add multi-level namespace integration tests for Spark and Trino

Add regression tests for #8959 that create a two-level namespace (e.g.
"analytics.daily"), create a table under it, insert data, and query it
back. This exercises the dot-separated namespace path construction and
verifies that Spark/Trino can actually read the data at the S3 location
returned by the Iceberg REST API.

* fix(test): enable nested namespace in Trino Iceberg catalog config

Trino requires `iceberg.rest-catalog.nested-namespace-enabled=true` to
support multi-level namespaces. Without this, CREATE SCHEMA with a
dotted name fails with "Nested namespace is not enabled for this catalog".

* fix(test): parse Trino COUNT(*) output as integer instead of substring match

Avoids false matches from strings.Contains(output, "3") by parsing the
actual numeric result with strconv.Atoi and asserting equality.

* fix(test): use separate Trino config for nested namespace test

The nested-namespace-enabled=true setting in Trino changes how SHOW
SCHEMAS works, causing "Internal error" for all tests sharing that
catalog config. Move the flag to a dedicated config used only by
TestTrinoMultiLevelNamespace.

* fix(iceberg): support parent query parameter in ListNamespaces for nested namespaces

Add handling for the Iceberg REST spec's `parent` query parameter in
handleListNamespaces. When Trino has nested-namespace-enabled=true, it
sends `GET /v1/namespaces?parent=<ns>` to list child namespaces. The
parent value is decoded from the Iceberg unit separator format and
converted to a dot-separated prefix for the S3 Tables layer.

Also simplify TestTrinoMultiLevelNamespace to focus on namespace
operations (create, list, show tables) rather than data operations,
since Trino's REST catalog has a non-empty location check that conflicts
with server-side metadata creation.

* fix(test): expand Trino multi-level namespace test and merge config helpers

- Expand TestTrinoMultiLevelNamespace to create a table with explicit
  location, insert rows, query them back, and verify the S3 file path
  contains the dot-separated namespace (not \x1F). This ensures the
  original #8959 bug would be caught by the Trino integration test.
- Merge writeTrinoConfig and writeTrinoNestedNamespaceConfig into a
  single parameterized function using functional options.
2026-04-07 12:21:22 -07:00
Chris Lu
d4548376a1 fix(ec): off-by-one in nLargeBlockRows causes EC read corruption (#8957)
* fix(ec): off-by-one in nLargeBlockRows causes EC read corruption (#8947)

The nLargeBlockRows formula in locateOffset used (shardDatSize-1)/largeBlockLength,
which produces an off-by-one error when shardDatSize is an exact multiple of
largeBlockLength (e.g. a 30GB volume with 10 data shards = 3GB per shard).
This causes needles in the last large block row to be mislocated as small blocks,
reading from completely wrong shard positions and returning garbage data.

Fix: remove the -1 from locateOffset and only apply it in the ecdFileSize fallback
path (old volumes without datFileSize in .vif), where it's needed to handle the
ambiguous case conservatively.

Also fix ReadEcShardNeedle to pass offset=0 to ReadBytes, consistent with the
scrub path, since the bytes buffer already starts at position 0.

* fix: add volume context to EC read errors, remove contextless glog

The glog.Errorf in ReadBytes logged "entry not found" without any volume
ID, making it impossible to identify which volume was affected. Remove
this contextless log and instead add volume ID, needle ID, offset, and
size to the error returned from the EC read path.

The EC scrub callers already wrap errors with volume context.
2026-04-07 12:02:51 -07:00
Chris Lu
45bf3ad058 shell: add s3.user.* and s3.policy.attach|detach commands (#8954)
* shell: add s3.user.* and s3.policy.attach|detach commands

Add focused IAM shell commands following a noun-verb model:

- s3.user.create: create user with auto-generated or explicit credentials
- s3.user.list: tabular listing with status, policies, key count
- s3.user.show: detailed user view (status, source, policies, credentials)
- s3.user.delete: delete a user
- s3.user.enable: enable a disabled user
- s3.user.disable: disable a user (preserves credentials and policies)
- s3.policy.attach: attach a named policy to a user
- s3.policy.detach: detach a policy from a user

These commands are thin wrappers over the existing IAM gRPC service,
producing human-readable output instead of raw protobuf text.

This is part of a larger effort to replace the monolithic s3.configure
command with a composable set of single-purpose commands.

* shell: address review feedback for s3.user.* and s3.policy.attach|detach

- Return flag parse errors instead of swallowing them (all commands)
- Use GetConfiguration instead of N+1 GetUser calls in s3.user.list
- Add nil check for resp.Identity in s3.user.show
- Fix GetPolicy error masking in s3.policy.attach (wrap original error)
- Simplify joinMax using strings.Join

* shell: add nil identity guards and wrap gRPC errors

- Add nil check for resp.Identity in policy_attach, policy_detach,
  user_enable, user_disable
- Wrap GetUser errors with user context for better diagnostics
2026-04-07 11:26:57 -07:00
Chris Lu
d123a2768b shell: add s3.accesskey.*, s3.anonymous.*, s3.serviceaccount.* commands (#8955)
* shell: add s3.accesskey.*, s3.anonymous.*, s3.serviceaccount.* commands

Add credential, anonymous access, and service account management commands:

Access key commands:
- s3.accesskey.create: add credentials to an existing user
- s3.accesskey.list: list access keys for a user (key ID + status)
- s3.accesskey.delete: remove a specific access key
- s3.accesskey.rotate: atomic create-new + delete-old key rotation

Anonymous access commands:
- s3.anonymous.set: set/remove public access on a bucket
- s3.anonymous.get: show anonymous access for a bucket
- s3.anonymous.list: list all buckets with anonymous access

Service account commands:
- s3.serviceaccount.create: create with optional action subset and expiry
- s3.serviceaccount.list: tabular listing, optionally filtered by parent
- s3.serviceaccount.show: detailed view of a service account
- s3.serviceaccount.delete: remove a service account

These replace the credential and anonymous portions of the monolithic
s3.configure and s3.bucket.access commands.

* shell: address review feedback for s3.accesskey.*, s3.anonymous.*, s3.serviceaccount.*

- Return flag parse errors instead of swallowing them (all commands)
- Add action validation in s3.anonymous.set (Read, Write, List, Tagging, Admin)
- Fix s3.serviceaccount.create output: note to use list for server-assigned ID
  since CreateServiceAccountResponse does not return the ID

* shell: fix bucket matching and action validation in s3.anonymous.*

- Use SplitN instead of HasSuffix for bucket name matching to avoid
  false positives when one bucket name is a suffix of another
- Make action validation case-insensitive with canonical normalization

* shell: fix nil panics, dedup actions, validate service account actions

- Fix nil-pointer panic in getOrCreateAnonymousUser when GetUser returns
  err==nil with nil Identity (status.FromError(nil) returns nil status)
- Add nil Identity guards in s3.anonymous.get and s3.anonymous.list
- Deduplicate action values in s3.anonymous.set (e.g. -access Read,Read)
- Add action validation in s3.serviceaccount.create with case normalization

* shell: dedup actions and reject negative expiry in s3.serviceaccount.create

- Deduplicate -actions values (e.g. Read,read,Read produces one entry)
- Reject negative -expiry values instead of silently treating as no expiration
2026-04-07 11:20:15 -07:00
Chris Lu
733517df30 fix(s3): s3:PutObject bucket policy now implicitly allows multipart uploads (#8968)
* fix(s3): s3:PutObject bucket policy now implicitly allows multipart uploads

The PolicyEngine.evaluateStatement() method used raw regex matching for
actions, bypassing the multipart-inherits-PutObject logic that only
existed in the unused CompiledStatement.MatchesAction() code path.

When a bucket policy granted only s3:PutObject, multipart upload
operations (CreateMultipartUpload, UploadPart, CompleteMultipartUpload,
etc.) were denied, forcing users to explicitly list every multipart
action.

Fixes https://github.com/seaweedfs/seaweedfs/discussions/8751

* fix(s3): add s3:UploadPartCopy to multipartActionSet and improve test coverage

Add missing S3_ACTION_UPLOAD_PART_COPY constant and include it in
multipartActionSet so UploadPartCopy is implicitly allowed by s3:PutObject.

Also add a bucket-ARN sub-test for ListBucketMultipartUploads to verify
that an object-only resource pattern does not match bucket-level requests.
2026-04-07 11:13:29 -07:00
Chris Lu
0fed72d95a volume.tier.move: fulfill target replication before deleting old replicas (#8950)
* volume.tier.move: fulfill target replication before deleting old replicas

When -toReplication is specified, volume.tier.move now creates all
required replicas on the destination tier before deleting old replicas.
This closes the data-loss window where only one copy existed on the
target tier while awaiting volume.fix.replication.

If replication fulfillment fails, old replicas are preserved and marked
writable so the volume remains accessible.

Also extracts replicateVolumeToServer and configureVolumeReplication
helpers to reduce duplication across volume.tier.move and
volume.fix.replication.

Fixes #8937

* volume.tier.move: always fulfill replication before deleting old replicas

When -toReplication is specified, use that replication setting.
Otherwise, read the volume's existing replication from the super block.
In both cases, all required replicas are created on the destination
tier before old replicas are deleted.

If replication fulfillment fails (e.g. not enough destination nodes),
old replicas are preserved and marked writable so no data is lost.

* volume.tier.move: address review feedback on ensureReplicationFulfilled

- Add 5s delay before re-collecting topology to allow master heartbeat
  propagation after the move
- Add nil guard for targetTierReplicas to prevent panic if the moved
  replica is not yet visible in the topology
- Treat configureVolumeReplication failure as a hard error instead of a
  warning, so the rollback logic preserves old replicas

* volume.tier.move: harden replication config error handling

- Make configureVolumeReplication failure on the primary moved replica a
  hard error that aborts the move, instead of logging and continuing
- Configure replication metadata on all existing target-tier replicas
  (not just newly created ones) when -toReplication is specified
- Deletion of old replicas cannot affect new replicas since the
  locations list only contains pre-move servers (verified, no change)

* volume.tier.move: fix cleanup deleting fulfilled replicas and broken recovery

Fix 1: The cleanup loop now preserves pre-existing target-tier replicas
that ensureReplicationFulfilled counted toward the replication target.
Previously, a mixed-tier volume with an existing replica on the target
tier could have that replica deleted right after being counted as
fulfilled, leaving the volume under-replicated.

ensureReplicationFulfilled now returns a preserveServers set that the
deletion loop checks before removing any old replica.

Fix 2: Failure paths after LiveMoveVolume (which deletes the source
replica) now use restoreSurvivingReplicasWritable instead of
markVolumeReplicasWritable. The old helper stopped on first error, so
attempting to mark the already-deleted source writable would prevent
all surviving replicas from being restored. The new helper skips the
deleted source and continues through all remaining locations, logging
per-replica errors instead of aborting.

* volume.tier.move: mark preserved replicas writable, skip nodes with existing volume

Fix 1: Preserved pre-existing target-tier replicas were left read-only
after the move completed. They were marked read-only at the start
(along with all other replicas) but never restored since the old code
deleted them. Now they are explicitly marked writable before cleanup.

Fix 2: The fulfillment loop could pick a candidate node that already
hosts this volume on a different disk type, causing a VolumeCopy
conflict. Added a guard that skips any node already hosting the volume
(on any disk) before attempting replication.
2026-04-06 14:55:37 -07:00
dependabot[bot]
d0692f14ad build(deps): bump github.com/aws/aws-sdk-go-v2/credentials from 1.19.13 to 1.19.14 (#8942)
build(deps): bump github.com/aws/aws-sdk-go-v2/credentials

Bumps [github.com/aws/aws-sdk-go-v2/credentials](https://github.com/aws/aws-sdk-go-v2) from 1.19.13 to 1.19.14.
- [Release notes](https://github.com/aws/aws-sdk-go-v2/releases)
- [Commits](https://github.com/aws/aws-sdk-go-v2/compare/credentials/v1.19.13...credentials/v1.19.14)

---
updated-dependencies:
- dependency-name: github.com/aws/aws-sdk-go-v2/credentials
  dependency-version: 1.19.14
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-04-06 13:26:02 -07:00
Chris Lu
69218c88fe fix(stats): fix build on openbsd, solaris, and windows (#8951)
fix(stats): replace undefined calculateDiskRemaining with inline calculation

disk_openbsd.go, disk_solaris.go, and disk_windows.go all call
calculateDiskRemaining() which is never defined, causing build failures
on those platforms. Replace with the same inline calculation used in
disk_supported.go.
2026-04-06 12:48:48 -07:00
Chris Lu
b0a4647d87 fix: prevent stack overflow in ECBalanceTask.reportProgress (#8949)
* fix: prevent stack overflow in ECBalanceTask.reportProgress

Add re-entry guard to reportProgress() to prevent infinite recursion.
The progressCallback invoked by ReportProgressWithStage can re-enter
reportProgress, causing a stack overflow that crashes the worker process
(goroutine stack exceeds 1GB limit after ~22M frames).

* fix: use atomics for progress and re-entry guard to avoid data races

Address review feedback: GetProgress() can be called from a different
goroutine while reportProgress is updating the value. Use atomic
operations for both the progress field (via Float64bits/Float64frombits)
and the reporting re-entry guard (via CompareAndSwap).
2026-04-06 12:26:38 -07:00
dependabot[bot]
83a632669a build(deps): bump github.com/aws/aws-sdk-go-v2/service/s3 from 1.96.0 to 1.98.0 (#8943)
build(deps): bump github.com/aws/aws-sdk-go-v2/service/s3

Bumps [github.com/aws/aws-sdk-go-v2/service/s3](https://github.com/aws/aws-sdk-go-v2) from 1.96.0 to 1.98.0.
- [Release notes](https://github.com/aws/aws-sdk-go-v2/releases)
- [Commits](https://github.com/aws/aws-sdk-go-v2/compare/service/s3/v1.96.0...service/s3/v1.98.0)

---
updated-dependencies:
- dependency-name: github.com/aws/aws-sdk-go-v2/service/s3
  dependency-version: 1.98.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-04-06 10:51:10 -07:00
dependabot[bot]
331d76e024 build(deps): bump google.golang.org/api from 0.267.0 to 0.274.0 (#8945)
Bumps [google.golang.org/api](https://github.com/googleapis/google-api-go-client) from 0.267.0 to 0.274.0.
- [Release notes](https://github.com/googleapis/google-api-go-client/releases)
- [Changelog](https://github.com/googleapis/google-api-go-client/blob/main/CHANGES.md)
- [Commits](https://github.com/googleapis/google-api-go-client/compare/v0.267.0...v0.274.0)

---
updated-dependencies:
- dependency-name: google.golang.org/api
  dependency-version: 0.274.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-04-06 10:50:49 -07:00
dependabot[bot]
2b73db9c71 build(deps): bump go.etcd.io/etcd/client/v3 from 3.6.9 to 3.6.10 (#8944)
Bumps [go.etcd.io/etcd/client/v3](https://github.com/etcd-io/etcd) from 3.6.9 to 3.6.10.
- [Release notes](https://github.com/etcd-io/etcd/releases)
- [Commits](https://github.com/etcd-io/etcd/compare/v3.6.9...v3.6.10)

---
updated-dependencies:
- dependency-name: go.etcd.io/etcd/client/v3
  dependency-version: 3.6.10
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-04-06 10:50:41 -07:00
dependabot[bot]
9a7c731e68 build(deps): bump github.com/hashicorp/vault/api from 1.22.0 to 1.23.0 (#8941)
Bumps [github.com/hashicorp/vault/api](https://github.com/hashicorp/vault) from 1.22.0 to 1.23.0.
- [Release notes](https://github.com/hashicorp/vault/releases)
- [Changelog](https://github.com/hashicorp/vault/blob/main/CHANGELOG-v1.10-v1.15.md)
- [Commits](https://github.com/hashicorp/vault/compare/api/v1.22.0...api/v1.23.0)

---
updated-dependencies:
- dependency-name: github.com/hashicorp/vault/api
  dependency-version: 1.23.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-04-06 10:50:21 -07:00
dependabot[bot]
5c9d3949be build(deps): bump actions/upload-artifact from 4 to 7 (#8940)
Bumps [actions/upload-artifact](https://github.com/actions/upload-artifact) from 4 to 7.
- [Release notes](https://github.com/actions/upload-artifact/releases)
- [Commits](https://github.com/actions/upload-artifact/compare/v4...v7)

---
updated-dependencies:
- dependency-name: actions/upload-artifact
  dependency-version: '7'
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-04-06 10:50:13 -07:00
dependabot[bot]
7dd6d5547e build(deps): bump docker/login-action from 4.0.0 to 4.1.0 (#8939)
Bumps [docker/login-action](https://github.com/docker/login-action) from 4.0.0 to 4.1.0.
- [Release notes](https://github.com/docker/login-action/releases)
- [Commits](https://github.com/docker/login-action/compare/v4...v4.1.0)

---
updated-dependencies:
- dependency-name: docker/login-action
  dependency-version: 4.1.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-04-06 10:50:06 -07:00
dependabot[bot]
b201386c8c build(deps): bump actions/download-artifact from 4 to 8 (#8938)
Bumps [actions/download-artifact](https://github.com/actions/download-artifact) from 4 to 8.
- [Release notes](https://github.com/actions/download-artifact/releases)
- [Commits](https://github.com/actions/download-artifact/compare/v4...v8)

---
updated-dependencies:
- dependency-name: actions/download-artifact
  dependency-version: '8'
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-04-06 10:50:00 -07:00
Mmx233
3cea900241 fix: replication sinks upload ciphertext for SSE-encrypted objects (#8931)
* fix: decrypt SSE-encrypted objects in S3 replication sink

* fix: add SSE decryption support to GCS, Azure, B2, Local sinks

* fix: return error instead of warning for SSE-C objects during replication

* fix: close readers after upload to prevent resource leaks

* fix: return error for unknown SSE types instead of passing through ciphertext

* refactor(repl_util): extract CloseReader/CloseMaybeDecryptedReader helpers

The io.Closer close-on-error and defer-close pattern was duplicated in
copyWithDecryption and the S3 sink. Extract exported helpers to keep a
single implementation and prevent future divergence.

* fix(repl_util): warn on mixed SSE types across chunks in detectSSEType

detectSSEType previously returned the SSE type of the first encrypted
chunk without inspecting the rest. If an entry somehow has chunks with
different SSE types, only the first type's decryption would be applied.
Now scans all chunks and logs a warning on mismatch.

* fix(repl_util): decrypt inline SSE objects during replication

Small SSE-encrypted objects stored in entry.Content were being copied
as ciphertext because:
1. detectSSEType only checked chunk metadata, but inline objects have
   no chunks — now falls back to checking entry.Extended for SSE keys
2. Non-S3 sinks short-circuited on len(entry.Content)>0, bypassing
   the decryption path — now call MaybeDecryptContent before writing

Adds MaybeDecryptContent helper for decrypting inline byte content.

* fix(repl_util): add KMS initialization for replication SSE decryption

SSE-KMS decryption was not wired up for filer.backup — the only
initialization was for SSE-S3 key manager. CreateSSEKMSDecryptedReader
requires a global KMS provider which is only loaded by the S3 API
auth-config path.

Add InitializeSSEForReplication helper that initializes both SSE-S3
(from filer KEK) and SSE-KMS (from Viper config [kms] section /
WEED_KMS_* env vars). Replace the SSE-S3-only init in filer_backup.go.

* fix(replicator): initialize SSE decryption for filer.replicate

The SSE decryption setup was only added to filer_backup.go, but the
notification-based replicator (filer.replicate) uses the same sinks
and was missing the required initialization. Add SSE init in
NewReplicator so filer.replicate can decrypt SSE objects.

* refactor(repl_util): fold entry param into CopyFromChunkViews

Remove the CopyFromChunkViewsWithEntry wrapper and add the entry
parameter directly to CopyFromChunkViews, since all callers already
pass it.

* fix(repl_util): guard SSE init with sync.Once, error on mixed SSE types

InitializeWithFiler overwrites the global superKey on every call.
Wrap InitializeSSEForReplication with sync.Once so repeated calls
(e.g. from NewReplicator) are safe.

detectSSEType now returns an error instead of logging a warning when
chunks have inconsistent SSE types, so replication aborts rather than
silently applying the wrong decryption to some chunks.

* fix(repl_util): allow SSE init retry, detect conflicting metadata, add tests

- Replace sync.Once with mutex+bool so transient failures (e.g. filer
  unreachable) don't permanently prevent initialization. Only successful
  init flips the flag; failed attempts allow retries.

- Remove v.IsSet("kms") guard that prevented env-only KMS configs
  (WEED_KMS_*) from being detected. Always attempt KMS loading and let
  LoadConfigurations handle "no config found".

- detectSSEType now checks for conflicting extended metadata keys
  (e.g. both SeaweedFSSSES3Key and SeaweedFSSSEKMSKey present) and
  returns an error instead of silently picking the first match.

- Add table-driven tests for detectSSEType, MaybeDecryptReader, and
  MaybeDecryptContent covering plaintext, uniform SSE, mixed chunks,
  inline SSE via extended metadata, conflicting metadata, and SSE-C.

* test(repl_util): add SSE-S3 and SSE-KMS integration tests

Add round-trip encryption/decryption tests:
- SSE-S3: encrypt with CreateSSES3EncryptedReader, decrypt with
  CreateSSES3DecryptedReader, verify plaintext matches
- SSE-KMS: encrypt with AES-CTR, wire a mock KMSProvider via
  SetGlobalKMSProvider, build serialized KMS metadata, verify
  MaybeDecryptReader and MaybeDecryptContent produce correct plaintext

Fix existing tests to check io.ReadAll errors.

* test(repl_util): exercise full SSE-S3 path through MaybeDecryptReader

Replace direct CreateSSES3DecryptedReader calls with end-to-end tests
that go through MaybeDecryptReader → decryptSSES3 →
DeserializeSSES3Metadata → GetSSES3IV → CreateSSES3DecryptedReader.

Uses WEED_S3_SSE_KEK env var + a mock filer client to initialize the
global key manager with a test KEK, then SerializeSSES3Metadata to
build proper envelope-encrypted metadata. Cleanup restores the key
manager state.

* fix(localsink): write to temp file to prevent truncated replicas

The local sink truncated the destination file before writing content.
If decryption or chunk copy failed, the file was left empty/truncated,
destroying the previous replica.

Write to a temp file in the same directory and atomically rename on
success. On any error the temp file is cleaned up and the existing
replica is untouched.

---------

Co-authored-by: Chris Lu <chris.lu@gmail.com>
2026-04-06 00:32:27 -07:00
Chris Lu
7ab6306e15 fix(kafka): resolve consumer group resumption timeout in e2e tests (#8935)
* fix(kafka): resolve consumer group resumption timeout in e2e tests

Three issues caused ConsumerGroupResumption to time out when the second
consumer tried to resume from committed offsets:

1. ForceCompleteRebalance deadlock: performCleanup() held group.Mu.Lock
   then called ForceCompleteRebalance() which tried to acquire the same
   lock — a guaranteed deadlock on Go's non-reentrant sync.Mutex. Fixed
   by requiring callers to hold the lock (matching actual call sites).

2. Unbounded fallback fetch: when the multi-batch fetch timed out, the
   fallback GetStoredRecords call used the connection context (no
   deadline). A slow broker gRPC call could block the data-plane
   goroutine indefinitely, causing head-of-line blocking for all
   responses on that connection. Fixed with a 10-second timeout.

3. HWM lookup failure caused empty responses: after a consumer leaves
   and the partition is deactivated, GetLatestOffset can fail. The
   fetch handler treated this as "no data" and entered the long-poll
   loop (up to 10s × 4 retries = 40s timeout). Fixed by assuming data
   may exist when HWM lookup fails, so the actual fetch determines
   availability.

* fix(kafka): address review feedback on HWM sentinel and fallback timeout

- Don't expose synthetic HWM (requestedOffset+1) to clients; keep
  result.highWaterMark at 0 when the real HWM lookup fails.
- Tie fallback timeout to client's MaxWaitTime instead of a fixed 10s,
  so one slow partition doesn't hold the reader beyond the request budget.

* fix(kafka): use large HWM sentinel and clamp fallback timeout

- Use requestedOffset+10000 as sentinel HWM instead of +1, so
  FetchMultipleBatches doesn't artificially limit to 1 record.
- Add 2s floor to fallback timeout so disk reads via gRPC have
  a reasonable chance even when maxWaitMs is small or zero.

* fix(kafka): use MaxInt64 sentinel and derive HWM from fetch result

- Use math.MaxInt64 as HWM sentinel to avoid integer overflow risk
  (previously requestedOffset+10000 could wrap on large offsets).
- After the fetch, derive a meaningful HWM from newOffset so the
  client never sees MaxInt64 or 0 in the response.

* fix(kafka): use remaining time budget for fallback fetch

The fallback was restarting the full maxWaitMs budget even though the
multi-batch fetch already consumed part of it. Now compute remaining
time from either the parent context deadline or maxWaitMs minus
elapsed, skip the fallback if budget is exhausted, and clamp to
[2s, 10s] bounds.
2026-04-05 20:13:57 -07:00
Chris Lu
72eb93919c fix(gcssink): prevent empty object finalization on write failure (#8933)
* fix(gcssink): prevent empty object finalization on write failure

The GCS writer was created unconditionally with defer wc.Close(),
which finalizes the upload even when content decryption or copy
fails. This silently overwrites valid objects with empty data.
Remove the unconditional defer, explicitly close on success to
propagate errors, and delete the object on write failure.

* fix(gcssink): use context cancellation instead of obj.Delete on failure

obj.Delete() after a failed write would delete the existing object at
that key, causing data loss on updates. Use a cancelable context
instead — cancelling before Close() aborts the GCS upload without
touching any pre-existing object.
2026-04-05 16:07:49 -07:00
Chris Lu
4fd974b16b fix(azuresink): delete freshly created blob on write failure (#8934)
* fix(azuresink): delete freshly created blob on write failure

appendBlobClient.Create() runs before content decryption and copy.
If MaybeDecryptContent or CopyFromChunkViews fails, an empty blob
is left behind, silently replacing any previous valid data. Add
cleanup that deletes the blob on content write errors when we were
the ones who created it.

* fix(azuresink): track recreated blobs for cleanup on write failure

handleExistingBlob deletes and recreates the blob when overwrite is
needed, but freshlyCreated was only set on the initial Create success
path. Set freshlyCreated = needsWrite after handleExistingBlob so
recreated blobs are also cleaned up on content write failure.
2026-04-05 16:07:34 -07:00
Chris Lu
b8fc99a9cd fix(s3): apply PutObject multipart expansion to STS session policies (#8932)
* fix(s3): apply PutObject multipart expansion to STS session policy evaluation (#8929)

PR #8445 added logic to implicitly grant multipart upload actions when
s3:PutObject is authorized, but only in the S3 API policy engine's
CompiledStatement.MatchesAction(). STS session policies are evaluated
through the IAM policy engine's matchesActions() -> awsIAMMatch() path,
which did plain pattern matching without the multipart expansion.

Add the same multipart expansion logic to the IAM policy engine's
matchesActions() so that session policies containing s3:PutObject
correctly allow multipart upload operations.

* fix: make multipart action set lookup case-insensitive and optimize

Address PR review feedback:
- Lowercase multipartActionSet keys and use strings.ToLower for lookup,
  since AWS IAM actions are case-insensitive
- Only check for s3:PutObject permission when the requested action is
  actually a multipart action, avoiding unnecessary awsIAMMatch calls
- Add test case for case-insensitive multipart action matching
2026-04-05 14:06:50 -07:00
Mmx233
69cd5fa37b fix: S3 sink puts all entry.Extended into Tagging header instead of only object tags (#8930)
* test: add failing tests for S3 sink buildTaggingString

* fix: S3 sink should only put object tags into Tagging header

* fix: avoid sending empty x-amz-tagging header
2026-04-05 12:16:04 -07:00
Chris Lu
076d504044 fix(admin): reduce memory usage and verbose logging for large clusters (#8927)
* fix(admin): reduce memory usage and verbose logging for large clusters (#8919)

The admin server used excessive memory and produced thousands of log lines
on clusters with many volumes (e.g., 33k volumes). Three root causes:

1. Scanner duplicated all volume metrics: getVolumeHealthMetrics() created
   VolumeHealthMetrics objects, then convertToTaskMetrics() copied them all
   into identical types.VolumeHealthMetrics. Now uses the task-system type
   directly, eliminating the duplicate allocation and removing convertToTaskMetrics.

2. All previous task states loaded at startup: LoadTasksFromPersistence read
   and deserialized every .pb file from disk, logging each one. With thousands
   of balance tasks persisted, this caused massive startup I/O, memory usage,
   and log noise (including unguarded DEBUG glog.Infof per task). Now starts
   with an empty queue — the scanner re-detects current needs from live cluster
   state. Terminal tasks are purged from memory and disk when new scan results
   arrive.

3. Verbose per-volume/per-node logging: V(2) and V(3) logs produced thousands
   of lines per scan. Per-volume logs bumped to V(4), per-node/rack/disk logs
   bumped to V(3). Topology summary now logs counts instead of full node ID arrays.

Also removes lastTopologyInfo field from MaintenanceScanner — the raw protobuf
topology is returned as a local value and not retained between 30-minute scans.

* fix(admin): delete stale task files at startup, add DeleteAllTaskStates

Old task .pb files from previous runs were left on disk. The periodic
CleanupCompletedTasks still loads all files to find completed ones —
the same expensive 4GB path from the pprof profile.

Now at startup, DeleteAllTaskStates removes all .pb files by scanning
the directory without reading or deserializing them. The scanner will
re-detect any tasks still needed from live cluster state.

* fix(admin): don't persist terminal tasks to disk

CompleteTask was saving failed/completed tasks to disk where they'd
accumulate. The periodic cleanup only triggered for completed tasks,
not failed ones. Now terminal tasks are deleted from disk immediately
and only kept in memory for the current session's UI.

* fix(admin): cap in-memory tasks to 100 per job type

Without a limit, the task map grows unbounded — balance could create
thousands of pending tasks for a cluster with many imbalanced volumes.
Now AddTask rejects new tasks when a job type already has 100 in the
queue. The scanner will re-detect skipped volumes on the next scan.

* fix(admin): address PR review - memory-only purge, active-only capacity

- purgeTerminalTasks now only cleans in-memory map (terminal tasks are
  already deleted from disk by CompleteTask)
- Per-type capacity limit counts only active tasks (pending/assigned/
  in_progress), not terminal ones
- When at capacity, purge terminal tasks first before rejecting

* fix(admin): fix orphaned comment, add TaskStatusCancelled to terminal switch

- Move hasQueuedOrActiveTaskForVolume comment to its function definition
- Add TaskStatusCancelled to the terminal state switch in CompleteTask
  so cancelled task files are deleted from disk
2026-04-04 18:45:57 -07:00
Chris Lu
2c8a1ea6cc fix(docker): disable glibc _FORTIFY_SOURCE for aarch64-musl cross builds
When cross-compiling aws-lc-sys for aarch64-unknown-linux-musl using
aarch64-linux-gnu-gcc, glibc's _FORTIFY_SOURCE generates calls to
__memcpy_chk, __fprintf_chk etc. which don't exist in musl, causing
linker errors. Disable it via CFLAGS_aarch64_unknown_linux_musl.
2026-04-04 14:25:05 -07:00
Chris Lu
4efe0acaf5 fix(master): fast resume state and default resumeState to true (#8925)
* fix(master): fast resume state and default resumeState to true

When resumeState is enabled in single-master mode, the raft server had
existing log entries so the self-join path couldn't promote to leader.
The server waited the full election timeout (10-20s) before self-electing.

Fix by temporarily setting election timeout to 1ms before Start() when
in single-master + resumeState mode with existing log, then restoring
the original timeout after leader election. This makes resume near-instant.

Also change the default for resumeState from false to true across all
CLI commands (master, mini, server) so state is preserved by default.

* fix(master): prevent fastResume goroutine from hanging forever

Use defer to guarantee election timeout is always restored, and bound
the polling loop with a timeout so it cannot spin indefinitely if
leader election never succeeds.

* fix(master): use ticker instead of time.After in fastResume polling loop
2026-04-04 14:15:56 -07:00
Chris Lu
0da1794856 fix(rust): remove transitive openssl dependency from seaweed-volume
reqwest's default features include native-tls which depends on
openssl-sys, causing builds to fail on musl targets where OpenSSL
headers are not available. Since we already use rustls-tls, disable
default features to eliminate the openssl-sys dependency entirely.
2026-04-04 14:07:01 -07:00
Chris Lu
47baf6c841 fix(docker): add Rust volume server pre-build to latest and dev container workflows
Both container_latest.yml and container_dev.yml use Dockerfile.go_build
which expects weed-volume-prebuilt/ with pre-compiled Rust binaries, but
neither workflow produced them, causing COPY failures during docker build.

Add build-rust-binaries jobs that natively cross-compile for amd64 and
arm64, then download and place the artifacts in the Docker build context.
Also fix the trivy-scan local build path in container_latest.yml.
2026-04-04 13:53:13 -07:00
Chris Lu
d37b592bc4 Update object_store_users_templ.go 2026-04-04 11:52:57 -07:00
Chris Lu
896114d330 fix(admin): fix master leader link showing incorrect port in Admin UI (#8924)
fix(admin): use gRPC address for current server in RaftListClusterServers

The old Raft implementation was returning the HTTP address
(ms.option.Master) for the current server, while peers used gRPC
addresses (peer.ConnectionString). The Admin UI's GetClusterMasters()
converts all addresses from gRPC to HTTP via GrpcAddressToServerAddress
(port - 10000), which produced a negative port (-667) for the current
server since its address was already in HTTP format (port 9333).

Use ToGrpcAddress() for consistency with both HashicorpRaft (which
stores gRPC addresses) and old Raft peers.

Fixes #8921
2026-04-04 11:50:43 -07:00
Chris Lu
f6df7126b6 feat(admin): add profiling options for debugging high memory/CPU usage (#8923)
* feat(admin): add profiling options for debugging high memory/CPU usage

Add -debug, -debug.port, -cpuprofile, and -memprofile flags to the admin
command, matching the profiling support already available in master, volume,
and other server commands. This enables investigation of resource usage
issues like #8919.

* refactor(admin): move profiling flags into AdminOptions struct

Move cpuprofile and memprofile flags from global variables into the
AdminOptions struct and init() function for consistency with other flags.

* fix(debug): bind pprof server to localhost only and document profiling flags

StartDebugServer was binding to all interfaces (0.0.0.0), exposing
runtime profiling data to the network. Restrict to 127.0.0.1 since
this is a development/debugging tool.

Also add a "Debugging and Profiling" section to the admin command's
help text documenting the new flags.
2026-04-04 10:05:19 -07:00
Chris Lu
9add18e169 fix(volume-rust): fix volume balance between Go and Rust servers (#8915)
Two bugs prevented reliable volume balancing when a Rust volume server
is the copy target:

1. find_last_append_at_ns returned None for delete tombstones (Size==0
   in dat header), falling back to file mtime truncated to seconds.
   This caused the tail step to re-send needles from the last sub-second
   window. Fix: change `needle_size <= 0` to `< 0` since Size==0 delete
   needles still have a valid timestamp in their tail.

2. VolumeTailReceiver called read_body_v2 on delete needles, which have
   no DataSize/Data/flags — only checksum+timestamp+padding after the
   header. Fix: skip read_body_v2 when size == 0, reject negative sizes.

Also:
- Unify gRPC server bind: use TcpListener::bind before spawn for both
  TLS and non-TLS paths, propagating bind errors at startup.
- Add mixed Go+Rust cluster test harness and integration tests covering
  VolumeCopy in both directions, copy with deletes, and full balance
  move with tail tombstone propagation and source deletion.
- Make FindOrBuildRustBinary configurable for default vs no-default
  features (4-byte vs 5-byte offsets).
2026-04-04 09:13:23 -07:00
Chris Lu
d1823d3784 fix(s3): include static identities in listing operations (#8903)
* fix(s3): include static identities in listing operations

Static identities loaded from -s3.config file were only stored in the
S3 API server's in-memory state. Listing operations (s3.configure shell
command, aws iam list-users) queried the credential manager which only
returned dynamic identities from the backend store.

Register static identities with the credential manager after loading
so they are included in LoadConfiguration and ListUsers results, and
filtered out before SaveConfiguration to avoid persisting them to the
dynamic store.

Fixes https://github.com/seaweedfs/seaweedfs/discussions/8896

* fix: avoid mutating caller's config and defensive copies

- SaveConfiguration: use shallow struct copy instead of mutating the
  caller's config.Identities field
- SetStaticIdentities: skip nil entries to avoid panics
- GetStaticIdentities: defensively copy PolicyNames slice to avoid
  aliasing the original

* fix: filter nil static identities and sync on config reload

- SetStaticIdentities: filter nil entries from the stored slice (not
  just from staticNames) to prevent panics in LoadConfiguration/ListUsers
- Extract updateCredentialManagerStaticIdentities helper and call it
  from both startup and the grace.OnReload handler so the credential
  manager's static snapshot stays current after config file reloads

* fix: add mutex for static identity fields and fix ListUsers for store callers

- Add sync.RWMutex to protect staticIdentities/staticNames against
  concurrent reads during config reload
- Revert CredentialManager.ListUsers to return only store users, since
  internal callers (e.g. DeletePolicy) look up each user in the store
  and fail on non-existent static entries
- Merge static usernames in the filer gRPC ListUsers handler instead,
  via the new GetStaticUsernames method
- Fix CI: TestIAMPolicyManagement/managed_policy_crud_lifecycle was
  failing because DeletePolicy iterated static users that don't exist
  in the store

* fix: show static identities in admin UI and weed shell

The admin UI and weed shell s3.configure command query the filer's
credential manager via gRPC, which is a separate instance from the S3
server's credential manager. Static identities were only registered
on the S3 server's credential manager, so they never appeared in the
filer's responses.

- Add CredentialManager.LoadS3ConfigFile to parse a static S3 config
  file and register its identities
- Add FilerOptions.s3ConfigFile so the filer can load the same static
  config that the S3 server uses
- Wire s3ConfigFile through in weed mini and weed server modes
- Merge static usernames in filer gRPC ListUsers handler
- Add CredentialManager.GetStaticUsernames helper
- Add sync.RWMutex to protect concurrent access to static identity
  fields
- Avoid importing weed/filer from weed/credential (which pulled in
  filer store init() registrations and broke test isolation)
- Add docker/compose/s3_static_users_example.json

* fix(admin): make static users read-only in admin UI

Static users loaded from the -s3.config file should not be editable
or deletable through the admin UI since they are managed via the
config file.

- Add IsStatic field to ObjectStoreUser, set from credential manager
- Hide edit, delete, and access key buttons for static users in the
  users table template
- Show a "static" badge next to static user names
- Return 403 Forbidden from UpdateUser and DeleteUser API handlers
  when the target user is a static identity

* fix(admin): show details for static users

GetObjectStoreUserDetails called credentialManager.GetUser which only
queries the dynamic store. For static users this returned
ErrUserNotFound. Fall back to GetStaticIdentity when the store lookup
fails.

* fix(admin): load static S3 identities in admin server

The admin server has its own credential manager (gRPC store) which is
a separate instance from the S3 server's and filer's. It had no static
identity data, so IsStaticIdentity returned false (edit/delete buttons
shown) and GetStaticIdentity returned nil (details page failed).

Pass the -s3.config file path through to the admin server and call
LoadS3ConfigFile on its credential manager, matching the approach
used for the filer.

* fix: use protobuf is_static field instead of passing config file path

The previous approach passed -s3.config file path to every component
(filer, admin). This is wrong because the admin server should not need
to know about S3 config files.

Instead, add an is_static field to the Identity protobuf message.
The field is set when static identities are serialized (in
GetStaticIdentities and LoadS3ConfigFile). Any gRPC client that loads
configuration via GetConfiguration automatically sees which identities
are static, without needing the config file.

- Add is_static field (tag 8) to iam_pb.Identity proto message
- Set IsStatic=true in GetStaticIdentities and LoadS3ConfigFile
- Admin GetObjectStoreUsers reads identity.IsStatic from proto
- Admin IsStaticUser helper loads config via gRPC to check the flag
- Filer GetUser gRPC handler falls back to GetStaticIdentity
- Remove s3ConfigFile from AdminOptions and NewAdminServer signature
2026-04-03 20:01:28 -07:00
Chris Lu
0798b274dd feat(s3): add concurrent chunk prefetch for large file downloads (#8917)
* feat(s3): add concurrent chunk prefetch for large file downloads

Add a pipe-based prefetch pipeline that overlaps chunk fetching with
response writing during S3 GetObject, SSE downloads, and filer proxy.

While chunk N streams to the HTTP response, fetch goroutines for the
next K chunks establish HTTP connections to volume servers ahead of
time, eliminating the RTT gap between sequential chunk fetches.

Uses io.Pipe for minimal memory overhead (~1MB per download regardless
of chunk size, vs buffering entire chunks). Also increases the
streaming read buffer from 64KB to 256KB to reduce syscall overhead.

Benchmark results (64KB chunks, prefetch=4):
- 0ms latency:  1058 → 2362 MB/s (2.2× faster)
- 5ms latency:  11.0 → 41.7 MB/s (3.8× faster)
- 10ms latency: 5.9  → 23.3 MB/s (4.0× faster)
- 20ms latency: 3.1  → 12.1 MB/s (3.9× faster)

* fix: address review feedback for prefetch pipeline

- Fix data race: use *chunkPipeResult (pointer) on channel to avoid
  copying struct while fetch goroutines write to it. Confirmed clean
  with -race detector.
- Remove concurrent map write: retryWithCacheInvalidation no longer
  updates fileId2Url map. Producer only reads it; consumer never writes.
- Use mem.Allocate/mem.Free for copy buffer to reduce GC pressure.
- Add local cancellable context so consumer errors (client disconnect)
  immediately stop the producer and all in-flight fetch goroutines.

* fix(test): remove dead code and add Range header support in test server

- Remove unused allData variable in makeChunksAndServer
- Add Range header handling to createTestServer for partial chunk
  read coverage (206 Partial Content, 416 Range Not Satisfiable)

* fix: correct retry condition and goroutine leak in prefetch pipeline

- Fix retry condition: use result.fetchErr/result.written instead of
  copied to decide cache-invalidation retry. The old condition wrongly
  triggered retry when the fetch succeeded but the response writer
  failed on the first write (copied==0 despite fetcher having data).
  Now matches the sequential path (stream.go:197) which checks whether
  the fetcher itself wrote zero bytes.

- Fix goroutine leak: when the producer's send to the results channel
  is interrupted by context cancellation, the fetch goroutine was
  already launched but the result was never sent to the channel. The
  drain loop couldn't handle it. Now waits on result.done before
  returning so every fetch goroutine is properly awaited.
2026-04-03 19:57:30 -07:00
Chris Lu
3efe88c718 feat(s3): store and return checksum headers for additional checksum algorithms (#8914)
* feat(s3): store and return checksum headers for additional checksum algorithms

When clients upload with --checksum-algorithm (SHA256, CRC32, etc.),
SeaweedFS validated the checksum but discarded it. The checksum was
never stored in metadata or returned in PUT/HEAD/GET responses.

Now the checksum is computed alongside MD5 during upload, stored in
entry extended attributes, and returned as the appropriate
x-amz-checksum-* header in all responses.

Fixes #8911

* fix(s3): address review feedback and CI failures for checksum support

- Gate GET/HEAD checksum response headers on x-amz-checksum-mode: ENABLED
  per AWS S3 spec, fixing FlexibleChecksumError on ranged GETs and
  multipart copies
- Verify computed checksum against client-provided header value for
  non-chunked uploads, returning BadDigest on mismatch
- Add nil check for getCheckSumWriter to prevent panic
- Handle comma-separated values in X-Amz-Trailer header
- Use ordered slice instead of map for deterministic checksum header
  selection; extract shared mappings into package-level vars

* fix(s3): skip checksum header for ranged GET responses

The stored checksum covers the full object. Returning it for ranged
(partial) responses causes SDK checksum validation failures because the
SDK validates the header value against the partial content received.

Skip emitting x-amz-checksum-* headers when a Range request header is
present, fixing PyArrow large file read failures.

* fix(s3): reject unsupported checksum algorithm with 400

detectRequestedChecksumAlgorithm now returns an error code when
x-amz-sdk-checksum-algorithm or x-amz-checksum-algorithm contains
an unsupported value, instead of silently ignoring it.

* feat(s3): compute composite checksum for multipart uploads

Store the checksum algorithm during CreateMultipartUpload, then during
CompleteMultipartUpload compute a composite checksum from per-part
checksums following the AWS S3 spec: concatenate raw per-part checksums,
hash with the same algorithm, format as "base64-N" where N is part count.

The composite checksum is persisted on the final object entry and
returned in HEAD/GET responses (gated on x-amz-checksum-mode: ENABLED).

Reuses existing per-part checksum storage from putToFiler and the
getCheckSumWriter/checksumHeaders infrastructure.

* fix(s3): validate checksum algorithm in CreateMultipartUpload, error on missing part checksums

- Move detectRequestedChecksumAlgorithm call before mkdir callback so
  an unsupported algorithm returns 400 before the upload is created
- Change computeCompositeChecksum to return an error when a part is
  missing its checksum (the upload was initiated with a checksum
  algorithm, so all parts must have checksums)
- Propagate the error as ErrInvalidPart in CompleteMultipartUpload

* fix(s3): return checksum header in CompleteMultipartUpload response, validate per-part algorithm

- Add ChecksumHeaderName/ChecksumValue fields to CompleteMultipartUploadResult
  and set the x-amz-checksum-* HTTP response header in the handler, matching
  the AWS S3 CompleteMultipartUpload response spec
- Validate that each part's stored checksum algorithm matches the upload's
  expected algorithm before assembling the composite checksum; return an
  error if a part was uploaded with a different algorithm
2026-04-03 18:37:54 -07:00
Chris Lu
36f37b9b6a fix(filer): remove cancellation guard from RollbackTransaction and clean up #8909 (#8916)
* fix(filer): remove cancellation guard from RollbackTransaction and clean up #8909

RollbackTransaction is a cleanup operation that must succeed even when
the context is cancelled — guarding it causes the exact orphaned state
that #8909 was trying to prevent.

Also:
- Use single-evaluation `if err := ctx.Err(); err != nil` pattern
  instead of double-calling ctx.Err()
- Remove spurious blank lines before guards
- Add context.DeadlineExceeded test coverage
- Simplify tests from ~230 lines to ~130 lines

* fix(filer): call cancel() in expiredCtx and test rollback with expired context

- Call cancel() instead of suppressing it to avoid leaking timer resources
- Test RollbackTransaction with both cancelled and expired contexts
2026-04-03 17:55:27 -07:00
os-pradipbabar
d5128f00f1 fix: Prevent orphaned metadata from cancelled S3 operations (Issue #8908) (#8909)
fix(filer): check if context was already cancelled before ignoring cancellation
2026-04-03 16:22:46 -07:00