14298 Commits

Author SHA1 Message Date
github-actions[bot] d0b90d29eb 4.36 4.36 2026-06-25 05:09:40 +00:00
Chris Lu d65ed3b557 add release version-bump workflow 2026-06-24 22:08:06 -07:00
Chris Lu 3b9e196e5f sts: enforce session-policy explicit deny during role chaining (#10103)
* sts: enforce session-policy explicit deny during role chaining

A chained AssumeRole caller authenticates with an STS session token whose
inline session policy can explicitly deny sts:AssumeRole. The deny check only
evaluated the caller's named policies, so such a session could still chain into
any role its trust policy admits. Validate the session token in the deny check
and honor an explicit Deny in the inline session policy too.

* test(sts): integration coverage for AssumeRole authorization

Add an end-to-end AssumeRole authorization test (real weed mini + boto3):
a non-admin caller assumes a role its trust policy admits, an explicit
identity-side deny is blocked, and a session policy's explicit deny blocks
role chaining.

* sts: skip OIDC tokens and reject revoked sessions in the chaining deny check

Review follow-ups on the session-policy deny check:
- Guard session validation with !isOIDCToken so a bearer token our STS service
  cannot validate does not error into a false deny.
- Reject a revoked session before evaluating its policy, restoring the
  revocation enforcement the AssumeRole path lost when it stopped routing
  through IsActionAllowed.
2026-06-24 21:38:21 -07:00
Chris Lu 88a4a939aa fix(sts): authorize AssumeRole by the role's trust policy (#10097)
* fix(sts): authorize AssumeRole by the role's trust policy

The role's trust policy already declares who may assume it, but the caller
also had to pass an identity-side sts:AssumeRole check that only the Admin
action could satisfy — legacy static identities have no way to express
sts:AssumeRole on a role. So assuming any role required a full admin
identity. Drop the redundant check and let the trust policy be the authority;
scope it to specific principals to restrict who can assume.

* sts: resolve caller principal ARN for the trust-policy check

A legacy static identity can reach AssumeRole without a PrincipalArn set;
passing the empty value would miss a trust policy that names a concrete
principal. Resolve it to the canonical user ARN, sharing the logic
GetCallerIdentity already used inline.

* sts: enforce explicit identity-side deny for AssumeRole

Authorizing a named role by its trust policy alone dropped identity-side
evaluation entirely, so a caller whose attached policy explicitly denies
sts:AssumeRole could still assume any role the trust policy admits. Re-check
the caller's policies through the IAM manager for an explicit deny
(deny-always-wins) without requiring an allow; the trust policy stays the
allow authority.
2026-06-24 20:14:26 -07:00
sshhan a1fff50935 fix(postgres): prevent uint32 underflow & OOM in message parsing (#10099)
* fix(postgres): prevent uint32 underflow & OOM in message parsing

* postgres: drop redundant startup guard, use maxStartupMessageSize const

The msgTotalLen < 8 check already guarantees msgLength >= 4, so the extra
msgLength < 4 guard before reading the protocol version was unreachable.
Point the startup size limit at maxStartupMessageSize instead of a literal.

* postgres: trim query terminator safely, cap pre-auth payloads

Use strings.TrimSuffix for the simple-query null terminator so a
non-null-terminated body isn't silently shortened, matching the auth
handlers. Bound password/MD5 reads with a dedicated maxAuthMessageSize
(10 KiB) instead of the 100 MiB maxMessageSize, since these payloads are
read before authentication.

---------

Co-authored-by: shangshuhan <shangshuhan@cmict.chinamobile.com>
Co-authored-by: Chris Lu <chris.lu@gmail.com>
2026-06-24 20:05:43 -07:00
Chris Lu 0f1ec8983d mount: don't fail close() on a benign FUSE interrupt (#10102)
A FUSE interrupt is not a process kill. Go's async preemption (SIGURG)
makes a close() under load emit an interrupt on nearly every flush, so
deriving the metadata-flush context from the FUSE cancel channel turned
healthy concurrent close()s into EIO: the interrupt cancelled the
in-flight CreateEntry, which surfaced as "input/output error".

Bound the flush with a deadline instead. A healthy CreateEntry finishes
in well under a second, so the deadline only fires against a genuinely
stuck filer -- still keeping close() from hanging forever -- while
benign preemption no longer aborts a good flush.
2026-06-24 19:54:03 -07:00
Chris Lu 95427b5573 security: add BearerPrefix constant for Authorization headers (#10101)
Introduce security.BearerPrefix ("Bearer ", RFC 6750) and use it
everywhere an "Authorization: Bearer <token>" header is constructed,
replacing the scattered "BEARER "/"Bearer " string literals. SeaweedFS
matches the scheme case-insensitively when parsing (security.GetJwt), so
behavior is unchanged; this removes the magic string and settles the
casing on the standard form. The parser's upper-case comparison stays as
is on purpose.
2026-06-24 19:36:42 -07:00
Chris Lu 4d3e5d94a9 filer: mint volume read JWT when proxying chunk reads (#10100)
The /?proxyChunkId= endpoint forwards the caller's headers to the volume
server but never mints a read token, so proxied chunk reads return 401
once jwt.signing.read.key is configured. Generate a fileId-scoped volume
token the same way the direct filer read path does, which fixes
filer.sync, filer.backup, filerProxy mounts, the MQ broker and the upload
gateway in one place.
2026-06-24 19:21:57 -07:00
dependabot[bot] 7c9f61d4dc build(deps): bump com.fasterxml.jackson.core:jackson-databind from 2.18.6 to 2.22.0 in /test/java/spark (#10094)
* build(deps): bump com.fasterxml.jackson.core:jackson-databind

Bumps [com.fasterxml.jackson.core:jackson-databind](https://github.com/FasterXML/jackson) from 2.18.6 to 2.22.0.
- [Commits](https://github.com/FasterXML/jackson/commits)

---
updated-dependencies:
- dependency-name: com.fasterxml.jackson.core:jackson-databind
  dependency-version: 2.22.0
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>

* build(deps): pin jackson-annotations to its own 2.22 version

jackson-annotations dropped the patch digit in 2.20 and releases on its
own line, so 2.22.0 does not exist. Sharing jackson.version broke
dependency resolution; give it a dedicated property.

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Chris Lu <chris.lu@gmail.com>
2026-06-24 19:12:48 -07:00
Chris Lu 96d2d13efe s3: replicate by fanning out from the gateway to every holder (#10078)
* s3: replicate by fanning out from the gateway to every holder

The S3 gateway uploaded each chunk to one volume server, which then
relayed the copies to the other replica holders. The gateway now uploads
each chunk to every holder in parallel (type=replicate), removing the
primary volume server's receive-then-resend relay.

AssignVolume returns every replica holder (new repeated Location replicas,
forwarded from the master assign), the s3api captures them, and the
chunked uploader fans out whenever a chunk has more than one holder.
Cipher uploads keep the server-driven path since per-call encryption would
diverge the replicas.

* s3: cancel sibling replica uploads on the first failure

* s3: trim replica fan-out comments

* s3: roll back successful fan-out chunk copies when a holder fails

A failed fan-out records no FileChunk, so copies that landed on the holders
that finished before the cancel were leaked as orphans the caller could not
see. Track the holders that succeeded and delete the needle from each
(type=replicate, local-only) on failure, leaving nothing behind.
2026-06-24 16:31:58 -07:00
os-pradipbabar d1b1338558 Fix stale cache fallback for empty volume locations in wdclient (#10081)
fix(wdclient): prevent stale cache fallback for empty volume locations

## Problem
During Kubernetes pod restarts, volume servers temporarily disconnect and their
locations are removed from vidMap. The deleteLocation function leaves an empty
array [] in vid2Locations map instead of removing the key entirely.

GetLocations() was checking 'if found && len(locations) > 0', which would fail
for empty arrays and fall back to the cache chain, returning STALE locations
from before the restart. This caused S3 gateway to try connecting to old pod
IPs that no longer exist, resulting in connection timeouts and hanging registry
sync jobs.

Example timeline:
1. Volume pod at 10.131.1.28:8081 registers volumes 10,12
2. S3 gateway caches: vid2Locations[10] = [10.131.1.28:8081]
3. Pod restarts, gets new IP 10.131.1.65:8081
4. Master sends delete → vid2Locations[10] = [] (empty, but key exists)
5. BUG: GetLocations(10) sees found=true, len=0 → falls back to cache
6. Returns stale 10.131.1.28:8081 instead of waiting for new location
7. S3 requests timeout trying to reach unreachable old IP

## Solution
Distinguish between two cases:
- found=true, locations=[] : Volume explicitly has no locations (e.g. restart)
  → Return nil, false (no fallback to cache)
- found=false : Volume never seen in current map
  → Check cache (preserve cache benefits for unknown volumes)

An empty array explicitly means 'this volume currently has no locations',
which is semantically different from 'volume unknown'. Don't fall back to
stale cache for explicitly empty volumes.

## Testing
Added comprehensive tests:
- TestGetLocationsEmptyArrayNoFallback: Verifies empty arrays don't use cache
- TestGetLocationsUnknownVolumeUsesCache: Verifies unknown volumes still use cache
- All existing tests pass

## Impact
Fixes registry sync job hangs during SeaweedFS upgrades/restarts. S3 gateway
will now correctly wait for updated volume locations instead of using stale
cached IPs.

Related: OutSystems.SeaWeedfs Helm chart, vega cluster incident 2026-06-24
2026-06-24 16:31:32 -07:00
Chris Lu 089acfbf36 fix(s3api): apply static config file updates on reload (#10096)
A config-file reload (SIGHUP) routed through MergeS3ApiConfiguration,
which skips identities marked static so dynamic admin/filer updates can't
clobber them. That also blocked the config file itself from updating its
own identities, so editing a secretKey and reloading had no effect.

Thread a fromStaticFile flag from the file-load path into the merge: the
authoritative file overwrites its static identities (and reapplies service
accounts under them), while dynamic updates still leave them immutable.
Mark the rebuilt identities static in the merge so a concurrent
RemoveIdentity never observes them as removable mid-reload.
2026-06-24 16:26:35 -07:00
Chris Lu cd828f6503 s3: propagate IAM changes from standalone weed s3 to peer pods (#10095)
Standalone weed s3 created a master client and registered the receiving
SeaweedS3IamCache gRPC service, but never wrapped its credential store
with the propagating store. Only the filer-embedded path called
SetMasterClient, so IAM mutations on one s3 pod never reached peers; they
served a stale in-memory identity cache and returned InvalidAccessKeyId
until restarted.

Wrap the credential store with the master client when one is available,
mirroring the filer path, so mutations fan out over the existing gRPC
cache service.
2026-06-24 16:26:08 -07:00
Chris Lu c15989387b s3tables: allow hyphens in namespace and table names (#10093)
* s3tables: allow hyphens in namespace and table names

Iceberg REST clients routinely use hyphenated namespace/table names, but the
S3 Tables charset (a-z, 0-9, _) rejected them with 400. Accept '-' as an
interior character (names must still start, and namespaces end, with a letter
or digit), making the catalog conformant for those clients. A permissive
superset of the AWS S3 Tables charset.

* s3tables: allow hyphens in table ARN parsing too

The ARN regexes still excluded '-', so parseTableFromARN rejected ARNs with
hyphenated namespace/table names and existing reject-the-hyphen tests broke.
Widen the ARN patterns to match the validator, retarget those tests at a
still-invalid leading-hyphen name, and cover ARN parsing with hyphens.
2026-06-24 16:24:45 -07:00
Chris Lu 1c5f8244a4 s3tables: fix create-after-rename overwriting the renamed table (#10091)
* s3tables: purge decoupled table data without deleting the reused name path

A renamed or created-over-leftover table keeps its data at a location that
differs from its catalog name path. Drop now purges that data location and
clears the marker, instead of recursively deleting the name path, which may
still hold another table's data.

* iceberg: route a table created over a leftover to a unique location

When the default location is occupied by a leftover directory (data kept when
another table was renamed to this name), create the new table at a unique
location so it cannot overwrite that table's metadata. Common case is unchanged.

* iceberg: fail table create when the leftover-path check errors

A transient filer lookup error fell through as "not occupied", routing the
new table back to the default path and risking the very overwrite this check
guards against. Propagate the error and return 500 instead.

* s3tables: assert all catalog xattrs cleared on decoupled drop

Seed the full marker set so the test catches a regression that leaves the
policy, tags, version, or entry-type attribute on the reused name path.

* s3tables: refuse to drop a table whose data path is an ancestor

Corrupt metadata can resolve the data path to the bucket or namespace root,
which the bucket-scope check still admits; a recursive purge there would wipe
sibling tables. Reject an ancestor data path before deleting.
2026-06-24 14:37:04 -07:00
Chris Lu 5456f9d695 mount: confirm an empty directory rebuild before caching it (#10092)
A directory rebuild wiped the cached children, listed the filer once, and
published the directory authoritatively cached over whatever came back. A
transient empty listing -- a momentary list-stream glitch that ends as a
clean EOF with no entries -- then stranded a populated directory cached
over an empty store, hiding every file in it until some unrelated event
happened to rebuild it: stat returns ENOENT and readdir returns nothing
though the files are safe on the filer, and nothing re-triggers a build.

Re-read the directory when the listing comes back empty before trusting
it. The first re-read is immediate, since the likely transient clears on a
fresh stream; later attempts space out. A genuinely empty directory still
lists empty every time and caches as before, so only empty listings pay
the extra read.

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-24 14:25:23 -07:00
Chris Lu 5112da98a2 mount: skip redundant permission checks under default_permissions (#10089)
With default_permissions (the mount default) the kernel enforces unix
permission bits from the getattr/lookup attributes before it ever calls
Open, Create, or Mknod. The mount was re-checking permissions in
AcquireHandle and createRegularFile anyway, which duplicated the kernel's
work and kept the supplementary-group lookup on the per-file hot path.

Gate only the mode-bit access check on default_permissions being off, so
a non-root copy does no permission work on open/create. createRegularFile
still loads the parent to validate it exists, since the create RPC skips
the filer-side parent check. With default_permissions off the mount
remains the sole enforcer, so the full check still runs.
2026-06-24 14:24:51 -07:00
Chris Lu ef109fe9e1 mount: don't hang close() when a writer is killed during flush (#10090)
* operation: bound AssignVolume with a deadline

AssignVolume ran on context.Background(), so when the filer is overwhelmed
the RPC could block indefinitely and wedge every caller holding the
connection. Give it a 30s deadline so a stuck assign fails and the caller's
retry/error path runs instead of hanging forever.

* mount: abort flush when the FUSE request is interrupted

On close(), a killed process blocks in fuse_flush waiting for the mount to
answer. doFlush ran its metadata CreateEntry on context.Background() and
ignored the kernel interrupt channel, so against an overwhelmed filer the
flush never completed and the process stayed in uninterruptible sleep --
making the pod un-killable.

Derive a context from the FUSE cancel channel in Flush/Fsync and thread it
through doFlush -> flushMetadataToFiler -> streamCreateEntry; the retry loop
stops as soon as the context is cancelled. Release and the pre-rename flush
keep a non-cancellable context since they must finish regardless.

* operation: harden the AssignVolume timeout test

Make the test double's signal send non-blocking and bound the receive with a
timeout so a regression can't wedge the test instead of failing it.
2026-06-24 14:24:22 -07:00
Jaehoon Kim a11d81b21f fix(filer.backup): repair chunk-incomplete and stale destination entries (#10082)
* fix(filer.backup): repair chunk-incomplete and stale destination entries

filer.backup left destinations diverged while metadata advanced — chunk-incomplete
(missing/gapped ranges at full attr.file_size) or holding a chunk superseded by a
missed overwrite. The skip/repair decision keyed on filer.FileSize (the attr),
which a truncated entry keeps full, so it never repaired.

Decide from actual chunk state instead:
- coversReference: range-by-range containment (scalar byte totals and attr
  FileSize/Md5 cannot see chunk-level gaps).
- hasStaleBackupChunk: a backup-written chunk (SourceFileId) the source no longer
  lists; ignores out-of-band (rsync/direct) chunks.
- destinationMatchesReference: allocation-free positional fast path gating the
  above so they run only on divergence (the in-sync path stays cheap).
- A strictly-newer destination is never repaired, so an older out-of-order replay
  cannot roll it back. The stale signal is deferred at equal mtime (same-second
  versions cannot be ordered; reliable S3 sub-second ordering is a separate fix).

Tests in filer_sink_test.go.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* filer.backup: verify chunk range in destinationMatchesReference fast path

The allocation-free fast path matched a destination chunk to its reference
by SourceFileId alone. That is correct today only because replicateOneChunk
copies the source chunk's Offset/Size verbatim, so SourceFileId identity
implies an identical range — an invariant that lives in another file with no
guard linking the two. If replication ever re-chunks (split/coalesce), a
chunk with the right SourceFileId but a different range would fast-path as a
full match and skip a needed repair (a false positive in the very class this
change otherwise prevents).

Compare Offset/Size alongside SourceFileId so the fast path is self-contained
and can only be more conservative (a range mismatch falls through to the
precise coversReference/hasStaleBackupChunk checks). Add tests for a shifted
offset and a larger size at matching identity.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-24 14:23:38 -07:00
jk2lx e1f89f85f2 fix(filer): apply -filer.disk default to metadata log assigns (#10080)
* fix(filer): apply -filer.disk default to metadata log assigns

Metadata event log writes call operation.Assign directly and used only
FilerConf path rule DiskType. When filer.conf rules were missing or
unmatched, the master received an empty DiskType and grew volumes on the
built-in hdd layout.

Mirror resolveAssignStorageOption: wire FilerOption.DiskType into the
Filer, fall back when the matched path rule has no disk type, and return
the matched rule from resolveMetadataLogAssignDiskType to avoid duplicate
MatchStorageRule lookups.

Co-authored-by: Cursor <cursoragent@cursor.com>

* mini: fall back to -volume.disk for filer default disk type

weed server copies -volume.disk into the filer disk default when
-filer.disk is unset; weed mini did not, so metadata-log assigns sent
an empty disk type on clusters that only tag volumes (e.g. hot/warm).

---------

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Chris Lu <chris.lu@gmail.com>
2026-06-24 10:47:11 -07:00
Chris Lu d29e6ed98a deps: replace deleted tyler-smith/go-bip39 with cosmos fork (#10088)
The tyler-smith/go-bip39 repository was deleted from GitHub, so go mod
download fails for anyone resolving it directly (GOPROXY=direct). It
only reaches us transitively through rclone's internxt backend, which
calls IsMnemonicValid and NewSeed. Point it at cosmos/go-bip39, an
API-compatible and maintained fork.
2026-06-24 10:41:43 -07:00
Chris Lu e744b5f2ee iceberg: detect table-exists through the wrapped manager error (#10075)
handleCreateTable used a type assertion that fails through WithFilerClient's
'all filers failed' wrap, so a concurrent create that the pre-check missed
fell through instead of returning the existing table. Use errors.As.
2026-06-24 10:22:36 -07:00
patrick 3e2c637858 util: trim minFreeSpace values before parsing (#10083) 2026-06-24 09:03:38 -07:00
Lisandro Pin 30f2dd5040 Weed shell ec.rebuild: Allow targeting rebuild to specific volume IDs. (#10087) 2026-06-24 08:40:29 -07:00
qzhello fb168e2a36 fix: avoid reading upload body when writing JSON errors (#10073)
* fix(shell): correct volume.list -writable filter unit and comparison

* fix(shell): correct volume.list -writable filter unit and comparison

* chore(shell): fix typo in EC shard helper param names

* fix(shell): use exact match for volume.balance -racks/-nodes filter

The old strings.Contains-based filter quietly included any id that was a
  substring of the user-supplied flag value (e.g. -racks=rack10 also matched
  rack1). Replace it with an exact-match set parsed from the comma-separated
  flag value, and add regression tests for both -racks and -nodes paths.

  Also fix a small typo in the "remote storage" error returned by
  maybeMoveOneVolume.

* fix(shell): use exact match for volume.balance -racks/-nodes filter

The old strings.Contains-based filter quietly included any id that was a
  substring of the user-supplied flag value (e.g. -racks=rack10 also matched
  rack1). Replace it with an exact-match set parsed from the comma-separated
  flag value, and add regression tests for both -racks and -nodes paths.

  Also fix a small typo in the "remote storage" error returned by
  maybeMoveOneVolume.

* refactor(shell): drop nil sentinel in splitCSVSet, use len() in callers

* fix: avoid reading upload body when writing JSON errors
2026-06-23 20:20:11 -07:00
Chris Lu c95401b11a iceberg: support table rename (#10068)
* s3tables: add RenameTable operation

* iceberg: support table rename

* iceberg: test table rename

* s3tables: keep table data in place on rename

rename is catalog-only: drop the source's catalog xattrs in place instead of recursively deleting its directory, which wiped the metadata.json and data files the renamed destination still points at. treat a missing table-metadata xattr as NoSuchTable in GetTable so the soft-deleted source name stops resolving.

* s3tables: test rename preserves data

make the in-memory filer honor recursive data deletion and seed the source table's metadata/ and data/ children, then assert a rename leaves them intact, the source name resolves to NoSuchTable, and the destination resolves to the preserved location.

* iceberg: map rename errors through wrapped manager error

* s3tables: authorize rename destination namespace

rename moved a table into the destination namespace after only checking the source, letting a source-authorized caller place tables in namespaces they don't control. require CreateTable on the destination namespace and bucket before writing.

* s3tables: purge renamed table data on drop

* s3tables: test table data dir derivation
2026-06-23 20:18:11 -07:00
Chris Lu 7abed4e517 s3: skip 503 when client disconnects during remote cache wait (#10071)
s3: don't write 503 to a disconnected client during remote cache wait

When the remote-only cache poll returns without chunks, re-check the
request context before emitting 503 + Retry-After. A client that
disconnected during the wait surfaces as context.Canceled, which the
caller already handles silently; writing to the closed connection only
produced broken-pipe log noise.
2026-06-23 15:31:08 -07:00
Chris Lu 0403e47ef6 iceberg: support views (#10069)
* s3tables: tag table entries and exclude views from table listings

* s3tables: add view CRUD operations

* iceberg: support view create, load, exists, drop, and list

* iceberg: support view update

* iceberg: test view error classification and metadata round-trip

* iceberg: pre-check existence and write view metadata only after create

* iceberg: map view namespace-not-found to 404

* iceberg: test view create namespace-404 and duplicate no-clobber

* s3tables: tag view metadata and entry type atomically

CreateView wrote ExtendedKeyMetadata and ExtendedKeyEntryType in two
UpdateEntry calls, so a partial failure could leave a view directory
untagged. Add setExtendedAttributes to set both in one UpdateEntry.

* iceberg: roll back view registration when metadata write fails

The metadata file is written after the catalog registers the view. If
that write fails, drop the just-created view so it doesn't linger
pointing at a missing metadata.json. Reuse the DeleteView path via a
shared dropView helper.
2026-06-23 15:22:31 -07:00
Chris Lu 1ca628d3e9 iceberg: support multi-table transaction commit (#10066)
* iceberg: support multi-table transaction commit

Add handleCommitTransaction for POST /v1/transactions/commit. Validation
is atomic across all table-changes (resolve, load, evaluate every
requirement before any write); metadata writes and pointer flips are
best-effort with rollback, so this is not crash-atomic.

* iceberg: route transactions/commit with and without prefix

* iceberg: test transaction commit request decoding

* iceberg: restore full prior table state on transaction rollback

* iceberg: test transaction rollback restores full prior table state

* iceberg: only clean up metadata for rolled-back tables
2026-06-23 14:08:03 -07:00
Chris Lu 628ce57625 iceberg: support table register (#10067)
* s3tables: add RegisterTable op

* iceberg: support table register

* iceberg: test register table

* iceberg: parse engine-written metadata version from location

* iceberg: test metadata version parsing for both filename forms

* iceberg: map register errors through wrapped manager error

* iceberg: validate register metadata-location bucket and reject traversal

* iceberg: log register metadata load failure
2026-06-23 14:07:13 -07:00
Chris Lu 63f2f0bef5 s3: keep a file promoted to a directory retrievable as an object (#10070)
* filer: treat a directory carrying object data as an S3 key object

A file promoted to a directory by a child write keeps its chunks, inline
content, or remote-tiered entry. Recognize that as a directory key object,
not only when a Mime is set, so the object still lists, demotes on delete,
and is not reclaimed by cleanup like the object it still is.

* filer: keep the empty-folder cleaner from reclaiming a promoted object

The cleaner skips directory key objects, but its check only looked at the
Mime. Mirror the chunks/content/remote check so a file promoted to a
directory is not deleted once its children are gone.

* s3: serve ranged GET for a directory that carries object data

Reject only zero-size directories so a file promoted to a directory streams
range requests instead of returning 404, while empty directories still 404.

* s3: return HEAD metadata for a directory that carries object data

HEAD now 404s a directory only when it has no data, so a promoted object is
retrievable while empty/implicit directories still fall back to LIST.
2026-06-23 14:06:00 -07:00
7y-9 ddd11e44f9 feat: support marking volumes by collection (#9585)
* feat: add collection.mark shell command

Add collection.mark to mark all existing normal volume replicas in a collection as readonly or writable. The command runs in preview mode by default and requires -apply to execute changes. It reuses existing volume mark RPCs, supports default collection aliases, skips EC shards, and adds unit tests for option parsing and target collection logic.

* Revert "feat: add collection.mark shell command"

This reverts commit 50c2bbf94c.

* feat: support marking volumes by collection

Add a -collection option to volume.mark so operators can mark every normal volume replica in a collection using existing topology data and volume mark RPCs.

The change keeps the single-volume path unchanged and adds tests for collection target selection, EC shard exclusion, and argument validation.

Co-authored-by: Codex <noreply@openai.com>

* volume.mark: reuse eachDataNode for collection traversal

* volume.mark: continue past per-volume failures and report progress

Collection marking aborted on the first failed RPC, leaving the
collection half-marked with no record of which volumes succeeded.
Mark every reachable volume, print per-volume progress to the writer,
and return an aggregated error naming the failures.

* volume.mark: let -collection _default target the unnamed collection

Other volume commands use the _default sentinel to match volumes with
no named collection; volume.mark could not reach them at all. Map
_default to the empty collection name in the filter.

---------

Co-authored-by: Chris Lu <chrislusf@users.noreply.github.com>
Co-authored-by: Codex <noreply@openai.com>
Co-authored-by: Chris Lu <chris.lu@gmail.com>
2026-06-23 11:27:43 -07:00
msementsov 70d9dd5afe volume.balance: add -volumesPerExec to cap moves per run
Limit the number of volume moves performed in one command execution; re-run to continue. 0 = unlimited.
2026-06-23 10:48:33 -07:00
198wmj aeaf62fa86 fix: resolve postgres startup message length type mismatch and uint underflow OOM risk (#10065)
* fix: resolve postgres startup message length type mismatch and uint underflow OOM risk

* Update weed/server/postgres/server.go

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

---------

Co-authored-by: wangmeijuan <542204218@qq.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-06-23 10:08:26 -07:00
Chris Lu b0bad761ff worker: don't leak task goroutines on forced shutdown (#10062)
* worker: don't leak task goroutines on forced shutdown

Stop() drains in-flight tasks for 30s, then terminates the manager loop.
A task still running past that deadline later reports completion through
w.cmds - getAdmin in completeTask, the ActionIncTask* send, removeTask -
but with the loop gone and cmds unbuffered, those sends and the response
reads behind getAdmin/getTaskLoad block forever, leaking the goroutine.

Close a done channel when the loop exits and route the task-goroutine
sends through it so they abort and return zero values instead of
blocking. getAdmin can now return nil mid-shutdown, so collapse its
double-call sites to a single nil-checked call to avoid a deref.

* worker: abort remaining manager-loop sends after shutdown

Extend the post-shutdown abort to the sends that still blocked: Stop()'s
own ActionStop (so a second Stop, e.g. an admin-shutdown timer racing an
explicit one, doesn't hang), setTask, and handleTaskCancellation. Route
them through w.done so they return instead of blocking when the loop is
gone. Stop is now idempotent.
2026-06-23 10:06:59 -07:00
AlexALei faa8c3963b fix(chunk_cache): close data/index files on initialization error (#10057)
* fix(chunk_cache): close data/index files on initialization error

* chunk_cache: assign outer err on the .dat open path

The error-path defer keys off the function-level err, but the .dat
OpenFile used := and shadowed it, so that path relied on nothing being
open yet rather than the cleanup invariant. Assign the outer err so
every error return is uniform.

* chunk_cache: verify descriptor closure on POSIX, not just Windows

os.Remove succeeds on open files on Linux/macOS, so the removal check
only proved closure on Windows. Compare the open-fd count before and
after the failed load; gate the removal check to Windows.

---------

Co-authored-by: Contributor <contributor@example.com>
Co-authored-by: Chris Lu <chris.lu@gmail.com>
2026-06-23 01:49:35 -07:00
shiftraodd 1e2412e502 fix: enforce XATTR_REPLACE semantics in setxattr (#10059)
* 修复weedfs_xattr.go 中 XATTR_REPLACE 语义缺失

* mount: fix XATTR_CREATE/XATTR_REPLACE flag semantics in setxattr

XATTR_CREATE fell through into the XATTR_REPLACE branch: creating a new
attribute hit the empty-oldData guard and returned ENODATA instead of
creating it, while creating over an existing attribute silently succeeded
without the EEXIST that setxattr(2) requires. Drop the fallthrough chain
so CREATE returns EEXIST when the attribute already exists, REPLACE
returns ENODATA when it is missing, and otherwise the value is written.
Test existence via the map lookup so an attribute with an empty value is
still treated as present.

---------

Co-authored-by: 王郁文 <wangyuwen@cmict.chinamobile.com>
Co-authored-by: Chris Lu <chris.lu@gmail.com>
2026-06-23 01:31:14 -07:00
patrick 4bcd27fb6f s3api: preserve equals signs in tag values (#10058)
* s3api: preserve equals signs in tag values

* s3api: decode tag key once in parseTagsHeader

---------

Co-authored-by: Chris Lu <chris.lu@gmail.com>
2026-06-23 01:25:32 -07:00
mumingl 16c3f5c816 fix: Resolve inconsistent usage of error variables (#10060)
* fix: Resolve inconsistent usage of error variables

* mysql2: guard nil DB on open failure and wrap connect error

---------

Co-authored-by: muminglei <muminglei@cmict.chinamobile.com>
Co-authored-by: Chris Lu <chris.lu@gmail.com>
2026-06-23 01:25:02 -07:00
Chris Lu 6de283ccaa iceberg: return 400 for invalid namespace/table names (#10051)
* iceberg: return 400 for invalid namespace/table names

The S3 Tables name charset (a-z, 0-9, _) is stricter than the Iceberg
REST spec, so clients sending hyphens or uppercase hit a validation
error. That error fell through to 500; it's client input, so map it to
400 BadRequestException across the namespace and table handlers.

* iceberg: tighten name-validation error matching

Match the validator's own phrasings (invalid/must/cannot) instead of a
bare "namespace name"/"table name" substring, so an unrelated fault that
happens to mention a name isn't misreported as a 400. Lowercase first to
stay robust to message capitalization.
2026-06-23 00:42:42 -07:00
Chris Lu 0ded0984a4 iceberg: support namespace property updates (#10052)
* iceberg: support namespace property updates

Add POST /v1/namespaces/{namespace}/properties to the REST catalog. It
applies the request's removals and updates and returns the removed/updated/
missing summary the spec defines. A new UpdateNamespace op on the S3 Tables
manager rewrites the stored namespace properties; AWS S3 Tables namespaces
have no properties, so this is the SeaweedFS-side backing for the catalog.

* iceberg: dedup namespace property removals

A key repeated in removals was deleted on its first occurrence, then
reported as missing on the next — landing in both removed and missing.
Skip keys already processed.

* iceberg: map namespace-update backend errors to REST statuses

UpdateNamespaceProperties returned 500 for every manager failure, masking
the namespace being dropped between read and write, or a denied caller.
Inspect the typed S3TablesError and answer 404/403 accordingly, 500 only
for the rest. Also replaces the GetNamespace not-found string match.

* iceberg: test the namespace-properties conflict path

Cover the 422 returned when a key appears in both removals and updates.
The check runs before any backend call, so it needs no filer.
2026-06-23 00:41:47 -07:00
7y-9 44d575100a fix(s3api): preserve requested AES256 copy encryption (#10049)
* fix(s3api): preserve requested AES256 copy encryption

Problem
CopyObject metadata processing ignored an explicit x-amz-server-side-encryption: AES256 request header. A destination copy could lose the requested SSE-S3 metadata even though KMS requests were handled.

Root cause
processMetadataBytes only wrote the destination SSE header when the requested algorithm was aws:kms. Any other explicit SSE algorithm fell through to the source-preservation branch.

Fix
Write the requested SSE algorithm whenever x-amz-server-side-encryption is present, and keep KMS-specific metadata handling limited to aws:kms.

Co-authored-by: Codex <noreply@openai.com>

* fix(s3api): reject unsupported copy encryption algorithms

A mistyped or unsupported x-amz-server-side-encryption value on a copy
request slipped past validation and got persisted as the destination's
algorithm header, advertising encryption that was never applied. Reject
anything other than AES256 or aws:kms up front.

* fix(s3api): write SSE key metadata for empty encrypted copies

A zero-byte source copied with an explicit SSE request took the
no-content branch and never ran the encryption path, leaving the object
with a bare algorithm header but no key. HEAD then advertised SSE while
the encryption-state machine saw the header as orphaned. Run the inline
encryption path when the destination requests encryption so the key
metadata is written too.

* s3api: use SSEAlgorithmKMS constant in copy metadata handling

* test(s3api): cover source SSE preservation on copy

* test(iam): allow the local client's real source IP in SourceIp tests

The aws:SourceIp allow policies hardcoded the loopback CIDRs, but a CI
runner reaching the server over localhost can be observed with one of the
host's RFC1918 addresses (the S3 endpoint is advertised on a 10.x
interface), so the positive-condition PutObject was denied and the allow
assertion flaked while the deny path passed trivially. Broaden the allow
list to loopback plus private ranges via a shared helper, and log the
denial on each failed attempt so any residual failure is diagnosable.

---------

Co-authored-by: Codex <noreply@openai.com>
Co-authored-by: Chris Lu <chris.lu@gmail.com>
2026-06-22 22:19:24 -07:00
aCuteBegCinner 42ccfc0763 refactor: 将fmt.Errorf中的%v替换为%w以保留错误链 (#10050)
替换了多个文件中的错误格式化方式,使用%w包裹原始错误,
保留完整的错误调用链以提升调试时的错误追踪能力。

Co-authored-by: guant <guant@chinaunicom.cn>
2026-06-22 21:31:45 -07:00
AlexALei 091d953c34 fix(benchmark): close CPU profile file handle after profiling (#10048)
Co-authored-by: Contributor <contributor@example.com>
2026-06-22 20:33:22 -07:00
patrick 11b7b7247f util: support IPv6 host port parsing (#10046)
* util: support IPv6 host port parsing

* Update weed/util/parse.go

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

---------

Co-authored-by: Chris Lu <chrislusf@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-06-22 20:32:43 -07:00
DanielWu-star 55a54574af fix: use %w instead of %v in fmt.Errorf to preserve error chain (#10047)
In ec_task.go, 23 fmt.Errorf calls used %v verb to wrap errors,
breaking the error chain introduced in Go 1.13. This prevents
callers from using errors.Is() and errors.As() to inspect the
underlying error type.

Changed all fmt.Errorf calls from %v to %w to properly wrap
errors, preserving the error chain for upstream callers.

Note: glog.* logging calls and fmt.Sprintf calls intentionally
keep %v as they are not error wrapping contexts.

Co-authored-by: 吴奇臻 <wuqizhen@cmict.chinamobile.com>
2026-06-22 20:30:37 -07:00
dependabot[bot] 36f2ddcaea build(deps): bump github.com/apache/iceberg-go from 0.5.0 to 0.6.0 (#10038)
* build(deps): bump github.com/apache/iceberg-go from 0.5.0 to 0.6.0

Bumps [github.com/apache/iceberg-go](https://github.com/apache/iceberg-go) from 0.5.0 to 0.6.0.
- [Release notes](https://github.com/apache/iceberg-go/releases)
- [Commits](https://github.com/apache/iceberg-go/compare/v0.5.0...v0.6.0)

---
updated-dependencies:
- dependency-name: github.com/apache/iceberg-go
  dependency-version: 0.6.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

* iceberg: adapt worker to iceberg-go 0.6.0 API

Fields() now yields iter.Seq2 (index, value); SortField.SourceID and
PartitionField.SourceID are methods backed by SourceIDs; RemoveSnapshots
takes a postCommit flag (false here, file cleanup runs through the filer).

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Chris Lu <chris.lu@gmail.com>
2026-06-22 11:51:37 -07:00
qzhello 9de9dbaa83 fix(shell): exclude failed EC shard copies from rebuild recoverability gate (#10043)
* fix(shell): correct volume.list -writable filter unit and comparison

* fix(shell): correct volume.list -writable filter unit and comparison

* chore(shell): fix typo in EC shard helper param names

* fix(shell): use exact match for volume.balance -racks/-nodes filter

The old strings.Contains-based filter quietly included any id that was a
  substring of the user-supplied flag value (e.g. -racks=rack10 also matched
  rack1). Replace it with an exact-match set parsed from the comma-separated
  flag value, and add regression tests for both -racks and -nodes paths.

  Also fix a small typo in the "remote storage" error returned by
  maybeMoveOneVolume.

* fix(shell): use exact match for volume.balance -racks/-nodes filter

The old strings.Contains-based filter quietly included any id that was a
  substring of the user-supplied flag value (e.g. -racks=rack10 also matched
  rack1). Replace it with an exact-match set parsed from the comma-separated
  flag value, and add regression tests for both -racks and -nodes paths.

  Also fix a small typo in the "remote storage" error returned by
  maybeMoveOneVolume.

* refactor(shell): drop nil sentinel in splitCSVSet, use len() in callers

* fix(shell): exclude failed EC shard copies from rebuild recoverability gate

prepareDataToRecover incremented the remote-shard counter before the copy
RPC, so in apply mode a failed VolumeEcShardsCopy was still counted toward
the DataShardsCount recoverability gate. The gate could then pass with
fewer real shards than required, deferring the failure to the deeper
generateMissingShards/reconstruct step and reporting an inflated shard
count in the "not enough shards" error.

Count the remote shard only after a successful copy (apply mode) or when
planning (dry-run), and rename wouldCopy to recoverableRemoteShards for
clarity. Add a regression test covering an apply-mode copy failure.

* fix(shell): clean up copied EC shards when the recoverability gate fails

A runtime copy failure can trip the gate after earlier copies already
succeeded, stranding those working shards on the rebuilder. Return the
copied shard ids on the error path and run the cleanup defer even when
recovery fails, so the temp shards get deleted.

---------

Co-authored-by: Chris Lu <chris.lu@gmail.com>
2026-06-22 11:23:23 -07:00
MorezMartin 6f1d4af035 fix(filer): propagate proxyChunkId query params to volume server (#10036)
* fix(filer): propagate proxyChunkId query params to volume server

When weed mount reads via filer proxy mode (-volumeServerAccess=filerProxy),
the mount adds query params like readDeleted=true to chunk read requests.

Two bugs prevented these from working:

1. filer_server_handlers.go extracted fileId from the raw RequestURI, which
   includes query params, corrupting the fileId (e.g. '6,abc&readDeleted=true').
   Fix: use r.URL.Query().Get("proxyChunkId") for clean extraction.

2. filer_server_handlers_proxy.go didn't forward query params to the volume
   server. The urlStrings from LookupFileId already contain the fileId in the
   path, so just append the original query string.

* filer: match chunk proxy by query param, not URI prefix order

Order-dependent prefix slicing missed proxyChunkId when it wasn't the
first query param. Gate on root path and read the parsed query value.

* filer: drop internal proxyChunkId from proxied volume query

Lookup URLs already carry the fileId in the path, so forwarding the raw
query duplicated proxyChunkId onto the volume server. Strip it and only
append the remaining caller params (e.g. readDeleted).

---------

Co-authored-by: Chris Lu <chris.lu@gmail.com>
2026-06-22 11:21:29 -07:00
Chris Lu 16ba8af0b7 util/http: lazily init the global HTTP client to fix admin metrics nil panic (#10044)
util/http: lazily init the global HTTP client

GetGlobalHttpClient returned a nil client until InitGlobalHttpClient ran,
which only happens in weed.go's main. Anything that starts a command
in-process bypasses that: the admin server's metrics goroutine seeds a
dashboard sample on startup, reaching fetchPublicUrlMap -> GetGlobalHttpClient().Do,
and nil-derefs the receiver in GetHttpScheme.

Init the client on first Get via sync.Once so it is never nil regardless of
the startup path. InitGlobalHttpClient keeps its eager-init role through the
same Once.
2026-06-22 10:20:02 -07:00