* s3: support AWS object form for bucket policy Principal, add NotPrincipal
Bucket policy statements only accepted a bare string or array of strings for
the Principal element, so the AWS-documented object form was rejected:
"Principal": { "AWS": "arn:aws:iam::123456789012:root" }
"Principal": { "AWS": ["arn:...", "999999999999"] }
Add a PolicyPrincipal type that parses the bare string, the bare array
(retained for backward compatibility), and the object form keyed by AWS,
Service, Federated or CanonicalUser (each value a string or array). All keyed
values are flattened for principal matching, and the original JSON is preserved
so PutBucketPolicy/GetBucketPolicy returns the exact shape submitted - keeping
infrastructure-as-code tools (Terraform, Ansible) idempotent.
Also add NotPrincipal support (a statement applies to every principal except the
ones named), compiled and evaluated in both policy evaluators, and reject
statements that specify both Principal and NotPrincipal.
* s3: address review - validate principal object form, honor dynamic NotPrincipal
- Reject unsupported Principal object keys (only AWS/Service/Federated/
CanonicalUser) and empty values, so a form like {"AWS":[]} no longer compiles
to zero matchers and silently relies on the match-all fallback.
- Detect both Principal and NotPrincipal by field presence, not by flattened
length, so a present-but-empty field is still rejected.
- Honor dynamic (policy-variable) NotPrincipal/Principal patterns in the
compiled evaluator; previously a NotPrincipal made only of variables was
treated as absent and its exclusion bypassed.
- Add regression tests for the object-form validation and dynamic NotPrincipal.
* Review comment removed unnecessary success and failure count
* fix: use Gather.Gather() with seeded counter for EC rebuild registration test
- Restore Gather.Gather() to verify MustRegister calls as requested in review
- Seed VolumeServerECRebuildCounter before gathering because CounterVec
only appears after at least one label value is observed
- Use correct fully-qualified metric names (SeaweedFS_volumeServer_*)
* fix: remove preflight checkEcVolumeStatus failure from ec_rebuild_total counter
ec_rebuild_total should only reflect actual rebuild execution failures
(from RebuildEcFiles / RebuildEcxFile), not scan/precheck failures in
the volume status loop. The error is still returned to the caller;
only the misleading counter increment was removed.
* Review comment removed unnecessary observe
* label EC rebuild duration histogram by result
Without a result label, fast failures pull down the success-latency
quantiles shown on the EC Rebuild Duration panel. Make the histogram a
HistogramVec keyed by result, record success/failure through one
recordEcRebuild helper, and split the Grafana quantiles by (le, result).
* reset EC rebuild metric vecs in registration test
The HistogramVec needs a child before Gather emits it, so the test must
observe once; reset both vecs in cleanup so that sample doesn't leak into
other tests.
---------
Co-authored-by: Ubuntu User <ubuntu@example.com>
Co-authored-by: Chris Lu <chris.lu@gmail.com>
An empty or truncated tasks/*.pb file unmarshals into a TaskStateFile
with a nil Task, and protobufToMaintenanceTask dereferenced it
immediately, panicking the whole admin process on startup. Guard the
nil case so the loader logs a warning and skips the bad file.
Under a herd of concurrent assigns with no writable volume, Assign spun
PickForWrite for the full 10s timeout, pinning a goroutine per request and
starving the master of the cycles it needs to process growth and answer
heartbeats. When growth is the relevant remedy and already in flight, stop
spinning: if free space exists, shed with a fast retryable error so clients
back off and retry once growth lands; if the cluster is out of space, fail fast
with the real out-of-space error instead of masking it as retryable.
The gRPC shed uses ResourceExhausted, not Unavailable: operation.Assign retries
it, but the client connection layer doesn't treat it as a dead channel, so a
per-request shed across a herd doesn't tear down the shared master connection
and cancel every other in-flight assign. The HTTP dirAssignHandler sheds with
503 + Retry-After.
* volume server: route VolumeMarkReadonly to raft leader
After a master raft election, volume servers may still heartbeat a follower
while admin paths such as weed shell volume.mark call notifyMasterVolumeReadonly
via vs.GetMaster(). Followers reject VolumeMarkReadonly with NotLeader, which
breaks tiering and other mark-readonly workflows until the heartbeat loop
reconnects.
Resolve the leader through GetMasterConfiguration on configured -master peers
(same Leader field filer/master clients already use) before calling
VolumeMarkReadonly. When the leader differs from the heartbeat peer, update
currentMaster so the heartbeat loop converges faster.
Adds operation.LookupRaftLeaderMaster with unit tests.
Co-authored-by: Cursor <cursoragent@cursor.com>
* fix: address review feedback on volume.mark raft leader routing
Do not update currentMaster during leader lookup — heartbeat owns that
field and uses stream GetLeader() to reconnect. Try the heartbeat peer
first and only resolve the raft leader after a NotLeader rejection.
Add ctx.Err() early exit and quieter logging for context cancellation.
Co-authored-by: Cursor <cursoragent@cursor.com>
* fix(operation): thread the lookup timeout ctx into connection invalidation
The 5s timeout drove only the RPC; WithMasterServerClient saw the
unbounded outer ctx, so a self-inflicted timeout (slow GetMasterConfiguration
during an election) was treated as a stale channel and tore down the shared
master connection. Pass the timeout ctx into the helper so its own expiry
leaves ctx.Err() set and spares the connection.
---------
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Chris Lu <chris.lu@gmail.com>
* fix(filer.sync.verify): sort listings client-side before merge
The merge walks both filers' directory listings in lockstep and needs
them in the same byte order. A filer before 4.32 with a locale SQL
collation lists case-insensitively while a 4.32+ peer lists byte-ordered,
so comparing two such clusters returns the same names in a different
order and the merge desyncs into spurious MISSING / ONLY_IN_B.
Buffer and sort each directory client-side so both sides agree on order
regardless of filer version or store backend. Trades the streaming
source's O(buffer) memory for O(directory) per side, fine for a one-shot
verify CLI; both sides still load concurrently.
Claude-Session: https://claude.ai/code/session_01BKsBdKYFNCEjeHLjJfumPF
* fix(filer.sync.verify): surface listing errors before merging
A listing that fails mid-stream leaves a partial, unsorted buffer. Now
that both sides are fully buffered anyway, check each side's error right
after the loads finish and before the merge, so partial entries can't
emit spurious MISSING / ONLY_IN_B before the error aborts the run.
Claude-Session: https://claude.ai/code/session_01BKsBdKYFNCEjeHLjJfumPF
* fix(shell): correct volume.list -writable filter unit and comparison
* fix(shell): correct volume.list -writable filter unit and comparison
* chore(shell): fix typo in EC shard helper param names
* fix(shell): use exact match for volume.balance -racks/-nodes filter
The old strings.Contains-based filter quietly included any id that was a
substring of the user-supplied flag value (e.g. -racks=rack10 also matched
rack1). Replace it with an exact-match set parsed from the comma-separated
flag value, and add regression tests for both -racks and -nodes paths.
Also fix a small typo in the "remote storage" error returned by
maybeMoveOneVolume.
* fix(shell): use exact match for volume.balance -racks/-nodes filter
The old strings.Contains-based filter quietly included any id that was a
substring of the user-supplied flag value (e.g. -racks=rack10 also matched
rack1). Replace it with an exact-match set parsed from the comma-separated
flag value, and add regression tests for both -racks and -nodes paths.
Also fix a small typo in the "remote storage" error returned by
maybeMoveOneVolume.
* refactor(shell): drop nil sentinel in splitCSVSet, use len() in callers
* fix: apply collectionPattern during detection in volume.fix.replication
* use existing wildcard.MatchesWildcard for collection matching
It returns a plain bool, so drop the up-front filepath.Match validation
and the path/filepath import that only existed to handle its error.
* trim verbose comments to terse one-liners
* drop redundant per-path collection guards
Detection already filters by replicas[0].info.Collection. The repair guard
re-checked pickOneReplicaToCopyFrom's collection (a different replica), so a
mixed-collection volume could pass detection yet be skipped in repair without
decrementing the counter, spinning the -apply loop. deleteOneVolume keeps its
collectionIsMismatch safety.
---------
Co-authored-by: Chris Lu <chris.lu@gmail.com>
mount: move directory cache state to a side map to shrink InodeEntry
The mount keeps an InodeEntry alive for every inode the kernel references.
On a mount that is almost entirely regular files, each entry carried the full
directory readdir-cache bookkeeping (four time.Time fields plus counters),
bloating it to 152 bytes whether or not the inode was a directory.
Move that state into a dirState held in a side map keyed by inode, and drop the
isDirectory bool: an inode is a directory iff it has a dirState. InodeEntry is
now just paths + nlookup at 32 bytes, landing in a smaller Go allocator size
class; on a mount with tens of millions of cached file inodes that is several GB
less resident heap. As a side effect the readdir-cache scan helpers iterate only
directories instead of every inode.
* fix(volume): fsync .vif and downloaded tier .dat (Rust)
save_volume_info wrote the .vif with a plain write and no fsync, and the
tier download never synced the .dat it wrote. Either could be lost on a
crash before the tier-down path acts on them. fsync both, matching the Go
volume server's util.WriteFile and DownloadFile.
* fix(volume): swap to local before deleting remote on tier-down (Rust)
The tier-down path deleted the shared remote object before trimming the
.vif, so a crash in between left the volume's .vif pointing at a deleted
object. It also dropped the remote backend only on the delete path and
never opened the downloaded local .dat, so reads broke until reload and a
keep-remote download kept serving from the slow remote object.
Trim the .vif and swap to the local .dat on both paths, bracketed by
directory fsyncs, before removing the remote object; gate only the object
removal on keep_remote_dat_file. Matches the Go volume server's crash-safe
ordering.
After VolumeTierMoveDatToRemote uploaded the .dat, the volume closed its
local backend but never opened the remote one, leaving both dat_file and
remote_dat_file empty. The needle read path has no lazy reopen, so reads
returned "dat file not open" until the volume reloaded.
Switch to the remote backend right after saving the .vif, the same as the
Go volume server's LoadRemoteFile, so the volume keeps serving from remote
storage immediately after tiering.
* ci: add per-process memory sampler for perf jobs
Samples VmRSS once a second into a CSV and records peak VmHWM per process
on stop. Linux only; reads /proc/<pid>/status.
* ci: run perf benchmarks on the Rust volume server and report memory
Matrix the throughput and S3 jobs over go/rust volume servers, using a
standalone master (plus filer for S3) and swapping only the volume binary
so the two are directly comparable. Sample peak RSS in every job and surface
it per impl in the run summary.
* ci: harden mem sampler arg handling and peak fallback
Guard against missing args under set -u, and fall back to the max RSS
sampled when a process exits before VmHWM can be read.
* ec: recover EC shards whose .ecx index lives only on a peer server
A volume server that boots with EC shard files on disk but no .ecx index
on any local disk cannot mount the shards, so the master never learns
about them. ec.rebuild works off master-registered shards, so it sees the
volume as short and gives up even though the shard data is intact.
Add an operator-triggered recovery: VolumeEcShardsMount gains a
recover_missing_index flag that makes the volume server fetch the missing
.ecx (plus .ecj/.vif) from a peer holding it and mount the on-disk shards.
ec.rebuild runs this across the cluster before planning, so orphaned
shards register and the rebuild sees the true shard set.
.ecx is an immutable encode-time index, identical on every holder. .ecj
is a per-holder deletion journal that differs across holders, so the
recovered node adopts the source peer's deletion view, like a balanced or
rebuilt shard does.
* ec: mirror missing-index recovery into the Rust volume server
Port the #10104 recovery to seaweed-volume so the Rust volume server
self-heals the same layout: EC shards on disk with the .ecx index only on
a peer. Adds collect_ec_volumes_missing_index / mount_recovered_ec_shards
to the store, recover_missing_ec_indexes (master LookupEcVolume + peer
CopyFile fetch + mount) to the server, and the recover_missing_index flag
on VolumeEcShardsMount.
.ecx is the immutable encode-time index, identical on every holder. .ecj
is a per-holder deletion journal, so the recovered node adopts the source
peer's deletion view, matching the Go path.
* fix(volume): stream copy_file from disk instead of buffering whole file
copy_file pushed every 2MB chunk into a Vec and only then returned tokio_stream::iter(results), so serving a near-limit volume as a copy source (e.g. during volume.fix.replication) held the entire .dat resident and could OOM the process. Stream chunks through a bounded mpsc channel from a spawn_blocking reader instead; caps memory at ~16MB per transfer with backpressure.
* fix(volume): stream volume_incremental_copy from disk instead of buffering
Same buffering pattern as copy_file: every 2MB chunk was pushed into a Vec and only then returned via tokio_stream::iter, holding the entire delta resident. Stream the byte range from an owned file handle through a bounded mpsc channel, mirroring the copy_file fix.
* test(volume): cover streaming copy_file and volume_incremental_copy
Adds a multi-chunk .dat fixture and tests asserting both handlers stream in 2MB chunks (multiple messages), reassemble byte-for-byte, carry modified_ts_ns only on the first copy_file message, and honor stop_offset.
* address review: use u64 byte counters; stream local incremental copy without holding the store lock
- copy_file/volume_incremental_copy: track remaining bytes and offsets as u64 instead of casting uint64 stop_offset/dat_size through i64 (CodeRabbit).
- volume_incremental_copy: for local volumes open the .dat and stream directly with no lock held; only remote/tiered volumes take the per-chunk read_dat_slice path, so a remote S3 read is never performed while holding the store read lock (Gemini).
* volume (Rust): stream tiered incremental copy off the store lock, open .dat under it
Capture the reader for volume_incremental_copy while the volume lookup is still
under the store read lock: an open File for local volumes, a cloned remote
backend handle for tiered ones. Then drop the lock and stream with none held.
Opening under the lock pins the reader to the volume that exists now, so a
concurrent delete/recreate can't stream from the wrong file, and a slow S3
fetch for a tiered .dat no longer blocks store writers (the remote path
previously re-took the store lock per chunk).
Use a non-uniform copy-test payload so chunk reassembly catches duplicated or
reordered chunks a repeated byte would hide.
* volume (Rust): return empty when incremental-copy start offset is past the .dat
A corrupt needle index could locate an offset beyond the captured .dat size,
underflowing the dat_size - start_offset subtraction (panic in debug, wrap in
release). Guard it up front like the other empty-delta early returns.
---------
Co-authored-by: adri <adri@digitalunited.net>
Co-authored-by: Chris Lu <chris.lu@gmail.com>
Drop max-parallel so the 13 per-platform builds run together instead of two
waves of 8 (rocksdb was queuing behind the cap and starting ~8 min late).
Keep cache-to mode=max for rocksdb: its RocksDB static_lib compile is
sha-independent, so it caches across releases and stops being the ~16-min
long-pole that gates the merge fan-in. go-build variants stay mode=min.
docker release: build per-platform on native runners, drop mode=max cache
The build job built every platform of a variant on one runner, so 2-4 Go
cross-compiles fought over a single 2-vCPU box and arm64 ran in an emulated
context. Split the matrix to one platform per job on a native runner
(amd64/386 on ubuntu-latest, arm64/arm-v7 on ubuntu-24.04-arm); only arm/v7
still needs QEMU, and only for its final apk stage. Each job pushes by
digest, and a new merge job assembles the multi-arch tag with imagetools
and mirrors it to Docker Hub.
cache-to mode=max -> mode=min: BRANCH=sha cache-busts the heavy go-build
layer every release, so writing all intermediate layers to the gha backend
spent 3-11 min per variant on a cache the next release's sha can never hit.
* test: add self-contained S3 read/write load tool
Concurrent PUT/GET against the S3 gateway, reporting requests/sec,
transfer rate, and latency percentiles. Built on the aws-sdk-go-v2
client the S3 tests already use, so no extra benchmark binary is needed.
* ci: add performance workflow
Three parallel jobs: cpu/heap pprof of the server under write load,
native throughput via weed benchmark plus the Go micro-benchmarks, and
an S3 read/write benchmark against the gateway. Runs on push to master
and manual dispatch with tunable duration, object count, size, and
concurrency.
* sts: enforce session-policy explicit deny during role chaining
A chained AssumeRole caller authenticates with an STS session token whose
inline session policy can explicitly deny sts:AssumeRole. The deny check only
evaluated the caller's named policies, so such a session could still chain into
any role its trust policy admits. Validate the session token in the deny check
and honor an explicit Deny in the inline session policy too.
* test(sts): integration coverage for AssumeRole authorization
Add an end-to-end AssumeRole authorization test (real weed mini + boto3):
a non-admin caller assumes a role its trust policy admits, an explicit
identity-side deny is blocked, and a session policy's explicit deny blocks
role chaining.
* sts: skip OIDC tokens and reject revoked sessions in the chaining deny check
Review follow-ups on the session-policy deny check:
- Guard session validation with !isOIDCToken so a bearer token our STS service
cannot validate does not error into a false deny.
- Reject a revoked session before evaluating its policy, restoring the
revocation enforcement the AssumeRole path lost when it stopped routing
through IsActionAllowed.
* fix(sts): authorize AssumeRole by the role's trust policy
The role's trust policy already declares who may assume it, but the caller
also had to pass an identity-side sts:AssumeRole check that only the Admin
action could satisfy — legacy static identities have no way to express
sts:AssumeRole on a role. So assuming any role required a full admin
identity. Drop the redundant check and let the trust policy be the authority;
scope it to specific principals to restrict who can assume.
* sts: resolve caller principal ARN for the trust-policy check
A legacy static identity can reach AssumeRole without a PrincipalArn set;
passing the empty value would miss a trust policy that names a concrete
principal. Resolve it to the canonical user ARN, sharing the logic
GetCallerIdentity already used inline.
* sts: enforce explicit identity-side deny for AssumeRole
Authorizing a named role by its trust policy alone dropped identity-side
evaluation entirely, so a caller whose attached policy explicitly denies
sts:AssumeRole could still assume any role the trust policy admits. Re-check
the caller's policies through the IAM manager for an explicit deny
(deny-always-wins) without requiring an allow; the trust policy stays the
allow authority.
* fix(postgres): prevent uint32 underflow & OOM in message parsing
* postgres: drop redundant startup guard, use maxStartupMessageSize const
The msgTotalLen < 8 check already guarantees msgLength >= 4, so the extra
msgLength < 4 guard before reading the protocol version was unreachable.
Point the startup size limit at maxStartupMessageSize instead of a literal.
* postgres: trim query terminator safely, cap pre-auth payloads
Use strings.TrimSuffix for the simple-query null terminator so a
non-null-terminated body isn't silently shortened, matching the auth
handlers. Bound password/MD5 reads with a dedicated maxAuthMessageSize
(10 KiB) instead of the 100 MiB maxMessageSize, since these payloads are
read before authentication.
---------
Co-authored-by: shangshuhan <shangshuhan@cmict.chinamobile.com>
Co-authored-by: Chris Lu <chris.lu@gmail.com>
A FUSE interrupt is not a process kill. Go's async preemption (SIGURG)
makes a close() under load emit an interrupt on nearly every flush, so
deriving the metadata-flush context from the FUSE cancel channel turned
healthy concurrent close()s into EIO: the interrupt cancelled the
in-flight CreateEntry, which surfaced as "input/output error".
Bound the flush with a deadline instead. A healthy CreateEntry finishes
in well under a second, so the deadline only fires against a genuinely
stuck filer -- still keeping close() from hanging forever -- while
benign preemption no longer aborts a good flush.
Introduce security.BearerPrefix ("Bearer ", RFC 6750) and use it
everywhere an "Authorization: Bearer <token>" header is constructed,
replacing the scattered "BEARER "/"Bearer " string literals. SeaweedFS
matches the scheme case-insensitively when parsing (security.GetJwt), so
behavior is unchanged; this removes the magic string and settles the
casing on the standard form. The parser's upper-case comparison stays as
is on purpose.
The /?proxyChunkId= endpoint forwards the caller's headers to the volume
server but never mints a read token, so proxied chunk reads return 401
once jwt.signing.read.key is configured. Generate a fileId-scoped volume
token the same way the direct filer read path does, which fixes
filer.sync, filer.backup, filerProxy mounts, the MQ broker and the upload
gateway in one place.
* build(deps): bump com.fasterxml.jackson.core:jackson-databind
Bumps [com.fasterxml.jackson.core:jackson-databind](https://github.com/FasterXML/jackson) from 2.18.6 to 2.22.0.
- [Commits](https://github.com/FasterXML/jackson/commits)
---
updated-dependencies:
- dependency-name: com.fasterxml.jackson.core:jackson-databind
dependency-version: 2.22.0
dependency-type: direct:production
...
Signed-off-by: dependabot[bot] <support@github.com>
* build(deps): pin jackson-annotations to its own 2.22 version
jackson-annotations dropped the patch digit in 2.20 and releases on its
own line, so 2.22.0 does not exist. Sharing jackson.version broke
dependency resolution; give it a dedicated property.
---------
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Chris Lu <chris.lu@gmail.com>
* s3: replicate by fanning out from the gateway to every holder
The S3 gateway uploaded each chunk to one volume server, which then
relayed the copies to the other replica holders. The gateway now uploads
each chunk to every holder in parallel (type=replicate), removing the
primary volume server's receive-then-resend relay.
AssignVolume returns every replica holder (new repeated Location replicas,
forwarded from the master assign), the s3api captures them, and the
chunked uploader fans out whenever a chunk has more than one holder.
Cipher uploads keep the server-driven path since per-call encryption would
diverge the replicas.
* s3: cancel sibling replica uploads on the first failure
* s3: trim replica fan-out comments
* s3: roll back successful fan-out chunk copies when a holder fails
A failed fan-out records no FileChunk, so copies that landed on the holders
that finished before the cancel were leaked as orphans the caller could not
see. Track the holders that succeeded and delete the needle from each
(type=replicate, local-only) on failure, leaving nothing behind.
fix(wdclient): prevent stale cache fallback for empty volume locations
## Problem
During Kubernetes pod restarts, volume servers temporarily disconnect and their
locations are removed from vidMap. The deleteLocation function leaves an empty
array [] in vid2Locations map instead of removing the key entirely.
GetLocations() was checking 'if found && len(locations) > 0', which would fail
for empty arrays and fall back to the cache chain, returning STALE locations
from before the restart. This caused S3 gateway to try connecting to old pod
IPs that no longer exist, resulting in connection timeouts and hanging registry
sync jobs.
Example timeline:
1. Volume pod at 10.131.1.28:8081 registers volumes 10,12
2. S3 gateway caches: vid2Locations[10] = [10.131.1.28:8081]
3. Pod restarts, gets new IP 10.131.1.65:8081
4. Master sends delete → vid2Locations[10] = [] (empty, but key exists)
5. BUG: GetLocations(10) sees found=true, len=0 → falls back to cache
6. Returns stale 10.131.1.28:8081 instead of waiting for new location
7. S3 requests timeout trying to reach unreachable old IP
## Solution
Distinguish between two cases:
- found=true, locations=[] : Volume explicitly has no locations (e.g. restart)
→ Return nil, false (no fallback to cache)
- found=false : Volume never seen in current map
→ Check cache (preserve cache benefits for unknown volumes)
An empty array explicitly means 'this volume currently has no locations',
which is semantically different from 'volume unknown'. Don't fall back to
stale cache for explicitly empty volumes.
## Testing
Added comprehensive tests:
- TestGetLocationsEmptyArrayNoFallback: Verifies empty arrays don't use cache
- TestGetLocationsUnknownVolumeUsesCache: Verifies unknown volumes still use cache
- All existing tests pass
## Impact
Fixes registry sync job hangs during SeaweedFS upgrades/restarts. S3 gateway
will now correctly wait for updated volume locations instead of using stale
cached IPs.
Related: OutSystems.SeaWeedfs Helm chart, vega cluster incident 2026-06-24
A config-file reload (SIGHUP) routed through MergeS3ApiConfiguration,
which skips identities marked static so dynamic admin/filer updates can't
clobber them. That also blocked the config file itself from updating its
own identities, so editing a secretKey and reloading had no effect.
Thread a fromStaticFile flag from the file-load path into the merge: the
authoritative file overwrites its static identities (and reapplies service
accounts under them), while dynamic updates still leave them immutable.
Mark the rebuilt identities static in the merge so a concurrent
RemoveIdentity never observes them as removable mid-reload.
Standalone weed s3 created a master client and registered the receiving
SeaweedS3IamCache gRPC service, but never wrapped its credential store
with the propagating store. Only the filer-embedded path called
SetMasterClient, so IAM mutations on one s3 pod never reached peers; they
served a stale in-memory identity cache and returned InvalidAccessKeyId
until restarted.
Wrap the credential store with the master client when one is available,
mirroring the filer path, so mutations fan out over the existing gRPC
cache service.
* s3tables: allow hyphens in namespace and table names
Iceberg REST clients routinely use hyphenated namespace/table names, but the
S3 Tables charset (a-z, 0-9, _) rejected them with 400. Accept '-' as an
interior character (names must still start, and namespaces end, with a letter
or digit), making the catalog conformant for those clients. A permissive
superset of the AWS S3 Tables charset.
* s3tables: allow hyphens in table ARN parsing too
The ARN regexes still excluded '-', so parseTableFromARN rejected ARNs with
hyphenated namespace/table names and existing reject-the-hyphen tests broke.
Widen the ARN patterns to match the validator, retarget those tests at a
still-invalid leading-hyphen name, and cover ARN parsing with hyphens.
* s3tables: purge decoupled table data without deleting the reused name path
A renamed or created-over-leftover table keeps its data at a location that
differs from its catalog name path. Drop now purges that data location and
clears the marker, instead of recursively deleting the name path, which may
still hold another table's data.
* iceberg: route a table created over a leftover to a unique location
When the default location is occupied by a leftover directory (data kept when
another table was renamed to this name), create the new table at a unique
location so it cannot overwrite that table's metadata. Common case is unchanged.
* iceberg: fail table create when the leftover-path check errors
A transient filer lookup error fell through as "not occupied", routing the
new table back to the default path and risking the very overwrite this check
guards against. Propagate the error and return 500 instead.
* s3tables: assert all catalog xattrs cleared on decoupled drop
Seed the full marker set so the test catches a regression that leaves the
policy, tags, version, or entry-type attribute on the reused name path.
* s3tables: refuse to drop a table whose data path is an ancestor
Corrupt metadata can resolve the data path to the bucket or namespace root,
which the bucket-scope check still admits; a recursive purge there would wipe
sibling tables. Reject an ancestor data path before deleting.
A directory rebuild wiped the cached children, listed the filer once, and
published the directory authoritatively cached over whatever came back. A
transient empty listing -- a momentary list-stream glitch that ends as a
clean EOF with no entries -- then stranded a populated directory cached
over an empty store, hiding every file in it until some unrelated event
happened to rebuild it: stat returns ENOENT and readdir returns nothing
though the files are safe on the filer, and nothing re-triggers a build.
Re-read the directory when the listing comes back empty before trusting
it. The first re-read is immediate, since the likely transient clears on a
fresh stream; later attempts space out. A genuinely empty directory still
lists empty every time and caches as before, so only empty listings pay
the extra read.
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
With default_permissions (the mount default) the kernel enforces unix
permission bits from the getattr/lookup attributes before it ever calls
Open, Create, or Mknod. The mount was re-checking permissions in
AcquireHandle and createRegularFile anyway, which duplicated the kernel's
work and kept the supplementary-group lookup on the per-file hot path.
Gate only the mode-bit access check on default_permissions being off, so
a non-root copy does no permission work on open/create. createRegularFile
still loads the parent to validate it exists, since the create RPC skips
the filer-side parent check. With default_permissions off the mount
remains the sole enforcer, so the full check still runs.
* operation: bound AssignVolume with a deadline
AssignVolume ran on context.Background(), so when the filer is overwhelmed
the RPC could block indefinitely and wedge every caller holding the
connection. Give it a 30s deadline so a stuck assign fails and the caller's
retry/error path runs instead of hanging forever.
* mount: abort flush when the FUSE request is interrupted
On close(), a killed process blocks in fuse_flush waiting for the mount to
answer. doFlush ran its metadata CreateEntry on context.Background() and
ignored the kernel interrupt channel, so against an overwhelmed filer the
flush never completed and the process stayed in uninterruptible sleep --
making the pod un-killable.
Derive a context from the FUSE cancel channel in Flush/Fsync and thread it
through doFlush -> flushMetadataToFiler -> streamCreateEntry; the retry loop
stops as soon as the context is cancelled. Release and the pre-rename flush
keep a non-cancellable context since they must finish regardless.
* operation: harden the AssignVolume timeout test
Make the test double's signal send non-blocking and bound the receive with a
timeout so a regression can't wedge the test instead of failing it.
* fix(filer.backup): repair chunk-incomplete and stale destination entries
filer.backup left destinations diverged while metadata advanced — chunk-incomplete
(missing/gapped ranges at full attr.file_size) or holding a chunk superseded by a
missed overwrite. The skip/repair decision keyed on filer.FileSize (the attr),
which a truncated entry keeps full, so it never repaired.
Decide from actual chunk state instead:
- coversReference: range-by-range containment (scalar byte totals and attr
FileSize/Md5 cannot see chunk-level gaps).
- hasStaleBackupChunk: a backup-written chunk (SourceFileId) the source no longer
lists; ignores out-of-band (rsync/direct) chunks.
- destinationMatchesReference: allocation-free positional fast path gating the
above so they run only on divergence (the in-sync path stays cheap).
- A strictly-newer destination is never repaired, so an older out-of-order replay
cannot roll it back. The stale signal is deferred at equal mtime (same-second
versions cannot be ordered; reliable S3 sub-second ordering is a separate fix).
Tests in filer_sink_test.go.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* filer.backup: verify chunk range in destinationMatchesReference fast path
The allocation-free fast path matched a destination chunk to its reference
by SourceFileId alone. That is correct today only because replicateOneChunk
copies the source chunk's Offset/Size verbatim, so SourceFileId identity
implies an identical range — an invariant that lives in another file with no
guard linking the two. If replication ever re-chunks (split/coalesce), a
chunk with the right SourceFileId but a different range would fast-path as a
full match and skip a needed repair (a false positive in the very class this
change otherwise prevents).
Compare Offset/Size alongside SourceFileId so the fast path is self-contained
and can only be more conservative (a range mismatch falls through to the
precise coversReference/hasStaleBackupChunk checks). Add tests for a shifted
offset and a larger size at matching identity.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* fix(filer): apply -filer.disk default to metadata log assigns
Metadata event log writes call operation.Assign directly and used only
FilerConf path rule DiskType. When filer.conf rules were missing or
unmatched, the master received an empty DiskType and grew volumes on the
built-in hdd layout.
Mirror resolveAssignStorageOption: wire FilerOption.DiskType into the
Filer, fall back when the matched path rule has no disk type, and return
the matched rule from resolveMetadataLogAssignDiskType to avoid duplicate
MatchStorageRule lookups.
Co-authored-by: Cursor <cursoragent@cursor.com>
* mini: fall back to -volume.disk for filer default disk type
weed server copies -volume.disk into the filer disk default when
-filer.disk is unset; weed mini did not, so metadata-log assigns sent
an empty disk type on clusters that only tag volumes (e.g. hot/warm).
---------
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Chris Lu <chris.lu@gmail.com>
The tyler-smith/go-bip39 repository was deleted from GitHub, so go mod
download fails for anyone resolving it directly (GOPROXY=direct). It
only reaches us transitively through rclone's internxt backend, which
calls IsMnemonicValid and NewSeed. Point it at cosmos/go-bip39, an
API-compatible and maintained fork.
handleCreateTable used a type assertion that fails through WithFilerClient's
'all filers failed' wrap, so a concurrent create that the pre-check missed
fell through instead of returning the existing table. Use errors.As.
* fix(shell): correct volume.list -writable filter unit and comparison
* fix(shell): correct volume.list -writable filter unit and comparison
* chore(shell): fix typo in EC shard helper param names
* fix(shell): use exact match for volume.balance -racks/-nodes filter
The old strings.Contains-based filter quietly included any id that was a
substring of the user-supplied flag value (e.g. -racks=rack10 also matched
rack1). Replace it with an exact-match set parsed from the comma-separated
flag value, and add regression tests for both -racks and -nodes paths.
Also fix a small typo in the "remote storage" error returned by
maybeMoveOneVolume.
* fix(shell): use exact match for volume.balance -racks/-nodes filter
The old strings.Contains-based filter quietly included any id that was a
substring of the user-supplied flag value (e.g. -racks=rack10 also matched
rack1). Replace it with an exact-match set parsed from the comma-separated
flag value, and add regression tests for both -racks and -nodes paths.
Also fix a small typo in the "remote storage" error returned by
maybeMoveOneVolume.
* refactor(shell): drop nil sentinel in splitCSVSet, use len() in callers
* fix: avoid reading upload body when writing JSON errors
* s3tables: add RenameTable operation
* iceberg: support table rename
* iceberg: test table rename
* s3tables: keep table data in place on rename
rename is catalog-only: drop the source's catalog xattrs in place instead of recursively deleting its directory, which wiped the metadata.json and data files the renamed destination still points at. treat a missing table-metadata xattr as NoSuchTable in GetTable so the soft-deleted source name stops resolving.
* s3tables: test rename preserves data
make the in-memory filer honor recursive data deletion and seed the source table's metadata/ and data/ children, then assert a rename leaves them intact, the source name resolves to NoSuchTable, and the destination resolves to the preserved location.
* iceberg: map rename errors through wrapped manager error
* s3tables: authorize rename destination namespace
rename moved a table into the destination namespace after only checking the source, letting a source-authorized caller place tables in namespaces they don't control. require CreateTable on the destination namespace and bucket before writing.
* s3tables: purge renamed table data on drop
* s3tables: test table data dir derivation
s3: don't write 503 to a disconnected client during remote cache wait
When the remote-only cache poll returns without chunks, re-check the
request context before emitting 503 + Retry-After. A client that
disconnected during the wait surfaces as context.Canceled, which the
caller already handles silently; writing to the closed connection only
produced broken-pipe log noise.
* s3tables: tag table entries and exclude views from table listings
* s3tables: add view CRUD operations
* iceberg: support view create, load, exists, drop, and list
* iceberg: support view update
* iceberg: test view error classification and metadata round-trip
* iceberg: pre-check existence and write view metadata only after create
* iceberg: map view namespace-not-found to 404
* iceberg: test view create namespace-404 and duplicate no-clobber
* s3tables: tag view metadata and entry type atomically
CreateView wrote ExtendedKeyMetadata and ExtendedKeyEntryType in two
UpdateEntry calls, so a partial failure could leave a view directory
untagged. Add setExtendedAttributes to set both in one UpdateEntry.
* iceberg: roll back view registration when metadata write fails
The metadata file is written after the catalog registers the view. If
that write fails, drop the just-created view so it doesn't linger
pointing at a missing metadata.json. Reuse the DeleteView path via a
shared dropView helper.
* iceberg: support multi-table transaction commit
Add handleCommitTransaction for POST /v1/transactions/commit. Validation
is atomic across all table-changes (resolve, load, evaluate every
requirement before any write); metadata writes and pointer flips are
best-effort with rollback, so this is not crash-atomic.
* iceberg: route transactions/commit with and without prefix
* iceberg: test transaction commit request decoding
* iceberg: restore full prior table state on transaction rollback
* iceberg: test transaction rollback restores full prior table state
* iceberg: only clean up metadata for rolled-back tables
* s3tables: add RegisterTable op
* iceberg: support table register
* iceberg: test register table
* iceberg: parse engine-written metadata version from location
* iceberg: test metadata version parsing for both filename forms
* iceberg: map register errors through wrapped manager error
* iceberg: validate register metadata-location bucket and reject traversal
* iceberg: log register metadata load failure
* filer: treat a directory carrying object data as an S3 key object
A file promoted to a directory by a child write keeps its chunks, inline
content, or remote-tiered entry. Recognize that as a directory key object,
not only when a Mime is set, so the object still lists, demotes on delete,
and is not reclaimed by cleanup like the object it still is.
* filer: keep the empty-folder cleaner from reclaiming a promoted object
The cleaner skips directory key objects, but its check only looked at the
Mime. Mirror the chunks/content/remote check so a file promoted to a
directory is not deleted once its children are gone.
* s3: serve ranged GET for a directory that carries object data
Reject only zero-size directories so a file promoted to a directory streams
range requests instead of returning 404, while empty directories still 404.
* s3: return HEAD metadata for a directory that carries object data
HEAD now 404s a directory only when it has no data, so a promoted object is
retrievable while empty/implicit directories still fall back to LIST.
* feat: add collection.mark shell command
Add collection.mark to mark all existing normal volume replicas in a collection as readonly or writable. The command runs in preview mode by default and requires -apply to execute changes. It reuses existing volume mark RPCs, supports default collection aliases, skips EC shards, and adds unit tests for option parsing and target collection logic.
* Revert "feat: add collection.mark shell command"
This reverts commit 50c2bbf94c.
* feat: support marking volumes by collection
Add a -collection option to volume.mark so operators can mark every normal volume replica in a collection using existing topology data and volume mark RPCs.
The change keeps the single-volume path unchanged and adds tests for collection target selection, EC shard exclusion, and argument validation.
Co-authored-by: Codex <noreply@openai.com>
* volume.mark: reuse eachDataNode for collection traversal
* volume.mark: continue past per-volume failures and report progress
Collection marking aborted on the first failed RPC, leaving the
collection half-marked with no record of which volumes succeeded.
Mark every reachable volume, print per-volume progress to the writer,
and return an aggregated error naming the failures.
* volume.mark: let -collection _default target the unnamed collection
Other volume commands use the _default sentinel to match volumes with
no named collection; volume.mark could not reach them at all. Map
_default to the empty collection name in the filter.
---------
Co-authored-by: Chris Lu <chrislusf@users.noreply.github.com>
Co-authored-by: Codex <noreply@openai.com>
Co-authored-by: Chris Lu <chris.lu@gmail.com>