seaweedfs

mirror of https://github.com/seaweedfs/seaweedfs.git synced 2026-06-09 18:32:43 +00:00

Author	SHA1	Message	Date
github-actions[bot]	c06a2dca87	4.37 4.37	2026-06-29 06:45:55 +00:00
Chris Lu	5797fb24ec	s3: support AWS object form for bucket policy Principal, add NotPrincipal (#10125 ) * s3: support AWS object form for bucket policy Principal, add NotPrincipal Bucket policy statements only accepted a bare string or array of strings for the Principal element, so the AWS-documented object form was rejected: "Principal": { "AWS": "arn:aws:iam::123456789012:root" } "Principal": { "AWS": ["arn:...", "999999999999"] } Add a PolicyPrincipal type that parses the bare string, the bare array (retained for backward compatibility), and the object form keyed by AWS, Service, Federated or CanonicalUser (each value a string or array). All keyed values are flattened for principal matching, and the original JSON is preserved so PutBucketPolicy/GetBucketPolicy returns the exact shape submitted - keeping infrastructure-as-code tools (Terraform, Ansible) idempotent. Also add NotPrincipal support (a statement applies to every principal except the ones named), compiled and evaluated in both policy evaluators, and reject statements that specify both Principal and NotPrincipal. * s3: address review - validate principal object form, honor dynamic NotPrincipal - Reject unsupported Principal object keys (only AWS/Service/Federated/ CanonicalUser) and empty values, so a form like {"AWS":[]} no longer compiles to zero matchers and silently relies on the match-all fallback. - Detect both Principal and NotPrincipal by field presence, not by flattened length, so a present-but-empty field is still rejected. - Honor dynamic (policy-variable) NotPrincipal/Principal patterns in the compiled evaluator; previously a NotPrincipal made only of variables was treated as absent and its exclusion bypassed. - Add regression tests for the object-form validation and dynamic NotPrincipal.	2026-06-27 22:36:26 -07:00
Rushikesh Deshpande	d0db94c34a	feat(metrics): Add EC rebuild/reconstruct Prometheus metrics (#10124 ) * Review comment removed unnecessary success and failure count * fix: use Gather.Gather() with seeded counter for EC rebuild registration test - Restore Gather.Gather() to verify MustRegister calls as requested in review - Seed VolumeServerECRebuildCounter before gathering because CounterVec only appears after at least one label value is observed - Use correct fully-qualified metric names (SeaweedFS_volumeServer_) fix: remove preflight checkEcVolumeStatus failure from ec_rebuild_total counter ec_rebuild_total should only reflect actual rebuild execution failures (from RebuildEcFiles / RebuildEcxFile), not scan/precheck failures in the volume status loop. The error is still returned to the caller; only the misleading counter increment was removed. * Review comment removed unnecessary observe * label EC rebuild duration histogram by result Without a result label, fast failures pull down the success-latency quantiles shown on the EC Rebuild Duration panel. Make the histogram a HistogramVec keyed by result, record success/failure through one recordEcRebuild helper, and split the Grafana quantiles by (le, result). * reset EC rebuild metric vecs in registration test The HistogramVec needs a child before Gather emits it, so the test must observe once; reset both vecs in cleanup so that sample doesn't leak into other tests. --------- Co-authored-by: Ubuntu User <ubuntu@example.com> Co-authored-by: Chris Lu <chris.lu@gmail.com>	2026-06-27 22:01:36 -07:00
Chris Lu	57ffef8543	fix(admin): skip task state files with no task data on load An empty or truncated tasks/*.pb file unmarshals into a TaskStateFile with a nil Task, and protobufToMaintenanceTask dereferenced it immediately, panicking the whole admin process on startup. Guard the nil case so the loader logs a warning and skips the bad file.	2026-06-26 17:36:42 -07:00
Chris Lu	f643893891	fix(master): shed assign load when volume growth is already in flight (#10121 ) Under a herd of concurrent assigns with no writable volume, Assign spun PickForWrite for the full 10s timeout, pinning a goroutine per request and starving the master of the cycles it needs to process growth and answer heartbeats. When growth is the relevant remedy and already in flight, stop spinning: if free space exists, shed with a fast retryable error so clients back off and retry once growth lands; if the cluster is out of space, fail fast with the real out-of-space error instead of masking it as retryable. The gRPC shed uses ResourceExhausted, not Unavailable: operation.Assign retries it, but the client connection layer doesn't treat it as a dead channel, so a per-request shed across a herd doesn't tear down the shared master connection and cancel every other in-flight assign. The HTTP dirAssignHandler sheds with 503 + Retry-After.	2026-06-26 14:23:40 -07:00
jk2lx	81ed379884	volume server: route VolumeMarkReadonly to raft leader (#10120 ) * volume server: route VolumeMarkReadonly to raft leader After a master raft election, volume servers may still heartbeat a follower while admin paths such as weed shell volume.mark call notifyMasterVolumeReadonly via vs.GetMaster(). Followers reject VolumeMarkReadonly with NotLeader, which breaks tiering and other mark-readonly workflows until the heartbeat loop reconnects. Resolve the leader through GetMasterConfiguration on configured -master peers (same Leader field filer/master clients already use) before calling VolumeMarkReadonly. When the leader differs from the heartbeat peer, update currentMaster so the heartbeat loop converges faster. Adds operation.LookupRaftLeaderMaster with unit tests. Co-authored-by: Cursor <cursoragent@cursor.com> * fix: address review feedback on volume.mark raft leader routing Do not update currentMaster during leader lookup — heartbeat owns that field and uses stream GetLeader() to reconnect. Try the heartbeat peer first and only resolve the raft leader after a NotLeader rejection. Add ctx.Err() early exit and quieter logging for context cancellation. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(operation): thread the lookup timeout ctx into connection invalidation The 5s timeout drove only the RPC; WithMasterServerClient saw the unbounded outer ctx, so a self-inflicted timeout (slow GetMasterConfiguration during an election) was treated as a stale channel and tore down the shared master connection. Pass the timeout ctx into the helper so its own expiry leaves ctx.Err() set and spares the connection. --------- Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: Chris Lu <chris.lu@gmail.com>	2026-06-26 14:22:57 -07:00
Chris Lu	7c3c5ed2a4	fix(filer.sync.verify): sort listings client-side before merge (#10117 ) * fix(filer.sync.verify): sort listings client-side before merge The merge walks both filers' directory listings in lockstep and needs them in the same byte order. A filer before 4.32 with a locale SQL collation lists case-insensitively while a 4.32+ peer lists byte-ordered, so comparing two such clusters returns the same names in a different order and the merge desyncs into spurious MISSING / ONLY_IN_B. Buffer and sort each directory client-side so both sides agree on order regardless of filer version or store backend. Trades the streaming source's O(buffer) memory for O(directory) per side, fine for a one-shot verify CLI; both sides still load concurrently. Claude-Session: https://claude.ai/code/session_01BKsBdKYFNCEjeHLjJfumPF * fix(filer.sync.verify): surface listing errors before merging A listing that fails mid-stream leaves a partial, unsorted buffer. Now that both sides are fully buffered anyway, check each side's error right after the loads finish and before the merge, so partial entries can't emit spurious MISSING / ONLY_IN_B before the error aborts the run. Claude-Session: https://claude.ai/code/session_01BKsBdKYFNCEjeHLjJfumPF	2026-06-26 10:27:18 -07:00
qzhello	378f9a64ff	fix: apply collectionPattern during detection in volume.fix.replication (#10115 ) * fix(shell): correct volume.list -writable filter unit and comparison * fix(shell): correct volume.list -writable filter unit and comparison * chore(shell): fix typo in EC shard helper param names * fix(shell): use exact match for volume.balance -racks/-nodes filter The old strings.Contains-based filter quietly included any id that was a substring of the user-supplied flag value (e.g. -racks=rack10 also matched rack1). Replace it with an exact-match set parsed from the comma-separated flag value, and add regression tests for both -racks and -nodes paths. Also fix a small typo in the "remote storage" error returned by maybeMoveOneVolume. * fix(shell): use exact match for volume.balance -racks/-nodes filter The old strings.Contains-based filter quietly included any id that was a substring of the user-supplied flag value (e.g. -racks=rack10 also matched rack1). Replace it with an exact-match set parsed from the comma-separated flag value, and add regression tests for both -racks and -nodes paths. Also fix a small typo in the "remote storage" error returned by maybeMoveOneVolume. * refactor(shell): drop nil sentinel in splitCSVSet, use len() in callers * fix: apply collectionPattern during detection in volume.fix.replication * use existing wildcard.MatchesWildcard for collection matching It returns a plain bool, so drop the up-front filepath.Match validation and the path/filepath import that only existed to handle its error. * trim verbose comments to terse one-liners * drop redundant per-path collection guards Detection already filters by replicas[0].info.Collection. The repair guard re-checked pickOneReplicaToCopyFrom's collection (a different replica), so a mixed-collection volume could pass detection yet be skipped in repair without decrementing the counter, spinning the -apply loop. deleteOneVolume keeps its collectionIsMismatch safety. --------- Co-authored-by: Chris Lu <chris.lu@gmail.com>	2026-06-26 00:48:29 -07:00
Chris Lu	f475d60fcf	mount: move directory cache state to a side map to shrink InodeEntry (152 to 32 bytes) (#10114 ) mount: move directory cache state to a side map to shrink InodeEntry The mount keeps an InodeEntry alive for every inode the kernel references. On a mount that is almost entirely regular files, each entry carried the full directory readdir-cache bookkeeping (four time.Time fields plus counters), bloating it to 152 bytes whether or not the inode was a directory. Move that state into a dirState held in a side map keyed by inode, and drop the isDirectory bool: an inode is a directory iff it has a dirState. InodeEntry is now just paths + nlookup at 32 bytes, landing in a smaller Go allocator size class; on a mount with tens of millions of cached file inodes that is several GB less resident heap. As a side effect the readdir-cache scan helpers iterate only directories instead of every inode.	2026-06-25 19:17:32 -07:00
Chris Lu	c2668fbc64	fix(volume): make tier-down crash-safe and serve from local (Rust) (#10113 ) * fix(volume): fsync .vif and downloaded tier .dat (Rust) save_volume_info wrote the .vif with a plain write and no fsync, and the tier download never synced the .dat it wrote. Either could be lost on a crash before the tier-down path acts on them. fsync both, matching the Go volume server's util.WriteFile and DownloadFile. * fix(volume): swap to local before deleting remote on tier-down (Rust) The tier-down path deleted the shared remote object before trimming the .vif, so a crash in between left the volume's .vif pointing at a deleted object. It also dropped the remote backend only on the delete path and never opened the downloaded local .dat, so reads broke until reload and a keep-remote download kept serving from the slow remote object. Trim the .vif and swap to the local .dat on both paths, bracketed by directory fsyncs, before removing the remote object; gate only the object removal on keep_remote_dat_file. Matches the Go volume server's crash-safe ordering.	2026-06-25 12:29:21 -07:00
Chris Lu	66620a1ab8	fix(volume): serve reads from remote after tier upload (Rust) (#10112 ) After VolumeTierMoveDatToRemote uploaded the .dat, the volume closed its local backend but never opened the remote one, leaving both dat_file and remote_dat_file empty. The needle read path has no lazy reopen, so reads returned "dat file not open" until the volume reloaded. Switch to the remote backend right after saving the .vif, the same as the Go volume server's LoadRemoteFile, so the volume keeps serving from remote storage immediately after tiering.	2026-06-25 10:55:52 -07:00
Chris Lu	2c2df751f5	Perf CI: benchmark the Rust volume server and report memory usage (#10111 ) * ci: add per-process memory sampler for perf jobs Samples VmRSS once a second into a CSV and records peak VmHWM per process on stop. Linux only; reads /proc/<pid>/status. * ci: run perf benchmarks on the Rust volume server and report memory Matrix the throughput and S3 jobs over go/rust volume servers, using a standalone master (plus filer for S3) and swapping only the volume binary so the two are directly comparable. Sample peak RSS in every job and surface it per impl in the run summary. * ci: harden mem sampler arg handling and peak fallback Guard against missing args under set -u, and fall back to the max RSS sampled when a process exits before VmHWM can be read.	2026-06-25 10:52:23 -07:00
Chris Lu	2efc0e1656	ec: recover EC shards whose .ecx index lives only on a peer server (#10108 ) * ec: recover EC shards whose .ecx index lives only on a peer server A volume server that boots with EC shard files on disk but no .ecx index on any local disk cannot mount the shards, so the master never learns about them. ec.rebuild works off master-registered shards, so it sees the volume as short and gives up even though the shard data is intact. Add an operator-triggered recovery: VolumeEcShardsMount gains a recover_missing_index flag that makes the volume server fetch the missing .ecx (plus .ecj/.vif) from a peer holding it and mount the on-disk shards. ec.rebuild runs this across the cluster before planning, so orphaned shards register and the rebuild sees the true shard set. .ecx is an immutable encode-time index, identical on every holder. .ecj is a per-holder deletion journal that differs across holders, so the recovered node adopts the source peer's deletion view, like a balanced or rebuilt shard does. * ec: mirror missing-index recovery into the Rust volume server Port the #10104 recovery to seaweed-volume so the Rust volume server self-heals the same layout: EC shards on disk with the .ecx index only on a peer. Adds collect_ec_volumes_missing_index / mount_recovered_ec_shards to the store, recover_missing_ec_indexes (master LookupEcVolume + peer CopyFile fetch + mount) to the server, and the recover_missing_index flag on VolumeEcShardsMount. .ecx is the immutable encode-time index, identical on every holder. .ecj is a per-holder deletion journal, so the recovered node adopts the source peer's deletion view, matching the Go path.	2026-06-25 10:38:14 -07:00
adri	130a5dffc3	fix (Volume [Rust]): stream copy_file and volume_incremental_copy instead of buffering the whole file in memory (#10110 ) * fix(volume): stream copy_file from disk instead of buffering whole file copy_file pushed every 2MB chunk into a Vec and only then returned tokio_stream::iter(results), so serving a near-limit volume as a copy source (e.g. during volume.fix.replication) held the entire .dat resident and could OOM the process. Stream chunks through a bounded mpsc channel from a spawn_blocking reader instead; caps memory at ~16MB per transfer with backpressure. * fix(volume): stream volume_incremental_copy from disk instead of buffering Same buffering pattern as copy_file: every 2MB chunk was pushed into a Vec and only then returned via tokio_stream::iter, holding the entire delta resident. Stream the byte range from an owned file handle through a bounded mpsc channel, mirroring the copy_file fix. * test(volume): cover streaming copy_file and volume_incremental_copy Adds a multi-chunk .dat fixture and tests asserting both handlers stream in 2MB chunks (multiple messages), reassemble byte-for-byte, carry modified_ts_ns only on the first copy_file message, and honor stop_offset. * address review: use u64 byte counters; stream local incremental copy without holding the store lock - copy_file/volume_incremental_copy: track remaining bytes and offsets as u64 instead of casting uint64 stop_offset/dat_size through i64 (CodeRabbit). - volume_incremental_copy: for local volumes open the .dat and stream directly with no lock held; only remote/tiered volumes take the per-chunk read_dat_slice path, so a remote S3 read is never performed while holding the store read lock (Gemini). * volume (Rust): stream tiered incremental copy off the store lock, open .dat under it Capture the reader for volume_incremental_copy while the volume lookup is still under the store read lock: an open File for local volumes, a cloned remote backend handle for tiered ones. Then drop the lock and stream with none held. Opening under the lock pins the reader to the volume that exists now, so a concurrent delete/recreate can't stream from the wrong file, and a slow S3 fetch for a tiered .dat no longer blocks store writers (the remote path previously re-took the store lock per chunk). Use a non-uniform copy-test payload so chunk reassembly catches duplicated or reordered chunks a repeated byte would hide. * volume (Rust): return empty when incremental-copy start offset is past the .dat A corrupt needle index could locate an offset beyond the captured .dat size, underflowing the dat_size - start_offset subtraction (panic in debug, wrap in release). Guard it up front like the other empty-delta early returns. --------- Co-authored-by: adri <adri@digitalunited.net> Co-authored-by: Chris Lu <chris.lu@gmail.com>	2026-06-25 10:25:42 -07:00
Chris Lu	c01cea8786	docker release: run all platform jobs in one wave, cache rocksdb compile Drop max-parallel so the 13 per-platform builds run together instead of two waves of 8 (rocksdb was queuing behind the cap and starting ~8 min late). Keep cache-to mode=max for rocksdb: its RocksDB static_lib compile is sha-independent, so it caches across releases and stops being the ~16-min long-pole that gates the merge fan-in. go-build variants stay mode=min.	2026-06-25 01:13:32 -07:00
Chris Lu	3f68b19500	docker release: per-platform builds on native runners, drop mode=max cache (#10109 ) docker release: build per-platform on native runners, drop mode=max cache The build job built every platform of a variant on one runner, so 2-4 Go cross-compiles fought over a single 2-vCPU box and arm64 ran in an emulated context. Split the matrix to one platform per job on a native runner (amd64/386 on ubuntu-latest, arm64/arm-v7 on ubuntu-24.04-arm); only arm/v7 still needs QEMU, and only for its final apk stage. Each job pushes by digest, and a new merge job assembles the multi-arch tag with imagetools and mirrors it to Docker Hub. cache-to mode=max -> mode=min: BRANCH=sha cache-busts the heavy go-build layer every release, so writing all intermediate layers to the gha backend spent 3-11 min per variant on a cache the next release's sha can never hit.	2026-06-25 00:37:33 -07:00
jay	d2795de186	fix(admin): volume TTL in dashboard (#10107 ) fix: admin dashboard ttl display Signed-off-by: jayl1e <jayl1e@outlook.com>	2026-06-24 23:42:10 -07:00
Chris Lu	a88acaf061	Add performance CI (profiling, throughput, S3 read/write) (#10105 ) * test: add self-contained S3 read/write load tool Concurrent PUT/GET against the S3 gateway, reporting requests/sec, transfer rate, and latency percentiles. Built on the aws-sdk-go-v2 client the S3 tests already use, so no extra benchmark binary is needed. * ci: add performance workflow Three parallel jobs: cpu/heap pprof of the server under write load, native throughput via weed benchmark plus the Go micro-benchmarks, and an S3 read/write benchmark against the gateway. Runs on push to master and manual dispatch with tunable duration, object count, size, and concurrency.	2026-06-24 22:44:03 -07:00
github-actions[bot]	d0b90d29eb	4.36 4.36	2026-06-25 05:09:40 +00:00
Chris Lu	d65ed3b557	add release version-bump workflow	2026-06-24 22:08:06 -07:00
Chris Lu	3b9e196e5f	sts: enforce session-policy explicit deny during role chaining (#10103 ) * sts: enforce session-policy explicit deny during role chaining A chained AssumeRole caller authenticates with an STS session token whose inline session policy can explicitly deny sts:AssumeRole. The deny check only evaluated the caller's named policies, so such a session could still chain into any role its trust policy admits. Validate the session token in the deny check and honor an explicit Deny in the inline session policy too. * test(sts): integration coverage for AssumeRole authorization Add an end-to-end AssumeRole authorization test (real weed mini + boto3): a non-admin caller assumes a role its trust policy admits, an explicit identity-side deny is blocked, and a session policy's explicit deny blocks role chaining. * sts: skip OIDC tokens and reject revoked sessions in the chaining deny check Review follow-ups on the session-policy deny check: - Guard session validation with !isOIDCToken so a bearer token our STS service cannot validate does not error into a false deny. - Reject a revoked session before evaluating its policy, restoring the revocation enforcement the AssumeRole path lost when it stopped routing through IsActionAllowed.	2026-06-24 21:38:21 -07:00
Chris Lu	88a4a939aa	fix(sts): authorize AssumeRole by the role's trust policy (#10097 ) * fix(sts): authorize AssumeRole by the role's trust policy The role's trust policy already declares who may assume it, but the caller also had to pass an identity-side sts:AssumeRole check that only the Admin action could satisfy — legacy static identities have no way to express sts:AssumeRole on a role. So assuming any role required a full admin identity. Drop the redundant check and let the trust policy be the authority; scope it to specific principals to restrict who can assume. * sts: resolve caller principal ARN for the trust-policy check A legacy static identity can reach AssumeRole without a PrincipalArn set; passing the empty value would miss a trust policy that names a concrete principal. Resolve it to the canonical user ARN, sharing the logic GetCallerIdentity already used inline. * sts: enforce explicit identity-side deny for AssumeRole Authorizing a named role by its trust policy alone dropped identity-side evaluation entirely, so a caller whose attached policy explicitly denies sts:AssumeRole could still assume any role the trust policy admits. Re-check the caller's policies through the IAM manager for an explicit deny (deny-always-wins) without requiring an allow; the trust policy stays the allow authority.	2026-06-24 20:14:26 -07:00
sshhan	a1fff50935	fix(postgres): prevent uint32 underflow & OOM in message parsing (#10099 ) * fix(postgres): prevent uint32 underflow & OOM in message parsing * postgres: drop redundant startup guard, use maxStartupMessageSize const The msgTotalLen < 8 check already guarantees msgLength >= 4, so the extra msgLength < 4 guard before reading the protocol version was unreachable. Point the startup size limit at maxStartupMessageSize instead of a literal. * postgres: trim query terminator safely, cap pre-auth payloads Use strings.TrimSuffix for the simple-query null terminator so a non-null-terminated body isn't silently shortened, matching the auth handlers. Bound password/MD5 reads with a dedicated maxAuthMessageSize (10 KiB) instead of the 100 MiB maxMessageSize, since these payloads are read before authentication. --------- Co-authored-by: shangshuhan <shangshuhan@cmict.chinamobile.com> Co-authored-by: Chris Lu <chris.lu@gmail.com>	2026-06-24 20:05:43 -07:00
Chris Lu	0f1ec8983d	mount: don't fail close() on a benign FUSE interrupt (#10102 ) A FUSE interrupt is not a process kill. Go's async preemption (SIGURG) makes a close() under load emit an interrupt on nearly every flush, so deriving the metadata-flush context from the FUSE cancel channel turned healthy concurrent close()s into EIO: the interrupt cancelled the in-flight CreateEntry, which surfaced as "input/output error". Bound the flush with a deadline instead. A healthy CreateEntry finishes in well under a second, so the deadline only fires against a genuinely stuck filer -- still keeping close() from hanging forever -- while benign preemption no longer aborts a good flush.	2026-06-24 19:54:03 -07:00
Chris Lu	95427b5573	security: add BearerPrefix constant for Authorization headers (#10101 ) Introduce security.BearerPrefix ("Bearer ", RFC 6750) and use it everywhere an "Authorization: Bearer <token>" header is constructed, replacing the scattered "BEARER "/"Bearer " string literals. SeaweedFS matches the scheme case-insensitively when parsing (security.GetJwt), so behavior is unchanged; this removes the magic string and settles the casing on the standard form. The parser's upper-case comparison stays as is on purpose.	2026-06-24 19:36:42 -07:00
Chris Lu	4d3e5d94a9	filer: mint volume read JWT when proxying chunk reads (#10100 ) The /?proxyChunkId= endpoint forwards the caller's headers to the volume server but never mints a read token, so proxied chunk reads return 401 once jwt.signing.read.key is configured. Generate a fileId-scoped volume token the same way the direct filer read path does, which fixes filer.sync, filer.backup, filerProxy mounts, the MQ broker and the upload gateway in one place.	2026-06-24 19:21:57 -07:00
dependabot[bot]	7c9f61d4dc	build(deps): bump com.fasterxml.jackson.core:jackson-databind from 2.18.6 to 2.22.0 in /test/java/spark (#10094 ) * build(deps): bump com.fasterxml.jackson.core:jackson-databind Bumps [com.fasterxml.jackson.core:jackson-databind](https://github.com/FasterXML/jackson) from 2.18.6 to 2.22.0. - [Commits](https://github.com/FasterXML/jackson/commits) --- updated-dependencies: - dependency-name: com.fasterxml.jackson.core:jackson-databind dependency-version: 2.22.0 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> * build(deps): pin jackson-annotations to its own 2.22 version jackson-annotations dropped the patch digit in 2.20 and releases on its own line, so 2.22.0 does not exist. Sharing jackson.version broke dependency resolution; give it a dedicated property. --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Chris Lu <chris.lu@gmail.com>	2026-06-24 19:12:48 -07:00
Chris Lu	96d2d13efe	s3: replicate by fanning out from the gateway to every holder (#10078 ) * s3: replicate by fanning out from the gateway to every holder The S3 gateway uploaded each chunk to one volume server, which then relayed the copies to the other replica holders. The gateway now uploads each chunk to every holder in parallel (type=replicate), removing the primary volume server's receive-then-resend relay. AssignVolume returns every replica holder (new repeated Location replicas, forwarded from the master assign), the s3api captures them, and the chunked uploader fans out whenever a chunk has more than one holder. Cipher uploads keep the server-driven path since per-call encryption would diverge the replicas. * s3: cancel sibling replica uploads on the first failure * s3: trim replica fan-out comments * s3: roll back successful fan-out chunk copies when a holder fails A failed fan-out records no FileChunk, so copies that landed on the holders that finished before the cancel were leaked as orphans the caller could not see. Track the holders that succeeded and delete the needle from each (type=replicate, local-only) on failure, leaving nothing behind.	2026-06-24 16:31:58 -07:00
os-pradipbabar	d1b1338558	Fix stale cache fallback for empty volume locations in wdclient (#10081 ) fix(wdclient): prevent stale cache fallback for empty volume locations ## Problem During Kubernetes pod restarts, volume servers temporarily disconnect and their locations are removed from vidMap. The deleteLocation function leaves an empty array [] in vid2Locations map instead of removing the key entirely. GetLocations() was checking 'if found && len(locations) > 0', which would fail for empty arrays and fall back to the cache chain, returning STALE locations from before the restart. This caused S3 gateway to try connecting to old pod IPs that no longer exist, resulting in connection timeouts and hanging registry sync jobs. Example timeline: 1. Volume pod at 10.131.1.28:8081 registers volumes 10,12 2. S3 gateway caches: vid2Locations[10] = [10.131.1.28:8081] 3. Pod restarts, gets new IP 10.131.1.65:8081 4. Master sends delete → vid2Locations[10] = [] (empty, but key exists) 5. BUG: GetLocations(10) sees found=true, len=0 → falls back to cache 6. Returns stale 10.131.1.28:8081 instead of waiting for new location 7. S3 requests timeout trying to reach unreachable old IP ## Solution Distinguish between two cases: - found=true, locations=[] : Volume explicitly has no locations (e.g. restart) → Return nil, false (no fallback to cache) - found=false : Volume never seen in current map → Check cache (preserve cache benefits for unknown volumes) An empty array explicitly means 'this volume currently has no locations', which is semantically different from 'volume unknown'. Don't fall back to stale cache for explicitly empty volumes. ## Testing Added comprehensive tests: - TestGetLocationsEmptyArrayNoFallback: Verifies empty arrays don't use cache - TestGetLocationsUnknownVolumeUsesCache: Verifies unknown volumes still use cache - All existing tests pass ## Impact Fixes registry sync job hangs during SeaweedFS upgrades/restarts. S3 gateway will now correctly wait for updated volume locations instead of using stale cached IPs. Related: OutSystems.SeaWeedfs Helm chart, vega cluster incident 2026-06-24	2026-06-24 16:31:32 -07:00
Chris Lu	089acfbf36	fix(s3api): apply static config file updates on reload (#10096 ) A config-file reload (SIGHUP) routed through MergeS3ApiConfiguration, which skips identities marked static so dynamic admin/filer updates can't clobber them. That also blocked the config file itself from updating its own identities, so editing a secretKey and reloading had no effect. Thread a fromStaticFile flag from the file-load path into the merge: the authoritative file overwrites its static identities (and reapplies service accounts under them), while dynamic updates still leave them immutable. Mark the rebuilt identities static in the merge so a concurrent RemoveIdentity never observes them as removable mid-reload.	2026-06-24 16:26:35 -07:00
Chris Lu	cd828f6503	s3: propagate IAM changes from standalone weed s3 to peer pods (#10095 ) Standalone weed s3 created a master client and registered the receiving SeaweedS3IamCache gRPC service, but never wrapped its credential store with the propagating store. Only the filer-embedded path called SetMasterClient, so IAM mutations on one s3 pod never reached peers; they served a stale in-memory identity cache and returned InvalidAccessKeyId until restarted. Wrap the credential store with the master client when one is available, mirroring the filer path, so mutations fan out over the existing gRPC cache service.	2026-06-24 16:26:08 -07:00
Chris Lu	c15989387b	s3tables: allow hyphens in namespace and table names (#10093 ) * s3tables: allow hyphens in namespace and table names Iceberg REST clients routinely use hyphenated namespace/table names, but the S3 Tables charset (a-z, 0-9, _) rejected them with 400. Accept '-' as an interior character (names must still start, and namespaces end, with a letter or digit), making the catalog conformant for those clients. A permissive superset of the AWS S3 Tables charset. * s3tables: allow hyphens in table ARN parsing too The ARN regexes still excluded '-', so parseTableFromARN rejected ARNs with hyphenated namespace/table names and existing reject-the-hyphen tests broke. Widen the ARN patterns to match the validator, retarget those tests at a still-invalid leading-hyphen name, and cover ARN parsing with hyphens.	2026-06-24 16:24:45 -07:00
Chris Lu	1c5f8244a4	s3tables: fix create-after-rename overwriting the renamed table (#10091 ) * s3tables: purge decoupled table data without deleting the reused name path A renamed or created-over-leftover table keeps its data at a location that differs from its catalog name path. Drop now purges that data location and clears the marker, instead of recursively deleting the name path, which may still hold another table's data. * iceberg: route a table created over a leftover to a unique location When the default location is occupied by a leftover directory (data kept when another table was renamed to this name), create the new table at a unique location so it cannot overwrite that table's metadata. Common case is unchanged. * iceberg: fail table create when the leftover-path check errors A transient filer lookup error fell through as "not occupied", routing the new table back to the default path and risking the very overwrite this check guards against. Propagate the error and return 500 instead. * s3tables: assert all catalog xattrs cleared on decoupled drop Seed the full marker set so the test catches a regression that leaves the policy, tags, version, or entry-type attribute on the reused name path. * s3tables: refuse to drop a table whose data path is an ancestor Corrupt metadata can resolve the data path to the bucket or namespace root, which the bucket-scope check still admits; a recursive purge there would wipe sibling tables. Reject an ancestor data path before deleting.	2026-06-24 14:37:04 -07:00
Chris Lu	5456f9d695	mount: confirm an empty directory rebuild before caching it (#10092 ) A directory rebuild wiped the cached children, listed the filer once, and published the directory authoritatively cached over whatever came back. A transient empty listing -- a momentary list-stream glitch that ends as a clean EOF with no entries -- then stranded a populated directory cached over an empty store, hiding every file in it until some unrelated event happened to rebuild it: stat returns ENOENT and readdir returns nothing though the files are safe on the filer, and nothing re-triggers a build. Re-read the directory when the listing comes back empty before trusting it. The first re-read is immediate, since the likely transient clears on a fresh stream; later attempts space out. A genuinely empty directory still lists empty every time and caches as before, so only empty listings pay the extra read. Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-24 14:25:23 -07:00
Chris Lu	5112da98a2	mount: skip redundant permission checks under default_permissions (#10089 ) With default_permissions (the mount default) the kernel enforces unix permission bits from the getattr/lookup attributes before it ever calls Open, Create, or Mknod. The mount was re-checking permissions in AcquireHandle and createRegularFile anyway, which duplicated the kernel's work and kept the supplementary-group lookup on the per-file hot path. Gate only the mode-bit access check on default_permissions being off, so a non-root copy does no permission work on open/create. createRegularFile still loads the parent to validate it exists, since the create RPC skips the filer-side parent check. With default_permissions off the mount remains the sole enforcer, so the full check still runs.	2026-06-24 14:24:51 -07:00
Chris Lu	ef109fe9e1	mount: don't hang close() when a writer is killed during flush (#10090 ) * operation: bound AssignVolume with a deadline AssignVolume ran on context.Background(), so when the filer is overwhelmed the RPC could block indefinitely and wedge every caller holding the connection. Give it a 30s deadline so a stuck assign fails and the caller's retry/error path runs instead of hanging forever. * mount: abort flush when the FUSE request is interrupted On close(), a killed process blocks in fuse_flush waiting for the mount to answer. doFlush ran its metadata CreateEntry on context.Background() and ignored the kernel interrupt channel, so against an overwhelmed filer the flush never completed and the process stayed in uninterruptible sleep -- making the pod un-killable. Derive a context from the FUSE cancel channel in Flush/Fsync and thread it through doFlush -> flushMetadataToFiler -> streamCreateEntry; the retry loop stops as soon as the context is cancelled. Release and the pre-rename flush keep a non-cancellable context since they must finish regardless. * operation: harden the AssignVolume timeout test Make the test double's signal send non-blocking and bound the receive with a timeout so a regression can't wedge the test instead of failing it.	2026-06-24 14:24:22 -07:00
Jaehoon Kim	a11d81b21f	fix(filer.backup): repair chunk-incomplete and stale destination entries (#10082 ) * fix(filer.backup): repair chunk-incomplete and stale destination entries filer.backup left destinations diverged while metadata advanced — chunk-incomplete (missing/gapped ranges at full attr.file_size) or holding a chunk superseded by a missed overwrite. The skip/repair decision keyed on filer.FileSize (the attr), which a truncated entry keeps full, so it never repaired. Decide from actual chunk state instead: - coversReference: range-by-range containment (scalar byte totals and attr FileSize/Md5 cannot see chunk-level gaps). - hasStaleBackupChunk: a backup-written chunk (SourceFileId) the source no longer lists; ignores out-of-band (rsync/direct) chunks. - destinationMatchesReference: allocation-free positional fast path gating the above so they run only on divergence (the in-sync path stays cheap). - A strictly-newer destination is never repaired, so an older out-of-order replay cannot roll it back. The stale signal is deferred at equal mtime (same-second versions cannot be ordered; reliable S3 sub-second ordering is a separate fix). Tests in filer_sink_test.go. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * filer.backup: verify chunk range in destinationMatchesReference fast path The allocation-free fast path matched a destination chunk to its reference by SourceFileId alone. That is correct today only because replicateOneChunk copies the source chunk's Offset/Size verbatim, so SourceFileId identity implies an identical range — an invariant that lives in another file with no guard linking the two. If replication ever re-chunks (split/coalesce), a chunk with the right SourceFileId but a different range would fast-path as a full match and skip a needed repair (a false positive in the very class this change otherwise prevents). Compare Offset/Size alongside SourceFileId so the fast path is self-contained and can only be more conservative (a range mismatch falls through to the precise coversReference/hasStaleBackupChunk checks). Add tests for a shifted offset and a larger size at matching identity. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-24 14:23:38 -07:00
jk2lx	e1f89f85f2	fix(filer): apply -filer.disk default to metadata log assigns (#10080 ) * fix(filer): apply -filer.disk default to metadata log assigns Metadata event log writes call operation.Assign directly and used only FilerConf path rule DiskType. When filer.conf rules were missing or unmatched, the master received an empty DiskType and grew volumes on the built-in hdd layout. Mirror resolveAssignStorageOption: wire FilerOption.DiskType into the Filer, fall back when the matched path rule has no disk type, and return the matched rule from resolveMetadataLogAssignDiskType to avoid duplicate MatchStorageRule lookups. Co-authored-by: Cursor <cursoragent@cursor.com> * mini: fall back to -volume.disk for filer default disk type weed server copies -volume.disk into the filer disk default when -filer.disk is unset; weed mini did not, so metadata-log assigns sent an empty disk type on clusters that only tag volumes (e.g. hot/warm). --------- Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: Chris Lu <chris.lu@gmail.com>	2026-06-24 10:47:11 -07:00
Chris Lu	d29e6ed98a	deps: replace deleted tyler-smith/go-bip39 with cosmos fork (#10088 ) The tyler-smith/go-bip39 repository was deleted from GitHub, so go mod download fails for anyone resolving it directly (GOPROXY=direct). It only reaches us transitively through rclone's internxt backend, which calls IsMnemonicValid and NewSeed. Point it at cosmos/go-bip39, an API-compatible and maintained fork.	2026-06-24 10:41:43 -07:00
Chris Lu	e744b5f2ee	iceberg: detect table-exists through the wrapped manager error (#10075 ) handleCreateTable used a type assertion that fails through WithFilerClient's 'all filers failed' wrap, so a concurrent create that the pre-check missed fell through instead of returning the existing table. Use errors.As.	2026-06-24 10:22:36 -07:00
patrick	3e2c637858	util: trim minFreeSpace values before parsing (#10083 )	2026-06-24 09:03:38 -07:00
Lisandro Pin	30f2dd5040	Weed shell `ec.rebuild`: Allow targeting rebuild to specific volume IDs. (#10087 )	2026-06-24 08:40:29 -07:00
qzhello	fb168e2a36	fix: avoid reading upload body when writing JSON errors (#10073 ) * fix(shell): correct volume.list -writable filter unit and comparison * fix(shell): correct volume.list -writable filter unit and comparison * chore(shell): fix typo in EC shard helper param names * fix(shell): use exact match for volume.balance -racks/-nodes filter The old strings.Contains-based filter quietly included any id that was a substring of the user-supplied flag value (e.g. -racks=rack10 also matched rack1). Replace it with an exact-match set parsed from the comma-separated flag value, and add regression tests for both -racks and -nodes paths. Also fix a small typo in the "remote storage" error returned by maybeMoveOneVolume. * fix(shell): use exact match for volume.balance -racks/-nodes filter The old strings.Contains-based filter quietly included any id that was a substring of the user-supplied flag value (e.g. -racks=rack10 also matched rack1). Replace it with an exact-match set parsed from the comma-separated flag value, and add regression tests for both -racks and -nodes paths. Also fix a small typo in the "remote storage" error returned by maybeMoveOneVolume. * refactor(shell): drop nil sentinel in splitCSVSet, use len() in callers * fix: avoid reading upload body when writing JSON errors	2026-06-23 20:20:11 -07:00
Chris Lu	c95401b11a	iceberg: support table rename (#10068 ) * s3tables: add RenameTable operation * iceberg: support table rename * iceberg: test table rename * s3tables: keep table data in place on rename rename is catalog-only: drop the source's catalog xattrs in place instead of recursively deleting its directory, which wiped the metadata.json and data files the renamed destination still points at. treat a missing table-metadata xattr as NoSuchTable in GetTable so the soft-deleted source name stops resolving. * s3tables: test rename preserves data make the in-memory filer honor recursive data deletion and seed the source table's metadata/ and data/ children, then assert a rename leaves them intact, the source name resolves to NoSuchTable, and the destination resolves to the preserved location. * iceberg: map rename errors through wrapped manager error * s3tables: authorize rename destination namespace rename moved a table into the destination namespace after only checking the source, letting a source-authorized caller place tables in namespaces they don't control. require CreateTable on the destination namespace and bucket before writing. * s3tables: purge renamed table data on drop * s3tables: test table data dir derivation	2026-06-23 20:18:11 -07:00
Chris Lu	7abed4e517	s3: skip 503 when client disconnects during remote cache wait (#10071 ) s3: don't write 503 to a disconnected client during remote cache wait When the remote-only cache poll returns without chunks, re-check the request context before emitting 503 + Retry-After. A client that disconnected during the wait surfaces as context.Canceled, which the caller already handles silently; writing to the closed connection only produced broken-pipe log noise.	2026-06-23 15:31:08 -07:00
Chris Lu	0403e47ef6	iceberg: support views (#10069 ) * s3tables: tag table entries and exclude views from table listings * s3tables: add view CRUD operations * iceberg: support view create, load, exists, drop, and list * iceberg: support view update * iceberg: test view error classification and metadata round-trip * iceberg: pre-check existence and write view metadata only after create * iceberg: map view namespace-not-found to 404 * iceberg: test view create namespace-404 and duplicate no-clobber * s3tables: tag view metadata and entry type atomically CreateView wrote ExtendedKeyMetadata and ExtendedKeyEntryType in two UpdateEntry calls, so a partial failure could leave a view directory untagged. Add setExtendedAttributes to set both in one UpdateEntry. * iceberg: roll back view registration when metadata write fails The metadata file is written after the catalog registers the view. If that write fails, drop the just-created view so it doesn't linger pointing at a missing metadata.json. Reuse the DeleteView path via a shared dropView helper.	2026-06-23 15:22:31 -07:00
Chris Lu	1ca628d3e9	iceberg: support multi-table transaction commit (#10066 ) * iceberg: support multi-table transaction commit Add handleCommitTransaction for POST /v1/transactions/commit. Validation is atomic across all table-changes (resolve, load, evaluate every requirement before any write); metadata writes and pointer flips are best-effort with rollback, so this is not crash-atomic. * iceberg: route transactions/commit with and without prefix * iceberg: test transaction commit request decoding * iceberg: restore full prior table state on transaction rollback * iceberg: test transaction rollback restores full prior table state * iceberg: only clean up metadata for rolled-back tables	2026-06-23 14:08:03 -07:00
Chris Lu	628ce57625	iceberg: support table register (#10067 ) * s3tables: add RegisterTable op * iceberg: support table register * iceberg: test register table * iceberg: parse engine-written metadata version from location * iceberg: test metadata version parsing for both filename forms * iceberg: map register errors through wrapped manager error * iceberg: validate register metadata-location bucket and reject traversal * iceberg: log register metadata load failure	2026-06-23 14:07:13 -07:00
Chris Lu	63f2f0bef5	s3: keep a file promoted to a directory retrievable as an object (#10070 ) * filer: treat a directory carrying object data as an S3 key object A file promoted to a directory by a child write keeps its chunks, inline content, or remote-tiered entry. Recognize that as a directory key object, not only when a Mime is set, so the object still lists, demotes on delete, and is not reclaimed by cleanup like the object it still is. * filer: keep the empty-folder cleaner from reclaiming a promoted object The cleaner skips directory key objects, but its check only looked at the Mime. Mirror the chunks/content/remote check so a file promoted to a directory is not deleted once its children are gone. * s3: serve ranged GET for a directory that carries object data Reject only zero-size directories so a file promoted to a directory streams range requests instead of returning 404, while empty directories still 404. * s3: return HEAD metadata for a directory that carries object data HEAD now 404s a directory only when it has no data, so a promoted object is retrievable while empty/implicit directories still fall back to LIST.	2026-06-23 14:06:00 -07:00
7y-9	ddd11e44f9	feat: support marking volumes by collection (#9585 ) * feat: add collection.mark shell command Add collection.mark to mark all existing normal volume replicas in a collection as readonly or writable. The command runs in preview mode by default and requires -apply to execute changes. It reuses existing volume mark RPCs, supports default collection aliases, skips EC shards, and adds unit tests for option parsing and target collection logic. * Revert "feat: add collection.mark shell command" This reverts commit `50c2bbf94c`. * feat: support marking volumes by collection Add a -collection option to volume.mark so operators can mark every normal volume replica in a collection using existing topology data and volume mark RPCs. The change keeps the single-volume path unchanged and adds tests for collection target selection, EC shard exclusion, and argument validation. Co-authored-by: Codex <noreply@openai.com> * volume.mark: reuse eachDataNode for collection traversal * volume.mark: continue past per-volume failures and report progress Collection marking aborted on the first failed RPC, leaving the collection half-marked with no record of which volumes succeeded. Mark every reachable volume, print per-volume progress to the writer, and return an aggregated error naming the failures. * volume.mark: let -collection _default target the unnamed collection Other volume commands use the _default sentinel to match volumes with no named collection; volume.mark could not reach them at all. Map _default to the empty collection name in the filter. --------- Co-authored-by: Chris Lu <chrislusf@users.noreply.github.com> Co-authored-by: Codex <noreply@openai.com> Co-authored-by: Chris Lu <chris.lu@gmail.com>	2026-06-23 11:27:43 -07:00

1 2 3 4 5 ...

14316 Commits