seaweedfs

mirror of https://github.com/seaweedfs/seaweedfs.git synced 2026-05-21 09:11:29 +00:00

Author	SHA1	Message	Date
Chris Lu	e332b97d52	fix(shell): volume.balance no longer drains all volumes onto one server (#9579 ) * fix(shell): volume.balance no longer drains all volumes onto one server The density-based capacity function reads per-disk VolumeInfos sizes, but adjustAfterMove only updated VolumeCount and the selectedVolumes map. The planner re-read a stale topology after every move, so the source node's density never dropped and it kept moving volumes until that node was empty. Move the volume's size accounting between disks after each planned move so the density recomputes and the loop converges to an even distribution. * refactor(shell): O(1) volume removal and direct disk lookup in adjustAfterMove removeVolumeInfo swaps with the last element instead of shifting, and the disk is fetched by key rather than ranging the DiskInfos map. 4.27	2026-05-20 01:39:23 -07:00
Chris Lu	868849392c	4.27	2026-05-20 00:25:16 -07:00
Chris Lu	a4415c39aa	fix(mount): keep periodic metadata flush from dropping concurrent chunk uploads (#9574 ) * fix(mount): keep periodic metadata flush from dropping concurrent chunk uploads The periodic flush snapshotted entry.Chunks, then ran CompactFileChunks and MaybeManifestize (the manifest upload is a network round trip) before reassigning entry.Chunks. Async uploaders append freshly uploaded chunks during that window, and the reassignment overwrote them: the data stayed on the volumes but the file lost those chunk references, leaving zero-filled holes on read. Large sequential writes such as cat of two 15 GiB files hit several flush cycles and ended up corrupted. Snapshot the chunk list under the entry lock with a length marker, do the slow compaction and manifestization on the snapshot, then splice the processed prefix back in front of whatever chunks arrived after the snapshot. * mount: drop redundant slice copies in the flush splice processedPrefix is freshly built and the tail sub-slice is consumed immediately under the entry lock, so append straight onto processedPrefix instead of allocating two throwaway copies.	2026-05-19 20:47:52 -07:00
Lars Lehtonen	9914e6af30	chore(weed/command): prune unused functions (#9573 ) * chore(weed/command): prune unused functions * drop now-unused closed field and renderLocked guard --------- Co-authored-by: Chris Lu <chris.lu@gmail.com>	2026-05-19 17:45:50 -07:00
Chris Lu	cc5ef1b741	feat(s3): add TagUser, UntagUser, ListUserTags IAM actions (#9572 ) * feat(s3): add TagUser, UntagUser, ListUserTags IAM actions Adds AWS IAM-compatible user tag operations on the embedded IAM endpoint. Tags persist in the Identity proto as a repeated UserTag field; the existing 50-tag / 128-byte-key / 256-byte-value AWS limits are enforced. Pagination is stubbed (IsTruncated=false) since the 50-tag cap means all tags fit in a single response. * review: validate UntagUser TagKeys entries parseTagKeysParams now rejects empty keys and keys past MaxUserTagKeyLength; UntagUser additionally requires at least one TagKeys.member.N entry to match AWS validation behavior. * review: pre-allocate user-tag merge and filter slices mergeUserTags now allocates the combined existing+incoming capacity up front; UntagUser builds the filtered slice via make with the full ident.Tags capacity instead of ident.Tags[:0:0], which forced a reallocation on every append. * review: cover duplicate-in-request and invalid TagKeys cases Regression tests assert TagUser rejects two members with the same key in one request, and UntagUser rejects missing/empty/oversized TagKeys entries.	2026-05-19 17:35:44 -07:00
Chris Lu	37b6a14b0d	feat(s3): add four bucket configuration handlers (#9570 ) * feat(s3): add four bucket configuration handlers - GetBucketPolicyStatus: computes IsPublic from the existing bucket policy - PutBucketRequestPayment: companion writer to the existing GET; accepts only BucketOwner - GetBucketAccelerateConfiguration: returns <Status>Suspended</Status> - GetBucketLogging: returns an empty BucketLoggingStatus Lets AWS SDK probes succeed instead of returning MethodNotAllowed. * review: route GetBucketPolicyStatus through checkBucket Mirrors the existence/auth gating used by other bucket handlers and drops the bespoke filer_pb lookup so NoSuchBucket precedence is consistent across the API surface. * review: cap PutBucketRequestPayment body with MaxBytesReader The body is unmarshalled as RequestPaymentConfiguration, which is a handful of bytes; reject excessively large payloads up front and defer Close immediately after wrapping. * review: gate static getters on checkBucket GetBucketAccelerateConfiguration and GetBucketLogging now run the standard bucket existence check before returning the static Suspended / empty-status response so a missing bucket cannot appear to have valid configuration. * review: share cache helper across misc tests; check io.ReadAll error Accelerate and Logging tests now run through newMiscTestServer like the others so the checkBucket guard sees a cached bucket; the ReadAll error is explicitly checked.	2026-05-19 17:35:08 -07:00
Chris Lu	cee2bf697c	feat(s3): stub bucket configuration list endpoints (#9571 ) * feat(s3): stub bucket configuration list endpoints Adds Get and List handlers for Analytics, Inventory, IntelligentTiering, and Metrics bucket configurations. List returns an empty result with IsTruncated=false; single-get returns NoSuchConfiguration so SDK error parsing remains predictable. * review: gate stubs on bucket existence All eight stub handlers now call checkBucket via stubBucketGuard so NoSuchBucket takes precedence over NoSuchConfiguration / empty-list responses, matching AWS S3 precedence. Tests provide a cached bucket so the guard sees it as present.	2026-05-19 17:34:51 -07:00
Chris Lu	285025eb73	s3api: support group inline policies + Condition enforcement (#9569 ) * test(s3api): cover IAM inline policy aws:SourceIp + group inline gap Unit tests under weed/s3api/ drive PutUserPolicy / PutGroupPolicy → reload → VerifyActionPermission with a synthetic 127.0.0.1 request and assert that the policy's IpAddress condition flips the outcome. The user-policy cases pass on master (hydrateRuntimePolicies already routes inline docs through the policy engine, so Condition blocks are honored end- to-end). The group-policy case fails: PutGroupPolicy still returns NotImplemented, so a group inline doc never lands in the engine. Integration counterparts live under test/s3/iam/ and exercise the same paths against a live SeaweedFS S3+IAM endpoint. * s3api: support group inline policies + Condition enforcement PutGroupPolicy/GetGroupPolicy/DeleteGroupPolicy/ListGroupPolicies used to return NotImplemented in embedded IAM mode, so anything attached to a group as an inline doc — including aws:SourceIp or any other Condition — was simply unreachable. Wire the four endpoints to the credential-store methods that were already in place (memory, postgres, filer_etc all implement GroupInlinePolicyStore). On every config reload, hydrateRuntimePolicies now also walks LoadGroupInlinePolicies, registers each doc in the IAM policy engine under __inline_group_policy__/<group>/<policy>, and appends that key to Group.PolicyNames so evaluateIAMPolicies picks it up through its existing group walk. PutGroupPolicy/DeleteGroupPolicy are added to the ReloadConfiguration trigger list in DoActions. Side fix: MemoryStore.LoadConfiguration now surfaces store.groups too. Without it iam.groups never repopulated on a memory-store reload, so group policy evaluation silently no-op'd whether the policy was inline or attached. The existing tests didn't notice because no test reloaded through cm after creating a group. The NotImplemented unit test is inverted to drive the new round-trip. * s3api: drop redundant refreshIAMConfiguration from Put/DeleteGroupPolicy DoActions already triggers ReloadConfiguration for both actions via the explicit reload list, so calling refreshIAMConfiguration inline runs the load twice per request. Per PR review. * s3api: scope group-policy resource names per test; tighten deny polling - Integration test resource names get a per-test suffix so retried or parallel CI jobs don't trip EntityAlreadyExists / BucketAlreadyExists. - Deny-path Eventually loops gate on AccessDenied via a typed helper rather than any non-nil error; transient setup errors no longer end the wait prematurely. - ListGroupPolicies returns ServiceFailure when the credential manager is nil, matching Put/Get/DeleteGroupPolicy. * test(s3 iam): cover both IPv4 and IPv6 loopback in allow CIDRs CI runners with happy-eyeballs resolve `localhost` to ::1 first, in which case a 127.0.0.0/8-only allow would silently never match and the deny-driven enforcement test would hang for the allow case. Add ::1/128 to every loopback-matching policy so the allow path works regardless of which loopback family the SDK lands on.	2026-05-19 16:03:45 -07:00
Chris Lu	77ac781bbd	fix(ec): VolumeEcShardsInfo walks every disk on multi-disk servers (#9568 ) * fix(ec): VolumeEcShardsInfo walks every disk on multi-disk servers When a volume server holds EC shards for the same vid across more than one disk, each DiskLocation registers its own EcVolume entry and Store.FindEcVolume returns whichever one it hits first. The shard-info RPC iterated only that single EcVolume's Shards, so the response missed every shard mounted on a sibling disk. The worker's verifyEcShardsBeforeDelete sums the per-server responses into a union bitmap and refuses to delete the source volume when the union falls short of dataShards+parityShards. On multi-disk destinations, the union was systematically under-counted and source deletion got blocked even though all shards were physically present and mounted. Walk every DiskLocation in the handler and emit the deduplicated union of all shards. The .ecx-backed fields (file counts, volume size) still come from a single EcVolume since every disk's entry opens the same .ecx via NewEcVolume's cross-disk fallback. Tests: - TestVolumeEcShardsInfo_AggregatesAcrossDisks unit test in weed/server/. - test/volume_server/grpc/ec_verify_multi_disk_test.go integration test drives the full generate -> mount -> redistribute -> restart -> reconcile path and asserts both VolumeEcShardsInfo and VerifyShardsAcrossServers + RequireFullShardSet (the production source-deletion gate) report all 14 shards. - ec_multi_disk_lifecycle_test.go tightened: replaces the "VolumeEcShardsInfo only sees one disk's EcVolume" workaround with a full-shard-set assertion. * review: use ShardBits bitmask + cap-pre-allocation for shard dedup	2026-05-19 14:58:56 -07:00
Chris Lu	f72983c1fd	fix(s3): stop S3 Tables routes from swallowing buckets named "buckets" or "get-table" (#9566 ) * fix(s3): stop S3 Tables routes from swallowing buckets named "buckets" or "get-table" The S3 Tables REST endpoints share top-level paths with the regular S3 API (/buckets for ListTableBuckets/CreateTableBucket, /get-table for GetTable). They are registered first on the same router as the bucket subrouter, so a path-style request such as GET /buckets?list-type=2 on a bucket actually named "buckets" matched ListTableBuckets and returned JSON. AWS SDK V2 (and Hadoop s3a / Spark) then failed XML parsing with "Unexpected character '{' (code 123) in prolog". Disambiguate by requiring the AWS V4 credential scope to name the s3tables service on the colliding routes. Regular S3 SDKs sign with service=s3, S3 Tables SDKs sign with service=s3tables, and the scope is present in both the Authorization header and the X-Amz-Credential query parameter for presigned URLs, so the matcher works for both flavors. ARN-bearing S3 Tables routes (/buckets/<arn>, /namespaces/<arn>, etc.) already cannot collide because colons are not valid in bucket names, so they are left untouched. * fix(s3): accept AWS JSON RPC content type as S3 Tables intent signal The Iceberg catalog integration tests send unsigned PUT /buckets with Content-Type: application/x-amz-json-1.1 to create table buckets. With only the credential-scope check, those requests fell through to the regular S3 CreateBucket handler and the suite went red on this branch. Extend the matcher so a request is recognized as S3 Tables when either: - its AWS V4 credential scope names SERVICE=s3tables; or - it carries the canonical AWS JSON RPC 1.1 content type and is unsigned (a request explicitly signed for SERVICE=s3 still wins). The regular S3 SDKs do not send application/x-amz-json-1.1, so the signal is safe for the colliding paths (/buckets, /get-table). Also add an AWS SDK V2 for Go integration test under test/s3/sdk_v2_routing/ that drives the SDK's own XML deserializer against a bucket literally named "buckets" and "get-table" — the SDK errors before the test asserts if the server returns the wrong body shape. Wired up via .github/workflows/s3-sdk-v2-routing-tests.yml, mirroring the etag/acl workflow. * s3api: extend service matcher to all S3 Tables routes; simplify scope check - Apply serviceMatcher to every S3 Tables route, not just the bare-path ones. ARN-bearing paths could otherwise be hit by an S3 object key that starts with arn:aws:s3tables:..., inside a bucket named "buckets", "namespaces", "tables", or "tag". One matcher everywhere closes both collision classes. - Replace strings.Split + index lookup with strings.Contains for the credential-scope check. The scope shape is fixed at AK/DATE/REGION/SERVICE/aws4_request, slashes only delimit components, and access keys are alphanumeric — so /s3tables/ matches iff SERVICE is exactly s3tables. Existing unit cases (including the access-key-substring case) still pass. - Read the GetObject body in the SDK v2 routing test with io.ReadAll; the single Read could return short and make the equality check flaky. * s3api: drop content-type fallback; sign s3 tables harness traffic instead The content-type fallback in isS3TablesSignedRequest let an anonymous regular-S3 request whose body type is application/x-amz-json-1.1 hit an S3 Tables route when the path-style object key happened to be shaped like an S3 Tables ARN (e.g. PutObject on bucket "buckets" with key arn:aws:s3tables:.../bucket/foo/policy). Narrow the matcher back to the AWS V4 credential scope so only requests signed for SERVICE=s3tables match the S3 Tables routes. Update the Iceberg catalog test harness — the only caller still sending unsigned PUT /buckets — to sign with SERVICE=s3tables. The mini instance runs in default-allow mode, so the signature itself is not verified; only the credential scope matters for the route match. Drop the stale unit cases for the JSON-RPC content-type signal and the routing test that exercised unsigned harness traffic.	2026-05-19 14:24:25 -07:00
Chris Lu	cfc08fbf6c	fix(volume): tombstone integrity check no longer flips volumes read-only (fixes #9563 ) (#9565 ) * fix(volume): pass on-disk tombstone size to ReadData in verifyDeletedNeedleIntegrity verifyDeletedNeedleIntegrity was forwarding TombstoneFileSize (-1) into Needle.ReadData. A deletion tombstone is appended to .dat with DataSize=0 so the on-disk needle header carries Size=0; TombstoneFileSize is only the .idx sentinel for "this entry is deleted" and is never written into a needle header. ReadBytes' size check therefore mismatched on every tombstone (-1 != 0), returned ErrorSizeMismatch, and triggered the 4-byte-offset wrap-around retry in ReadData (offset + 32 GB). On any volume large enough that offset+32 GB exceeds dat fileSize the retry read EOF, CheckVolumeDataIntegrity reported corruption, and the loader set noWriteOrDelete = true. Every volume whose last 10 .idx entries included a deletion went read-only on startup — i.e. any healthy volume where the most recent operations included a delete. Pass Size(0) so the size check matches the on-disk tombstone header. Add a regression test that writes three needles, deletes one, and asserts CheckVolumeDataIntegrity succeeds with a tombstone at the .idx tail. Without this fix the test reproduces the exact log shape from the bug report: read 0 dataSize 32 offset <orig+32GB> fileSize <much smaller>: EOF verifyDeletedNeedleIntegrity ...idx failed: read data [N,N+32) : EOF The Rust port guards its integrity-check size comparison with !size.is_deleted() (seaweed-volume/src/storage/volume.rs) and never hits this path, so no Rust mirror change is needed. * test(seaweed-volume): mirror Go regression for deletion-tombstone integrity The Rust integrity check already guards its size-mismatch comparison with !size.is_deleted() (volume.rs:1859) and reads tombstone AppendAtNs with body_size=0, so the Go regression fixed in the previous commit does not apply. Lock that guarantee in with a parallel reload test: write three needles, delete one, sync, reopen via Volume::new, assert the volume is not flipped read-only. Catches any future change that removes the deleted-entry guard or re-introduces a size-strict path in check_volume_data_integrity for tombstones. * fix(volume): propagate io.EOF and ErrorSizeMismatch from verifyDeletedNeedleIntegrity CheckVolumeDataIntegrity relies on identity comparison against io.EOF and ErrorSizeMismatch to walk back through the last ten .idx entries and tolerate a partial truncation at the tail (the "fix and continue" loop). The live-needle branch in doCheckAndFixVolumeData already returns those sentinels unwrapped; the deletion branch wrapped them in fmt.Errorf, so a genuine .dat truncation past a tombstone offset broke the recovery and flipped the volume read-only. Mirror the live-needle handling: both verifyDeletedNeedleIntegrity and doCheckAndFixVolumeData now short-circuit on io.EOF / ErrorSizeMismatch and pass them through unwrapped. Other errors keep their existing context wrapping. Also tighten the regression test to capture lastAppendAtNs and assert it's non-zero, so a future regression that skips the tombstone body (and therefore never populates AppendAtNs) is caught even when the err check still passes.	2026-05-19 13:11:19 -07:00
Chris Lu	d57de6dc20	fix(s3): keep anonymous access working with EnableIam default (fixes #9557 ) (#9567 ) fix(s3): keep anonymous access working with EnableIam default `docker run seaweedfs` (and `weed mini` with no config) start with EnableIam=true but no IAM config file and no identities. The advanced-IAM init path was failing in 4.25 because of the missing STS signing key, which masked a latent bug: SetIAMIntegration unconditionally flipped isAuthEnabled to true, and isEnabled() also treated a non-nil iamIntegration as auth-on. Once the mini SSE-S3 KEK landed in 4.26 the STS fallback started succeeding, the integration got installed end to end, and every anonymous S3 request bounced as AccessDenied. Separate the two concerns: SetIAMIntegration just plumbs in the OIDC / embedded-IAM machinery, and a new EnableAuthEnforcement opts in to enforcement. The startup path calls it only when -s3.iam.config is actually provided, so operators with explicit IAM configs still get auth (preserves #7726). isEnabled() now reads isAuthEnabled only.	2026-05-19 13:03:30 -07:00
Peter Dodd	4476cb282b	feat(filer): add atime to FuseAttributes + TouchAccessTime RPC (#9556 ) * feat(filer): add atime field and TouchAccessTime RPC to filer proto Introduce POSIX-style access-time tracking on the filer: - FuseAttributes gains atime (field 22) and atime_ns (field 23). - New TouchAccessTime RPC (and Touch{Access,Time}{Request,Response}) lets read paths bump atime without going through UpdateEntry's chunk-rewrite/EqualEntry short-circuit. Additive proto changes only; zero atime is treated as unset and existing clients are unaffected. Java client proto is kept in lock step. Co-authored-by: Cursor <cursoragent@cursor.com> * feat(filer): wire Atime through Attr codec with mtime fallback Add Attr.Atime and round-trip it through EntryAttributeToPb / EntryAttributeToExistingPb / PbToEntryAttribute. A zero proto atime decodes as Mtime, so legacy entries report a sensible value and freshly-created/updated entries default Atime to Mtime when callers do not set it explicitly. CreateEntry and UpdateEntry stamp Atime = Mtime (or Crtime) when it is zero. TouchAccessTime later bypasses this path to write atime alone via Store.UpdateEntry. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(filer): preserve atime in first epoch second on decode The Atime decode branch previously treated any attr.Atime == 0 as unset and overwrote it with Mtime, which drops valid timestamps in the first second of the unix epoch where attr.Atime is 0 but attr.AtimeNs > 0. Check both fields so we only fall back to Mtime when both are zero. Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: Cursor <cursoragent@cursor.com>	2026-05-19 10:22:17 -07:00
Chris Lu	b63610cf8f	volume: accept legacy needle CRC encoding on read (#9564 ) Volumes written by versions before 3.09 (commit `056c480eb`) store the needle checksum using the deprecated CRC.Value() transform. When the read path moved into readNeedleTail, the fallback that accepts both encodings was dropped, so .dat files copied from old installs now fail verification with "invalid CRC ... data on disk corrupted" even though the data is intact. Restore the dual check, matching the surviving fallback in volume_read.go.	2026-05-19 09:58:47 -07:00
Chris Lu	c61d227613	s3api: verify source permission on CopyObject and UploadPartCopy (#9555 ) * s3api: verify source permission on CopyObject and UploadPartCopy The Auth middleware only authorized the destination because routes key on the request URL. The source from X-Amz-Copy-Source was never evaluated, so an STS session token scoped to one prefix could copy from any other prefix in the same bucket. Add AuthorizeCopySource on IdentityAccessManagement to run the full bucket-policy + IAM/identity flow against the source, using a synthetic GetObject request so action resolution lands on s3:GetObject (or s3:GetObjectVersion when a source versionId is supplied). Both CopyObjectHandler and CopyObjectPartHandler now invoke it before reading the source. * s3api: preserve presigned-URL session token on copy-source check Presigned CopyObject / UploadPartCopy requests carry the STS session token in the query string (X-Amz-Security-Token), not in a header. Rebuilding the synthetic source URL from scratch dropped that token, so the source authorization would fall through to non-STS paths and miss session policy enforcement. Forward X-Amz-Security-Token from the original query (alongside versionId), still excluding unrelated params like uploadId/partNumber that would steer ResolveS3Action away from s3:GetObject.	2026-05-18 21:35:53 -07:00
Chris Lu	7c252e1f16	fix(volume): reopen .idx writable after MarkVolumeWritable (fixes #9515 ) (#9526 ) * fix(volume): reopen .idx writable after MarkVolumeWritable When .vif has ReadOnly=true, load() opens .idx as O_RDONLY and builds a SortedFileNeedleMap whose Put returns os.ErrInvalid. MarkVolumeWritable only flipped noWriteOrDelete back to false and rewrote .vif, so writes still failed at v.nm.Put. Reopen .idx in O_RDWR and rebuild v.nm in its writable form (in-memory or leveldb small/medium/large) before flipping the flag. Mirror the same fix in seaweed-volume: the Rust load path leaves CompactNeedleMap/RedbNeedleMap with no idx_file writer when the volume boots read-only, so post-MarkVolumeWritable puts silently succeeded in-memory only and were lost on the next restart. set_writable now reattaches an append-mode writer when one is missing. * fix(volume): keep old needle map until replacement is built; defer writable flag Go: build the writable needle map into a local before swapping. A construction failure now leaves v.nm pointing at the original SortedFileNeedleMap so MarkVolumeWritable can roll back, instead of stranding the volume with v.nm == nil. Rust: attach the .idx writer before flipping no_write_or_delete to false. A transient open/metadata failure used to leave the volume marked writable with no writer attached, and subsequent puts would silently skip the on-disk append.	2026-05-18 20:51:04 -07:00
Chris Lu	7c5296dfb1	fix(admin): switch file browser upload/download to filer gRPC + volume HTTP (#9538 ) * fix(admin): switch file browser upload/download to filer gRPC + volume HTTP The admin file browser proxied uploads and downloads through the filer's HTTP listener, so the whole feature 404'd against filers started with -disableHttp=true even though S3 still worked on its own port. Re-route through the filer gRPC service: LookupDirectoryEntry + StreamContent for reads (chunks flow straight from the volume servers), AssignVolume + volume HTTP POST + CreateEntry for writes. Volume read tokens come from jwt.signing.read.key when configured; the old jwt.filer_signing tokens no longer apply since the filer HTTP surface is bypassed. * admin file browser: propagate request context + track response writes Pass r.Context() into uploadFileToFiler so a client disconnect cancels the in-flight chunked upload instead of letting it run to completion against the volume servers. For DownloadFile, replace the Content-Type probe with a small response-writer wrapper that records whether headers or bytes have actually been sent, so the error path can't silently convert a pre-stream failure into a partial response if future code moves the header-setting around.	2026-05-18 20:33:16 -07:00
Chris Lu	58c3fa802c	fix(s3): keep host-less bucket catch-all so reverse proxies work (#9540 ) When s3.domainName is set, all bucket-prefix routes were gated on a matching Host header. Requests that arrive via an IP, an unlisted hostname, or a reverse proxy that rewrites Host hit no router and bounce back as 405/404 (and 503 once a proxy maps the upstream error). Register the path-style catch-all unconditionally, after the host-specific routers, so it only fires when no Host matcher applies.	2026-05-18 19:44:19 -07:00
dependabot[bot]	d3f80444df	build(deps): bump github.com/cognusion/imaging from 1.0.2 to 1.0.3 (#9552 ) Bumps [github.com/cognusion/imaging](https://github.com/cognusion/imaging) from 1.0.2 to 1.0.3. - [Commits](https://github.com/cognusion/imaging/compare/v1.0.2...v1.0.3) --- updated-dependencies: - dependency-name: github.com/cognusion/imaging dependency-version: 1.0.3 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2026-05-18 19:43:33 -07:00
Chris Lu	0dc65e7069	fix(admin.plugin): include disk_id in EC execution plan (#9547 ) TaskSource and TaskTarget carry disk_id on the wire, but the execution plan map built for the admin UI dropped the field entirely. On a multi-disk node holding shards of the same volume, there was no way to tell from the plan which disk would receive each shard. Include disk_id on each endpoint and target_disk_id on each shard assignment, and extend the existing execution-plan test to set and assert the field.	2026-05-18 19:43:18 -07:00
ᎠᎡ. Ѕϵrgϵ Ѵictor	18c6c24e47	Revise MinIO comparison in README for accuracy (#9548 ) Updated the README to reflect the current status of MinIO, noting its ceased development and security concerns, along with changes in the descriptions of its features compared to SeaweedFS.	2026-05-18 19:32:54 -07:00
dependabot[bot]	120901c883	build(deps): bump github.com/parquet-go/parquet-go from 0.28.0 to 0.30.1 (#9549 ) Bumps [github.com/parquet-go/parquet-go](https://github.com/parquet-go/parquet-go) from 0.28.0 to 0.30.1. - [Release notes](https://github.com/parquet-go/parquet-go/releases) - [Changelog](https://github.com/parquet-go/parquet-go/blob/main/CHANGELOG.md) - [Commits](https://github.com/parquet-go/parquet-go/compare/v0.28.0...v0.30.1) --- updated-dependencies: - dependency-name: github.com/parquet-go/parquet-go dependency-version: 0.30.1 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2026-05-18 19:28:42 -07:00
dependabot[bot]	a79880ed41	build(deps): bump github.com/redis/go-redis/v9 from 9.18.0 to 9.19.0 (#9550 ) Bumps [github.com/redis/go-redis/v9](https://github.com/redis/go-redis) from 9.18.0 to 9.19.0. - [Release notes](https://github.com/redis/go-redis/releases) - [Changelog](https://github.com/redis/go-redis/blob/master/RELEASE-NOTES.md) - [Commits](https://github.com/redis/go-redis/compare/v9.18.0...v9.19.0) --- updated-dependencies: - dependency-name: github.com/redis/go-redis/v9 dependency-version: 9.19.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2026-05-18 19:28:32 -07:00
dependabot[bot]	f5aa776742	build(deps): bump github.com/Azure/azure-sdk-for-go/sdk/storage/azblob from 1.6.4 to 1.7.0 (#9551 ) build(deps): bump github.com/Azure/azure-sdk-for-go/sdk/storage/azblob Bumps [github.com/Azure/azure-sdk-for-go/sdk/storage/azblob](https://github.com/Azure/azure-sdk-for-go) from 1.6.4 to 1.7.0. - [Release notes](https://github.com/Azure/azure-sdk-for-go/releases) - [Commits](https://github.com/Azure/azure-sdk-for-go/compare/sdk/storage/azblob/v1.6.4...sdk/azcore/v1.7.0) --- updated-dependencies: - dependency-name: github.com/Azure/azure-sdk-for-go/sdk/storage/azblob dependency-version: 1.7.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2026-05-18 19:28:24 -07:00
dependabot[bot]	f3d6633aac	build(deps): bump github.com/aws/aws-sdk-go-v2/service/s3 from 1.99.0 to 1.101.0 (#9553 ) build(deps): bump github.com/aws/aws-sdk-go-v2/service/s3 Bumps [github.com/aws/aws-sdk-go-v2/service/s3](https://github.com/aws/aws-sdk-go-v2) from 1.99.0 to 1.101.0. - [Release notes](https://github.com/aws/aws-sdk-go-v2/releases) - [Commits](https://github.com/aws/aws-sdk-go-v2/compare/service/s3/v1.99.0...service/s3/v1.101.0) --- updated-dependencies: - dependency-name: github.com/aws/aws-sdk-go-v2/service/s3 dependency-version: 1.101.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2026-05-18 19:28:06 -07:00
Chris Lu	68794fb94c	fix(ec_distribute): remove partial files on copy stream error (#9543 ) * fix(ec_distribute): remove partial files on copy stream error writeToFile opens the destination with O_TRUNC and streams into it. On a mid-stream receive / write / cancellation error it returned the failure but left the destination behind in whatever state had been written so far — typically 0 bytes when the source errored before sending any FileContent. VolumeEcShardsCopy distributes .ecx by calling doCopyFile, so this same stub-leaving behaviour produced the 0-byte .ecx files seen on EC encoding failures: the source claims a non-zero ModifiedTsNs (so the existing "source not found" cleanup doesn't fire), the stream then errors immediately, and the receiver ends up with a 0-byte .ecx that downstream code mistook for a valid empty index. Clean up the partial file on every error path that returns from the streaming loop (receive, write, and cancellation). Skip cleanup when isAppend=true so resumable appends keep their existing content. As defense in depth, VolumeEcShardsCopy also stats the .ecx after copy and removes / errors on a 0-byte result so the orchestrator can pick a different source. The Rust volume server has only the source side of CopyFile (no client-side stream-to-disk consumer) and no .ecx subsystem yet, so this fix has no Rust mirror. * fix(ec_distribute): close file before remove, fail fast on stat error Address review feedback: - writeToFile's mid-stream removeIncomplete called os.Remove while the destination file handle was still open. On Windows os.Remove fails while a handle is open, so the cleanup wouldn't run there. Wrap the handle close in a once-only helper, call it from removeIncomplete and from the existing "source not found" cleanup, and keep a deferred close as the safety net for the normal-return path. - VolumeEcShardsCopy's post-copy .ecx check silently passed when os.Stat returned an error: doCopyFile had reported success but if the file was already gone, unreadable, or somehow a directory, the orchestrator only learned at mount time with no useful context. Treat any non-nil stat error and any directory result as a copy failure here and surface it immediately.	2026-05-18 15:19:51 -07:00
Chris Lu	af8d4e00ee	fix(ec_mount): reject 0-byte .ecx and aggregate cross-disk failures (#9542 ) * fix(ec_mount): reject 0-byte .ecx and aggregate cross-disk failures MountEcShards's per-disk loop bailed on the first disk returning a non-ENOENT error, and NewEcVolume wrapped its ENOENT with %v so the caller's `err == os.ErrNotExist` check never matched. On a multi-disk volume server where ec.balance / ec.rebuild had distributed shards across sibling disks while the matching .ecx never arrived, the mount loop bailed after disk 0 with "cannot open ec volume index" and the operator never saw that the rest of the disks were also empty. The companion failure mode is a 0-byte .ecx stub left by EC distribute's writeToFile after a mid-stream copy failure: Stat() succeeds, treating the stub as a valid index, and downstream mount work proceeds against an empty file. Wrap the ec-volume open errors with %w, treat a 0-byte .ecx as os.ErrNotExist (in NewEcVolume, findEcxIdxDirForVolume, and HasEcxFileOnDisk), and have MountEcShards collect per-disk failures before returning a single aggregated error. The "no .ecx anywhere" case gets a distinct error so the orchestrator can re-copy the index from a healthy replica rather than retry against the same broken state. * fix(ec_reconcile): indexEcxOwners also rejects 0-byte .ecx stubs findEcxIdxDirForVolume already skipped 0-byte .ecx during MountEcShards, but indexEcxOwners (used by reconcileEcShardsAcrossDisks at startup) still recorded the first .ecx by name only. On a store where one disk holds a 0-byte stub left by a failed EC distribute and a sibling disk holds the real index, the stub would win the owner selection — and NewEcVolume's new size check would then refuse to load against it, leaving the orphan shards unloaded even though a valid index exists. Mirror the size check from findEcxIdxDirForVolume: skip directory entries whose .ecx Info() reports size 0 or whose Info() call fails. * fix(ec_mount): accept 0-byte .ecx as valid empty index The previous commit treated a 0-byte .ecx in NewEcVolume as os.ErrNotExist, on the assumption that any empty .ecx was a stub left by a failed copy stream. That broke the legitimate empty-volume case: when an EC volume's source .idx has no live entries (e.g. all needles deleted before WriteSortedFileFromIdx), the sorted .ecx is genuinely 0 bytes and must mount. The integration test TestEcShardsToVolumeMissingShardAndNoLiveEntries fails with "MountEcShards: no .ecx index found on any local disk" because the mount path now refuses the legitimate empty index. A 0-byte .ecx left by a failed copy stream is indistinguishable from the legitimate empty case by file size alone. Preventing stub files from being written is the receiver-side cleanup in writeToFile's job (the companion EC distribute PR), not NewEcVolume's at mount time. The cross-disk lookup helpers (findEcxIdxDirForVolume, HasEcxFileOnDisk, indexEcxOwners) keep their size > 0 preference: when a real .ecx exists on a sibling disk alongside a stub, we still want to route mounts and reconcile at the real one. If no non-zero .ecx exists anywhere, the per-disk fallback in MountEcShards can still open the 0-byte .ecx and the volume mounts. Replace TestMountEcShards_ZeroByteEcxOnlyDisk with TestMountEcShards_EmptyEcxMountsSuccessfully, which pins the empty-volume invariant.	2026-05-18 15:00:33 -07:00
Chris Lu	41b6ad002b	fix(volume.list): show one entry per physical disk on multi-disk nodes (#9541 ) * fix(volume.list): show one entry per physical disk on multi-disk nodes DataNodeInfo.DiskInfos is keyed by disk type, so several same-type physical disks on one node collapse to a single map entry at the master. volume.list iterated that map directly and reported one "Disk hdd ... id:0" line per node, hiding the per-disk volume and shard layout. EC operators on multi-disk volume servers had no way to verify which physical disk a shard landed on. Lift the per-physical-disk split into a DiskInfo.SplitByPhysicalDisk() method on the proto type so consumers outside admin/topology can use it. Apply it in writeDataNodeInfo so the verbose Disk block shows one entry per physical disk, ordered by DiskId. Capacity counters are split evenly across reconstructed disks since the wire format doesn't carry per-disk capacity yet. This is a display-only change. ActiveTopology already did the split on its own and is now updated to call the shared helper. * fix(volume.list): preserve totals, count active/remote exactly, dedupe header Address review feedback on the per-physical-disk split: - share() truncated remainders so reconstructed per-disk counters could sum to less than the original aggregate (10 / 3 = 3+3+3). Distribute the remainder to the lowest disk ids so MaxVolumeCount and FreeVolumeCount sum exactly back to the node totals. - ActiveVolumeCount and RemoteVolumeCount are derivable per disk from the VolumeInfos already grouped by DiskId, so count them exactly (ReadOnly=false and RemoteStorageName!="" respectively) instead of approximating with an even split. - writeDataNodeInfo's per-disk callback fired the DataNode header on every iteration after the split, so a node with 6 physical disks emitted 6 DataNode headers. Guard the callback with headerPrinted so the header still appears at most once per node. - Sort split disks deterministically using explicit DiskId comparison to avoid int overflow risk on 32-bit systems. - Tighten the volume.list test substring to "id:N\n" so unrelated tokens like "ec volume id:101" don't accidentally match the id:1 needle, and assert the rack callback fires once.	2026-05-18 14:43:44 -07:00
Chris Lu	a761441926	fix(test): reserve mini ports on all interfaces; bound risingwave cleanup shell (#9545 ) The 127.0.0.1-only reservation in AllocateMiniPorts/AllocatePortSet let another process hold the gRPC port on a different interface, so weed mini's isPortAvailable check failed and it shifted master.grpc. weed shell -master=<HTTP> still derives grpc as HTTP+10000 and dialed the unused port, hanging until the 30s context deadline killed it. Bind the reservation listeners on :port to match mini's check. Also bound listFilerContents in catalog_risingwave with a 30s exec.CommandContext so a hung weed shell during failure-cleanup can't burn the 20-minute test budget.	2026-05-18 14:16:22 -07:00
Chris Lu	37e6263efe	fix(shell): attach admin JWT for filer IAM gRPC calls (#9536 ) When jwt.filer_signing.key is set, the filer's IamGrpcServer requires a Bearer token on every IAM RPC. The shell's s3.* IAM commands dialed without that header and failed with Unauthenticated. Route them through a small helper that mints a token from the same key viper-loaded from security.toml and appends it as outgoing metadata, matching the credential grpc_store pattern.	2026-05-18 13:42:32 -07:00
Chris Lu	3d872a1416	fix(filer): load -s3.config static identities into the filer's CredentialManager (#9537 ) When weed filer started its embedded S3 gateway with -s3 -s3.config, only the S3 server loaded the s3.json static identities — the filer's own CredentialManager stayed empty, so the IAM gRPC service backing the admin UI and weed shell returned only dynamic users. Mirror the wiring weed server already does and hand the same config path to the filer.	2026-05-18 13:41:30 -07:00
Chris Lu	4d04609bb8	fix(mount): don't release file handles from FUSE Forget (#9529 ) fix(mount): don't release file handles from Forget Forget(nodeid, nlookup) only decrements the kernel inode lookup count. File handle lifecycle belongs to FUSE Open/Release. Driving the FH refcount from Forget coupled two unrelated counters and could tear down a still-live handle if Forget ever raced ahead of Release. Drop the ReleaseByInode call (and the now-unused method).	2026-05-18 01:02:58 -07:00
Chris Lu	01b3e4a71c	template 4.26	2026-05-17 23:12:04 -07:00
Chris Lu	6cab199400	fix(iceberg): dial filer gRPC address verbatim in plugin worker (#9527 ) * fix(iceberg): dial filer gRPC address verbatim in plugin worker dialFiler was running its address argument through pb.ServerAddress.ToGrpcAddress, whose single-port fallback adds +10000 to any host:port — so when the admin forwards ClusterContext.FilerGrpcAddresses (already host:grpcPort) to the worker, the iceberg handler turns the real gRPC port (e.g. 18888) into a non-existent 28888 and dispatched jobs fail with connection refused. Drop the conversion; the address is already dialable. Tests that produced fake filer addresses in dual-port form now return host:grpcPort to match the new contract. * test(ec): use renamed detection_interval_minutes field The admin_runtime.detection_interval_seconds field was renamed to detection_interval_minutes back in May. This integration test was not updated, so the unknown JSON field was silently ignored and the scheduler fell back to the default detection interval (17 min for erasure_coding), which exceeds the test's 5-minute wait and times out. Switch to detection_interval_minutes: 1 — local run completes in ~120s.	2026-05-17 23:03:00 -07:00
Chris Lu	136eb1b7c8	4.26	2026-05-17 21:05:25 -07:00
Chris Lu	c11ff6657b	fix(ec): mirror EC sidecars onto every shard-bearing disk at startup (#9525 ) * fix(ec): mirror EC sidecars onto every shard-bearing disk at startup In a multi-disk volume server, ec.balance and ec.rebuild can land shards on a disk that does not also hold the matching .ecx / .ecj / .vif index files. The orphan-shard reconciler in reconcileEcShardsAcrossDisks already loads those shards by pointing the EcVolume at the sibling disk's index files; reads work, but any failure on the index-owning disk silently disables every shard on the other disk, even though those shards are physically fine. This change adds mirrorEcMetadataToShardDisks, a startup pass that physically replicates .ecx / .ecj / .vif onto each disk that holds shards but is missing them. Each copy is atomic (tmp + fsync + rename) and idempotent (a destination that already has the sidecar is preserved). After mirroring, the cross-disk reconciler prefers the local IdxDirectory so the EcVolume mounts self-contained; the cross-disk virtual mount remains as a fallback for volumes whose mirror failed (read-only target, out of space, partial copy on a previous boot). The same-disk invariant the EC lifecycle (encode / decode / balance / vacuum / repair) was already documented as promising is now actually restored at boot, so a future failure of one disk in a split-shards layout no longer takes the other disk's shards with it. Tests cover the orphan-layout mirror (dir0 receives the .ecx / .ecj / .vif from dir1) and idempotency (an existing destination .ecx is not overwritten with the owner's copy). * fix(ec): handle legacy pre-dir.idx sidecar layout in mirror skip-check hasAllEcSidecarsLocally checked only the modern destination path (IdxDirectory for .ecx/.ecj, Directory for .vif). A destination disk that still had a legacy .ecx in its data dir (written before -dir.idx was set) would report "not present" and the mirror would write a second copy to IdxDirectory, leaving two .ecx files on disk. Matches HasEcxFileOnDisk's open-with-fallback contract: check the modern path first, then the opposite directory. Factored the exists-and-not-a-dir check into a small statRegular helper so the fallback ladder stays readable. * rust(seaweed-volume): mirror EC sidecars onto shard-bearing disks at startup Port of the Go fix (commit `088e26ea6`) to the Rust volume server. Adds Store::mirror_ec_metadata_to_shard_disks, called from add_location / load_new_volumes before the cross-disk orphan reconciler. Physically copies .ecx / .ecj / .vif from the disk that owns the index files onto every disk holding shards but missing sidecars, so each shard-bearing disk ends up self-contained. The reconciler now prefers the local idx_directory when the mirror has installed a .ecx there; the cross-disk virtual mount remains as the fallback for volumes whose mirror failed (read-only target, out of space, partial copy on a previous boot). Adds ec_local_ecx_path helper shared between reconcile and mirror to detect the post-mirror fast path. Mirrors the Go-side fallback in hasAllEcSidecarsLocally: when -dir.idx is configured and the destination still has a legacy .ecx in its data dir, that's recognized so the mirror does not write a duplicate copy into idx_directory. Tests cover the two key cases: orphan layout (dir0 receives the sidecars from dir1) and idempotency (a pre-existing destination .ecx is not overwritten). * trim verbose comments on EC mirror code Comments now lead with the WHY (non-obvious constraints, the post-mirror fast path, why local copies are authoritative) and drop restate-the-code blocks, headers, and section dividers. Behavior is unchanged; all existing tests still pass on both the Go volume server and the seaweed-volume Rust port. * drop github issue refs from added comments Two stray "#9212" references slipped into comments I added on the cross-disk reconciler call site. The git log carries the issue history; comments stand on their own. * test(ec): accept rebuild on either disk after sidecar mirror TestEcLifecycleAcrossMultipleDisks asserted the rebuilt shard 9 must land at the disk-0 path. With the boot-time sidecar mirror, every shard-bearing disk owns its own .ecx, so VolumeEcShardsRebuild now picks whichever disk hosts the most shards — disk 1 in this layout after the deletion. The shard can legitimately rebuild on either disk; the test now accepts both and uses the chosen path for the subsequent mount + read verification.	2026-05-17 19:55:15 -07:00
Chris Lu	6b94701213	mini: quieter startup with a docker-compose-style progress board (#9524 ) * mini: quieter startup with a docker-compose-style progress board Replaces noisy startup/shutdown logs with a single in-place progress table on a TTY (or one line per state change off-TTY). Each component renders as `pending -> starting -> ready` during startup and `stopping -> stopped` during shutdown, with elapsed time on transition. Also folds in a few cleanups uncovered while making this readable: - route the admin.go startup prints through glog so quietMiniLogs() filters them under mini but standalone weed admin still shows them - generate a dev SSE-S3 KEK + passphrase on first run via WEED_S3_SSE_KEK and WEED_S3_SSE_KEK_PASSPHRASE env vars (viper.Set has a nested-key conflict between s3.sse.kek and s3.sse.kek.passphrase); persisted under the data folder so restarts reuse the same key - demote worker/master gRPC Recv 'context canceled' to V(1); those are the normal shutdown signal, not Errors/Warnings - drop the 'Optimized Settings' block and the 'credentials loaded from environment variables' message from the welcome banner - only show the credentials setup hints when no S3 identities exist (new s3api.HasAnyIdentity accessor backed by an atomic.Bool) - use S3_BUCKET in the credentials hint so it pairs with AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY - reorder running-services list to master / volume / filer / webdav / s3 / iceberg / admin * mini: refuse in-memory-only SSE-S3 dev keys; surface admin serve errors loadOrCreateMiniHexSecret returns "" when os.WriteFile fails, so SSE-S3 won't encrypt data under a KEK that the next restart can't reproduce (which would orphan whatever was written this run). The caller already treats "" as "skip setting WEED_S3_SSE_* env vars", so SSE-S3 and IAM just stay disabled for this run. startAdminServer's serve goroutine used to only log ListenAndServe failures, so a bind error left the caller blocked on ctx.Done() with no listener. Forward the error through a buffered channel and select on it alongside ctx.Done(). * ci(s3-proxy-signature): match weed mini's new progress-board ready line The readiness probe grepped for "S3 (gateway\|service).*(started\|ready)", which matched weed mini's old "S3 service is ready at ..." line. Mini now emits " S3 ready (Xs)" from its progress board, so the old pattern misses and the test timed out at the 30-second wait. Widen the alternation to also accept "S3\s+ready". The curl HEAD fallback already covers any remaining cases.	2026-05-17 19:13:09 -07:00
Chris Lu	ff6f9fd90a	iam: honor configured credential store for IAM API policies and propagate to S3 caches (fixes #9518 ) (#9522 ) * iamapi: route managed policies through credential manager (fixes #9518) CreatePolicy via the IAM API wrote straight to the filer /etc/iam/policies.json, ignoring any non-filer credential store. When credential.postgres was configured, policies created via the IAM API landed only in the filer while the Admin UI wrote to postgres, producing a split-brain where ListPolicies/GetPolicy never saw the Admin UI's policies and vice versa. GetPolicies/PutPolicies on IamS3ApiConfigure now load managed policies from credentialManager and persist Create/Update/Delete as a delta against the store. Inline user/group policies still live in the legacy policies.json file (no credential-store API for them yet). Pre-existing managed policies in the legacy file are merged on read so deployments don't lose data, and re-persisted to the store on the next write so the legacy file is drained over time. * credential: route IAM API inline policies through credential manager Extends the #9518 fix to user-inline and group-inline policies so the IAM API never writes the legacy /etc/iam/policies.json bundle directly. The previous patch only routed managed policies; this one finishes the job for the other two policy types. - Add GroupInlinePolicyStore + GroupInlinePoliciesLoader optional interfaces, mirroring the existing user-inline ones, and matching Put/Get/Delete/List/LoadAll wrappers on CredentialManager. - Implement group-inline storage in memory (new map), filer_etc (new field on PoliciesCollection, reusing the legacy file under policyMu), and postgres (new group_inline_policies table with ON DELETE CASCADE off the groups FK). - Wire the new methods through PropagatingCredentialStore so wrapped stores still delegate correctly. - IamS3ApiConfigure.PutPolicies now applies managed + user-inline + group-inline as deltas through the credential manager; the legacy /etc/iam/policies.json file is never written when a credential manager is wired up. GetPolicies still reads the legacy bundle once as a fallback so unmigrated data is picked up and re-persisted into the store on the next write. * credential: propagate SaveConfiguration writes to running S3 caches Postgres (and any non-filer) credential stores never fired the S3 IAM cache invalidation path on bulk identity / group updates. The PropagatingCredentialStore had explicit Put/Remove handlers for single-entity calls (CreateUser, PutPolicy, etc.) but inherited SaveConfiguration unchanged from the embedded store, so the bulk path the IAM API takes at the end of every handler was silent. Inline-policy changes recompute identity.Actions and persist via SaveConfiguration, so until restart the cached Actions on each S3 server stayed stale and authorization decisions used the pre-change view. Override SaveConfiguration to snapshot the prior user / group lists, delegate the save, then fan out PutIdentity / PutGroup for what's in the new config and RemoveIdentity / RemoveGroup for what got pruned. Reuses the existing SeaweedS3IamCache RPCs, no protobuf changes. * iamapi: drain legacy policies.json after authoritative credential-store writes Review pointed out a resurrection bug: GetPolicies still reads /etc/iam/policies.json as a one-way migration fallback, but PutPolicies in the credential-manager path never wrote that file, so legacy-only entries reappeared on the next read even after the IAM API "deleted" them. PutPolicies now overwrites the bundle with an empty {} after a successful credential-store write, unless the store is filer_etc (which owns the bundle as its own inline-policy backing — clearing it would wipe filer_etc's data). Also wraps the filer read, JSON unmarshal, and marshal errors with context per the other review comments.	2026-05-17 13:15:27 -07:00
Chris Lu	b4289abb0a	admin: convert filer address to gRPC form before dispatch (#9523 ) The master returns each registered filer in pb.ServerAddress dual-port form (host:httpPort.grpcPort, e.g. 10.0.0.1:8888.18888). The admin's plugin context builder forwarded that string verbatim as filer_grpc_address, so workers calling grpc.DialContext on it failed every job in ~3ms with "dial tcp: lookup tcp/8888.18888: unknown port". Run each entry through pb.ServerAddress.ToGrpcAddress before populating ClusterContext.FilerGrpcAddresses. The lifecycle integration test now pins filer.port.grpc to a value that breaks the FILER_PORT+10000 assumption, and a new dispatch test drives the admin's /api/plugin/job-types/s3_lifecycle/run path end-to-end and asserts the dispatched job both reaches the filer and deletes the backdated object.	2026-05-17 11:33:54 -07:00
Chris Lu	2a41e76101	fix(ec): blanket-clean every destination over the full shard range (#9512 ) * fix(ec): blanket-clean every destination over the full shard range The previous cleanup pass walked t.sources only, with the shard ids the topology had reported at detection time. In the wild, a destination can end up with EC shards mounted that the topology snapshot didn't list — shards on a sibling disk that hadn't heartbeated, or shards left over from a concurrent attempt's mount step. FindEcVolume still returns true, so the next ReceiveFile trips the mounted-volume guard. Cleanup now unions t.sources (with ShardIds) and t.targets and issues unmount + delete over [0..totalShards-1] on each. Both RPCs are idempotent on missing shards, so the wider sweep is free. Two new tests cover the gap: shards mounted beyond what t.sources lists, and a target-only destination with no source row. * log(ec): include disk_id in EC unmount/delete/refusal log lines The current logs identify the volume and shard but leave disk_id off, which makes the cross-server cleanup story hard to follow when multiple disks of one server hold pieces of the same volume: UnmountEcShards 4121.1 -> add disk_id ec volume video-recordings_4121 shard delete [1 5] -> add per-loc disk_id volume server X:Y deletes ec shards from 4121 [...] -> add disk_id ReceiveFile: ec volume 4121 is mounted; refusing... -> add disk_ids ReceiveFile's refusal now names the disk_ids actually holding the mount so operators can see whether the next cleanup pass needs to target a sibling disk. Added Store.FindEcVolumeDiskIds / Store::find_ec_volume_disk_ids as the supporting primitive. Mirrored in seaweed-volume/src/ (unmount log in Store::unmount_ec_shard, heartbeat delete log in diff_ec_shard_delta_messages, refusal in the ReceiveFile handler). * test(ec): stub VolumeEcShardsUnmount/Delete on the fake volume server The plugin-worker EC tests boot a fake volume server that embeds UnimplementedVolumeServerServer. After the worker started calling VolumeEcShardsUnmount + VolumeEcShardsDelete pre-distribute, the default Unimplemented response surfaced as fourteen "method not implemented" errors and TestErasureCodingExecutionEncodesShards failed. Both RPCs are no-ops here — nothing on the fake server has mounted state or persisted shard files to remove.	2026-05-17 11:31:37 -07:00
Chris Lu	bf9110ebd3	fix(ec): mount falls back to sibling-disk .ecx (fixes #9519 ) (#9521 ) * fix(ec): mount falls back to sibling-disk .ecx (fixes #9519) MountEcShards iterated DiskLocations and on each disk called LoadEcShard with that disk's IdxDirectory as the .ecx home. When ec.balance lands the .ec?? shard on disk A but the .ecx on sibling disk B of the same volume server, NewEcVolume ENOENTs the .ecx and returns "cannot open ec volume index ...". That error is not os.ErrNotExist, so the per-disk continue branch did not engage and the mount loop bailed before trying any other disk. The startup reconciliation in reconcileEcShardsAcrossDisks already handles this layout for orphan shards discovered on boot (issue #9212). This change mirrors the same primitive on the mount path: look up the .ecx owner across all DiskLocations once and route NewEcVolume at that directory whenever the disk being mounted does not own its own copy of the .ecx. Same-disk mounts are unaffected because HasEcxFileOnDisk keeps LocalIdxDirectory in play. Adds a regression test that plants the index files on a sibling disk AFTER NewStore returns (so the startup reconcile is a no-op for that vid) and verifies MountEcShards succeeds; also pins the same-disk baseline against accidental re-routing. * fix(ec): skip redundant stats in cross-disk .ecx lookup (review) Two follow-ups from gemini-code-assist on #9521: 1. MountEcShards: when findEcxIdxDirForVolume already returned a path that lives on this disk's IdxDirectory or Directory, the disk owns the .ecx — skip the HasEcxFileOnDisk stat and use the local idx dir directly. Only re-check when the disk's directories are neither, so the duplicate-.ecx-on-multiple-disks edge case is still honored. 2. findEcxIdxDirForVolume: hoist the seen map across the location loop so a shared IdxDirectory (one -dir.idx paired with several -dir entries) is only stat'd once per call. Both are I/O optimizations; behavior is unchanged. Existing cross-disk and same-disk regression tests still pass. * docs(ec): drop issue/PR references from cross-disk mount comments Comments and test docstrings stand on their own; the issue number adds nothing a reader can act on and goes stale across forks. Keep the description of what the layout is and why the fallback exists, just without the reference.	2026-05-17 11:10:37 -07:00
Chris Lu	d51454adf4	rust(seaweed-volume): distributed EC read across peer servers (#9516 ) * feat(seaweed-volume): distributed EC read across peer servers EcVolume::read_ec_shard_needle previously errored with NotFound when any interval's shard wasn't local. In an RS(10,4)-across-N deployment each server holds one shard, so every read needed >=9 peer fetches and post-EC GETs returned 404 on volumes whose shards lived on more than one server. Mirror of weed/storage/store_ec.go's readOneEcShardInterval -> readRemoteEcShardInterval -> recoverOneRemoteEcShardInterval chain: * server/store_ec.rs (new): entry point read_ec_shard_needle_distributed. Snapshots locate-needle + local reads under the Store sync lock, drops the lock, then async-fetches missing intervals via the peer's VolumeEcShardRead RPC. Falls back to Reed-Solomon reconstruction (read every other shard at the same (shard_offset, size) and run rs.reconstruct) when the direct peer read fails. Refreshes the per-EcVolume shard_locations cache from the master's LookupEcVolume RPC using Go's freshness thresholds (11s / 7min / 37min). * erasure_coding/ec_volume.rs: shard_locations now sits behind a std::sync::RwLock so the read path can refresh the map without holding the Store write lock. Adds shard_locations_refresh_time (Mutex<Option<Instant>>) for the staleness heuristic. Mirrors Go's ShardLocationsLock / ShardLocationsRefreshTime fields. set/get helpers updated for interior mutability. * server/handlers.rs: GET handler now tries the local-only fast path first, then falls through to the distributed path on NotFound. * review: address PR 9516 feedback on distributed EC read Five of the six PR-review comments addressed; the sixth (JWT on outgoing peer gRPC) is deferred with an explicit TODO because the crate-wide outgoing-JWT signing surface doesn't exist yet — adding it in this one call site would split the credential plumbing across peer paths that already lack it (copy_file_from_source, batch_delete, …). Revisit when an outgoing-JWT helper lands. Fixed in this commit: * Handlers: drop the two-tier (local-first, then distributed) read in handlers.rs. read_ec_shard_needle_distributed already does the local-first pass under the same store read lock; the redundant outer attempt re-read local intervals twice for any needle that spanned mixed-locality shards. * Scanner snapshot: replace inline locate-needle math with `ecv.locate_needle(needle_id)`. Same routine the local-only read path uses, so byte-identical on shard-size + interval boundaries. * EcVolume::set_shard_locations also advances shard_locations_refresh_time so the staleness check honors callers that populate the cache directly without going through the master LookupEcVolume RPC. * parse_grpc_address moved from grpc_server.rs into grpc_client.rs as `pub` and is reused by both grpc_server.rs and the new store_ec module. Single source of truth for the HTTP↔gRPC port-offset convention. * Reconstruction (recover_one_remote_ec_shard_interval) now seeds bufs from locally-mounted survivor shards BEFORE the remote fan-out. Previously the fan-out was remote-only, so when the shard_locations cache was cold or the master lookup failed, reconstruction errored even though enough siblings were on local disk to recover the missing interval. * review: tighten parse_grpc_address; atomic shard-locations cache swap Two follow-up findings from the PR 9516 review round 2: * `parse_grpc_address` now validates BOTH port components in the dotted form (`host:port.grpcPort`) — previously a non-numeric HTTP port like `host:abc.18080` slipped through and tripped a less-useful downstream URI parse error. The implicit form (`host:port` → port + 10000) also gains an overflow check so inputs like `host:60000` (which silently wrap past u16) are rejected here instead of producing an opaque connection failure later. Six unit tests cover each rejection path. * `EcVolume::set_shard_locations` no longer bumps the per-volume refresh timestamp. The previous fix introduced a freshness race: a multi-shard population that inserts shard-by-shard would flip `needs_refresh == false` on the first write, letting a concurrent reader observe a half-populated map already marked "fresh" and return NotFound for the not-yet-inserted shards. Added `EcVolume::replace_shard_locations(map)` for the atomic bulk swap; `write_back_shard_locations` in the distributed-read path uses it so the cache transitions old → fresh in a single observable step.	2026-05-16 20:44:28 -07:00
Chris Lu	f892b445b3	helm(admin): support secretExtraEnvironmentVars (refs #9511 ) (#9513 ) * helm(admin): support secretExtraEnvironmentVars The admin statefulset only honored extraEnvironmentVars, forcing the OIDC client secret (and any other sensitive WEED_* value) to be inlined as plain text in values.yaml — not GitOps-friendly. The filer chart has had secretExtraEnvironmentVars for this exact case; mirror that pattern on admin so secrets can be projected via valueFrom.secretKeyRef. Surfaced by an enterprise OIDC deployment (issue #9511) where the only workaround was hardcoding WEED_ADMIN_OIDC_CLIENT_SECRET in values.yaml. * helm(admin): sort secretExtraEnvironmentVars keys for stable output Helm/Go template map iteration is non-deterministic, so the env entries could shuffle between renders and trigger spurious StatefulSet rollouts in GitOps tooling (ArgoCD/Flux). Sort the keys with sortAlpha, mirroring the extraEnvironmentVars block immediately above. Flagged by gemini-code-assist and coderabbitai on PR #9513.	2026-05-15 13:19:05 -07:00
Chris Lu	62821964dd	filer/iam-grpc: make admin Bearer auth opt-in (fixes #9509 ) (#9514 ) PR #9442 made the filer refuse to register the IAM gRPC service unless jwt.filer_signing.key was set in security.toml, which broke the admin UI Users/Groups/Policies pages for every deployment that ships without a security.toml — weed mini, plain Helm, vanilla weed filer. The Users tab returns Unimplemented and the page is unusable. Issues #9504, #9505 and #9509 all trace to this gap. The rest of the filer's gRPC surface is unauthenticated by default; treat IAM the same way. The service now always registers, and the auth gate is a no-op when no signing key is configured. When the key is set, every RPC still requires an admin-signed Bearer token, matching the post-#9442 behaviour. Operators who expose the filer gRPC port beyond a trusted network should set the key on both filer and admin. The admin client (IamGrpcStore.withIamClient) already skips attaching the authorization metadata when its key is empty, so no changes there.	2026-05-15 13:15:20 -07:00
Konstantin Lebedev	7d1b16fbcd	fix: ListBucketsHandler for pathStyleDomains (#9510 )	2026-05-15 13:12:55 -07:00
Chris Lu	2ed95d7ea9	helm: decouple JWT signing from cert-manager mTLS (fixes #9506 ) (#9508 ) * helm(security): decouple JWT signing from cert-manager mTLS The filer needs jwt.filer_signing.key to register the IAM gRPC service the Admin UI Users tab calls (PR #9442). The chart only rendered security.toml under enableSecurity, which also pulls in cert-manager for mTLS — much heavier than the Admin UI needs. Operators on Helm without cert-manager have no way to flip the JWT key on, so the Users tab fails with Unimplemented after upgrading past 4.24. Introduce seaweedfs.securityConfigEnabled, true when enableSecurity OR any explicit jwtSigning toggle (volumeRead/filerWrite/filerRead) is set. The configmap renders under that helper; the [grpc.]/[https.] sections inside stay gated on enableSecurity. Each pod template splits the security-config mount onto the helper and keeps the cert volume mounts on enableSecurity. volumeWrite is intentionally excluded from the helper trigger because it defaults to true; including it would silently start mounting security.toml on every fresh install. With this change, enableSecurity=false + defaults renders nothing (unchanged), enableSecurity=true renders the full toml (unchanged), and enableSecurity=false + filerWrite=true renders just the [jwt.] sections so the Admin UI works without mTLS. Fixes #9506. helm(security): trim verbose comments * helm(security): handle null securityConfig in helper Address review feedback: (.Values.global.seaweedfs.securityConfig).jwtSigning errored if a user explicitly set securityConfig: null in their values. Drop into intermediate $sec/$jwt with default dict at each step so a missing or nulled-out parent is tolerated. * helm(ci): cover IAM gRPC decoupling (issue #9506) Five regression assertions exercised against the rendered chart so a future change cannot silently re-couple jwt.filer_signing to mTLS: 1. defaults render no security-config ConfigMap (preserves baseline) 2. filerWrite=true alone renders [jwt.filer_signing] with no [grpc.] 3. filerWrite=true mounts security-config on filer + admin without pulling in cert volumes — the actual fix for the Admin UI Users tab 4. enableSecurity=true still produces the full toml with [grpc.master] 5. securityConfig=null and securityConfig.jwtSigning=null both render cleanly (gemini-code-assist review nit, applied chart-wide) Patch a pre-existing direct-access in filer-statefulset.yaml that crashed on securityConfig=null, surfaced by the new null assertion. helm(ci): drop issue numbers from comments * helm(ci): install pyyaml; assert [jwt.signing] in mTLS path Address coderabbit review: - The new IAM gRPC test block uses `import yaml` but ran before the later `pip install pyyaml -q` step that the security+S3 block performs. CI happens to pass because the runner image carries PyYAML, but make the dependency explicit so a future runner change cannot silently break the regression test. - The enableSecurity=true assertion only checked for [grpc.master]. Also assert [jwt.signing] so a refactor that drops the volume-side JWT stanza from the mTLS path fails the test instead of slipping through.	2026-05-14 23:43:24 -07:00
Chris Lu	bfb2661fec	fix(tests): make 32-bit GOARCH tests build and run (#9507 ) fix(tests): make 32-bit GOARCH tests build and run (#9503) verifyTestFilerClient had bare int64 atomic counters after a map header, so atomic.AddInt64 panicked with "unaligned 64-bit atomic operation" on linux/386. Switch to atomic.Int64, which the stdlib guarantees is 8-byte aligned on all platforms. rpc_version_filter_test.go passed the untyped constant 0xdeadbeef to t.Errorf, where it default-promoted to int and overflowed 32-bit int. Bind it to a typed uint32 const used in both the comparison and the error message.	2026-05-14 20:55:37 -07:00
Chris Lu	7acba59a5c	4.25 4.25	2026-05-14 12:16:26 -07:00
Chris Lu	2c1482f7a6	fix(ec): clear cross-server stale EC shards before re-distribute (#9478 ) (#9499 ) * fix(ec): clear cross-server stale EC shards before re-distribute (#9478) A previous failed encode leaves partial .ec?? shards mounted on destination volume servers that are not the .dat owner. PR #9480 only prunes when the .dat sits on a sibling disk of the SAME store, so the cross-server case stays stuck: every retry trips volume_grpc_copy.go:570's "ec volume %d is mounted; refusing overwrite" guard and the scheduler loops. Detection already lists existing EC shards as CleanupECShards sources; plumb the shard ids through (ActiveTopology.GetECShardLocations, TaskSourceSpec, TaskSource.shard_ids) and have the EC worker call VolumeEcShardsUnmount + VolumeEcShardsDelete on each destination after the local shard set is generated and before distributeEcShards. Skip EC-shard sources in getReplicas so the post-encode VolumeDelete step does not target destination-only nodes. Integration test mounts a partial shard subset, asserts the mounted-volume refusal, runs cleanupStaleEcShards, and asserts the next ReceiveFile lands. * chore(ec): tighten code comments in stale-shard cleanup Drop issue-number refs from code comments and shorten the docstrings on cleanupStaleEcShards / unmountAndDeleteEcShards / getReplicas plus the new test file. Behavior unchanged. * fix(ec): skip empty-ShardIds locations; dedupe getReplicas by node GetECShardLocations dropped entries where ecShardMatchesCollection saw a phantom info record with EcIndexBits=0 — without ShardIds, getReplicas misread the resulting source as a regular replica and would have called VolumeDelete on a destination-only node. getReplicas now dedupes by Node since VolumeDelete is server-wide; per-disk source rows on the same server collapse to one call. * refactor(ec): use MaxShardCount and ShardBits in collectShardIdsForDisk Drop the literal 32 bit-iteration bound for erasure_coding.MaxShardCount and treat the EcIndexBits union as a ShardBits so Count() drives the slice preallocation. Keeps the helper aligned with the rest of the EC code and survives any future expansion of the shard-count ceiling.	2026-05-14 11:57:45 -07:00
Chris Lu	e56a3ee4a2	ci(s3-lifecycle): split into per-test matrix jobs Each test now runs against a fresh `weed mini`, so per-collection TTL volume budget no longer leaks across tests and exhausts the pool.	2026-05-14 11:47:24 -07:00

1 2 3 4 5 ...

13898 Commits