* fix(shell): volume.balance no longer drains all volumes onto one server
The density-based capacity function reads per-disk VolumeInfos sizes, but
adjustAfterMove only updated VolumeCount and the selectedVolumes map. The
planner re-read a stale topology after every move, so the source node's
density never dropped and it kept moving volumes until that node was empty.
Move the volume's size accounting between disks after each planned move so the
density recomputes and the loop converges to an even distribution.
* refactor(shell): O(1) volume removal and direct disk lookup in adjustAfterMove
removeVolumeInfo swaps with the last element instead of shifting, and the disk
is fetched by key rather than ranging the DiskInfos map.
* fix(mount): keep periodic metadata flush from dropping concurrent chunk uploads
The periodic flush snapshotted entry.Chunks, then ran CompactFileChunks and
MaybeManifestize (the manifest upload is a network round trip) before
reassigning entry.Chunks. Async uploaders append freshly uploaded chunks
during that window, and the reassignment overwrote them: the data stayed on
the volumes but the file lost those chunk references, leaving zero-filled
holes on read. Large sequential writes such as cat of two 15 GiB files hit
several flush cycles and ended up corrupted.
Snapshot the chunk list under the entry lock with a length marker, do the
slow compaction and manifestization on the snapshot, then splice the
processed prefix back in front of whatever chunks arrived after the
snapshot.
* mount: drop redundant slice copies in the flush splice
processedPrefix is freshly built and the tail sub-slice is consumed
immediately under the entry lock, so append straight onto processedPrefix
instead of allocating two throwaway copies.
* chore(weed/command): prune unused functions
* drop now-unused closed field and renderLocked guard
---------
Co-authored-by: Chris Lu <chris.lu@gmail.com>
* feat(s3): add TagUser, UntagUser, ListUserTags IAM actions
Adds AWS IAM-compatible user tag operations on the embedded IAM
endpoint. Tags persist in the Identity proto as a repeated UserTag
field; the existing 50-tag / 128-byte-key / 256-byte-value AWS limits
are enforced. Pagination is stubbed (IsTruncated=false) since the
50-tag cap means all tags fit in a single response.
* review: validate UntagUser TagKeys entries
parseTagKeysParams now rejects empty keys and keys past
MaxUserTagKeyLength; UntagUser additionally requires at least one
TagKeys.member.N entry to match AWS validation behavior.
* review: pre-allocate user-tag merge and filter slices
mergeUserTags now allocates the combined existing+incoming capacity
up front; UntagUser builds the filtered slice via make with the full
ident.Tags capacity instead of ident.Tags[:0:0], which forced a
reallocation on every append.
* review: cover duplicate-in-request and invalid TagKeys cases
Regression tests assert TagUser rejects two members with the same key
in one request, and UntagUser rejects missing/empty/oversized TagKeys
entries.
* feat(s3): add four bucket configuration handlers
- GetBucketPolicyStatus: computes IsPublic from the existing bucket policy
- PutBucketRequestPayment: companion writer to the existing GET; accepts
only BucketOwner
- GetBucketAccelerateConfiguration: returns <Status>Suspended</Status>
- GetBucketLogging: returns an empty BucketLoggingStatus
Lets AWS SDK probes succeed instead of returning MethodNotAllowed.
* review: route GetBucketPolicyStatus through checkBucket
Mirrors the existence/auth gating used by other bucket handlers and
drops the bespoke filer_pb lookup so NoSuchBucket precedence is
consistent across the API surface.
* review: cap PutBucketRequestPayment body with MaxBytesReader
The body is unmarshalled as RequestPaymentConfiguration, which is a
handful of bytes; reject excessively large payloads up front and
defer Close immediately after wrapping.
* review: gate static getters on checkBucket
GetBucketAccelerateConfiguration and GetBucketLogging now run the
standard bucket existence check before returning the static
Suspended / empty-status response so a missing bucket cannot appear
to have valid configuration.
* review: share cache helper across misc tests; check io.ReadAll error
Accelerate and Logging tests now run through newMiscTestServer like
the others so the checkBucket guard sees a cached bucket; the
ReadAll error is explicitly checked.
* feat(s3): stub bucket configuration list endpoints
Adds Get and List handlers for Analytics, Inventory, IntelligentTiering,
and Metrics bucket configurations. List returns an empty result with
IsTruncated=false; single-get returns NoSuchConfiguration so SDK error
parsing remains predictable.
* review: gate stubs on bucket existence
All eight stub handlers now call checkBucket via stubBucketGuard so
NoSuchBucket takes precedence over NoSuchConfiguration / empty-list
responses, matching AWS S3 precedence. Tests provide a cached bucket
so the guard sees it as present.
* test(s3api): cover IAM inline policy aws:SourceIp + group inline gap
Unit tests under weed/s3api/ drive PutUserPolicy / PutGroupPolicy → reload
→ VerifyActionPermission with a synthetic 127.0.0.1 request and assert that
the policy's IpAddress condition flips the outcome.
The user-policy cases pass on master (hydrateRuntimePolicies already routes
inline docs through the policy engine, so Condition blocks are honored end-
to-end). The group-policy case fails: PutGroupPolicy still returns
NotImplemented, so a group inline doc never lands in the engine.
Integration counterparts live under test/s3/iam/ and exercise the same
paths against a live SeaweedFS S3+IAM endpoint.
* s3api: support group inline policies + Condition enforcement
PutGroupPolicy/GetGroupPolicy/DeleteGroupPolicy/ListGroupPolicies used to
return NotImplemented in embedded IAM mode, so anything attached to a
group as an inline doc — including aws:SourceIp or any other Condition —
was simply unreachable.
Wire the four endpoints to the credential-store methods that were
already in place (memory, postgres, filer_etc all implement
GroupInlinePolicyStore). On every config reload, hydrateRuntimePolicies
now also walks LoadGroupInlinePolicies, registers each doc in the IAM
policy engine under __inline_group_policy__/<group>/<policy>, and
appends that key to Group.PolicyNames so evaluateIAMPolicies picks it up
through its existing group walk. PutGroupPolicy/DeleteGroupPolicy are
added to the ReloadConfiguration trigger list in DoActions.
Side fix: MemoryStore.LoadConfiguration now surfaces store.groups too.
Without it iam.groups never repopulated on a memory-store reload, so
group policy evaluation silently no-op'd whether the policy was inline
or attached. The existing tests didn't notice because no test reloaded
through cm after creating a group.
The NotImplemented unit test is inverted to drive the new round-trip.
* s3api: drop redundant refreshIAMConfiguration from Put/DeleteGroupPolicy
DoActions already triggers ReloadConfiguration for both actions via the
explicit reload list, so calling refreshIAMConfiguration inline runs the
load twice per request. Per PR review.
* s3api: scope group-policy resource names per test; tighten deny polling
- Integration test resource names get a per-test suffix so retried or
parallel CI jobs don't trip EntityAlreadyExists / BucketAlreadyExists.
- Deny-path Eventually loops gate on AccessDenied via a typed helper
rather than any non-nil error; transient setup errors no longer end
the wait prematurely.
- ListGroupPolicies returns ServiceFailure when the credential manager
is nil, matching Put/Get/DeleteGroupPolicy.
* test(s3 iam): cover both IPv4 and IPv6 loopback in allow CIDRs
CI runners with happy-eyeballs resolve `localhost` to ::1 first, in
which case a 127.0.0.0/8-only allow would silently never match and the
deny-driven enforcement test would hang for the allow case. Add ::1/128
to every loopback-matching policy so the allow path works regardless of
which loopback family the SDK lands on.
* fix(ec): VolumeEcShardsInfo walks every disk on multi-disk servers
When a volume server holds EC shards for the same vid across more than
one disk, each DiskLocation registers its own EcVolume entry and
Store.FindEcVolume returns whichever one it hits first. The shard-info
RPC iterated only that single EcVolume's Shards, so the response missed
every shard mounted on a sibling disk.
The worker's verifyEcShardsBeforeDelete sums the per-server responses
into a union bitmap and refuses to delete the source volume when the
union falls short of dataShards+parityShards. On multi-disk
destinations, the union was systematically under-counted and source
deletion got blocked even though all shards were physically present and
mounted.
Walk every DiskLocation in the handler and emit the deduplicated union
of all shards. The .ecx-backed fields (file counts, volume size) still
come from a single EcVolume since every disk's entry opens the same
.ecx via NewEcVolume's cross-disk fallback.
Tests:
- TestVolumeEcShardsInfo_AggregatesAcrossDisks unit test in
weed/server/.
- test/volume_server/grpc/ec_verify_multi_disk_test.go integration test
drives the full generate -> mount -> redistribute -> restart ->
reconcile path and asserts both VolumeEcShardsInfo and
VerifyShardsAcrossServers + RequireFullShardSet (the production
source-deletion gate) report all 14 shards.
- ec_multi_disk_lifecycle_test.go tightened: replaces the
"VolumeEcShardsInfo only sees one disk's EcVolume" workaround with a
full-shard-set assertion.
* review: use ShardBits bitmask + cap-pre-allocation for shard dedup
* fix(s3): stop S3 Tables routes from swallowing buckets named "buckets" or "get-table"
The S3 Tables REST endpoints share top-level paths with the regular S3
API (/buckets for ListTableBuckets/CreateTableBucket, /get-table for
GetTable). They are registered first on the same router as the bucket
subrouter, so a path-style request such as GET /buckets?list-type=2 on
a bucket actually named "buckets" matched ListTableBuckets and returned
JSON. AWS SDK V2 (and Hadoop s3a / Spark) then failed XML parsing with
"Unexpected character '{' (code 123) in prolog".
Disambiguate by requiring the AWS V4 credential scope to name the
s3tables service on the colliding routes. Regular S3 SDKs sign with
service=s3, S3 Tables SDKs sign with service=s3tables, and the scope is
present in both the Authorization header and the X-Amz-Credential query
parameter for presigned URLs, so the matcher works for both flavors.
ARN-bearing S3 Tables routes (/buckets/<arn>, /namespaces/<arn>, etc.)
already cannot collide because colons are not valid in bucket names, so
they are left untouched.
* fix(s3): accept AWS JSON RPC content type as S3 Tables intent signal
The Iceberg catalog integration tests send unsigned PUT /buckets with
Content-Type: application/x-amz-json-1.1 to create table buckets. With
only the credential-scope check, those requests fell through to the
regular S3 CreateBucket handler and the suite went red on this branch.
Extend the matcher so a request is recognized as S3 Tables when either:
- its AWS V4 credential scope names SERVICE=s3tables; or
- it carries the canonical AWS JSON RPC 1.1 content type and is
unsigned (a request explicitly signed for SERVICE=s3 still wins).
The regular S3 SDKs do not send application/x-amz-json-1.1, so the
signal is safe for the colliding paths (/buckets, /get-table).
Also add an AWS SDK V2 for Go integration test under
test/s3/sdk_v2_routing/ that drives the SDK's own XML deserializer
against a bucket literally named "buckets" and "get-table" — the SDK
errors before the test asserts if the server returns the wrong body
shape. Wired up via .github/workflows/s3-sdk-v2-routing-tests.yml,
mirroring the etag/acl workflow.
* s3api: extend service matcher to all S3 Tables routes; simplify scope check
- Apply serviceMatcher to every S3 Tables route, not just the bare-path
ones. ARN-bearing paths could otherwise be hit by an S3 object key
that starts with arn:aws:s3tables:..., inside a bucket named
"buckets", "namespaces", "tables", or "tag". One matcher everywhere
closes both collision classes.
- Replace strings.Split + index lookup with strings.Contains for the
credential-scope check. The scope shape is fixed at
AK/DATE/REGION/SERVICE/aws4_request, slashes only delimit components,
and access keys are alphanumeric — so /s3tables/ matches iff SERVICE
is exactly s3tables. Existing unit cases (including the
access-key-substring case) still pass.
- Read the GetObject body in the SDK v2 routing test with io.ReadAll;
the single Read could return short and make the equality check flaky.
* s3api: drop content-type fallback; sign s3 tables harness traffic instead
The content-type fallback in isS3TablesSignedRequest let an anonymous
regular-S3 request whose body type is application/x-amz-json-1.1 hit
an S3 Tables route when the path-style object key happened to be
shaped like an S3 Tables ARN (e.g. PutObject on bucket "buckets"
with key arn:aws:s3tables:.../bucket/foo/policy). Narrow the matcher
back to the AWS V4 credential scope so only requests signed for
SERVICE=s3tables match the S3 Tables routes.
Update the Iceberg catalog test harness — the only caller still
sending unsigned PUT /buckets — to sign with SERVICE=s3tables. The
mini instance runs in default-allow mode, so the signature itself is
not verified; only the credential scope matters for the route match.
Drop the stale unit cases for the JSON-RPC content-type signal and
the routing test that exercised unsigned harness traffic.
* fix(volume): pass on-disk tombstone size to ReadData in verifyDeletedNeedleIntegrity
verifyDeletedNeedleIntegrity was forwarding TombstoneFileSize (-1) into
Needle.ReadData. A deletion tombstone is appended to .dat with DataSize=0
so the on-disk needle header carries Size=0; TombstoneFileSize is only
the .idx sentinel for "this entry is deleted" and is never written into
a needle header.
ReadBytes' size check therefore mismatched on every tombstone
(-1 != 0), returned ErrorSizeMismatch, and triggered the
4-byte-offset wrap-around retry in ReadData (offset + 32 GB). On any
volume large enough that offset+32 GB exceeds dat fileSize the retry
read EOF, CheckVolumeDataIntegrity reported corruption, and the loader
set noWriteOrDelete = true. Every volume whose last 10 .idx entries
included a deletion went read-only on startup — i.e. any healthy
volume where the most recent operations included a delete.
Pass Size(0) so the size check matches the on-disk tombstone header.
Add a regression test that writes three needles, deletes one, and
asserts CheckVolumeDataIntegrity succeeds with a tombstone at the .idx
tail. Without this fix the test reproduces the exact log shape from
the bug report:
read 0 dataSize 32 offset <orig+32GB> fileSize <much smaller>: EOF
verifyDeletedNeedleIntegrity ...idx failed: read data [N,N+32) : EOF
The Rust port guards its integrity-check size comparison with
!size.is_deleted() (seaweed-volume/src/storage/volume.rs) and never
hits this path, so no Rust mirror change is needed.
* test(seaweed-volume): mirror Go regression for deletion-tombstone integrity
The Rust integrity check already guards its size-mismatch comparison
with !size.is_deleted() (volume.rs:1859) and reads tombstone AppendAtNs
with body_size=0, so the Go regression fixed in the previous commit
does not apply. Lock that guarantee in with a parallel reload test:
write three needles, delete one, sync, reopen via Volume::new, assert
the volume is not flipped read-only.
Catches any future change that removes the deleted-entry guard or
re-introduces a size-strict path in check_volume_data_integrity for
tombstones.
* fix(volume): propagate io.EOF and ErrorSizeMismatch from verifyDeletedNeedleIntegrity
CheckVolumeDataIntegrity relies on identity comparison against io.EOF
and ErrorSizeMismatch to walk back through the last ten .idx entries
and tolerate a partial truncation at the tail (the "fix and continue"
loop). The live-needle branch in doCheckAndFixVolumeData already
returns those sentinels unwrapped; the deletion branch wrapped them
in fmt.Errorf, so a genuine .dat truncation past a tombstone offset
broke the recovery and flipped the volume read-only.
Mirror the live-needle handling: both verifyDeletedNeedleIntegrity
and doCheckAndFixVolumeData now short-circuit on io.EOF /
ErrorSizeMismatch and pass them through unwrapped. Other errors keep
their existing context wrapping.
Also tighten the regression test to capture lastAppendAtNs and assert
it's non-zero, so a future regression that skips the tombstone body
(and therefore never populates AppendAtNs) is caught even when the
err check still passes.
fix(s3): keep anonymous access working with EnableIam default
`docker run seaweedfs` (and `weed mini` with no config) start with
EnableIam=true but no IAM config file and no identities. The advanced-IAM
init path was failing in 4.25 because of the missing STS signing key,
which masked a latent bug: SetIAMIntegration unconditionally flipped
isAuthEnabled to true, and isEnabled() also treated a non-nil
iamIntegration as auth-on. Once the mini SSE-S3 KEK landed in 4.26 the
STS fallback started succeeding, the integration got installed end to
end, and every anonymous S3 request bounced as AccessDenied.
Separate the two concerns: SetIAMIntegration just plumbs in the OIDC /
embedded-IAM machinery, and a new EnableAuthEnforcement opts in to
enforcement. The startup path calls it only when -s3.iam.config is
actually provided, so operators with explicit IAM configs still get auth
(preserves #7726). isEnabled() now reads isAuthEnabled only.
* feat(filer): add atime field and TouchAccessTime RPC to filer proto
Introduce POSIX-style access-time tracking on the filer:
- FuseAttributes gains atime (field 22) and atime_ns (field 23).
- New TouchAccessTime RPC (and Touch{Access,Time}{Request,Response})
lets read paths bump atime without going through UpdateEntry's
chunk-rewrite/EqualEntry short-circuit.
Additive proto changes only; zero atime is treated as unset and
existing clients are unaffected. Java client proto is kept in lock
step.
Co-authored-by: Cursor <cursoragent@cursor.com>
* feat(filer): wire Atime through Attr codec with mtime fallback
Add Attr.Atime and round-trip it through EntryAttributeToPb /
EntryAttributeToExistingPb / PbToEntryAttribute. A zero proto atime
decodes as Mtime, so legacy entries report a sensible value and
freshly-created/updated entries default Atime to Mtime when callers
do not set it explicitly.
CreateEntry and UpdateEntry stamp Atime = Mtime (or Crtime) when it
is zero. TouchAccessTime later bypasses this path to write atime
alone via Store.UpdateEntry.
Co-authored-by: Cursor <cursoragent@cursor.com>
* fix(filer): preserve atime in first epoch second on decode
The Atime decode branch previously treated any attr.Atime == 0 as
unset and overwrote it with Mtime, which drops valid timestamps in
the first second of the unix epoch where attr.Atime is 0 but
attr.AtimeNs > 0. Check both fields so we only fall back to Mtime
when both are zero.
Co-authored-by: Cursor <cursoragent@cursor.com>
---------
Co-authored-by: Cursor <cursoragent@cursor.com>
Volumes written by versions before 3.09 (commit 056c480eb) store the
needle checksum using the deprecated CRC.Value() transform. When the
read path moved into readNeedleTail, the fallback that accepts both
encodings was dropped, so .dat files copied from old installs now fail
verification with "invalid CRC ... data on disk corrupted" even though
the data is intact. Restore the dual check, matching the surviving
fallback in volume_read.go.
* s3api: verify source permission on CopyObject and UploadPartCopy
The Auth middleware only authorized the destination because routes key on
the request URL. The source from X-Amz-Copy-Source was never evaluated,
so an STS session token scoped to one prefix could copy from any other
prefix in the same bucket.
Add AuthorizeCopySource on IdentityAccessManagement to run the full
bucket-policy + IAM/identity flow against the source, using a synthetic
GetObject request so action resolution lands on s3:GetObject (or
s3:GetObjectVersion when a source versionId is supplied). Both
CopyObjectHandler and CopyObjectPartHandler now invoke it before reading
the source.
* s3api: preserve presigned-URL session token on copy-source check
Presigned CopyObject / UploadPartCopy requests carry the STS session
token in the query string (X-Amz-Security-Token), not in a header.
Rebuilding the synthetic source URL from scratch dropped that token, so
the source authorization would fall through to non-STS paths and miss
session policy enforcement. Forward X-Amz-Security-Token from the
original query (alongside versionId), still excluding unrelated params
like uploadId/partNumber that would steer ResolveS3Action away from
s3:GetObject.
* fix(volume): reopen .idx writable after MarkVolumeWritable
When .vif has ReadOnly=true, load() opens .idx as O_RDONLY and builds a
SortedFileNeedleMap whose Put returns os.ErrInvalid. MarkVolumeWritable
only flipped noWriteOrDelete back to false and rewrote .vif, so writes
still failed at v.nm.Put. Reopen .idx in O_RDWR and rebuild v.nm in its
writable form (in-memory or leveldb small/medium/large) before flipping
the flag.
Mirror the same fix in seaweed-volume: the Rust load path leaves
CompactNeedleMap/RedbNeedleMap with no idx_file writer when the volume
boots read-only, so post-MarkVolumeWritable puts silently succeeded
in-memory only and were lost on the next restart. set_writable now
reattaches an append-mode writer when one is missing.
* fix(volume): keep old needle map until replacement is built; defer writable flag
Go: build the writable needle map into a local before swapping. A
construction failure now leaves v.nm pointing at the original
SortedFileNeedleMap so MarkVolumeWritable can roll back, instead of
stranding the volume with v.nm == nil.
Rust: attach the .idx writer before flipping no_write_or_delete to
false. A transient open/metadata failure used to leave the volume
marked writable with no writer attached, and subsequent puts would
silently skip the on-disk append.
* fix(admin): switch file browser upload/download to filer gRPC + volume HTTP
The admin file browser proxied uploads and downloads through the filer's
HTTP listener, so the whole feature 404'd against filers started with
-disableHttp=true even though S3 still worked on its own port. Re-route
through the filer gRPC service: LookupDirectoryEntry + StreamContent for
reads (chunks flow straight from the volume servers), AssignVolume +
volume HTTP POST + CreateEntry for writes. Volume read tokens come from
jwt.signing.read.key when configured; the old jwt.filer_signing tokens
no longer apply since the filer HTTP surface is bypassed.
* admin file browser: propagate request context + track response writes
Pass r.Context() into uploadFileToFiler so a client disconnect cancels
the in-flight chunked upload instead of letting it run to completion
against the volume servers. For DownloadFile, replace the Content-Type
probe with a small response-writer wrapper that records whether headers
or bytes have actually been sent, so the error path can't silently
convert a pre-stream failure into a partial response if future code
moves the header-setting around.
When s3.domainName is set, all bucket-prefix routes were gated on a
matching Host header. Requests that arrive via an IP, an unlisted
hostname, or a reverse proxy that rewrites Host hit no router and bounce
back as 405/404 (and 503 once a proxy maps the upstream error).
Register the path-style catch-all unconditionally, after the
host-specific routers, so it only fires when no Host matcher applies.
TaskSource and TaskTarget carry disk_id on the wire, but the execution
plan map built for the admin UI dropped the field entirely. On a
multi-disk node holding shards of the same volume, there was no way to
tell from the plan which disk would receive each shard. Include
disk_id on each endpoint and target_disk_id on each shard assignment,
and extend the existing execution-plan test to set and assert the
field.
Updated the README to reflect the current status of MinIO, noting its ceased development and security concerns, along with changes in the descriptions of its features compared to SeaweedFS.
* fix(ec_distribute): remove partial files on copy stream error
writeToFile opens the destination with O_TRUNC and streams into it. On
a mid-stream receive / write / cancellation error it returned the
failure but left the destination behind in whatever state had been
written so far — typically 0 bytes when the source errored before
sending any FileContent. VolumeEcShardsCopy distributes .ecx by
calling doCopyFile, so this same stub-leaving behaviour produced the
0-byte .ecx files seen on EC encoding failures: the source claims a
non-zero ModifiedTsNs (so the existing "source not found" cleanup
doesn't fire), the stream then errors immediately, and the receiver
ends up with a 0-byte .ecx that downstream code mistook for a valid
empty index.
Clean up the partial file on every error path that returns from the
streaming loop (receive, write, and cancellation). Skip cleanup when
isAppend=true so resumable appends keep their existing content. As
defense in depth, VolumeEcShardsCopy also stats the .ecx after copy
and removes / errors on a 0-byte result so the orchestrator can pick
a different source.
The Rust volume server has only the source side of CopyFile (no
client-side stream-to-disk consumer) and no .ecx subsystem yet, so
this fix has no Rust mirror.
* fix(ec_distribute): close file before remove, fail fast on stat error
Address review feedback:
- writeToFile's mid-stream removeIncomplete called os.Remove while the
destination file handle was still open. On Windows os.Remove fails
while a handle is open, so the cleanup wouldn't run there. Wrap the
handle close in a once-only helper, call it from removeIncomplete
and from the existing "source not found" cleanup, and keep a deferred
close as the safety net for the normal-return path.
- VolumeEcShardsCopy's post-copy .ecx check silently passed when
os.Stat returned an error: doCopyFile had reported success but if
the file was already gone, unreadable, or somehow a directory, the
orchestrator only learned at mount time with no useful context.
Treat any non-nil stat error and any directory result as a copy
failure here and surface it immediately.
* fix(ec_mount): reject 0-byte .ecx and aggregate cross-disk failures
MountEcShards's per-disk loop bailed on the first disk returning a
non-ENOENT error, and NewEcVolume wrapped its ENOENT with %v so the
caller's `err == os.ErrNotExist` check never matched. On a multi-disk
volume server where ec.balance / ec.rebuild had distributed shards
across sibling disks while the matching .ecx never arrived, the mount
loop bailed after disk 0 with "cannot open ec volume index" and the
operator never saw that the rest of the disks were also empty. The
companion failure mode is a 0-byte .ecx stub left by EC distribute's
writeToFile after a mid-stream copy failure: Stat() succeeds, treating
the stub as a valid index, and downstream mount work proceeds against
an empty file.
Wrap the ec-volume open errors with %w, treat a 0-byte .ecx as
os.ErrNotExist (in NewEcVolume, findEcxIdxDirForVolume, and
HasEcxFileOnDisk), and have MountEcShards collect per-disk failures
before returning a single aggregated error. The "no .ecx anywhere"
case gets a distinct error so the orchestrator can re-copy the index
from a healthy replica rather than retry against the same broken
state.
* fix(ec_reconcile): indexEcxOwners also rejects 0-byte .ecx stubs
findEcxIdxDirForVolume already skipped 0-byte .ecx during MountEcShards,
but indexEcxOwners (used by reconcileEcShardsAcrossDisks at startup)
still recorded the first .ecx by name only. On a store where one disk
holds a 0-byte stub left by a failed EC distribute and a sibling disk
holds the real index, the stub would win the owner selection — and
NewEcVolume's new size check would then refuse to load against it,
leaving the orphan shards unloaded even though a valid index exists.
Mirror the size check from findEcxIdxDirForVolume: skip directory
entries whose .ecx Info() reports size 0 or whose Info() call fails.
* fix(ec_mount): accept 0-byte .ecx as valid empty index
The previous commit treated a 0-byte .ecx in NewEcVolume as
os.ErrNotExist, on the assumption that any empty .ecx was a stub left
by a failed copy stream. That broke the legitimate empty-volume case:
when an EC volume's source .idx has no live entries (e.g. all needles
deleted before WriteSortedFileFromIdx), the sorted .ecx is genuinely
0 bytes and must mount. The integration test
TestEcShardsToVolumeMissingShardAndNoLiveEntries fails with
"MountEcShards: no .ecx index found on any local disk" because the
mount path now refuses the legitimate empty index.
A 0-byte .ecx left by a failed copy stream is indistinguishable from
the legitimate empty case by file size alone. Preventing stub files
from being written is the receiver-side cleanup in writeToFile's job
(the companion EC distribute PR), not NewEcVolume's at mount time.
The cross-disk lookup helpers (findEcxIdxDirForVolume, HasEcxFileOnDisk,
indexEcxOwners) keep their size > 0 preference: when a real .ecx
exists on a sibling disk alongside a stub, we still want to route
mounts and reconcile at the real one. If no non-zero .ecx exists
anywhere, the per-disk fallback in MountEcShards can still open the
0-byte .ecx and the volume mounts.
Replace TestMountEcShards_ZeroByteEcxOnlyDisk with
TestMountEcShards_EmptyEcxMountsSuccessfully, which pins the
empty-volume invariant.
* fix(volume.list): show one entry per physical disk on multi-disk nodes
DataNodeInfo.DiskInfos is keyed by disk type, so several same-type
physical disks on one node collapse to a single map entry at the master.
volume.list iterated that map directly and reported one "Disk hdd ...
id:0" line per node, hiding the per-disk volume and shard layout. EC
operators on multi-disk volume servers had no way to verify which
physical disk a shard landed on.
Lift the per-physical-disk split into a DiskInfo.SplitByPhysicalDisk()
method on the proto type so consumers outside admin/topology can use
it. Apply it in writeDataNodeInfo so the verbose Disk block shows one
entry per physical disk, ordered by DiskId. Capacity counters are
split evenly across reconstructed disks since the wire format doesn't
carry per-disk capacity yet.
This is a display-only change. ActiveTopology already did the split on
its own and is now updated to call the shared helper.
* fix(volume.list): preserve totals, count active/remote exactly, dedupe header
Address review feedback on the per-physical-disk split:
- share() truncated remainders so reconstructed per-disk counters could
sum to less than the original aggregate (10 / 3 = 3+3+3). Distribute
the remainder to the lowest disk ids so MaxVolumeCount and
FreeVolumeCount sum exactly back to the node totals.
- ActiveVolumeCount and RemoteVolumeCount are derivable per disk from
the VolumeInfos already grouped by DiskId, so count them exactly
(ReadOnly=false and RemoteStorageName!="" respectively) instead of
approximating with an even split.
- writeDataNodeInfo's per-disk callback fired the DataNode header on
every iteration after the split, so a node with 6 physical disks
emitted 6 DataNode headers. Guard the callback with headerPrinted so
the header still appears at most once per node.
- Sort split disks deterministically using explicit DiskId comparison
to avoid int overflow risk on 32-bit systems.
- Tighten the volume.list test substring to "id:N\n" so unrelated
tokens like "ec volume id:101" don't accidentally match the id:1
needle, and assert the rack callback fires once.
The 127.0.0.1-only reservation in AllocateMiniPorts/AllocatePortSet let
another process hold the gRPC port on a different interface, so weed
mini's isPortAvailable check failed and it shifted master.grpc. weed
shell -master=<HTTP> still derives grpc as HTTP+10000 and dialed the
unused port, hanging until the 30s context deadline killed it. Bind the
reservation listeners on :port to match mini's check.
Also bound listFilerContents in catalog_risingwave with a 30s
exec.CommandContext so a hung weed shell during failure-cleanup can't
burn the 20-minute test budget.
When jwt.filer_signing.key is set, the filer's IamGrpcServer requires
a Bearer token on every IAM RPC. The shell's s3.* IAM commands dialed
without that header and failed with Unauthenticated. Route them through
a small helper that mints a token from the same key viper-loaded from
security.toml and appends it as outgoing metadata, matching the credential
grpc_store pattern.
When weed filer started its embedded S3 gateway with -s3 -s3.config, only
the S3 server loaded the s3.json static identities — the filer's own
CredentialManager stayed empty, so the IAM gRPC service backing the admin
UI and weed shell returned only dynamic users. Mirror the wiring weed
server already does and hand the same config path to the filer.
fix(mount): don't release file handles from Forget
Forget(nodeid, nlookup) only decrements the kernel inode lookup count.
File handle lifecycle belongs to FUSE Open/Release. Driving the FH
refcount from Forget coupled two unrelated counters and could tear down
a still-live handle if Forget ever raced ahead of Release.
Drop the ReleaseByInode call (and the now-unused method).
* fix(iceberg): dial filer gRPC address verbatim in plugin worker
dialFiler was running its address argument through pb.ServerAddress.ToGrpcAddress,
whose single-port fallback adds +10000 to any host:port — so when the admin
forwards ClusterContext.FilerGrpcAddresses (already host:grpcPort) to the worker,
the iceberg handler turns the real gRPC port (e.g. 18888) into a non-existent
28888 and dispatched jobs fail with connection refused.
Drop the conversion; the address is already dialable. Tests that produced fake
filer addresses in dual-port form now return host:grpcPort to match the new
contract.
* test(ec): use renamed detection_interval_minutes field
The admin_runtime.detection_interval_seconds field was renamed to
detection_interval_minutes back in May. This integration test was not
updated, so the unknown JSON field was silently ignored and the scheduler
fell back to the default detection interval (17 min for erasure_coding),
which exceeds the test's 5-minute wait and times out.
Switch to detection_interval_minutes: 1 — local run completes in ~120s.
* fix(ec): mirror EC sidecars onto every shard-bearing disk at startup
In a multi-disk volume server, ec.balance and ec.rebuild can land shards
on a disk that does not also hold the matching .ecx / .ecj / .vif index
files. The orphan-shard reconciler in reconcileEcShardsAcrossDisks
already loads those shards by pointing the EcVolume at the sibling
disk's index files; reads work, but any failure on the index-owning
disk silently disables every shard on the other disk, even though those
shards are physically fine.
This change adds mirrorEcMetadataToShardDisks, a startup pass that
physically replicates .ecx / .ecj / .vif onto each disk that holds
shards but is missing them. Each copy is atomic (tmp + fsync + rename)
and idempotent (a destination that already has the sidecar is
preserved). After mirroring, the cross-disk reconciler prefers the
local IdxDirectory so the EcVolume mounts self-contained; the
cross-disk virtual mount remains as a fallback for volumes whose mirror
failed (read-only target, out of space, partial copy on a previous
boot).
The same-disk invariant the EC lifecycle (encode / decode / balance /
vacuum / repair) was already documented as promising is now actually
restored at boot, so a future failure of one disk in a split-shards
layout no longer takes the other disk's shards with it.
Tests cover the orphan-layout mirror (dir0 receives the .ecx / .ecj /
.vif from dir1) and idempotency (an existing destination .ecx is not
overwritten with the owner's copy).
* fix(ec): handle legacy pre-dir.idx sidecar layout in mirror skip-check
hasAllEcSidecarsLocally checked only the modern destination path
(IdxDirectory for .ecx/.ecj, Directory for .vif). A destination disk
that still had a legacy .ecx in its data dir (written before -dir.idx
was set) would report "not present" and the mirror would write a
second copy to IdxDirectory, leaving two .ecx files on disk.
Matches HasEcxFileOnDisk's open-with-fallback contract: check the
modern path first, then the opposite directory. Factored the
exists-and-not-a-dir check into a small statRegular helper so the
fallback ladder stays readable.
* rust(seaweed-volume): mirror EC sidecars onto shard-bearing disks at startup
Port of the Go fix (commit 088e26ea6) to the Rust volume server.
Adds Store::mirror_ec_metadata_to_shard_disks, called from
add_location / load_new_volumes before the cross-disk orphan
reconciler. Physically copies .ecx / .ecj / .vif from the disk that
owns the index files onto every disk holding shards but missing
sidecars, so each shard-bearing disk ends up self-contained.
The reconciler now prefers the local idx_directory when the mirror
has installed a .ecx there; the cross-disk virtual mount remains as
the fallback for volumes whose mirror failed (read-only target, out
of space, partial copy on a previous boot). Adds ec_local_ecx_path
helper shared between reconcile and mirror to detect the post-mirror
fast path.
Mirrors the Go-side fallback in hasAllEcSidecarsLocally: when
-dir.idx is configured and the destination still has a legacy .ecx
in its data dir, that's recognized so the mirror does not write a
duplicate copy into idx_directory.
Tests cover the two key cases: orphan layout (dir0 receives the
sidecars from dir1) and idempotency (a pre-existing destination .ecx
is not overwritten).
* trim verbose comments on EC mirror code
Comments now lead with the WHY (non-obvious constraints, the
post-mirror fast path, why local copies are authoritative) and drop
restate-the-code blocks, headers, and section dividers. Behavior is
unchanged; all existing tests still pass on both the Go volume
server and the seaweed-volume Rust port.
* drop github issue refs from added comments
Two stray "#9212" references slipped into comments I added on the
cross-disk reconciler call site. The git log carries the issue
history; comments stand on their own.
* test(ec): accept rebuild on either disk after sidecar mirror
TestEcLifecycleAcrossMultipleDisks asserted the rebuilt shard 9 must
land at the disk-0 path. With the boot-time sidecar mirror, every
shard-bearing disk owns its own .ecx, so VolumeEcShardsRebuild now
picks whichever disk hosts the most shards — disk 1 in this layout
after the deletion. The shard can legitimately rebuild on either
disk; the test now accepts both and uses the chosen path for the
subsequent mount + read verification.
* mini: quieter startup with a docker-compose-style progress board
Replaces noisy startup/shutdown logs with a single in-place progress
table on a TTY (or one line per state change off-TTY). Each component
renders as `pending -> starting -> ready` during startup and
`stopping -> stopped` during shutdown, with elapsed time on transition.
Also folds in a few cleanups uncovered while making this readable:
- route the admin.go startup prints through glog so quietMiniLogs()
filters them under mini but standalone weed admin still shows them
- generate a dev SSE-S3 KEK + passphrase on first run via WEED_S3_SSE_KEK
and WEED_S3_SSE_KEK_PASSPHRASE env vars (viper.Set has a nested-key
conflict between s3.sse.kek and s3.sse.kek.passphrase); persisted under
the data folder so restarts reuse the same key
- demote worker/master gRPC Recv 'context canceled' to V(1); those are
the normal shutdown signal, not Errors/Warnings
- drop the 'Optimized Settings' block and the 'credentials loaded from
environment variables' message from the welcome banner
- only show the credentials setup hints when no S3 identities exist
(new s3api.HasAnyIdentity accessor backed by an atomic.Bool)
- use S3_BUCKET in the credentials hint so it pairs with
AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY
- reorder running-services list to master / volume / filer / webdav /
s3 / iceberg / admin
* mini: refuse in-memory-only SSE-S3 dev keys; surface admin serve errors
loadOrCreateMiniHexSecret returns "" when os.WriteFile fails, so SSE-S3
won't encrypt data under a KEK that the next restart can't reproduce
(which would orphan whatever was written this run). The caller already
treats "" as "skip setting WEED_S3_SSE_* env vars", so SSE-S3 and IAM
just stay disabled for this run.
startAdminServer's serve goroutine used to only log ListenAndServe
failures, so a bind error left the caller blocked on ctx.Done() with
no listener. Forward the error through a buffered channel and select
on it alongside ctx.Done().
* ci(s3-proxy-signature): match weed mini's new progress-board ready line
The readiness probe grepped for "S3 (gateway|service).*(started|ready)",
which matched weed mini's old "S3 service is ready at ..." line. Mini
now emits " S3 ready (Xs)" from its progress board, so the
old pattern misses and the test timed out at the 30-second wait.
Widen the alternation to also accept "S3\s+ready". The curl HEAD
fallback already covers any remaining cases.
* iamapi: route managed policies through credential manager (fixes#9518)
CreatePolicy via the IAM API wrote straight to the filer
/etc/iam/policies.json, ignoring any non-filer credential store. When
credential.postgres was configured, policies created via the IAM API
landed only in the filer while the Admin UI wrote to postgres,
producing a split-brain where ListPolicies/GetPolicy never saw the
Admin UI's policies and vice versa.
GetPolicies/PutPolicies on IamS3ApiConfigure now load managed policies
from credentialManager and persist Create/Update/Delete as a delta
against the store. Inline user/group policies still live in the legacy
policies.json file (no credential-store API for them yet). Pre-existing
managed policies in the legacy file are merged on read so deployments
don't lose data, and re-persisted to the store on the next write so
the legacy file is drained over time.
* credential: route IAM API inline policies through credential manager
Extends the #9518 fix to user-inline and group-inline policies so the
IAM API never writes the legacy /etc/iam/policies.json bundle directly.
The previous patch only routed managed policies; this one finishes the
job for the other two policy types.
- Add GroupInlinePolicyStore + GroupInlinePoliciesLoader optional
interfaces, mirroring the existing user-inline ones, and matching
Put/Get/Delete/List/LoadAll wrappers on CredentialManager.
- Implement group-inline storage in memory (new map), filer_etc (new
field on PoliciesCollection, reusing the legacy file under policyMu),
and postgres (new group_inline_policies table with ON DELETE CASCADE
off the groups FK).
- Wire the new methods through PropagatingCredentialStore so wrapped
stores still delegate correctly.
- IamS3ApiConfigure.PutPolicies now applies managed + user-inline +
group-inline as deltas through the credential manager; the legacy
/etc/iam/policies.json file is never written when a credential
manager is wired up. GetPolicies still reads the legacy bundle once
as a fallback so unmigrated data is picked up and re-persisted into
the store on the next write.
* credential: propagate SaveConfiguration writes to running S3 caches
Postgres (and any non-filer) credential stores never fired the S3 IAM
cache invalidation path on bulk identity / group updates. The
PropagatingCredentialStore had explicit Put/Remove handlers for
single-entity calls (CreateUser, PutPolicy, etc.) but inherited
SaveConfiguration unchanged from the embedded store, so the bulk path
the IAM API takes at the end of every handler was silent. Inline-policy
changes recompute identity.Actions and persist via SaveConfiguration,
so until restart the cached Actions on each S3 server stayed stale and
authorization decisions used the pre-change view.
Override SaveConfiguration to snapshot the prior user / group lists,
delegate the save, then fan out PutIdentity / PutGroup for what's in
the new config and RemoveIdentity / RemoveGroup for what got pruned.
Reuses the existing SeaweedS3IamCache RPCs, no protobuf changes.
* iamapi: drain legacy policies.json after authoritative credential-store writes
Review pointed out a resurrection bug: GetPolicies still reads
/etc/iam/policies.json as a one-way migration fallback, but PutPolicies
in the credential-manager path never wrote that file, so legacy-only
entries reappeared on the next read even after the IAM API "deleted"
them. PutPolicies now overwrites the bundle with an empty {} after a
successful credential-store write, unless the store is filer_etc
(which owns the bundle as its own inline-policy backing — clearing it
would wipe filer_etc's data). Also wraps the filer read, JSON
unmarshal, and marshal errors with context per the other review
comments.
The master returns each registered filer in pb.ServerAddress dual-port
form (host:httpPort.grpcPort, e.g. 10.0.0.1:8888.18888). The admin's
plugin context builder forwarded that string verbatim as
filer_grpc_address, so workers calling grpc.DialContext on it failed
every job in ~3ms with "dial tcp: lookup tcp/8888.18888: unknown port".
Run each entry through pb.ServerAddress.ToGrpcAddress before populating
ClusterContext.FilerGrpcAddresses.
The lifecycle integration test now pins filer.port.grpc to a value that
breaks the FILER_PORT+10000 assumption, and a new dispatch test drives
the admin's /api/plugin/job-types/s3_lifecycle/run path end-to-end and
asserts the dispatched job both reaches the filer and deletes the
backdated object.
* fix(ec): blanket-clean every destination over the full shard range
The previous cleanup pass walked t.sources only, with the shard ids the
topology had reported at detection time. In the wild, a destination can
end up with EC shards mounted that the topology snapshot didn't list —
shards on a sibling disk that hadn't heartbeated, or shards left over
from a concurrent attempt's mount step. FindEcVolume still returns
true, so the next ReceiveFile trips the mounted-volume guard.
Cleanup now unions t.sources (with ShardIds) and t.targets and issues
unmount + delete over [0..totalShards-1] on each. Both RPCs are
idempotent on missing shards, so the wider sweep is free.
Two new tests cover the gap: shards mounted beyond what t.sources
lists, and a target-only destination with no source row.
* log(ec): include disk_id in EC unmount/delete/refusal log lines
The current logs identify the volume and shard but leave disk_id off,
which makes the cross-server cleanup story hard to follow when
multiple disks of one server hold pieces of the same volume:
UnmountEcShards 4121.1 -> add disk_id
ec volume video-recordings_4121 shard delete [1 5] -> add per-loc disk_id
volume server X:Y deletes ec shards from 4121 [...] -> add disk_id
ReceiveFile: ec volume 4121 is mounted; refusing... -> add disk_ids
ReceiveFile's refusal now names the disk_ids actually holding the
mount so operators can see whether the next cleanup pass needs to
target a sibling disk. Added Store.FindEcVolumeDiskIds /
Store::find_ec_volume_disk_ids as the supporting primitive.
Mirrored in seaweed-volume/src/ (unmount log in Store::unmount_ec_shard,
heartbeat delete log in diff_ec_shard_delta_messages, refusal in the
ReceiveFile handler).
* test(ec): stub VolumeEcShardsUnmount/Delete on the fake volume server
The plugin-worker EC tests boot a fake volume server that embeds
UnimplementedVolumeServerServer. After the worker started calling
VolumeEcShardsUnmount + VolumeEcShardsDelete pre-distribute, the
default Unimplemented response surfaced as fourteen "method not
implemented" errors and TestErasureCodingExecutionEncodesShards
failed. Both RPCs are no-ops here — nothing on the fake server has
mounted state or persisted shard files to remove.
* fix(ec): mount falls back to sibling-disk .ecx (fixes#9519)
MountEcShards iterated DiskLocations and on each disk called LoadEcShard
with that disk's IdxDirectory as the .ecx home. When ec.balance lands the
.ec?? shard on disk A but the .ecx on sibling disk B of the same volume
server, NewEcVolume ENOENTs the .ecx and returns "cannot open ec volume
index ...". That error is not os.ErrNotExist, so the per-disk continue
branch did not engage and the mount loop bailed before trying any other
disk.
The startup reconciliation in reconcileEcShardsAcrossDisks already
handles this layout for orphan shards discovered on boot (issue #9212).
This change mirrors the same primitive on the mount path: look up the
.ecx owner across all DiskLocations once and route NewEcVolume at that
directory whenever the disk being mounted does not own its own copy of
the .ecx. Same-disk mounts are unaffected because HasEcxFileOnDisk keeps
LocalIdxDirectory in play.
Adds a regression test that plants the index files on a sibling disk
AFTER NewStore returns (so the startup reconcile is a no-op for that
vid) and verifies MountEcShards succeeds; also pins the same-disk
baseline against accidental re-routing.
* fix(ec): skip redundant stats in cross-disk .ecx lookup (review)
Two follow-ups from gemini-code-assist on #9521:
1. MountEcShards: when findEcxIdxDirForVolume already returned a path
that lives on this disk's IdxDirectory or Directory, the disk owns
the .ecx — skip the HasEcxFileOnDisk stat and use the local idx dir
directly. Only re-check when the disk's directories are neither, so
the duplicate-.ecx-on-multiple-disks edge case is still honored.
2. findEcxIdxDirForVolume: hoist the seen map across the location loop
so a shared IdxDirectory (one -dir.idx paired with several -dir
entries) is only stat'd once per call.
Both are I/O optimizations; behavior is unchanged. Existing
cross-disk and same-disk regression tests still pass.
* docs(ec): drop issue/PR references from cross-disk mount comments
Comments and test docstrings stand on their own; the issue number
adds nothing a reader can act on and goes stale across forks. Keep
the description of *what* the layout is and *why* the fallback
exists, just without the reference.
* feat(seaweed-volume): distributed EC read across peer servers
EcVolume::read_ec_shard_needle previously errored with NotFound when
any interval's shard wasn't local. In an RS(10,4)-across-N deployment
each server holds one shard, so every read needed >=9 peer fetches and
post-EC GETs returned 404 on volumes whose shards lived on more than
one server.
Mirror of weed/storage/store_ec.go's readOneEcShardInterval ->
readRemoteEcShardInterval -> recoverOneRemoteEcShardInterval chain:
* server/store_ec.rs (new): entry point
read_ec_shard_needle_distributed. Snapshots locate-needle + local
reads under the Store sync lock, drops the lock, then async-fetches
missing intervals via the peer's VolumeEcShardRead RPC. Falls back
to Reed-Solomon reconstruction (read every other shard at the same
(shard_offset, size) and run rs.reconstruct) when the direct peer
read fails. Refreshes the per-EcVolume shard_locations cache from
the master's LookupEcVolume RPC using Go's freshness thresholds
(11s / 7min / 37min).
* erasure_coding/ec_volume.rs: shard_locations now sits behind a
std::sync::RwLock so the read path can refresh the map without
holding the Store write lock. Adds shard_locations_refresh_time
(Mutex<Option<Instant>>) for the staleness heuristic. Mirrors Go's
ShardLocationsLock / ShardLocationsRefreshTime fields. set/get
helpers updated for interior mutability.
* server/handlers.rs: GET handler now tries the local-only fast
path first, then falls through to the distributed path on
NotFound.
* review: address PR 9516 feedback on distributed EC read
Five of the six PR-review comments addressed; the sixth (JWT on
outgoing peer gRPC) is deferred with an explicit TODO because the
crate-wide outgoing-JWT signing surface doesn't exist yet — adding it
in this one call site would split the credential plumbing across
peer paths that already lack it (copy_file_from_source, batch_delete,
…). Revisit when an outgoing-JWT helper lands.
Fixed in this commit:
* Handlers: drop the two-tier (local-first, then distributed)
read in handlers.rs. read_ec_shard_needle_distributed already
does the local-first pass under the same store read lock; the
redundant outer attempt re-read local intervals twice for any
needle that spanned mixed-locality shards.
* Scanner snapshot: replace inline locate-needle math with
`ecv.locate_needle(needle_id)`. Same routine the local-only
read path uses, so byte-identical on shard-size + interval
boundaries.
* EcVolume::set_shard_locations also advances
shard_locations_refresh_time so the staleness check honors
callers that populate the cache directly without going through
the master LookupEcVolume RPC.
* parse_grpc_address moved from grpc_server.rs into
grpc_client.rs as `pub` and is reused by both grpc_server.rs
and the new store_ec module. Single source of truth for the
HTTP↔gRPC port-offset convention.
* Reconstruction (recover_one_remote_ec_shard_interval) now seeds
bufs from locally-mounted survivor shards BEFORE the remote
fan-out. Previously the fan-out was remote-only, so when the
shard_locations cache was cold or the master lookup failed,
reconstruction errored even though enough siblings were on
local disk to recover the missing interval.
* review: tighten parse_grpc_address; atomic shard-locations cache swap
Two follow-up findings from the PR 9516 review round 2:
* `parse_grpc_address` now validates BOTH port components in the
dotted form (`host:port.grpcPort`) — previously a non-numeric
HTTP port like `host:abc.18080` slipped through and tripped a
less-useful downstream URI parse error. The implicit form
(`host:port` → port + 10000) also gains an overflow check so
inputs like `host:60000` (which silently wrap past u16) are
rejected here instead of producing an opaque connection
failure later. Six unit tests cover each rejection path.
* `EcVolume::set_shard_locations` no longer bumps the per-volume
refresh timestamp. The previous fix introduced a freshness
race: a multi-shard population that inserts shard-by-shard
would flip `needs_refresh == false` on the first write, letting
a concurrent reader observe a half-populated map already
marked "fresh" and return NotFound for the not-yet-inserted
shards. Added `EcVolume::replace_shard_locations(map)` for the
atomic bulk swap; `write_back_shard_locations` in the
distributed-read path uses it so the cache transitions
old → fresh in a single observable step.
* helm(admin): support secretExtraEnvironmentVars
The admin statefulset only honored extraEnvironmentVars, forcing the
OIDC client secret (and any other sensitive WEED_* value) to be inlined
as plain text in values.yaml — not GitOps-friendly. The filer chart has
had secretExtraEnvironmentVars for this exact case; mirror that pattern
on admin so secrets can be projected via valueFrom.secretKeyRef.
Surfaced by an enterprise OIDC deployment (issue #9511) where the only
workaround was hardcoding WEED_ADMIN_OIDC_CLIENT_SECRET in values.yaml.
* helm(admin): sort secretExtraEnvironmentVars keys for stable output
Helm/Go template map iteration is non-deterministic, so the env entries
could shuffle between renders and trigger spurious StatefulSet rollouts
in GitOps tooling (ArgoCD/Flux). Sort the keys with sortAlpha, mirroring
the extraEnvironmentVars block immediately above.
Flagged by gemini-code-assist and coderabbitai on PR #9513.
PR #9442 made the filer refuse to register the IAM gRPC service unless
jwt.filer_signing.key was set in security.toml, which broke the admin
UI Users/Groups/Policies pages for every deployment that ships without
a security.toml — weed mini, plain Helm, vanilla weed filer. The Users
tab returns Unimplemented and the page is unusable. Issues #9504,
#9505 and #9509 all trace to this gap.
The rest of the filer's gRPC surface is unauthenticated by default;
treat IAM the same way. The service now always registers, and the
auth gate is a no-op when no signing key is configured. When the key
is set, every RPC still requires an admin-signed Bearer token, matching
the post-#9442 behaviour. Operators who expose the filer gRPC port
beyond a trusted network should set the key on both filer and admin.
The admin client (IamGrpcStore.withIamClient) already skips attaching
the authorization metadata when its key is empty, so no changes there.
* helm(security): decouple JWT signing from cert-manager mTLS
The filer needs jwt.filer_signing.key to register the IAM gRPC service the
Admin UI Users tab calls (PR #9442). The chart only rendered security.toml
under enableSecurity, which also pulls in cert-manager for mTLS — much heavier
than the Admin UI needs. Operators on Helm without cert-manager have no way
to flip the JWT key on, so the Users tab fails with Unimplemented after
upgrading past 4.24.
Introduce seaweedfs.securityConfigEnabled, true when enableSecurity OR any
explicit jwtSigning toggle (volumeRead/filerWrite/filerRead) is set. The
configmap renders under that helper; the [grpc.*]/[https.*] sections inside
stay gated on enableSecurity. Each pod template splits the security-config
mount onto the helper and keeps the cert volume mounts on enableSecurity.
volumeWrite is intentionally excluded from the helper trigger because it
defaults to true; including it would silently start mounting security.toml on
every fresh install. With this change, enableSecurity=false + defaults
renders nothing (unchanged), enableSecurity=true renders the full toml
(unchanged), and enableSecurity=false + filerWrite=true renders just the
[jwt.*] sections so the Admin UI works without mTLS.
Fixes#9506.
* helm(security): trim verbose comments
* helm(security): handle null securityConfig in helper
Address review feedback: (.Values.global.seaweedfs.securityConfig).jwtSigning
errored if a user explicitly set securityConfig: null in their values. Drop
into intermediate $sec/$jwt with default dict at each step so a missing or
nulled-out parent is tolerated.
* helm(ci): cover IAM gRPC decoupling (issue #9506)
Five regression assertions exercised against the rendered chart so a
future change cannot silently re-couple jwt.filer_signing to mTLS:
1. defaults render no security-config ConfigMap (preserves baseline)
2. filerWrite=true alone renders [jwt.filer_signing] with no [grpc.*]
3. filerWrite=true mounts security-config on filer + admin without
pulling in cert volumes — the actual fix for the Admin UI Users tab
4. enableSecurity=true still produces the full toml with [grpc.master]
5. securityConfig=null and securityConfig.jwtSigning=null both render
cleanly (gemini-code-assist review nit, applied chart-wide)
Patch a pre-existing direct-access in filer-statefulset.yaml that
crashed on securityConfig=null, surfaced by the new null assertion.
* helm(ci): drop issue numbers from comments
* helm(ci): install pyyaml; assert [jwt.signing] in mTLS path
Address coderabbit review:
- The new IAM gRPC test block uses `import yaml` but ran before the
later `pip install pyyaml -q` step that the security+S3 block
performs. CI happens to pass because the runner image carries
PyYAML, but make the dependency explicit so a future runner change
cannot silently break the regression test.
- The enableSecurity=true assertion only checked for [grpc.master].
Also assert [jwt.signing] so a refactor that drops the volume-side
JWT stanza from the mTLS path fails the test instead of slipping
through.
fix(tests): make 32-bit GOARCH tests build and run (#9503)
verifyTestFilerClient had bare int64 atomic counters after a map header,
so atomic.AddInt64 panicked with "unaligned 64-bit atomic operation" on
linux/386. Switch to atomic.Int64, which the stdlib guarantees is
8-byte aligned on all platforms.
rpc_version_filter_test.go passed the untyped constant 0xdeadbeef to
t.Errorf, where it default-promoted to int and overflowed 32-bit int.
Bind it to a typed uint32 const used in both the comparison and the
error message.
* fix(ec): clear cross-server stale EC shards before re-distribute (#9478)
A previous failed encode leaves partial .ec?? shards mounted on
destination volume servers that are not the .dat owner. PR #9480 only
prunes when the .dat sits on a sibling disk of the SAME store, so the
cross-server case stays stuck: every retry trips
volume_grpc_copy.go:570's "ec volume %d is mounted; refusing overwrite"
guard and the scheduler loops.
Detection already lists existing EC shards as CleanupECShards sources;
plumb the shard ids through (ActiveTopology.GetECShardLocations,
TaskSourceSpec, TaskSource.shard_ids) and have the EC worker call
VolumeEcShardsUnmount + VolumeEcShardsDelete on each destination after
the local shard set is generated and before distributeEcShards. Skip
EC-shard sources in getReplicas so the post-encode VolumeDelete step
does not target destination-only nodes.
Integration test mounts a partial shard subset, asserts the
mounted-volume refusal, runs cleanupStaleEcShards, and asserts the
next ReceiveFile lands.
* chore(ec): tighten code comments in stale-shard cleanup
Drop issue-number refs from code comments and shorten the docstrings
on cleanupStaleEcShards / unmountAndDeleteEcShards / getReplicas plus
the new test file. Behavior unchanged.
* fix(ec): skip empty-ShardIds locations; dedupe getReplicas by node
GetECShardLocations dropped entries where ecShardMatchesCollection saw a
phantom info record with EcIndexBits=0 — without ShardIds, getReplicas
misread the resulting source as a regular replica and would have called
VolumeDelete on a destination-only node.
getReplicas now dedupes by Node since VolumeDelete is server-wide;
per-disk source rows on the same server collapse to one call.
* refactor(ec): use MaxShardCount and ShardBits in collectShardIdsForDisk
Drop the literal 32 bit-iteration bound for erasure_coding.MaxShardCount
and treat the EcIndexBits union as a ShardBits so Count() drives the
slice preallocation. Keeps the helper aligned with the rest of the EC
code and survives any future expansion of the shard-count ceiling.