14047 Commits

Author SHA1 Message Date
Chris Lu 2a46d457ac 4.31 4.31 2026-06-01 23:32:04 -07:00
Chris Lu e264e9883e fix(seaweed-volume): bound request body and stored-content expansion to prevent OOM under load (#9780)
* fix(seaweed-volume): bound request body and stored-content expansion to prevent OOM

The Rust volume server buffered the entire upload body with
to_bytes(usize::MAX) and only checked the file-size limit afterward, so a
single large upload — or many concurrent uploads, since the in-flight byte
throttle defaults to 0 (unlimited) — could exhaust memory and get the process
OOM-killed under load. The read path had two more single-request OOM vectors:
`vec![0u8; manifest.size]` allocated from an attacker-controlled chunk-manifest
size, and gzip decompression was unbounded (gzip bomb).

- Bound the upload body read by file_size_limit_bytes (plus a margin for
  multipart framing), mirroring Go's io.LimitReader(sizeLimit+1), and reject
  oversize before the whole body is buffered.
- Validate manifest.size (reject negative / oversized) before allocating.
- Cap gzip output in maybe_decompress_gzip and route the inline GzDecoder sites
  through it.

* fix(seaweed-volume): address review - chunk offset, 32-bit cast, decompress errors

- Validate chunk.offset before indexing in chunk-manifest expansion: a negative
  offset wrapped to a huge usize and underflowed `end - offset` (panic from a
  crafted manifest). Reject negative, skip out-of-range, use saturating math.
- Use usize::try_from for the upload body limit instead of `as usize`, so a
  >usize::MAX file_size_limit on 32-bit caps at usize::MAX rather than silently
  truncating to a tiny value.
- maybe_decompress_gzip now returns Result<_, GunzipError> distinguishing a
  decode failure (callers fall back to raw bytes, as before) from hitting the
  size cap (TooLarge), which now returns 413 instead of silently serving the
  still-compressed bytes.

* fix(seaweed-volume): inflate manifest chunks into the result window to cap peak memory

The chunk-manifest expansion still doubled memory: `result` was already allocated
at manifest.size (<=2 GiB) and each compressed chunk was inflated into a separate
Vec (also up to 2 GiB), so a single request could peak near 4 GiB.

Decompress compressed chunks directly into their result[offset..] window (bounded
by the remaining space) so a chunk never allocates a second large buffer; peak
stays at ~manifest.size. Bytes past the window are dropped (matching the prior
truncation), and a fully-undecodable chunk still falls back to its raw bytes.

* fix(seaweed-volume): fall back to raw chunk bytes on any decode failure

Per review: the gzip fallback must run on any decode error, not only when no
bytes were decoded. Clear the partially-written output and copy the chunk's raw
bytes (truncated to the window), restoring the prior decode-failure behavior.
2026-06-01 22:24:13 -07:00
Chris Lu fba71ab14c ci: parallelize the unified release-container build (#9783)
* docker: cross-compile the Go binary instead of emulating it under QEMU

The builder stage ran as the target platform, so arm64/arm/386 images
emulated the whole Go compile (and the full git clone) under QEMU. The
binary is CGO-free, so pin the builder to $BUILDPLATFORM and cross-compile
with GOOS/GOARCH (GOARM for v7), keeping every target's compile native.

* ci: build all release container variants in parallel

The build matrix throttled to two variants at a time on a stale rate-limit
worry. Pulls go through mirror.gcr.io and pushes target GHCR only, so the
five variants can all build at once.

* ci: copy each variant to Docker Hub from its build job

The separate copy-to-dockerhub job waited on the whole build matrix before
any GHCR -> Docker Hub copy could start. Move the crane copy into the build
job so each variant copies as soon as it is built, overlapping with the
others still compiling. tag-latest and helm-release now depend on build.
2026-06-01 20:34:05 -07:00
Neetika Mittal 45465e5a05 fix(master): notify clients after manual volume grow (#9656)
Co-authored-by: Neetika Mittal <mneetika@users.noreply.github.com>
2026-06-01 20:33:37 -07:00
Chris Lu bf37fba0e1 fix(s3): recover versioned reads when the .versions latest pointer is absent (#9782)
GetObject on a versioned object returned NoSuchKey forever when the
.versions directory existed but carried no latest-version pointer (empty
Extended metadata) while real version files remained inside it. The
self-heal path only fired for a dangling pointer (present but referencing
a missing file), not an absent one, so doGetLatestObjectVersion fell
straight through and errored on every read.

- doGetLatestObjectVersion now calls recoverLatestVersionWithoutPointer
  when the pointer is missing or empty. An absent pointer is the legitimate
  signal that a pre-versioning or suspended-versioning "null" object is
  current, so that object wins; only when it is absent do we rescan
  .versions/ and rebuild the pointer from the version files present.
  Transient rescan failures propagate instead of being masked as NotFound.
- selectLatestVersion derives the version id from the v_<versionId> file
  name when the Seaweed-X-Amz-Version-Id attribute is absent, so version
  files written outside the normal versioned-PUT path (replicated or
  restored entries) are still promotable. The orphan diagnostic uses the
  same detection so an entry can't be both promoted and counted an orphan.
2026-06-01 20:01:30 -07:00
Chris Lu ca81c0c525 fix(ec): pass per-volume data-shard count to the parity-shard split (#9781)
* fix(ec): pass per-volume data-shard count to the parity-shard split

ShardsInfo.DeleteParityShards/MinusParityShards looped ids 10..13, assuming
the fixed 10+4 layout. For a non-default ratio this splits data vs parity
wrong — a wide ratio (12+4, 16+6) drops real data ids >= 10, which breaks
ec.decode. They now take a dataShards argument (<= 0 falls back to
DataShardsCount) and clear ids dataShards..MaxShardCount. ec.decode threads
the data-shard count from collectEcNodeShardsInfo to both split call sites,
and admin LogicalSize passes DataShardsCount.

Also: EC cleanup now sets an explicit per-disk storage impact
(-len(ShardIds)) instead of falling back to the TotalShardsCount constant,
so freed-capacity accounting matches the shards actually removed.

OSS is always 10+4, so behavior is unchanged here; this keeps the split
ratio-correct and the API aligned with the enterprise per-volume override.
Adds parity-split ratio tests.

* ec: clear parity shards in one locked pass

Address review: DeleteParityShards looped si.Delete, taking the lock once per
id. shards is sorted by Id and shardBits is a bitmap, so mask off the high
bits and truncate the sorted slice at the first parity id (binary search) under
a single lock. Preserves the dataShards<=0 -> DataShardsCount default.
2026-06-01 19:25:15 -07:00
Chris Lu f410d975c7 fix(ec): resolve EC data-shard count from the volume's .vif on reboot (#9779)
* fix(ec): resolve EC data-shard count from the volume's .vif on reboot

A volume server never loads a cluster EC config into memory, so startup
decisions that assumed 10 data shards mishandled volumes whose .vif
records a different ratio:

- validateEcVolume sized the expected shard against 10 data shards and
  required >=10 local shards, so a volume with a non-default ratio and a
  coexisting .dat could be wiped on reboot. Read the ratio from the .vif.
- pruneIncompleteEcWithSiblingDat used the hardcoded 10-shard threshold,
  so a full data set for a non-default ratio with a healthy sibling .dat
  was wiped as a partial leftover. Use the EcVolume's .vif-derived ratio.

Behavior is unchanged for the standard 10+4 layout (the .vif resolves to
10). Adds storage-level reboot tests.

* ec: avoid per-call allocations in ecDataShardsFromVif

Address review: the helper runs once per EC volume at startup. Replace the
slice+map dedup of the two dirs with direct conditional checks via a small
ecDataShardsFromVifDir helper, eliminating the heap allocations and GC
pressure when loading many volumes.
2026-06-01 19:22:14 -07:00
steve.wei 1313600b9e fix(topology): restore active count after vacuum recovery (#9770) 2026-06-01 15:23:22 -07:00
Chris Lu 2386fa550a grpc: don't tear down the shared master connection on a caller's own timeout (#9775)
A Canceled/DeadlineExceeded from the caller's per-request context was
treated like a dead channel: it closed the shared cached ClientConn and
cancelled every other in-flight RPC on it with "the client connection is
closing". Under a burst of concurrent chunk assigns (e.g. a large S3
multipart upload) one slow assign hitting its 10s attempt timeout could
poison the connection for all the rest, cascading into a flood of 500s.

Thread the caller's context into shouldInvalidateConnection and only
invalidate on Canceled/DeadlineExceeded while that context is still live,
which isolates the genuine stale-channel signal (a peer restart behind a
k8s Service VIP). To carry the context, add a ctx parameter to the
existing WithGrpcClient, WithMasterClient, and WithMasterServerClient; the
master assign and volume-lookup paths pass their per-attempt context and
every other caller passes context.Background().
2026-06-01 15:11:02 -07:00
Chris Lu dfa86b4313 volume: keep volume writable after a deletion-tail compaction (#9776)
makeupDiff replays post-snapshot changes onto the compacted volume. For a
replayed deletion it appended a tombstone to the new .dat but recorded the
.idx entry with offset 0. When that deletion is the last replayed change the
tombstone lands at the .dat tail, and the post-commit integrity check skips
offset-0 entries, so it sees 32 trailing bytes it can't account for and flips
the volume read-only, reloading it as a SortedFileNeedleMap instead of the
writable map.

Record the tombstone's real .dat offset, matching the normal delete path; the
needle map still treats it as deleted off the negative size, so lookups are
unchanged. Mirror the same fix into the Rust volume server.
2026-06-01 13:15:08 -07:00
Chris Lu 8c60408bfb s3: auto-enforce bucket quota read-only both ways (#9774)
* s3: auto-enforce bucket quota read-only both ways

Quota read-only only ever flipped when an admin re-ran
s3.bucket.quota.enforce, so a bucket that went over quota stayed
read-only forever even after usage dropped back under.

Fold enforcement into the per-minute, leader-locked bucket-size loop
the s3 gateway already runs for metrics: it now flips each bucket's
read-only flag to match its quota in both directions, rewriting
filer.conf only when a flag actually changes. The set/clear decision
lives in one shared FilerConf.ApplyBucketQuotaReadOnly helper so the
shell command and the gateway can't drift.

* only manage read-only when a quota is set, never clobber manual locks

* trim comments
2026-06-01 13:11:18 -07:00
Chris Lu 57797c9b38 filer.sync: repair a destination shorter than the source (#9778)
When the destination's stored mtime is newer than the incoming source version, UpdateEntry skips the update (last-writer-wins). A copy left truncated by an earlier failed replication trips this: the source kept the file's original mtime while the partial copy was written recently, so it looks "newer" and is never corrected. When the destination is strictly shorter than the source, re-replicate the full source content and replace the chunk list instead of skipping. Same shorter-than-source bypass for CreateEntry.
2026-06-01 13:04:23 -07:00
Nguyễn Lộc Phúc ed31271e28 fix(s3api): Fix multipart upload ETag compatibility with Hadoop S3A (#9772)
* s3api: use getEtagFromEntry for multipart part ETag to prefer Extended metadata

* s3api: add tests for getEtagFromEntry Extended ETag preference in multipart upload

* s3api: avoid double-quoting ETags in ListParts output

* s3api: add docstring for filer_multipart_etag_test.go
2026-06-01 13:03:46 -07:00
7y-9 5ea75dcc67 fix(http): handle invalid gzip stream errors (#9767)
* fix(http): handle invalid gzip stream errors

Explain:

- problem: ReadUrlAsStream could panic when a response claimed gzip encoding but the body was not a valid gzip stream.

- root cause: the gzip reader error was ignored and a nil reader was deferred and read from.

- fix: return the gzip.NewReader error before registering Close or reading.

- validation: go test ./weed/util/http -run TestReadUrlAsStreamReturnsGzipReaderError -count=1; git diff --check.

* test: avoid closing shared global HTTP client in unit test
2026-06-01 12:21:19 -07:00
Chris Lu 1a19683ee6 filer: name the read-only path in the write rejection (#9773)
* filer: name the read-only path in the write rejection

The write path rejected creates under a read-only rule with a bare
"read only", giving no hint which path was locked or why. Wrap the
error with the matched location prefix and a quota hint so a FUSE
mkdir or S3 put points straight at the offending bucket.

* return the read-only reason over HTTP and drop any query string from the fallback prefix
2026-06-01 12:20:45 -07:00
Chris Lu 2e3fabbf24 filer.sync: back off on transient upload errors (#9777)
A destination volume server that hits its idle deadline while reading a large upload body under load returns 400 "read tcp ...: i/o timeout". fetchAndWrite retried that on the flat ~1s RetryUntil backoff, hammering the already-overloaded destination. Route i/o timeout, connection reset, broken pipe and net.Error timeouts through the same escalating 10s-2min backoff already used for EOF so it can recover.
2026-06-01 12:18:17 -07:00
Chris Lu f9ee49b03e shell: volume.fsck must not skip the system-log subtree (#9764)
shell: only skip system-log subtree in fs.meta.save, not fsck/verify

The SystemLogDir skip lived in the shared BFS traversal, so volume.fsck
built its in-use set without the /topic/.system/log chunks and flagged
every referenced log needle as orphan. -reallyDeleteFromVolume would then
delete live log data and leave dangling filer entries. Gate the skip
behind a flag that only fs.meta.save sets.
2026-06-01 09:54:22 -07:00
Chris Lu 80dd3b2621 EC bitrot follow-ups: protect destination sidecar on optional copy; cap sidecar block_size (#9763)
* fix(ec_bitrot): cap sidecar block_size in ValidateBitrotManifest

A sidecar loaded from disk (or supplied via a backfill/peer RPC) could carry a
huge power-of-two block_size that passed validation, then force a multi-GiB
scratch-buffer allocation in scrub/verify. Add a shared MaxBitrotBlockSize
(64 MiB) constant, enforce it as an upper bound in isPow2MultipleOf1MiB, and
derive the volume flag cap from the same constant so they cannot drift.

* fix(ec_bitrot): don't destroy a valid destination sidecar on an optional copy

writeToFile opened the destination with O_TRUNC before knowing whether the
source had the file, so an optional copy (ignoreSourceFileNotFound) from a source
that lacks the .ecsum truncated and then removed a valid pre-existing destination
sidecar. Stage the optional copy into a temp sibling and commit it with an atomic
rename only when the source actually delivered the file; a missing source is now
a no-op. Mandatory copies keep their in-place behavior.
2026-05-31 23:42:33 -07:00
Chris Lu 9658f309d2 EC bitrot detection: per-shard checksum sidecars (#9761)
* ec: add EC bitrot checksum protobuf

EcBitrotProtection/EcShardChecksums/ChecksumAlgorithm sidecar messages,
copy_ecsum_file and unsafe_ignore_sidecar fields, and a CHECKSUM scrub mode.

* ec: bitrot checksum sidecar format, validation, and per-volume load

Per-shard CRC32C block checksums in an optional <base>.ecsum sidecar with a
self-integrity header; validation, rolling builder, backfill primitive, and
EcVolume load on mount + removal on destroy.

* ec: capture per-shard checksums at encode; verify-and-exclude on rebuild

WriteEcFilesWithContext returns the protection computed inline during encoding.
generateMissingEcFiles verifies present inputs against the sidecar, excludes
corrupt ones, regenerates in place, and re-verifies; fail-closed unless
unsafe_ignore_sidecar, removing all generated outputs on failure.

* ec: read-only checksum scrub with Reed-Solomon arbiter

ChecksumScrub verifies each local shard against the sidecar and reconstructs
flagged shards from the clean shards so stale-sidecar false positives are not
reported. Wired to the gRPC CHECKSUM mode and ec.scrub -mode checksum.

* ec: server-side bitrot sidecar write, copy, cleanup, and opportunistic backfill

Write .ecsum at fresh encode; propagate it with copy_ecsum_file (tolerant);
remove it on full delete and decode; rebuild honors unsafe_ignore_sidecar and
opportunistically backfills a sidecar when all shards are reachable.

* ec: volume server bitrot config flags

-ec.bitrotChecksum (default on) and -ec.bitrotBlockSizeMB (default 16).

* fix(ec_bitrot): bound -ec.bitrotBlockSizeMB before the int64 multiply

Validate the MiB value is in [1, 1024] before multiplying by 1 MiB, so a huge
flag value cannot overflow int64 and slip past the power-of-two check, and a
block size cannot collapse a sidecar to a few oversized blocks.

* fix(ec_bitrot): distribute the .ecsum sidecar from the worker encode path

The worker EC encode wrote the generation-0 sidecar locally but never added it
to shardFiles, so DistributeEcShards never shipped it and the distributed
holders came up unprotected. Append it to shardFiles and map the ecsum shard
type to its extension in the sender so it travels with the shards.

* fix(ec_bitrot): remove orphaned sidecars when the generation is gone

Gate sidecar removal on existingShardCount==0 alone rather than also requiring a
stray .ecx. A sidecar whose shards have all been deleted is orphaned and must be
removed even when no .ecx remains, or it leaks. .ecx/.ecj/.vif removal stays
gated on hasEcxFile as before.

* fix(ec_bitrot): do not fold checksum blocks scanned into TotalFiles

ChecksumScrub's first return is blocks scanned, not files. Discard it so the
scrub response's TotalFiles (a needle/file count) is not inflated by the block
count for CHECKSUM mode.

* test(ec_bitrot): clean up generated .ecsum sidecars in removeGeneratedFiles

* fix(ec_bitrot): reject an oversized sidecar payload before the uint32 cast

The header stores payload_len as a uint32; bound the payload before the
conversion so a pathological manifest cannot truncate the length field and
corrupt the sidecar. A real manifest is a few KB, so this never trips.

* fix(ec_bitrot): cap -ec.bitrotBlockSizeMB at 64 MiB

The block size becomes the per-shard scratch buffer the scrub/backfill path
allocates, so an over-large value (e.g. 1 GiB) is a memory hazard per concurrent
scrub worker. Lower the upper bound from 1024 to 64 MiB.

* fix(ec_bitrot): add -ecUnsafeIgnoreSidecar to weed tool fix -ecx

The -ecx recovery path reconstructs missing shards via RebuildEcFilesWithContext,
which fails closed on a malformed/stale .ecsum. Without an override flag an
operator could not complete the rebuild without manually deleting the sidecar.
Expose -ecUnsafeIgnoreSidecar (default false) and thread it through.

* fix(ec_bitrot): bound sidecar payload with a direct int constant; drop readFull

Guard len(payload) against a plain int constant (1 GiB) before the allocation
instead of a uint64 MaxUint32 compare, so the allocation-size value is provably
bounded (clears the CodeQL overflow alert) and the math import is no longer
needed. Inline os.File.ReadAt with io.EOF handling in verifyShardFileBlocks and
remove the now-redundant readFull helper (os.File.ReadAt fills the slice or
errors).

* test(ec_bitrot): use slices.Contains instead of a hand-rolled containsU32

* refactor(ec): fold the EcFiles WithContext variants into the base functions

RebuildEcFiles now takes the *ECContext directly (nil => derive from .vif as
before) and WriteEcFiles takes it too (nil => default), removing the parallel
RebuildEcFilesWithContext / WriteEcFilesWithContext names. Callers that had an
explicit context drop the WithContext suffix; the default-context callers pass
nil. No behavior change.

* refactor(ec): pass BackgroundECContext instead of nil to Write/RebuildEcFiles

Add a non-nil BackgroundECContext placeholder (analogous to context.Background())
and have callers with no specific layout pass it instead of a nil *ECContext.
WriteEcFiles resolves a zero/background context to the default ratio and
RebuildEcFiles resolves it from the .vif, so behavior is unchanged.

* fix(ec_bitrot): make BackgroundECContext a func; RebuildEcFiles fails closed on bad .vif

- BackgroundECContext is now a function returning a fresh *ECContext, so callers
  cannot mutate a shared singleton or race on it (and it mirrors context.Background,
  which is also a function).
- RebuildEcFiles now propagates the MaybeLoadVolumeInfo error: a present-but-
  unreadable .vif fails closed instead of silently rebuilding with the default
  ratio (which would corrupt a custom-ratio volume). Pass an explicit ctx to override.
2026-05-31 18:52:44 -07:00
Chris Lu fdfeb4063c shell: warn in volume.list when a volume id spans collections (#9759)
* shell: warn in volume.list when a volume id spans collections

A reused volume id, the result of the master handing out an id already
used by another collection (for example after losing its max-volume-id
counter on restart), makes collection.delete destroy the wrong
collection's data and makes any bare-id lookup, move, or vacuum
ambiguous. volume.list now scans the full topology and warns on ids
present in more than one collection so the clash is visible before any
destructive operation.

* volume.list: track duplicate ids lazily, sort with slices.Sort

Allocate the per-id collection set only on the first cross-collection clash
instead of one set per volume, so allocations scale with duplicates rather
than the volume count.
2026-05-31 11:52:39 -07:00
Chris Lu 35ab67fa8a s3: reject reserved bucket name "filemeta" (#9760)
filemeta is the filer SQL store's default table name. A bucket of that
name passes VerifyS3BucketName but is rejected by the store's isValidBucket
guard on every operation, so it creates fine yet can't be deleted and wedges
fsck. Reject it at creation so both checks agree.
2026-05-31 11:15:05 -07:00
Chris Lu 6b06fe5ec4 s3: commit a versioned PutObject and its latest pointer in one transaction (#9756)
* s3: commit a versioned PutObject and its latest pointer in one transaction

A versioned PutObject wrote the version file and flipped the .versions
latest pointer in two separate routed transactions. Fold the
RECOMPUTE_LATEST into the version file's PUT so both commit atomically
under the object's per-path lock: the recompute, applied after the PUT in
the same transaction, scans the directory and sees the new version. A
crash can no longer leave the version present with a stale pointer.

putToFiler now takes a putFinalize describing the finalize step — routed
mutations folded into the PUT, or an afterCreate run under the object
write lock off the ring. Suspended-versioning keeps its afterCreate-only
form; multipart, copy, and delete-marker finalizes are unchanged.

* s3: trim verbose finalize comments
2026-05-31 00:13:36 -07:00
Chris Lu d806778757 admin: store file browser uploads in volumes, not inline (#9752)
uploadFileGrpc passed SaveSmallInline with a 256 KiB limit, so uploads under
that size were written to entry.Content instead of a volume. The filer's own
upload path never inlines unless saveToFilerLimit is set (default 0), and the
S3 server shares that path. Drop the inline options so admin uploads always
land in volumes.
2026-05-30 23:47:42 -07:00
Chris Lu 186747e7e8 admin: view images and PDFs inline in the file browser (#9751)
The viewer embedded images and PDFs through the download URL, which sent
Content-Disposition: attachment, so the browser downloaded them instead of
rendering. Add an inline mode to the download endpoint, limited to images and
PDFs so a hostile upload (HTML, SVG) can't run as same-origin script, set
X-Content-Type-Options: nosniff, and resolve the MIME the same way the viewer
does. The viewer now requests the inline URL.
2026-05-30 23:46:09 -07:00
Chris Lu 05c6500453 volume: fix maxVolumeCount dead zone that stalled writes on auto-sized disks (#9755)
* volume: don't drop the last writable slot on auto-sized disks

MaybeAdjustVolumeMax subtracted 1 from the per-disk slot count, so a disk
with room for exactly one volume (free between 1x and 2x the size limit)
reported 0 slots. The master then never grew a writable volume and every
assign drained its retry budget, so writes failed with context deadline
exceeded. Count the full volumes that actually fit, floored at one for an
auto-sized disk that has free space.

* mini: show disk and volume capacity in the startup banner

Print free space, volume size, total volume count and free volume count
under the data directory line, so a volume size limit that outstrips the
disk is visible at startup instead of surfacing later as failed writes.
2026-05-30 23:45:17 -07:00
Chris Lu a10607f90a Add Terraform support for VM-based SeaweedFS deployment (#9754)
* terraform: add cloud-agnostic core renderer module

Renders per-node weed argv, systemd units, config files, disk-mount and secret-fetch scripts, and cloud-init from an address map. Creates zero cloud resources. Flags verified against the weed binary: volume uses -mserver for the master list, gRPC is -port.grpc (auto http+10000), minFreeSpacePercent is a string, filer store via -defaultStoreDir.

* terraform: add mTLS and JWT security module

Generates the CA, per-component certs with distinct CNs, and JWT signing keys via the tls/random providers. Emits a core_security object plus PEMs for secret-store delivery.

* terraform: add AWS deployment module and examples

Reserves stable ENIs first, renders config via the core, then creates instances, prevent_destroy EBS data disks mounted at /data, and the cluster security group. With enable_security, generates certs/JWT, stores them in SSM SecureString, grants an instance role, and fetches them at boot so secrets stay out of user_data. Keyed for_each on every stateful tier.

* terraform: add local cluster test harnesses

run_local_cluster.sh and run_local_secure.sh render a cluster with the core and run real weed processes, asserting master quorum, volume registration, filer/s3 round-trips, mutual-TLS formation, and JWT enforcement. Use an isolated high port range with a guard so they never touch a cluster already running on the machine. The weed binary defaults to $(go env GOPATH)/bin/weed.

* terraform: add CI workflow and README

fmt/validate/tofu-test plus smoke jobs that build weed and run both harnesses.

* terraform: guard against empty filesystem UUID in mount script

An empty UUID made grep -q match any fstab line, skipping the fstab entry and breaking the mount. Fail fast when blkid returns no UUID.

* terraform: sanitize cluster name in WEED_CLUSTER env keys

Hyphens or spaces in cluster_name produced invalid systemd/bash env var names; map non-alphanumerics to underscores.

* terraform: omit empty jwt.signing block from security.toml

With enable_security and no JWT key, the template emitted [jwt.signing] key="". Gate the block on a non-empty key and cover it with a test.

* terraform: mark core security input as sensitive

The security object carries JWT signing keys; keep them out of plan output and known values.

* terraform: enforce jwt_length minimum of 32

* terraform: note region/AZ coupling in HA example

* terraform: guard WORKDIR before recursive delete in test harnesses

* terraform: fix README fence language and test count

* terraform: handle embedded s3 with no filer nodes

Indexing sort(keys(var.filers))[0] errored at plan time when embedded S3 was enabled but no filers were defined; fall back to an empty config source.

* terraform: scope kms:Decrypt to a configurable key arn

Replace the hardcoded Resource="*" with a kms_key_arn variable (default "*") so production can restrict decrypt to a specific CMK.

* terraform: encrypt EBS data volumes at rest

Set encrypted = true on the volume/filer data disks and the all-in-one example disk.

* terraform: protect filer instances from API termination

Filers hold the leveldb2 metadata store, so they are stateful and get the same disable_api_termination as masters and volumes.

* terraform: stop instance before detaching in all-in-one example

* terraform: drop stale references to the removed plan doc

* terraform: correct stale mount-step comment in aws module

* terraform: mark Terraform support as experimental in README
2026-05-30 23:43:17 -07:00
Chris Lu 0e35235908 s3: return NoSuchVersion (not NoSuchKey) for a missing versionId (#9749)
GET/HEAD object with an explicit versionId that does not exist returned
NoSuchKey. AWS S3 returns NoSuchVersion (404) for this case; tools that
distinguish "key gone" from "this version gone" rely on that code.

Add the ErrNoSuchVersion error code and use it on the GET and HEAD
specific-version lookups. Only a genuine not-found maps to NoSuchVersion;
a transient or internal filer error now maps to InternalError (500)
instead of a misleading 404. getSpecificObjectVersion wraps its lookup
error with %w so callers can detect filer_pb.ErrNotFound.
2026-05-30 21:09:53 -07:00
Chris Lu 7c5ca01027 admin: export file/folder metadata from the file browser (#9750)
Add a per-row Export button (files and folders) that downloads the filer
metadata in the length-prefixed FullEntry protobuf format that weed shell
fs.meta.load reads, gzipped as <name>.meta.gz like fs.meta.save. Folders are
walked recursively via the filer BFS metadata stream, excluding the system
log subtree. Streamed over gRPC so it keeps working with the filer HTTP
listener disabled.
2026-05-30 20:59:01 -07:00
Chris Lu 3441a2a7f1 s3: short-circuit filer failover on ErrNotFound (#9748)
withFilerClientFailover treated a filer's ErrNotFound like a transport
failure: it kept the result, re-queried every other filer, and finally
wrapped the answer as "all filers failed, last error: ... no entry is
found in filer store".

For workloads with many legitimate misses (e.g. GET object?versionId=X
for a version that was deleted or expired), this turned each 404 into N
filer round-trips and produced a misleading error string.

A reachable filer that answers ErrNotFound has given an authoritative
answer; failover exists to route around unreachable or unhealthy filers,
not to look harder for an entry the store reports as absent. Return
ErrNotFound directly instead of fanning out. Callers that need
read-after-write retries already handle that at the S3 semantic layer
(e.g. getLatestObjectVersion).
2026-05-30 15:07:27 -07:00
Chris Lu 34be9170f0 4.30 4.30 2026-05-30 10:52:32 -07:00
Elias Paitz 30f49013e1 perf(s3.iam.GetUser): Make the API default to the request username if not specified (#9746)
* perf(s3.iam.GetUser): Make the API default to the request username if not
specified

This makes the Embedded S3 IAM API align with the documented behavior of the AWS IAM
API as per AWS Docs: https://docs.aws.amazon.com/IAM/latest/APIReference/API_GetUser.html

BREAKING CHANGE: This changes the default behavior of the Embedded IAM API to use the
username of the user holding the accesskey used to make the request in
the GetUsername request handler.

* test: cover GetUser implicit username default

---------

Co-authored-by: Chris Lu <chris.lu@gmail.com>
2026-05-30 10:51:03 -07:00
Chris Lu 4bf27278fa topology: fail replica writes fast when a replica is unreachable (#9744)
* operation: bound upload retries and honor context cancellation

retriedUploadData hardcoded 3 attempts and an uninterruptible backoff
sleep. A synchronous replica write to a dead host therefore paid the
full dial timeout three times over before failing.

Add UploadOption.MaxAttempts (<=0 keeps the default of 3) so callers can
cap attempts, and make the loop return as soon as the context is
cancelled so an abandoned upload unwinds instead of retrying.

* topology: fail replica writes fast when a replica is unreachable

DistributedOperation already returns on the first error, but a single
dead replica is itself the slow result: its goroutine retries the upload
three times through the dial timeout (~30s) before any error surfaces,
stalling the originating client write the whole time.

Make the replica write a single attempt (MaxAttempts=1) so a dead
replica fails after one dial timeout instead of three, and thread a
context into DistributedOperation that is cancelled once the outcome is
decided, so a healthy replica is no longer held hostage by one stalled
in a dial. The originating client write is what retries.

* topology: keep replica deletes off the client request context

ReplicatedDelete runs after the local needle is already deleted. Driving
the replica deletes off r.Context() means a client disconnect cancels
them and orphans needles on the replicas, so use a background context.

* operation, topology: trim comments on the replica fail-fast path
2026-05-30 10:45:02 -07:00
Chris Lu 5834c834e3 Refine enterprise edition feature blurb in version output and docs 2026-05-30 09:29:06 -07:00
Rushikesh Deshpande ea33b851e6 fix: return immediately on first error in DistributedOperation (#9740)
* fix: return immediately on first error in DistributedOperation

* simplify DistributedOperation fail-fast to a single buffered channel

Drop the separate errCh: the collector now fails fast on the first error
it reads off the buffered resultCh and returns ret.Error(), so the early
return carries the same [host]: err annotation as the aggregated path and
there is no select race between two channels.

---------

Co-authored-by: Ubuntu User <ubuntu@example.com>
Co-authored-by: Chris Lu <chris.lu@gmail.com>
2026-05-30 00:14:44 -07:00
Chris Lu e60d02c339 fix(topology): recover heartbeat-fulled volumes once they shrink (#9742)
A volume removed from writables by the write-assign path stamps fullSince
in RecordAssign, which UpdateVolumeSize's recovery branch needs to re-add
it once it decays back under the limit. A volume removed by the heartbeat
capacity path (SetVolumeCapacityFull) never stamped it, so after the
reported size dropped — vacuum, TTL expiry, bulk deletes — the volume
stayed out of writables forever, even though every heartbeat carried the
smaller size.

Stamp fullSince when the capacity path actually removes a volume from
writables, so the existing recovery branch fires. Gating on the removal
keeps it paired with the caller's activeVolumeCount decrement, matching
RecordAssign. Oversized volumes still stay out, as before.
2026-05-30 00:06:10 -07:00
Jaehoon Kim 4b23204023 fix(vacuum): writable volume re-notification after worker VACUUM (#9732)
* fix(vacuum): notify master writable after worker vacuum commit

Add Phase 3 (markWritableOne) that walks vacuumTargets and calls
VolumeMarkWritable on each replica's volume server, mirroring
batchVacuumVolumeCommit's per-replica SetVolumeAvailable. Failures are
logged at WARN; the task does not fail because the vacuum itself
already succeeded. See upstream seaweedfs#9685.

* fix(vacuum): delay Phase 3 to let post-commit heartbeats settle

Phase 3's VolumeMarkWritable can race with the volume server's first
post-commit heartbeat. SetVolumeWritable adds the vid to writables,
but a racing heartbeat whose ReadOnly value changed re-runs
EnsureCorrectWritables against the master's per-replica cache, and any
replica still cached as ReadOnly=true silently removes the vid again
— with no further heartbeat change to trigger another recovery.

Sleep 30s after Phase 2 (Commit) so every replica's post-vacuum
heartbeat has reached the master before Phase 3 fires. Cancel cleanly
on ctx.Done so a shutdown during the wait still exits.

* fix(vacuum): reduce post-commit settle from 30s to 10s

VolumePulsePeriod is 5s, so 10s (2x) is enough margin for every
replica's post-commit heartbeat to reach the master before Phase 3
fires. 30s was overly conservative and made TestVacuumExecutionIntegration
hit its 30s context deadline.

* fix(vacuum): use flat 1m timeout for VolumeMarkWritable RPC

VolumeMarkWritable on the volume server is a metadata operation
(reopen idx + flags + master ReadOnly=false heartbeat), independent
of volume size. Scaling via vacuumTimeout(time.Minute) gave it tens
of minutes — even hours on TB volumes — so a single unresponsive
replica could block Phase 3 indefinitely. Use a flat 1m cap.

* fix(vacuum): gate post-vacuum mark-writable on commit read-only state

Phase 3 force-called VolumeMarkWritable on every replica unconditionally,
clearing the read-only flag and persisting ReadOnly=false even for a
replica left read-only by an operator, an EIO quarantine, or low disk.
That overrode states the master deliberately keeps out of writables;
master built-in vacuum gates the same step on the commit's IsReadOnly via
SetVolumeAvailable.

Capture the VacuumVolumeCommit response and skip Phase 3 when any replica
came back read-only, letting it recover on its own ReadOnly=false
heartbeat. Drop the 10s post-commit settle sleep: the heartbeat race it
guarded needed a replica cached read-only at the master, which the gate
now excludes.

---------

Co-authored-by: Chris Lu <chris.lu@gmail.com>
2026-05-29 23:43:24 -07:00
Mohamed Chorfa e5fb547e95 wdclient, dailyrun: add equal jitter to retry backoff (#9737)
* wdclient, dailyrun: add equal jitter to retry backoff

Prevents thundering-herd retries when many clients recover from a
transient failure at the same instant (e.g., filer restart, network
partition healing).

Uses equal jitter: wait in [d/2, d) instead of deterministic d.
This bounds the maximum wait while still desynchronizing clients.

Files:
- weed/wdclient/filer_client.go   (LookupVolumeIds retry loop)
- weed/s3api/s3lifecycle/dailyrun/dispatch.go (dispatchWithRetry)

Tests added for bounds, zero/negative inputs, and distribution sanity.

Closes #9735

* wdclient: honor ctx cancellation during LookupVolumeIds backoff

---------

Co-authored-by: Mohamed Chorfa <mohamed.chorfa@thalesgroup.com>
Co-authored-by: Chris Lu <chris.lu@gmail.com>
2026-05-29 20:54:54 -07:00
Mohamed Chorfa 10c4ab3e33 s3, iam, volume, filer, master: add /healthz and /readyz health probes (#9738)
Adds standard Kubernetes liveness/readiness endpoints to all HTTP
servers that were missing them:

- S3:     adds /readyz (already had /healthz)
- IAM:    adds /healthz and /readyz (had none)
- Volume: adds /readyz (already had /healthz)
- Filer:  adds /readyz on default and readonly mux
- Master: adds /healthz and /readyz at root level
  (preserves existing /cluster/healthz)

All endpoints reuse existing health handlers or return 200 OK as a
minimal foundation. Future PRs can enhance /readyz with dependency
checks without breaking the contract.

Closes #9736

Co-authored-by: Mohamed Chorfa <mohamed.chorfa@thalesgroup.com>
2026-05-29 20:45:03 -07:00
Chris Lu 4c5d1d53b4 Update README.md 2026-05-29 14:34:24 -07:00
Chris Lu ba9e74d8a7 docs: add zyner as a gold sponsor 2026-05-29 12:40:58 -07:00
7y-9 fbcba51e73 refactor: avoid unused sql insert result (#9734) 2026-05-29 00:45:45 -07:00
Chris Lu c9623007a2 fix(filer.sync): keep sync_offset fresh through filtered-event markers (#9733)
On a read-only watched path the idle heartbeat keeps sync_offset fresh,
but a busy source filer still emits a MaxUnsyncedEvents marker after many
filtered events. The marker has a non-nil but empty EventNotification, so
the client routed it to the event path, where it advanced no real
watermark yet drove offsetFunc to republish the stale processed
watermark — regressing the gauge between heartbeats and spiking the
derived lag every time a filtered-event burst landed.

Route the empty marker through OnIdleHeartbeat like the idle heartbeat so
its fresh timestamp keeps the gauge current; it still advances the
in-stream resume cursor.
2026-05-28 23:29:59 -07:00
Chris Lu 5955972fe6 fix(shell): verify volume.merge output before overwriting replicas (#9731)
* fix(shell): verify volume.merge output before overwriting replicas

volume.merge overwrote every replica with the merged copy without checking it was complete. Read back the merged copy and refuse to overwrite unless it holds at least as many live needles as the most complete source replica, leaving the originals intact on a short or empty merge.

* fix(shell): keep merged volume until all replicas are rebuilt

On a copy failure partway through the overwrite loop, the temporary merged copy was deleted along with the half-rebuilt replicas. Stop deleting it until every replica has been rebuilt; on failure the verified copy is kept so the merge can be re-run to completion.

* refactor(shell): reuse readVolumeStatus in ensureVolumeReadonly

* fix(shell): guard against nil volume status response
2026-05-28 19:29:25 -07:00
Chris Lu 16717b0bf4 fix(s3): authenticate JWT unsigned-streaming uploads (#9729)
A bearer-token client whose SDK appends a CRC32 trailer sends an
unsigned-streaming PUT (STREAMING-UNSIGNED-PAYLOAD-TRAILER) with no SigV4
signature, so getRequestAuthType classifies it as authTypeStreamingUnsigned.
The auth dispatch ignored the bearer token and fell back to anonymous, and
newChunkedReader tried to verify the bearer token as a SigV4 seed signature
and failed, so the body could not be decoded either.

Dispatch the streaming-unsigned auth on whatever credential is present
(SigV4 / JWT / anonymous), and skip the SigV4 seed-signature recompute for
JWT requests in the chunked reader.
2026-05-28 18:10:24 -07:00
Chris Lu 2f0643e5b1 fix(volume): stop flipping volumes read-only on a non-append-ordered .idx (#9726)
* fix(volume): verify the .dat-tail needle in the integrity check

CheckVolumeDataIntegrity checked the last entry by file position in the .idx
and, for a live needle, flipped the volume read-only when fileSize > fileTailOffset.
That entry is the .dat tail only when the .idx is in append order; a key-sorted
.idx (weed fix and other rebuilds listed entries by key) puts the highest-key
needle last, whose tail sits mid-file, so healthy volumes went read-only on every
load and re-running weed fix only reproduced the sorted index.

Locate the needle at the maximum offset — the one physically last in the .dat —
and verify the .dat ends exactly at it, regardless of .idx ordering. The
append-ordered common case stays O(1) (the last entry's on-disk end matches the
.dat size); only a key-sorted index pays a single linear scan. Deletion
tombstones at the tail are now verified too, instead of skipping the file-size
check.

* fix(command): weed fix rebuilds the .idx in .dat offset order

SaveToIdx wrote entries via AscendingVisit — sorted by key, the .sdx/.ecx shape
— so the rebuilt .idx put the highest-key needle last instead of the .dat-tail
needle, and dropped tombstones whose live needle was gone. Collect the live and
deleted entries, sort by .dat offset, and write them in append order so the .idx
stays a faithful log whose last entry is the real .dat tail.
2026-05-28 18:04:31 -07:00
Chris Lu 685571d93f fix(s3): allow anonymous unsigned-streaming PutObject (#9727)
Modern botocore attaches a CRC32 trailer to plain PutObject, turning the
payload into STREAMING-UNSIGNED-PAYLOAD-TRAILER. An anonymous upload then
carries that header but no Authorization, so it was classified as
authTypeStreamingUnsigned and sent straight to SigV4 verification, which
rejected it as AccessDenied while explicit credentials kept working.

Fall back to the anonymous identity when an unsigned-streaming request
carries no signature, mirroring the plain anonymous path. The request
stays classified as unsigned-streaming so the chunked body is still
decoded.
2026-05-28 17:00:41 -07:00
Chris Lu f5b833ab6a test(ec): end-to-end encode over a multi-server multi-disk stuck layout (#9728)
* test(framework): support multiple disks per server in MultiVolumeCluster

StartMultiVolumeClusterWithDisks gives each volume server N data
directories (one DiskLocation each), passed to -dir as a comma list, with
a per-server disk-dir accessor for file inspection. StartMultiVolumeCluster
keeps its one-disk default.

* test(ec): end-to-end encode over a multi-server multi-disk stuck layout

A volume in the stuck state — real .dat source, a 0-byte stub replica, and
partial stale EC shards from an interrupted encode — must converge to one
valid EC layout. Asserts the full shard set across servers, .ecx/.vif kept
per server (info file survives the source-volume delete), stale shards
cleared, and no regular .dat/.idx left behind.
2026-05-28 16:44:42 -07:00
Chris Lu 3674f9d04d fix(storage): keep EC .vif when deleting a coexisting regular volume (#9723)
* fix(storage): keep EC .vif when deleting a coexisting regular volume

A regular volume and an EC volume for the same id share <base>.vif. When
EC shards are distributed onto a server that still holds the regular
volume — the encode source, or any replica the planner targets — the
post-encode VolumeDelete ran removeVolumeFiles and stripped the shared
.vif, leaving the freshly built EC volume without its info file.

Skip the .vif in removeVolumeFiles when an EC volume for the same id
exists on the disk (mounted, or a sealed .ecx on disk). The regular
volume's .dat/.idx still go; the EC sidecars survive.

A two-server end-to-end test encodes a volume whose source and a stub
replica both also receive shards, and asserts the final on-disk layout:
both .dat/.idx gone, each server holding only its assigned shards plus
.ecx/.vif. Storage unit tests cover the with-EC and no-EC cases, and the
Rust seaweed-volume port carries the same guard and tests.

* test(storage): assert .idx is removed in the no-EC destroy case

Strengthen TestDestroyRemovesVifWhenNoEc to confirm the full regular
volume cleanup (.dat, .idx, .vif) when no EC volume coexists.
2026-05-28 15:39:31 -07:00
Chris Lu dfd05d14cb refactor(filer): remove the inode->path index and the NFS gateway (#9724)
* fix(filer): derive inodes by hash instead of a snowflake sequencer

Compute the same inode the FUSE mount would: non-hard-linked entries hash path + crtime, hard links hash their shared HardLinkId so every link resolves to one inode. Removes the snowflake inodeSequencer and the SEAWEEDFS_FILER_SNOWFLAKE_ID knob; inodes are now deterministic across filers.

* chore: remove the experimental NFS gateway

The NFS frontend ('weed nfs') was the only consumer of the inode->path index. Remove the weed/server/nfs package, the command and its registration, the integration test harness, and the CI workflow; go mod tidy drops the willscott/go-nfs and go-nfs-client dependencies.

* refactor(filer): drop the inode->path index

With the NFS gateway gone, nothing reads it. A regular file's inode is a pure hash of its path and a hard link's is a hash of its shared HardLinkId -- both derivable on demand -- so the secondary KV index and its write/remove hooks are dead. Removes filer_inode_index.go and the recordInodeIndex hooks from the store wrapper.
2026-05-28 15:00:18 -07:00
Konstantin Lebedev 3537312045 [docker] add make test_keycloak_s3 for local develop and debug (#9719)
* add make test_keylock_s3 for local develop and debug

* fix typos

* add condition oidc:azp

* docker: reuse test/s3/iam realm and iam config for keycloak dev compose

Point the keycloak dev compose at the existing test/s3/iam configs instead
of a parallel realm/port/key/role set. Adds one declarative realm import
(seaweedfs-test-realm.json) as the single realm source and drops the
duplicated iam.json/s3.json.

---------

Co-authored-by: Chris Lu <chris.lu@gmail.com>
2026-05-28 13:39:32 -07:00