* fix(seaweed-volume): bound request body and stored-content expansion to prevent OOM
The Rust volume server buffered the entire upload body with
to_bytes(usize::MAX) and only checked the file-size limit afterward, so a
single large upload — or many concurrent uploads, since the in-flight byte
throttle defaults to 0 (unlimited) — could exhaust memory and get the process
OOM-killed under load. The read path had two more single-request OOM vectors:
`vec![0u8; manifest.size]` allocated from an attacker-controlled chunk-manifest
size, and gzip decompression was unbounded (gzip bomb).
- Bound the upload body read by file_size_limit_bytes (plus a margin for
multipart framing), mirroring Go's io.LimitReader(sizeLimit+1), and reject
oversize before the whole body is buffered.
- Validate manifest.size (reject negative / oversized) before allocating.
- Cap gzip output in maybe_decompress_gzip and route the inline GzDecoder sites
through it.
* fix(seaweed-volume): address review - chunk offset, 32-bit cast, decompress errors
- Validate chunk.offset before indexing in chunk-manifest expansion: a negative
offset wrapped to a huge usize and underflowed `end - offset` (panic from a
crafted manifest). Reject negative, skip out-of-range, use saturating math.
- Use usize::try_from for the upload body limit instead of `as usize`, so a
>usize::MAX file_size_limit on 32-bit caps at usize::MAX rather than silently
truncating to a tiny value.
- maybe_decompress_gzip now returns Result<_, GunzipError> distinguishing a
decode failure (callers fall back to raw bytes, as before) from hitting the
size cap (TooLarge), which now returns 413 instead of silently serving the
still-compressed bytes.
* fix(seaweed-volume): inflate manifest chunks into the result window to cap peak memory
The chunk-manifest expansion still doubled memory: `result` was already allocated
at manifest.size (<=2 GiB) and each compressed chunk was inflated into a separate
Vec (also up to 2 GiB), so a single request could peak near 4 GiB.
Decompress compressed chunks directly into their result[offset..] window (bounded
by the remaining space) so a chunk never allocates a second large buffer; peak
stays at ~manifest.size. Bytes past the window are dropped (matching the prior
truncation), and a fully-undecodable chunk still falls back to its raw bytes.
* fix(seaweed-volume): fall back to raw chunk bytes on any decode failure
Per review: the gzip fallback must run on any decode error, not only when no
bytes were decoded. Clear the partially-written output and copy the chunk's raw
bytes (truncated to the window), restoring the prior decode-failure behavior.
* docker: cross-compile the Go binary instead of emulating it under QEMU
The builder stage ran as the target platform, so arm64/arm/386 images
emulated the whole Go compile (and the full git clone) under QEMU. The
binary is CGO-free, so pin the builder to $BUILDPLATFORM and cross-compile
with GOOS/GOARCH (GOARM for v7), keeping every target's compile native.
* ci: build all release container variants in parallel
The build matrix throttled to two variants at a time on a stale rate-limit
worry. Pulls go through mirror.gcr.io and pushes target GHCR only, so the
five variants can all build at once.
* ci: copy each variant to Docker Hub from its build job
The separate copy-to-dockerhub job waited on the whole build matrix before
any GHCR -> Docker Hub copy could start. Move the crane copy into the build
job so each variant copies as soon as it is built, overlapping with the
others still compiling. tag-latest and helm-release now depend on build.
GetObject on a versioned object returned NoSuchKey forever when the
.versions directory existed but carried no latest-version pointer (empty
Extended metadata) while real version files remained inside it. The
self-heal path only fired for a dangling pointer (present but referencing
a missing file), not an absent one, so doGetLatestObjectVersion fell
straight through and errored on every read.
- doGetLatestObjectVersion now calls recoverLatestVersionWithoutPointer
when the pointer is missing or empty. An absent pointer is the legitimate
signal that a pre-versioning or suspended-versioning "null" object is
current, so that object wins; only when it is absent do we rescan
.versions/ and rebuild the pointer from the version files present.
Transient rescan failures propagate instead of being masked as NotFound.
- selectLatestVersion derives the version id from the v_<versionId> file
name when the Seaweed-X-Amz-Version-Id attribute is absent, so version
files written outside the normal versioned-PUT path (replicated or
restored entries) are still promotable. The orphan diagnostic uses the
same detection so an entry can't be both promoted and counted an orphan.
* fix(ec): pass per-volume data-shard count to the parity-shard split
ShardsInfo.DeleteParityShards/MinusParityShards looped ids 10..13, assuming
the fixed 10+4 layout. For a non-default ratio this splits data vs parity
wrong — a wide ratio (12+4, 16+6) drops real data ids >= 10, which breaks
ec.decode. They now take a dataShards argument (<= 0 falls back to
DataShardsCount) and clear ids dataShards..MaxShardCount. ec.decode threads
the data-shard count from collectEcNodeShardsInfo to both split call sites,
and admin LogicalSize passes DataShardsCount.
Also: EC cleanup now sets an explicit per-disk storage impact
(-len(ShardIds)) instead of falling back to the TotalShardsCount constant,
so freed-capacity accounting matches the shards actually removed.
OSS is always 10+4, so behavior is unchanged here; this keeps the split
ratio-correct and the API aligned with the enterprise per-volume override.
Adds parity-split ratio tests.
* ec: clear parity shards in one locked pass
Address review: DeleteParityShards looped si.Delete, taking the lock once per
id. shards is sorted by Id and shardBits is a bitmap, so mask off the high
bits and truncate the sorted slice at the first parity id (binary search) under
a single lock. Preserves the dataShards<=0 -> DataShardsCount default.
* fix(ec): resolve EC data-shard count from the volume's .vif on reboot
A volume server never loads a cluster EC config into memory, so startup
decisions that assumed 10 data shards mishandled volumes whose .vif
records a different ratio:
- validateEcVolume sized the expected shard against 10 data shards and
required >=10 local shards, so a volume with a non-default ratio and a
coexisting .dat could be wiped on reboot. Read the ratio from the .vif.
- pruneIncompleteEcWithSiblingDat used the hardcoded 10-shard threshold,
so a full data set for a non-default ratio with a healthy sibling .dat
was wiped as a partial leftover. Use the EcVolume's .vif-derived ratio.
Behavior is unchanged for the standard 10+4 layout (the .vif resolves to
10). Adds storage-level reboot tests.
* ec: avoid per-call allocations in ecDataShardsFromVif
Address review: the helper runs once per EC volume at startup. Replace the
slice+map dedup of the two dirs with direct conditional checks via a small
ecDataShardsFromVifDir helper, eliminating the heap allocations and GC
pressure when loading many volumes.
A Canceled/DeadlineExceeded from the caller's per-request context was
treated like a dead channel: it closed the shared cached ClientConn and
cancelled every other in-flight RPC on it with "the client connection is
closing". Under a burst of concurrent chunk assigns (e.g. a large S3
multipart upload) one slow assign hitting its 10s attempt timeout could
poison the connection for all the rest, cascading into a flood of 500s.
Thread the caller's context into shouldInvalidateConnection and only
invalidate on Canceled/DeadlineExceeded while that context is still live,
which isolates the genuine stale-channel signal (a peer restart behind a
k8s Service VIP). To carry the context, add a ctx parameter to the
existing WithGrpcClient, WithMasterClient, and WithMasterServerClient; the
master assign and volume-lookup paths pass their per-attempt context and
every other caller passes context.Background().
makeupDiff replays post-snapshot changes onto the compacted volume. For a
replayed deletion it appended a tombstone to the new .dat but recorded the
.idx entry with offset 0. When that deletion is the last replayed change the
tombstone lands at the .dat tail, and the post-commit integrity check skips
offset-0 entries, so it sees 32 trailing bytes it can't account for and flips
the volume read-only, reloading it as a SortedFileNeedleMap instead of the
writable map.
Record the tombstone's real .dat offset, matching the normal delete path; the
needle map still treats it as deleted off the negative size, so lookups are
unchanged. Mirror the same fix into the Rust volume server.
* s3: auto-enforce bucket quota read-only both ways
Quota read-only only ever flipped when an admin re-ran
s3.bucket.quota.enforce, so a bucket that went over quota stayed
read-only forever even after usage dropped back under.
Fold enforcement into the per-minute, leader-locked bucket-size loop
the s3 gateway already runs for metrics: it now flips each bucket's
read-only flag to match its quota in both directions, rewriting
filer.conf only when a flag actually changes. The set/clear decision
lives in one shared FilerConf.ApplyBucketQuotaReadOnly helper so the
shell command and the gateway can't drift.
* only manage read-only when a quota is set, never clobber manual locks
* trim comments
When the destination's stored mtime is newer than the incoming source version, UpdateEntry skips the update (last-writer-wins). A copy left truncated by an earlier failed replication trips this: the source kept the file's original mtime while the partial copy was written recently, so it looks "newer" and is never corrected. When the destination is strictly shorter than the source, re-replicate the full source content and replace the chunk list instead of skipping. Same shorter-than-source bypass for CreateEntry.
* s3api: use getEtagFromEntry for multipart part ETag to prefer Extended metadata
* s3api: add tests for getEtagFromEntry Extended ETag preference in multipart upload
* s3api: avoid double-quoting ETags in ListParts output
* s3api: add docstring for filer_multipart_etag_test.go
* fix(http): handle invalid gzip stream errors
Explain:
- problem: ReadUrlAsStream could panic when a response claimed gzip encoding but the body was not a valid gzip stream.
- root cause: the gzip reader error was ignored and a nil reader was deferred and read from.
- fix: return the gzip.NewReader error before registering Close or reading.
- validation: go test ./weed/util/http -run TestReadUrlAsStreamReturnsGzipReaderError -count=1; git diff --check.
* test: avoid closing shared global HTTP client in unit test
* filer: name the read-only path in the write rejection
The write path rejected creates under a read-only rule with a bare
"read only", giving no hint which path was locked or why. Wrap the
error with the matched location prefix and a quota hint so a FUSE
mkdir or S3 put points straight at the offending bucket.
* return the read-only reason over HTTP and drop any query string from the fallback prefix
A destination volume server that hits its idle deadline while reading a large upload body under load returns 400 "read tcp ...: i/o timeout". fetchAndWrite retried that on the flat ~1s RetryUntil backoff, hammering the already-overloaded destination. Route i/o timeout, connection reset, broken pipe and net.Error timeouts through the same escalating 10s-2min backoff already used for EOF so it can recover.
shell: only skip system-log subtree in fs.meta.save, not fsck/verify
The SystemLogDir skip lived in the shared BFS traversal, so volume.fsck
built its in-use set without the /topic/.system/log chunks and flagged
every referenced log needle as orphan. -reallyDeleteFromVolume would then
delete live log data and leave dangling filer entries. Gate the skip
behind a flag that only fs.meta.save sets.
* fix(ec_bitrot): cap sidecar block_size in ValidateBitrotManifest
A sidecar loaded from disk (or supplied via a backfill/peer RPC) could carry a
huge power-of-two block_size that passed validation, then force a multi-GiB
scratch-buffer allocation in scrub/verify. Add a shared MaxBitrotBlockSize
(64 MiB) constant, enforce it as an upper bound in isPow2MultipleOf1MiB, and
derive the volume flag cap from the same constant so they cannot drift.
* fix(ec_bitrot): don't destroy a valid destination sidecar on an optional copy
writeToFile opened the destination with O_TRUNC before knowing whether the
source had the file, so an optional copy (ignoreSourceFileNotFound) from a source
that lacks the .ecsum truncated and then removed a valid pre-existing destination
sidecar. Stage the optional copy into a temp sibling and commit it with an atomic
rename only when the source actually delivered the file; a missing source is now
a no-op. Mandatory copies keep their in-place behavior.
* ec: add EC bitrot checksum protobuf
EcBitrotProtection/EcShardChecksums/ChecksumAlgorithm sidecar messages,
copy_ecsum_file and unsafe_ignore_sidecar fields, and a CHECKSUM scrub mode.
* ec: bitrot checksum sidecar format, validation, and per-volume load
Per-shard CRC32C block checksums in an optional <base>.ecsum sidecar with a
self-integrity header; validation, rolling builder, backfill primitive, and
EcVolume load on mount + removal on destroy.
* ec: capture per-shard checksums at encode; verify-and-exclude on rebuild
WriteEcFilesWithContext returns the protection computed inline during encoding.
generateMissingEcFiles verifies present inputs against the sidecar, excludes
corrupt ones, regenerates in place, and re-verifies; fail-closed unless
unsafe_ignore_sidecar, removing all generated outputs on failure.
* ec: read-only checksum scrub with Reed-Solomon arbiter
ChecksumScrub verifies each local shard against the sidecar and reconstructs
flagged shards from the clean shards so stale-sidecar false positives are not
reported. Wired to the gRPC CHECKSUM mode and ec.scrub -mode checksum.
* ec: server-side bitrot sidecar write, copy, cleanup, and opportunistic backfill
Write .ecsum at fresh encode; propagate it with copy_ecsum_file (tolerant);
remove it on full delete and decode; rebuild honors unsafe_ignore_sidecar and
opportunistically backfills a sidecar when all shards are reachable.
* ec: volume server bitrot config flags
-ec.bitrotChecksum (default on) and -ec.bitrotBlockSizeMB (default 16).
* fix(ec_bitrot): bound -ec.bitrotBlockSizeMB before the int64 multiply
Validate the MiB value is in [1, 1024] before multiplying by 1 MiB, so a huge
flag value cannot overflow int64 and slip past the power-of-two check, and a
block size cannot collapse a sidecar to a few oversized blocks.
* fix(ec_bitrot): distribute the .ecsum sidecar from the worker encode path
The worker EC encode wrote the generation-0 sidecar locally but never added it
to shardFiles, so DistributeEcShards never shipped it and the distributed
holders came up unprotected. Append it to shardFiles and map the ecsum shard
type to its extension in the sender so it travels with the shards.
* fix(ec_bitrot): remove orphaned sidecars when the generation is gone
Gate sidecar removal on existingShardCount==0 alone rather than also requiring a
stray .ecx. A sidecar whose shards have all been deleted is orphaned and must be
removed even when no .ecx remains, or it leaks. .ecx/.ecj/.vif removal stays
gated on hasEcxFile as before.
* fix(ec_bitrot): do not fold checksum blocks scanned into TotalFiles
ChecksumScrub's first return is blocks scanned, not files. Discard it so the
scrub response's TotalFiles (a needle/file count) is not inflated by the block
count for CHECKSUM mode.
* test(ec_bitrot): clean up generated .ecsum sidecars in removeGeneratedFiles
* fix(ec_bitrot): reject an oversized sidecar payload before the uint32 cast
The header stores payload_len as a uint32; bound the payload before the
conversion so a pathological manifest cannot truncate the length field and
corrupt the sidecar. A real manifest is a few KB, so this never trips.
* fix(ec_bitrot): cap -ec.bitrotBlockSizeMB at 64 MiB
The block size becomes the per-shard scratch buffer the scrub/backfill path
allocates, so an over-large value (e.g. 1 GiB) is a memory hazard per concurrent
scrub worker. Lower the upper bound from 1024 to 64 MiB.
* fix(ec_bitrot): add -ecUnsafeIgnoreSidecar to weed tool fix -ecx
The -ecx recovery path reconstructs missing shards via RebuildEcFilesWithContext,
which fails closed on a malformed/stale .ecsum. Without an override flag an
operator could not complete the rebuild without manually deleting the sidecar.
Expose -ecUnsafeIgnoreSidecar (default false) and thread it through.
* fix(ec_bitrot): bound sidecar payload with a direct int constant; drop readFull
Guard len(payload) against a plain int constant (1 GiB) before the allocation
instead of a uint64 MaxUint32 compare, so the allocation-size value is provably
bounded (clears the CodeQL overflow alert) and the math import is no longer
needed. Inline os.File.ReadAt with io.EOF handling in verifyShardFileBlocks and
remove the now-redundant readFull helper (os.File.ReadAt fills the slice or
errors).
* test(ec_bitrot): use slices.Contains instead of a hand-rolled containsU32
* refactor(ec): fold the EcFiles WithContext variants into the base functions
RebuildEcFiles now takes the *ECContext directly (nil => derive from .vif as
before) and WriteEcFiles takes it too (nil => default), removing the parallel
RebuildEcFilesWithContext / WriteEcFilesWithContext names. Callers that had an
explicit context drop the WithContext suffix; the default-context callers pass
nil. No behavior change.
* refactor(ec): pass BackgroundECContext instead of nil to Write/RebuildEcFiles
Add a non-nil BackgroundECContext placeholder (analogous to context.Background())
and have callers with no specific layout pass it instead of a nil *ECContext.
WriteEcFiles resolves a zero/background context to the default ratio and
RebuildEcFiles resolves it from the .vif, so behavior is unchanged.
* fix(ec_bitrot): make BackgroundECContext a func; RebuildEcFiles fails closed on bad .vif
- BackgroundECContext is now a function returning a fresh *ECContext, so callers
cannot mutate a shared singleton or race on it (and it mirrors context.Background,
which is also a function).
- RebuildEcFiles now propagates the MaybeLoadVolumeInfo error: a present-but-
unreadable .vif fails closed instead of silently rebuilding with the default
ratio (which would corrupt a custom-ratio volume). Pass an explicit ctx to override.
* shell: warn in volume.list when a volume id spans collections
A reused volume id, the result of the master handing out an id already
used by another collection (for example after losing its max-volume-id
counter on restart), makes collection.delete destroy the wrong
collection's data and makes any bare-id lookup, move, or vacuum
ambiguous. volume.list now scans the full topology and warns on ids
present in more than one collection so the clash is visible before any
destructive operation.
* volume.list: track duplicate ids lazily, sort with slices.Sort
Allocate the per-id collection set only on the first cross-collection clash
instead of one set per volume, so allocations scale with duplicates rather
than the volume count.
filemeta is the filer SQL store's default table name. A bucket of that
name passes VerifyS3BucketName but is rejected by the store's isValidBucket
guard on every operation, so it creates fine yet can't be deleted and wedges
fsck. Reject it at creation so both checks agree.
* s3: commit a versioned PutObject and its latest pointer in one transaction
A versioned PutObject wrote the version file and flipped the .versions
latest pointer in two separate routed transactions. Fold the
RECOMPUTE_LATEST into the version file's PUT so both commit atomically
under the object's per-path lock: the recompute, applied after the PUT in
the same transaction, scans the directory and sees the new version. A
crash can no longer leave the version present with a stale pointer.
putToFiler now takes a putFinalize describing the finalize step — routed
mutations folded into the PUT, or an afterCreate run under the object
write lock off the ring. Suspended-versioning keeps its afterCreate-only
form; multipart, copy, and delete-marker finalizes are unchanged.
* s3: trim verbose finalize comments
uploadFileGrpc passed SaveSmallInline with a 256 KiB limit, so uploads under
that size were written to entry.Content instead of a volume. The filer's own
upload path never inlines unless saveToFilerLimit is set (default 0), and the
S3 server shares that path. Drop the inline options so admin uploads always
land in volumes.
The viewer embedded images and PDFs through the download URL, which sent
Content-Disposition: attachment, so the browser downloaded them instead of
rendering. Add an inline mode to the download endpoint, limited to images and
PDFs so a hostile upload (HTML, SVG) can't run as same-origin script, set
X-Content-Type-Options: nosniff, and resolve the MIME the same way the viewer
does. The viewer now requests the inline URL.
* volume: don't drop the last writable slot on auto-sized disks
MaybeAdjustVolumeMax subtracted 1 from the per-disk slot count, so a disk
with room for exactly one volume (free between 1x and 2x the size limit)
reported 0 slots. The master then never grew a writable volume and every
assign drained its retry budget, so writes failed with context deadline
exceeded. Count the full volumes that actually fit, floored at one for an
auto-sized disk that has free space.
* mini: show disk and volume capacity in the startup banner
Print free space, volume size, total volume count and free volume count
under the data directory line, so a volume size limit that outstrips the
disk is visible at startup instead of surfacing later as failed writes.
* terraform: add cloud-agnostic core renderer module
Renders per-node weed argv, systemd units, config files, disk-mount and secret-fetch scripts, and cloud-init from an address map. Creates zero cloud resources. Flags verified against the weed binary: volume uses -mserver for the master list, gRPC is -port.grpc (auto http+10000), minFreeSpacePercent is a string, filer store via -defaultStoreDir.
* terraform: add mTLS and JWT security module
Generates the CA, per-component certs with distinct CNs, and JWT signing keys via the tls/random providers. Emits a core_security object plus PEMs for secret-store delivery.
* terraform: add AWS deployment module and examples
Reserves stable ENIs first, renders config via the core, then creates instances, prevent_destroy EBS data disks mounted at /data, and the cluster security group. With enable_security, generates certs/JWT, stores them in SSM SecureString, grants an instance role, and fetches them at boot so secrets stay out of user_data. Keyed for_each on every stateful tier.
* terraform: add local cluster test harnesses
run_local_cluster.sh and run_local_secure.sh render a cluster with the core and run real weed processes, asserting master quorum, volume registration, filer/s3 round-trips, mutual-TLS formation, and JWT enforcement. Use an isolated high port range with a guard so they never touch a cluster already running on the machine. The weed binary defaults to $(go env GOPATH)/bin/weed.
* terraform: add CI workflow and README
fmt/validate/tofu-test plus smoke jobs that build weed and run both harnesses.
* terraform: guard against empty filesystem UUID in mount script
An empty UUID made grep -q match any fstab line, skipping the fstab entry and breaking the mount. Fail fast when blkid returns no UUID.
* terraform: sanitize cluster name in WEED_CLUSTER env keys
Hyphens or spaces in cluster_name produced invalid systemd/bash env var names; map non-alphanumerics to underscores.
* terraform: omit empty jwt.signing block from security.toml
With enable_security and no JWT key, the template emitted [jwt.signing] key="". Gate the block on a non-empty key and cover it with a test.
* terraform: mark core security input as sensitive
The security object carries JWT signing keys; keep them out of plan output and known values.
* terraform: enforce jwt_length minimum of 32
* terraform: note region/AZ coupling in HA example
* terraform: guard WORKDIR before recursive delete in test harnesses
* terraform: fix README fence language and test count
* terraform: handle embedded s3 with no filer nodes
Indexing sort(keys(var.filers))[0] errored at plan time when embedded S3 was enabled but no filers were defined; fall back to an empty config source.
* terraform: scope kms:Decrypt to a configurable key arn
Replace the hardcoded Resource="*" with a kms_key_arn variable (default "*") so production can restrict decrypt to a specific CMK.
* terraform: encrypt EBS data volumes at rest
Set encrypted = true on the volume/filer data disks and the all-in-one example disk.
* terraform: protect filer instances from API termination
Filers hold the leveldb2 metadata store, so they are stateful and get the same disable_api_termination as masters and volumes.
* terraform: stop instance before detaching in all-in-one example
* terraform: drop stale references to the removed plan doc
* terraform: correct stale mount-step comment in aws module
* terraform: mark Terraform support as experimental in README
GET/HEAD object with an explicit versionId that does not exist returned
NoSuchKey. AWS S3 returns NoSuchVersion (404) for this case; tools that
distinguish "key gone" from "this version gone" rely on that code.
Add the ErrNoSuchVersion error code and use it on the GET and HEAD
specific-version lookups. Only a genuine not-found maps to NoSuchVersion;
a transient or internal filer error now maps to InternalError (500)
instead of a misleading 404. getSpecificObjectVersion wraps its lookup
error with %w so callers can detect filer_pb.ErrNotFound.
Add a per-row Export button (files and folders) that downloads the filer
metadata in the length-prefixed FullEntry protobuf format that weed shell
fs.meta.load reads, gzipped as <name>.meta.gz like fs.meta.save. Folders are
walked recursively via the filer BFS metadata stream, excluding the system
log subtree. Streamed over gRPC so it keeps working with the filer HTTP
listener disabled.
withFilerClientFailover treated a filer's ErrNotFound like a transport
failure: it kept the result, re-queried every other filer, and finally
wrapped the answer as "all filers failed, last error: ... no entry is
found in filer store".
For workloads with many legitimate misses (e.g. GET object?versionId=X
for a version that was deleted or expired), this turned each 404 into N
filer round-trips and produced a misleading error string.
A reachable filer that answers ErrNotFound has given an authoritative
answer; failover exists to route around unreachable or unhealthy filers,
not to look harder for an entry the store reports as absent. Return
ErrNotFound directly instead of fanning out. Callers that need
read-after-write retries already handle that at the S3 semantic layer
(e.g. getLatestObjectVersion).
* perf(s3.iam.GetUser): Make the API default to the request username if not
specified
This makes the Embedded S3 IAM API align with the documented behavior of the AWS IAM
API as per AWS Docs: https://docs.aws.amazon.com/IAM/latest/APIReference/API_GetUser.html
BREAKING CHANGE: This changes the default behavior of the Embedded IAM API to use the
username of the user holding the accesskey used to make the request in
the GetUsername request handler.
* test: cover GetUser implicit username default
---------
Co-authored-by: Chris Lu <chris.lu@gmail.com>
* operation: bound upload retries and honor context cancellation
retriedUploadData hardcoded 3 attempts and an uninterruptible backoff
sleep. A synchronous replica write to a dead host therefore paid the
full dial timeout three times over before failing.
Add UploadOption.MaxAttempts (<=0 keeps the default of 3) so callers can
cap attempts, and make the loop return as soon as the context is
cancelled so an abandoned upload unwinds instead of retrying.
* topology: fail replica writes fast when a replica is unreachable
DistributedOperation already returns on the first error, but a single
dead replica is itself the slow result: its goroutine retries the upload
three times through the dial timeout (~30s) before any error surfaces,
stalling the originating client write the whole time.
Make the replica write a single attempt (MaxAttempts=1) so a dead
replica fails after one dial timeout instead of three, and thread a
context into DistributedOperation that is cancelled once the outcome is
decided, so a healthy replica is no longer held hostage by one stalled
in a dial. The originating client write is what retries.
* topology: keep replica deletes off the client request context
ReplicatedDelete runs after the local needle is already deleted. Driving
the replica deletes off r.Context() means a client disconnect cancels
them and orphans needles on the replicas, so use a background context.
* operation, topology: trim comments on the replica fail-fast path
* fix: return immediately on first error in DistributedOperation
* simplify DistributedOperation fail-fast to a single buffered channel
Drop the separate errCh: the collector now fails fast on the first error
it reads off the buffered resultCh and returns ret.Error(), so the early
return carries the same [host]: err annotation as the aggregated path and
there is no select race between two channels.
---------
Co-authored-by: Ubuntu User <ubuntu@example.com>
Co-authored-by: Chris Lu <chris.lu@gmail.com>
A volume removed from writables by the write-assign path stamps fullSince
in RecordAssign, which UpdateVolumeSize's recovery branch needs to re-add
it once it decays back under the limit. A volume removed by the heartbeat
capacity path (SetVolumeCapacityFull) never stamped it, so after the
reported size dropped — vacuum, TTL expiry, bulk deletes — the volume
stayed out of writables forever, even though every heartbeat carried the
smaller size.
Stamp fullSince when the capacity path actually removes a volume from
writables, so the existing recovery branch fires. Gating on the removal
keeps it paired with the caller's activeVolumeCount decrement, matching
RecordAssign. Oversized volumes still stay out, as before.
* fix(vacuum): notify master writable after worker vacuum commit
Add Phase 3 (markWritableOne) that walks vacuumTargets and calls
VolumeMarkWritable on each replica's volume server, mirroring
batchVacuumVolumeCommit's per-replica SetVolumeAvailable. Failures are
logged at WARN; the task does not fail because the vacuum itself
already succeeded. See upstream seaweedfs#9685.
* fix(vacuum): delay Phase 3 to let post-commit heartbeats settle
Phase 3's VolumeMarkWritable can race with the volume server's first
post-commit heartbeat. SetVolumeWritable adds the vid to writables,
but a racing heartbeat whose ReadOnly value changed re-runs
EnsureCorrectWritables against the master's per-replica cache, and any
replica still cached as ReadOnly=true silently removes the vid again
— with no further heartbeat change to trigger another recovery.
Sleep 30s after Phase 2 (Commit) so every replica's post-vacuum
heartbeat has reached the master before Phase 3 fires. Cancel cleanly
on ctx.Done so a shutdown during the wait still exits.
* fix(vacuum): reduce post-commit settle from 30s to 10s
VolumePulsePeriod is 5s, so 10s (2x) is enough margin for every
replica's post-commit heartbeat to reach the master before Phase 3
fires. 30s was overly conservative and made TestVacuumExecutionIntegration
hit its 30s context deadline.
* fix(vacuum): use flat 1m timeout for VolumeMarkWritable RPC
VolumeMarkWritable on the volume server is a metadata operation
(reopen idx + flags + master ReadOnly=false heartbeat), independent
of volume size. Scaling via vacuumTimeout(time.Minute) gave it tens
of minutes — even hours on TB volumes — so a single unresponsive
replica could block Phase 3 indefinitely. Use a flat 1m cap.
* fix(vacuum): gate post-vacuum mark-writable on commit read-only state
Phase 3 force-called VolumeMarkWritable on every replica unconditionally,
clearing the read-only flag and persisting ReadOnly=false even for a
replica left read-only by an operator, an EIO quarantine, or low disk.
That overrode states the master deliberately keeps out of writables;
master built-in vacuum gates the same step on the commit's IsReadOnly via
SetVolumeAvailable.
Capture the VacuumVolumeCommit response and skip Phase 3 when any replica
came back read-only, letting it recover on its own ReadOnly=false
heartbeat. Drop the 10s post-commit settle sleep: the heartbeat race it
guarded needed a replica cached read-only at the master, which the gate
now excludes.
---------
Co-authored-by: Chris Lu <chris.lu@gmail.com>
* wdclient, dailyrun: add equal jitter to retry backoff
Prevents thundering-herd retries when many clients recover from a
transient failure at the same instant (e.g., filer restart, network
partition healing).
Uses equal jitter: wait in [d/2, d) instead of deterministic d.
This bounds the maximum wait while still desynchronizing clients.
Files:
- weed/wdclient/filer_client.go (LookupVolumeIds retry loop)
- weed/s3api/s3lifecycle/dailyrun/dispatch.go (dispatchWithRetry)
Tests added for bounds, zero/negative inputs, and distribution sanity.
Closes#9735
* wdclient: honor ctx cancellation during LookupVolumeIds backoff
---------
Co-authored-by: Mohamed Chorfa <mohamed.chorfa@thalesgroup.com>
Co-authored-by: Chris Lu <chris.lu@gmail.com>
Adds standard Kubernetes liveness/readiness endpoints to all HTTP
servers that were missing them:
- S3: adds /readyz (already had /healthz)
- IAM: adds /healthz and /readyz (had none)
- Volume: adds /readyz (already had /healthz)
- Filer: adds /readyz on default and readonly mux
- Master: adds /healthz and /readyz at root level
(preserves existing /cluster/healthz)
All endpoints reuse existing health handlers or return 200 OK as a
minimal foundation. Future PRs can enhance /readyz with dependency
checks without breaking the contract.
Closes#9736
Co-authored-by: Mohamed Chorfa <mohamed.chorfa@thalesgroup.com>
On a read-only watched path the idle heartbeat keeps sync_offset fresh,
but a busy source filer still emits a MaxUnsyncedEvents marker after many
filtered events. The marker has a non-nil but empty EventNotification, so
the client routed it to the event path, where it advanced no real
watermark yet drove offsetFunc to republish the stale processed
watermark — regressing the gauge between heartbeats and spiking the
derived lag every time a filtered-event burst landed.
Route the empty marker through OnIdleHeartbeat like the idle heartbeat so
its fresh timestamp keeps the gauge current; it still advances the
in-stream resume cursor.
* fix(shell): verify volume.merge output before overwriting replicas
volume.merge overwrote every replica with the merged copy without checking it was complete. Read back the merged copy and refuse to overwrite unless it holds at least as many live needles as the most complete source replica, leaving the originals intact on a short or empty merge.
* fix(shell): keep merged volume until all replicas are rebuilt
On a copy failure partway through the overwrite loop, the temporary merged copy was deleted along with the half-rebuilt replicas. Stop deleting it until every replica has been rebuilt; on failure the verified copy is kept so the merge can be re-run to completion.
* refactor(shell): reuse readVolumeStatus in ensureVolumeReadonly
* fix(shell): guard against nil volume status response
A bearer-token client whose SDK appends a CRC32 trailer sends an
unsigned-streaming PUT (STREAMING-UNSIGNED-PAYLOAD-TRAILER) with no SigV4
signature, so getRequestAuthType classifies it as authTypeStreamingUnsigned.
The auth dispatch ignored the bearer token and fell back to anonymous, and
newChunkedReader tried to verify the bearer token as a SigV4 seed signature
and failed, so the body could not be decoded either.
Dispatch the streaming-unsigned auth on whatever credential is present
(SigV4 / JWT / anonymous), and skip the SigV4 seed-signature recompute for
JWT requests in the chunked reader.
* fix(volume): verify the .dat-tail needle in the integrity check
CheckVolumeDataIntegrity checked the last entry by file position in the .idx
and, for a live needle, flipped the volume read-only when fileSize > fileTailOffset.
That entry is the .dat tail only when the .idx is in append order; a key-sorted
.idx (weed fix and other rebuilds listed entries by key) puts the highest-key
needle last, whose tail sits mid-file, so healthy volumes went read-only on every
load and re-running weed fix only reproduced the sorted index.
Locate the needle at the maximum offset — the one physically last in the .dat —
and verify the .dat ends exactly at it, regardless of .idx ordering. The
append-ordered common case stays O(1) (the last entry's on-disk end matches the
.dat size); only a key-sorted index pays a single linear scan. Deletion
tombstones at the tail are now verified too, instead of skipping the file-size
check.
* fix(command): weed fix rebuilds the .idx in .dat offset order
SaveToIdx wrote entries via AscendingVisit — sorted by key, the .sdx/.ecx shape
— so the rebuilt .idx put the highest-key needle last instead of the .dat-tail
needle, and dropped tombstones whose live needle was gone. Collect the live and
deleted entries, sort by .dat offset, and write them in append order so the .idx
stays a faithful log whose last entry is the real .dat tail.
Modern botocore attaches a CRC32 trailer to plain PutObject, turning the
payload into STREAMING-UNSIGNED-PAYLOAD-TRAILER. An anonymous upload then
carries that header but no Authorization, so it was classified as
authTypeStreamingUnsigned and sent straight to SigV4 verification, which
rejected it as AccessDenied while explicit credentials kept working.
Fall back to the anonymous identity when an unsigned-streaming request
carries no signature, mirroring the plain anonymous path. The request
stays classified as unsigned-streaming so the chunked body is still
decoded.
* test(framework): support multiple disks per server in MultiVolumeCluster
StartMultiVolumeClusterWithDisks gives each volume server N data
directories (one DiskLocation each), passed to -dir as a comma list, with
a per-server disk-dir accessor for file inspection. StartMultiVolumeCluster
keeps its one-disk default.
* test(ec): end-to-end encode over a multi-server multi-disk stuck layout
A volume in the stuck state — real .dat source, a 0-byte stub replica, and
partial stale EC shards from an interrupted encode — must converge to one
valid EC layout. Asserts the full shard set across servers, .ecx/.vif kept
per server (info file survives the source-volume delete), stale shards
cleared, and no regular .dat/.idx left behind.
* fix(storage): keep EC .vif when deleting a coexisting regular volume
A regular volume and an EC volume for the same id share <base>.vif. When
EC shards are distributed onto a server that still holds the regular
volume — the encode source, or any replica the planner targets — the
post-encode VolumeDelete ran removeVolumeFiles and stripped the shared
.vif, leaving the freshly built EC volume without its info file.
Skip the .vif in removeVolumeFiles when an EC volume for the same id
exists on the disk (mounted, or a sealed .ecx on disk). The regular
volume's .dat/.idx still go; the EC sidecars survive.
A two-server end-to-end test encodes a volume whose source and a stub
replica both also receive shards, and asserts the final on-disk layout:
both .dat/.idx gone, each server holding only its assigned shards plus
.ecx/.vif. Storage unit tests cover the with-EC and no-EC cases, and the
Rust seaweed-volume port carries the same guard and tests.
* test(storage): assert .idx is removed in the no-EC destroy case
Strengthen TestDestroyRemovesVifWhenNoEc to confirm the full regular
volume cleanup (.dat, .idx, .vif) when no EC volume coexists.
* fix(filer): derive inodes by hash instead of a snowflake sequencer
Compute the same inode the FUSE mount would: non-hard-linked entries hash path + crtime, hard links hash their shared HardLinkId so every link resolves to one inode. Removes the snowflake inodeSequencer and the SEAWEEDFS_FILER_SNOWFLAKE_ID knob; inodes are now deterministic across filers.
* chore: remove the experimental NFS gateway
The NFS frontend ('weed nfs') was the only consumer of the inode->path index. Remove the weed/server/nfs package, the command and its registration, the integration test harness, and the CI workflow; go mod tidy drops the willscott/go-nfs and go-nfs-client dependencies.
* refactor(filer): drop the inode->path index
With the NFS gateway gone, nothing reads it. A regular file's inode is a pure hash of its path and a hard link's is a hash of its shared HardLinkId -- both derivable on demand -- so the secondary KV index and its write/remove hooks are dead. Removes filer_inode_index.go and the recordInodeIndex hooks from the store wrapper.
* add make test_keylock_s3 for local develop and debug
* fix typos
* add condition oidc:azp
* docker: reuse test/s3/iam realm and iam config for keycloak dev compose
Point the keycloak dev compose at the existing test/s3/iam configs instead
of a parallel realm/port/key/role set. Adds one declarative realm import
(seaweedfs-test-realm.json) as the single realm source and drops the
duplicated iam.json/s3.json.
---------
Co-authored-by: Chris Lu <chris.lu@gmail.com>