seaweedfs

mirror of https://github.com/seaweedfs/seaweedfs.git synced 2026-06-09 18:32:43 +00:00

Author	SHA1	Message	Date
Chris Lu	2a46d457ac	4.31 4.31	2026-06-01 23:32:04 -07:00
Chris Lu	e264e9883e	fix(seaweed-volume): bound request body and stored-content expansion to prevent OOM under load (#9780 ) * fix(seaweed-volume): bound request body and stored-content expansion to prevent OOM The Rust volume server buffered the entire upload body with to_bytes(usize::MAX) and only checked the file-size limit afterward, so a single large upload — or many concurrent uploads, since the in-flight byte throttle defaults to 0 (unlimited) — could exhaust memory and get the process OOM-killed under load. The read path had two more single-request OOM vectors: `vec![0u8; manifest.size]` allocated from an attacker-controlled chunk-manifest size, and gzip decompression was unbounded (gzip bomb). - Bound the upload body read by file_size_limit_bytes (plus a margin for multipart framing), mirroring Go's io.LimitReader(sizeLimit+1), and reject oversize before the whole body is buffered. - Validate manifest.size (reject negative / oversized) before allocating. - Cap gzip output in maybe_decompress_gzip and route the inline GzDecoder sites through it. * fix(seaweed-volume): address review - chunk offset, 32-bit cast, decompress errors - Validate chunk.offset before indexing in chunk-manifest expansion: a negative offset wrapped to a huge usize and underflowed `end - offset` (panic from a crafted manifest). Reject negative, skip out-of-range, use saturating math. - Use usize::try_from for the upload body limit instead of `as usize`, so a >usize::MAX file_size_limit on 32-bit caps at usize::MAX rather than silently truncating to a tiny value. - maybe_decompress_gzip now returns Result<_, GunzipError> distinguishing a decode failure (callers fall back to raw bytes, as before) from hitting the size cap (TooLarge), which now returns 413 instead of silently serving the still-compressed bytes. * fix(seaweed-volume): inflate manifest chunks into the result window to cap peak memory The chunk-manifest expansion still doubled memory: `result` was already allocated at manifest.size (<=2 GiB) and each compressed chunk was inflated into a separate Vec (also up to 2 GiB), so a single request could peak near 4 GiB. Decompress compressed chunks directly into their result[offset..] window (bounded by the remaining space) so a chunk never allocates a second large buffer; peak stays at ~manifest.size. Bytes past the window are dropped (matching the prior truncation), and a fully-undecodable chunk still falls back to its raw bytes. * fix(seaweed-volume): fall back to raw chunk bytes on any decode failure Per review: the gzip fallback must run on any decode error, not only when no bytes were decoded. Clear the partially-written output and copy the chunk's raw bytes (truncated to the window), restoring the prior decode-failure behavior.	2026-06-01 22:24:13 -07:00
Chris Lu	fba71ab14c	ci: parallelize the unified release-container build (#9783 ) * docker: cross-compile the Go binary instead of emulating it under QEMU The builder stage ran as the target platform, so arm64/arm/386 images emulated the whole Go compile (and the full git clone) under QEMU. The binary is CGO-free, so pin the builder to $BUILDPLATFORM and cross-compile with GOOS/GOARCH (GOARM for v7), keeping every target's compile native. * ci: build all release container variants in parallel The build matrix throttled to two variants at a time on a stale rate-limit worry. Pulls go through mirror.gcr.io and pushes target GHCR only, so the five variants can all build at once. * ci: copy each variant to Docker Hub from its build job The separate copy-to-dockerhub job waited on the whole build matrix before any GHCR -> Docker Hub copy could start. Move the crane copy into the build job so each variant copies as soon as it is built, overlapping with the others still compiling. tag-latest and helm-release now depend on build.	2026-06-01 20:34:05 -07:00
Neetika Mittal	45465e5a05	fix(master): notify clients after manual volume grow (#9656 ) Co-authored-by: Neetika Mittal <mneetika@users.noreply.github.com>	2026-06-01 20:33:37 -07:00
Chris Lu	bf37fba0e1	fix(s3): recover versioned reads when the .versions latest pointer is absent (#9782 ) GetObject on a versioned object returned NoSuchKey forever when the .versions directory existed but carried no latest-version pointer (empty Extended metadata) while real version files remained inside it. The self-heal path only fired for a dangling pointer (present but referencing a missing file), not an absent one, so doGetLatestObjectVersion fell straight through and errored on every read. - doGetLatestObjectVersion now calls recoverLatestVersionWithoutPointer when the pointer is missing or empty. An absent pointer is the legitimate signal that a pre-versioning or suspended-versioning "null" object is current, so that object wins; only when it is absent do we rescan .versions/ and rebuild the pointer from the version files present. Transient rescan failures propagate instead of being masked as NotFound. - selectLatestVersion derives the version id from the v_<versionId> file name when the Seaweed-X-Amz-Version-Id attribute is absent, so version files written outside the normal versioned-PUT path (replicated or restored entries) are still promotable. The orphan diagnostic uses the same detection so an entry can't be both promoted and counted an orphan.	2026-06-01 20:01:30 -07:00
Chris Lu	ca81c0c525	fix(ec): pass per-volume data-shard count to the parity-shard split (#9781 ) * fix(ec): pass per-volume data-shard count to the parity-shard split ShardsInfo.DeleteParityShards/MinusParityShards looped ids 10..13, assuming the fixed 10+4 layout. For a non-default ratio this splits data vs parity wrong — a wide ratio (12+4, 16+6) drops real data ids >= 10, which breaks ec.decode. They now take a dataShards argument (<= 0 falls back to DataShardsCount) and clear ids dataShards..MaxShardCount. ec.decode threads the data-shard count from collectEcNodeShardsInfo to both split call sites, and admin LogicalSize passes DataShardsCount. Also: EC cleanup now sets an explicit per-disk storage impact (-len(ShardIds)) instead of falling back to the TotalShardsCount constant, so freed-capacity accounting matches the shards actually removed. OSS is always 10+4, so behavior is unchanged here; this keeps the split ratio-correct and the API aligned with the enterprise per-volume override. Adds parity-split ratio tests. * ec: clear parity shards in one locked pass Address review: DeleteParityShards looped si.Delete, taking the lock once per id. shards is sorted by Id and shardBits is a bitmap, so mask off the high bits and truncate the sorted slice at the first parity id (binary search) under a single lock. Preserves the dataShards<=0 -> DataShardsCount default.	2026-06-01 19:25:15 -07:00
Chris Lu	f410d975c7	fix(ec): resolve EC data-shard count from the volume's .vif on reboot (#9779 ) * fix(ec): resolve EC data-shard count from the volume's .vif on reboot A volume server never loads a cluster EC config into memory, so startup decisions that assumed 10 data shards mishandled volumes whose .vif records a different ratio: - validateEcVolume sized the expected shard against 10 data shards and required >=10 local shards, so a volume with a non-default ratio and a coexisting .dat could be wiped on reboot. Read the ratio from the .vif. - pruneIncompleteEcWithSiblingDat used the hardcoded 10-shard threshold, so a full data set for a non-default ratio with a healthy sibling .dat was wiped as a partial leftover. Use the EcVolume's .vif-derived ratio. Behavior is unchanged for the standard 10+4 layout (the .vif resolves to 10). Adds storage-level reboot tests. * ec: avoid per-call allocations in ecDataShardsFromVif Address review: the helper runs once per EC volume at startup. Replace the slice+map dedup of the two dirs with direct conditional checks via a small ecDataShardsFromVifDir helper, eliminating the heap allocations and GC pressure when loading many volumes.	2026-06-01 19:22:14 -07:00
steve.wei	1313600b9e	fix(topology): restore active count after vacuum recovery (#9770 )	2026-06-01 15:23:22 -07:00
Chris Lu	2386fa550a	grpc: don't tear down the shared master connection on a caller's own timeout (#9775 ) A Canceled/DeadlineExceeded from the caller's per-request context was treated like a dead channel: it closed the shared cached ClientConn and cancelled every other in-flight RPC on it with "the client connection is closing". Under a burst of concurrent chunk assigns (e.g. a large S3 multipart upload) one slow assign hitting its 10s attempt timeout could poison the connection for all the rest, cascading into a flood of 500s. Thread the caller's context into shouldInvalidateConnection and only invalidate on Canceled/DeadlineExceeded while that context is still live, which isolates the genuine stale-channel signal (a peer restart behind a k8s Service VIP). To carry the context, add a ctx parameter to the existing WithGrpcClient, WithMasterClient, and WithMasterServerClient; the master assign and volume-lookup paths pass their per-attempt context and every other caller passes context.Background().	2026-06-01 15:11:02 -07:00
Chris Lu	dfa86b4313	volume: keep volume writable after a deletion-tail compaction (#9776 ) makeupDiff replays post-snapshot changes onto the compacted volume. For a replayed deletion it appended a tombstone to the new .dat but recorded the .idx entry with offset 0. When that deletion is the last replayed change the tombstone lands at the .dat tail, and the post-commit integrity check skips offset-0 entries, so it sees 32 trailing bytes it can't account for and flips the volume read-only, reloading it as a SortedFileNeedleMap instead of the writable map. Record the tombstone's real .dat offset, matching the normal delete path; the needle map still treats it as deleted off the negative size, so lookups are unchanged. Mirror the same fix into the Rust volume server.	2026-06-01 13:15:08 -07:00
Chris Lu	8c60408bfb	s3: auto-enforce bucket quota read-only both ways (#9774 ) * s3: auto-enforce bucket quota read-only both ways Quota read-only only ever flipped when an admin re-ran s3.bucket.quota.enforce, so a bucket that went over quota stayed read-only forever even after usage dropped back under. Fold enforcement into the per-minute, leader-locked bucket-size loop the s3 gateway already runs for metrics: it now flips each bucket's read-only flag to match its quota in both directions, rewriting filer.conf only when a flag actually changes. The set/clear decision lives in one shared FilerConf.ApplyBucketQuotaReadOnly helper so the shell command and the gateway can't drift. * only manage read-only when a quota is set, never clobber manual locks * trim comments	2026-06-01 13:11:18 -07:00
Chris Lu	57797c9b38	filer.sync: repair a destination shorter than the source (#9778 ) When the destination's stored mtime is newer than the incoming source version, UpdateEntry skips the update (last-writer-wins). A copy left truncated by an earlier failed replication trips this: the source kept the file's original mtime while the partial copy was written recently, so it looks "newer" and is never corrected. When the destination is strictly shorter than the source, re-replicate the full source content and replace the chunk list instead of skipping. Same shorter-than-source bypass for CreateEntry.	2026-06-01 13:04:23 -07:00
Nguyễn Lộc Phúc	ed31271e28	fix(s3api): Fix multipart upload ETag compatibility with Hadoop S3A (#9772 ) * s3api: use getEtagFromEntry for multipart part ETag to prefer Extended metadata * s3api: add tests for getEtagFromEntry Extended ETag preference in multipart upload * s3api: avoid double-quoting ETags in ListParts output * s3api: add docstring for filer_multipart_etag_test.go	2026-06-01 13:03:46 -07:00
7y-9	5ea75dcc67	fix(http): handle invalid gzip stream errors (#9767 ) * fix(http): handle invalid gzip stream errors Explain: - problem: ReadUrlAsStream could panic when a response claimed gzip encoding but the body was not a valid gzip stream. - root cause: the gzip reader error was ignored and a nil reader was deferred and read from. - fix: return the gzip.NewReader error before registering Close or reading. - validation: go test ./weed/util/http -run TestReadUrlAsStreamReturnsGzipReaderError -count=1; git diff --check. * test: avoid closing shared global HTTP client in unit test	2026-06-01 12:21:19 -07:00
Chris Lu	1a19683ee6	filer: name the read-only path in the write rejection (#9773 ) * filer: name the read-only path in the write rejection The write path rejected creates under a read-only rule with a bare "read only", giving no hint which path was locked or why. Wrap the error with the matched location prefix and a quota hint so a FUSE mkdir or S3 put points straight at the offending bucket. * return the read-only reason over HTTP and drop any query string from the fallback prefix	2026-06-01 12:20:45 -07:00
Chris Lu	2e3fabbf24	filer.sync: back off on transient upload errors (#9777 ) A destination volume server that hits its idle deadline while reading a large upload body under load returns 400 "read tcp ...: i/o timeout". fetchAndWrite retried that on the flat ~1s RetryUntil backoff, hammering the already-overloaded destination. Route i/o timeout, connection reset, broken pipe and net.Error timeouts through the same escalating 10s-2min backoff already used for EOF so it can recover.	2026-06-01 12:18:17 -07:00
Chris Lu	f9ee49b03e	shell: volume.fsck must not skip the system-log subtree (#9764 ) shell: only skip system-log subtree in fs.meta.save, not fsck/verify The SystemLogDir skip lived in the shared BFS traversal, so volume.fsck built its in-use set without the /topic/.system/log chunks and flagged every referenced log needle as orphan. -reallyDeleteFromVolume would then delete live log data and leave dangling filer entries. Gate the skip behind a flag that only fs.meta.save sets.	2026-06-01 09:54:22 -07:00
Chris Lu	80dd3b2621	EC bitrot follow-ups: protect destination sidecar on optional copy; cap sidecar block_size (#9763 ) * fix(ec_bitrot): cap sidecar block_size in ValidateBitrotManifest A sidecar loaded from disk (or supplied via a backfill/peer RPC) could carry a huge power-of-two block_size that passed validation, then force a multi-GiB scratch-buffer allocation in scrub/verify. Add a shared MaxBitrotBlockSize (64 MiB) constant, enforce it as an upper bound in isPow2MultipleOf1MiB, and derive the volume flag cap from the same constant so they cannot drift. * fix(ec_bitrot): don't destroy a valid destination sidecar on an optional copy writeToFile opened the destination with O_TRUNC before knowing whether the source had the file, so an optional copy (ignoreSourceFileNotFound) from a source that lacks the .ecsum truncated and then removed a valid pre-existing destination sidecar. Stage the optional copy into a temp sibling and commit it with an atomic rename only when the source actually delivered the file; a missing source is now a no-op. Mandatory copies keep their in-place behavior.	2026-05-31 23:42:33 -07:00
Chris Lu	9658f309d2	EC bitrot detection: per-shard checksum sidecars (#9761 ) * ec: add EC bitrot checksum protobuf EcBitrotProtection/EcShardChecksums/ChecksumAlgorithm sidecar messages, copy_ecsum_file and unsafe_ignore_sidecar fields, and a CHECKSUM scrub mode. * ec: bitrot checksum sidecar format, validation, and per-volume load Per-shard CRC32C block checksums in an optional <base>.ecsum sidecar with a self-integrity header; validation, rolling builder, backfill primitive, and EcVolume load on mount + removal on destroy. * ec: capture per-shard checksums at encode; verify-and-exclude on rebuild WriteEcFilesWithContext returns the protection computed inline during encoding. generateMissingEcFiles verifies present inputs against the sidecar, excludes corrupt ones, regenerates in place, and re-verifies; fail-closed unless unsafe_ignore_sidecar, removing all generated outputs on failure. * ec: read-only checksum scrub with Reed-Solomon arbiter ChecksumScrub verifies each local shard against the sidecar and reconstructs flagged shards from the clean shards so stale-sidecar false positives are not reported. Wired to the gRPC CHECKSUM mode and ec.scrub -mode checksum. * ec: server-side bitrot sidecar write, copy, cleanup, and opportunistic backfill Write .ecsum at fresh encode; propagate it with copy_ecsum_file (tolerant); remove it on full delete and decode; rebuild honors unsafe_ignore_sidecar and opportunistically backfills a sidecar when all shards are reachable. * ec: volume server bitrot config flags -ec.bitrotChecksum (default on) and -ec.bitrotBlockSizeMB (default 16). * fix(ec_bitrot): bound -ec.bitrotBlockSizeMB before the int64 multiply Validate the MiB value is in [1, 1024] before multiplying by 1 MiB, so a huge flag value cannot overflow int64 and slip past the power-of-two check, and a block size cannot collapse a sidecar to a few oversized blocks. * fix(ec_bitrot): distribute the .ecsum sidecar from the worker encode path The worker EC encode wrote the generation-0 sidecar locally but never added it to shardFiles, so DistributeEcShards never shipped it and the distributed holders came up unprotected. Append it to shardFiles and map the ecsum shard type to its extension in the sender so it travels with the shards. * fix(ec_bitrot): remove orphaned sidecars when the generation is gone Gate sidecar removal on existingShardCount==0 alone rather than also requiring a stray .ecx. A sidecar whose shards have all been deleted is orphaned and must be removed even when no .ecx remains, or it leaks. .ecx/.ecj/.vif removal stays gated on hasEcxFile as before. * fix(ec_bitrot): do not fold checksum blocks scanned into TotalFiles ChecksumScrub's first return is blocks scanned, not files. Discard it so the scrub response's TotalFiles (a needle/file count) is not inflated by the block count for CHECKSUM mode. * test(ec_bitrot): clean up generated .ecsum sidecars in removeGeneratedFiles * fix(ec_bitrot): reject an oversized sidecar payload before the uint32 cast The header stores payload_len as a uint32; bound the payload before the conversion so a pathological manifest cannot truncate the length field and corrupt the sidecar. A real manifest is a few KB, so this never trips. * fix(ec_bitrot): cap -ec.bitrotBlockSizeMB at 64 MiB The block size becomes the per-shard scratch buffer the scrub/backfill path allocates, so an over-large value (e.g. 1 GiB) is a memory hazard per concurrent scrub worker. Lower the upper bound from 1024 to 64 MiB. * fix(ec_bitrot): add -ecUnsafeIgnoreSidecar to weed tool fix -ecx The -ecx recovery path reconstructs missing shards via RebuildEcFilesWithContext, which fails closed on a malformed/stale .ecsum. Without an override flag an operator could not complete the rebuild without manually deleting the sidecar. Expose -ecUnsafeIgnoreSidecar (default false) and thread it through. * fix(ec_bitrot): bound sidecar payload with a direct int constant; drop readFull Guard len(payload) against a plain int constant (1 GiB) before the allocation instead of a uint64 MaxUint32 compare, so the allocation-size value is provably bounded (clears the CodeQL overflow alert) and the math import is no longer needed. Inline os.File.ReadAt with io.EOF handling in verifyShardFileBlocks and remove the now-redundant readFull helper (os.File.ReadAt fills the slice or errors). * test(ec_bitrot): use slices.Contains instead of a hand-rolled containsU32 * refactor(ec): fold the EcFiles WithContext variants into the base functions RebuildEcFiles now takes the ECContext directly (nil => derive from .vif as before) and WriteEcFiles takes it too (nil => default), removing the parallel RebuildEcFilesWithContext / WriteEcFilesWithContext names. Callers that had an explicit context drop the WithContext suffix; the default-context callers pass nil. No behavior change. refactor(ec): pass BackgroundECContext instead of nil to Write/RebuildEcFiles Add a non-nil BackgroundECContext placeholder (analogous to context.Background()) and have callers with no specific layout pass it instead of a nil ECContext. WriteEcFiles resolves a zero/background context to the default ratio and RebuildEcFiles resolves it from the .vif, so behavior is unchanged. fix(ec_bitrot): make BackgroundECContext a func; RebuildEcFiles fails closed on bad .vif - BackgroundECContext is now a function returning a fresh *ECContext, so callers cannot mutate a shared singleton or race on it (and it mirrors context.Background, which is also a function). - RebuildEcFiles now propagates the MaybeLoadVolumeInfo error: a present-but- unreadable .vif fails closed instead of silently rebuilding with the default ratio (which would corrupt a custom-ratio volume). Pass an explicit ctx to override.	2026-05-31 18:52:44 -07:00
Chris Lu	fdfeb4063c	shell: warn in volume.list when a volume id spans collections (#9759 ) * shell: warn in volume.list when a volume id spans collections A reused volume id, the result of the master handing out an id already used by another collection (for example after losing its max-volume-id counter on restart), makes collection.delete destroy the wrong collection's data and makes any bare-id lookup, move, or vacuum ambiguous. volume.list now scans the full topology and warns on ids present in more than one collection so the clash is visible before any destructive operation. * volume.list: track duplicate ids lazily, sort with slices.Sort Allocate the per-id collection set only on the first cross-collection clash instead of one set per volume, so allocations scale with duplicates rather than the volume count.	2026-05-31 11:52:39 -07:00
Chris Lu	35ab67fa8a	s3: reject reserved bucket name "filemeta" (#9760 ) filemeta is the filer SQL store's default table name. A bucket of that name passes VerifyS3BucketName but is rejected by the store's isValidBucket guard on every operation, so it creates fine yet can't be deleted and wedges fsck. Reject it at creation so both checks agree.	2026-05-31 11:15:05 -07:00
Chris Lu	6b06fe5ec4	s3: commit a versioned PutObject and its latest pointer in one transaction (#9756 ) * s3: commit a versioned PutObject and its latest pointer in one transaction A versioned PutObject wrote the version file and flipped the .versions latest pointer in two separate routed transactions. Fold the RECOMPUTE_LATEST into the version file's PUT so both commit atomically under the object's per-path lock: the recompute, applied after the PUT in the same transaction, scans the directory and sees the new version. A crash can no longer leave the version present with a stale pointer. putToFiler now takes a putFinalize describing the finalize step — routed mutations folded into the PUT, or an afterCreate run under the object write lock off the ring. Suspended-versioning keeps its afterCreate-only form; multipart, copy, and delete-marker finalizes are unchanged. * s3: trim verbose finalize comments	2026-05-31 00:13:36 -07:00
Chris Lu	d806778757	admin: store file browser uploads in volumes, not inline (#9752 ) uploadFileGrpc passed SaveSmallInline with a 256 KiB limit, so uploads under that size were written to entry.Content instead of a volume. The filer's own upload path never inlines unless saveToFilerLimit is set (default 0), and the S3 server shares that path. Drop the inline options so admin uploads always land in volumes.	2026-05-30 23:47:42 -07:00
Chris Lu	186747e7e8	admin: view images and PDFs inline in the file browser (#9751 ) The viewer embedded images and PDFs through the download URL, which sent Content-Disposition: attachment, so the browser downloaded them instead of rendering. Add an inline mode to the download endpoint, limited to images and PDFs so a hostile upload (HTML, SVG) can't run as same-origin script, set X-Content-Type-Options: nosniff, and resolve the MIME the same way the viewer does. The viewer now requests the inline URL.	2026-05-30 23:46:09 -07:00
Chris Lu	05c6500453	volume: fix maxVolumeCount dead zone that stalled writes on auto-sized disks (#9755 ) * volume: don't drop the last writable slot on auto-sized disks MaybeAdjustVolumeMax subtracted 1 from the per-disk slot count, so a disk with room for exactly one volume (free between 1x and 2x the size limit) reported 0 slots. The master then never grew a writable volume and every assign drained its retry budget, so writes failed with context deadline exceeded. Count the full volumes that actually fit, floored at one for an auto-sized disk that has free space. * mini: show disk and volume capacity in the startup banner Print free space, volume size, total volume count and free volume count under the data directory line, so a volume size limit that outstrips the disk is visible at startup instead of surfacing later as failed writes.	2026-05-30 23:45:17 -07:00
Chris Lu	a10607f90a	Add Terraform support for VM-based SeaweedFS deployment (#9754 ) * terraform: add cloud-agnostic core renderer module Renders per-node weed argv, systemd units, config files, disk-mount and secret-fetch scripts, and cloud-init from an address map. Creates zero cloud resources. Flags verified against the weed binary: volume uses -mserver for the master list, gRPC is -port.grpc (auto http+10000), minFreeSpacePercent is a string, filer store via -defaultStoreDir. * terraform: add mTLS and JWT security module Generates the CA, per-component certs with distinct CNs, and JWT signing keys via the tls/random providers. Emits a core_security object plus PEMs for secret-store delivery. * terraform: add AWS deployment module and examples Reserves stable ENIs first, renders config via the core, then creates instances, prevent_destroy EBS data disks mounted at /data, and the cluster security group. With enable_security, generates certs/JWT, stores them in SSM SecureString, grants an instance role, and fetches them at boot so secrets stay out of user_data. Keyed for_each on every stateful tier. * terraform: add local cluster test harnesses run_local_cluster.sh and run_local_secure.sh render a cluster with the core and run real weed processes, asserting master quorum, volume registration, filer/s3 round-trips, mutual-TLS formation, and JWT enforcement. Use an isolated high port range with a guard so they never touch a cluster already running on the machine. The weed binary defaults to $(go env GOPATH)/bin/weed. * terraform: add CI workflow and README fmt/validate/tofu-test plus smoke jobs that build weed and run both harnesses. * terraform: guard against empty filesystem UUID in mount script An empty UUID made grep -q match any fstab line, skipping the fstab entry and breaking the mount. Fail fast when blkid returns no UUID. * terraform: sanitize cluster name in WEED_CLUSTER env keys Hyphens or spaces in cluster_name produced invalid systemd/bash env var names; map non-alphanumerics to underscores. * terraform: omit empty jwt.signing block from security.toml With enable_security and no JWT key, the template emitted [jwt.signing] key="". Gate the block on a non-empty key and cover it with a test. * terraform: mark core security input as sensitive The security object carries JWT signing keys; keep them out of plan output and known values. * terraform: enforce jwt_length minimum of 32 * terraform: note region/AZ coupling in HA example * terraform: guard WORKDIR before recursive delete in test harnesses * terraform: fix README fence language and test count * terraform: handle embedded s3 with no filer nodes Indexing sort(keys(var.filers))[0] errored at plan time when embedded S3 was enabled but no filers were defined; fall back to an empty config source. * terraform: scope kms:Decrypt to a configurable key arn Replace the hardcoded Resource="" with a kms_key_arn variable (default "") so production can restrict decrypt to a specific CMK. * terraform: encrypt EBS data volumes at rest Set encrypted = true on the volume/filer data disks and the all-in-one example disk. * terraform: protect filer instances from API termination Filers hold the leveldb2 metadata store, so they are stateful and get the same disable_api_termination as masters and volumes. * terraform: stop instance before detaching in all-in-one example * terraform: drop stale references to the removed plan doc * terraform: correct stale mount-step comment in aws module * terraform: mark Terraform support as experimental in README	2026-05-30 23:43:17 -07:00
Chris Lu	0e35235908	s3: return NoSuchVersion (not NoSuchKey) for a missing versionId (#9749 ) GET/HEAD object with an explicit versionId that does not exist returned NoSuchKey. AWS S3 returns NoSuchVersion (404) for this case; tools that distinguish "key gone" from "this version gone" rely on that code. Add the ErrNoSuchVersion error code and use it on the GET and HEAD specific-version lookups. Only a genuine not-found maps to NoSuchVersion; a transient or internal filer error now maps to InternalError (500) instead of a misleading 404. getSpecificObjectVersion wraps its lookup error with %w so callers can detect filer_pb.ErrNotFound.	2026-05-30 21:09:53 -07:00
Chris Lu	7c5ca01027	admin: export file/folder metadata from the file browser (#9750 ) Add a per-row Export button (files and folders) that downloads the filer metadata in the length-prefixed FullEntry protobuf format that weed shell fs.meta.load reads, gzipped as <name>.meta.gz like fs.meta.save. Folders are walked recursively via the filer BFS metadata stream, excluding the system log subtree. Streamed over gRPC so it keeps working with the filer HTTP listener disabled.	2026-05-30 20:59:01 -07:00
Chris Lu	3441a2a7f1	s3: short-circuit filer failover on ErrNotFound (#9748 ) withFilerClientFailover treated a filer's ErrNotFound like a transport failure: it kept the result, re-queried every other filer, and finally wrapped the answer as "all filers failed, last error: ... no entry is found in filer store". For workloads with many legitimate misses (e.g. GET object?versionId=X for a version that was deleted or expired), this turned each 404 into N filer round-trips and produced a misleading error string. A reachable filer that answers ErrNotFound has given an authoritative answer; failover exists to route around unreachable or unhealthy filers, not to look harder for an entry the store reports as absent. Return ErrNotFound directly instead of fanning out. Callers that need read-after-write retries already handle that at the S3 semantic layer (e.g. getLatestObjectVersion).	2026-05-30 15:07:27 -07:00
Chris Lu	34be9170f0	4.30 4.30	2026-05-30 10:52:32 -07:00
Elias Paitz	30f49013e1	perf(s3.iam.GetUser): Make the API default to the request username if not specified (#9746 ) * perf(s3.iam.GetUser): Make the API default to the request username if not specified This makes the Embedded S3 IAM API align with the documented behavior of the AWS IAM API as per AWS Docs: https://docs.aws.amazon.com/IAM/latest/APIReference/API_GetUser.html BREAKING CHANGE: This changes the default behavior of the Embedded IAM API to use the username of the user holding the accesskey used to make the request in the GetUsername request handler. * test: cover GetUser implicit username default --------- Co-authored-by: Chris Lu <chris.lu@gmail.com>	2026-05-30 10:51:03 -07:00
Chris Lu	4bf27278fa	topology: fail replica writes fast when a replica is unreachable (#9744 ) * operation: bound upload retries and honor context cancellation retriedUploadData hardcoded 3 attempts and an uninterruptible backoff sleep. A synchronous replica write to a dead host therefore paid the full dial timeout three times over before failing. Add UploadOption.MaxAttempts (<=0 keeps the default of 3) so callers can cap attempts, and make the loop return as soon as the context is cancelled so an abandoned upload unwinds instead of retrying. * topology: fail replica writes fast when a replica is unreachable DistributedOperation already returns on the first error, but a single dead replica is itself the slow result: its goroutine retries the upload three times through the dial timeout (~30s) before any error surfaces, stalling the originating client write the whole time. Make the replica write a single attempt (MaxAttempts=1) so a dead replica fails after one dial timeout instead of three, and thread a context into DistributedOperation that is cancelled once the outcome is decided, so a healthy replica is no longer held hostage by one stalled in a dial. The originating client write is what retries. * topology: keep replica deletes off the client request context ReplicatedDelete runs after the local needle is already deleted. Driving the replica deletes off r.Context() means a client disconnect cancels them and orphans needles on the replicas, so use a background context. * operation, topology: trim comments on the replica fail-fast path	2026-05-30 10:45:02 -07:00
Chris Lu	5834c834e3	Refine enterprise edition feature blurb in version output and docs	2026-05-30 09:29:06 -07:00
Rushikesh Deshpande	ea33b851e6	fix: return immediately on first error in DistributedOperation (#9740 ) * fix: return immediately on first error in DistributedOperation * simplify DistributedOperation fail-fast to a single buffered channel Drop the separate errCh: the collector now fails fast on the first error it reads off the buffered resultCh and returns ret.Error(), so the early return carries the same [host]: err annotation as the aggregated path and there is no select race between two channels. --------- Co-authored-by: Ubuntu User <ubuntu@example.com> Co-authored-by: Chris Lu <chris.lu@gmail.com>	2026-05-30 00:14:44 -07:00
Chris Lu	e60d02c339	fix(topology): recover heartbeat-fulled volumes once they shrink (#9742 ) A volume removed from writables by the write-assign path stamps fullSince in RecordAssign, which UpdateVolumeSize's recovery branch needs to re-add it once it decays back under the limit. A volume removed by the heartbeat capacity path (SetVolumeCapacityFull) never stamped it, so after the reported size dropped — vacuum, TTL expiry, bulk deletes — the volume stayed out of writables forever, even though every heartbeat carried the smaller size. Stamp fullSince when the capacity path actually removes a volume from writables, so the existing recovery branch fires. Gating on the removal keeps it paired with the caller's activeVolumeCount decrement, matching RecordAssign. Oversized volumes still stay out, as before.	2026-05-30 00:06:10 -07:00
Jaehoon Kim	4b23204023	fix(vacuum): writable volume re-notification after worker VACUUM (#9732 ) * fix(vacuum): notify master writable after worker vacuum commit Add Phase 3 (markWritableOne) that walks vacuumTargets and calls VolumeMarkWritable on each replica's volume server, mirroring batchVacuumVolumeCommit's per-replica SetVolumeAvailable. Failures are logged at WARN; the task does not fail because the vacuum itself already succeeded. See upstream seaweedfs#9685. * fix(vacuum): delay Phase 3 to let post-commit heartbeats settle Phase 3's VolumeMarkWritable can race with the volume server's first post-commit heartbeat. SetVolumeWritable adds the vid to writables, but a racing heartbeat whose ReadOnly value changed re-runs EnsureCorrectWritables against the master's per-replica cache, and any replica still cached as ReadOnly=true silently removes the vid again — with no further heartbeat change to trigger another recovery. Sleep 30s after Phase 2 (Commit) so every replica's post-vacuum heartbeat has reached the master before Phase 3 fires. Cancel cleanly on ctx.Done so a shutdown during the wait still exits. * fix(vacuum): reduce post-commit settle from 30s to 10s VolumePulsePeriod is 5s, so 10s (2x) is enough margin for every replica's post-commit heartbeat to reach the master before Phase 3 fires. 30s was overly conservative and made TestVacuumExecutionIntegration hit its 30s context deadline. * fix(vacuum): use flat 1m timeout for VolumeMarkWritable RPC VolumeMarkWritable on the volume server is a metadata operation (reopen idx + flags + master ReadOnly=false heartbeat), independent of volume size. Scaling via vacuumTimeout(time.Minute) gave it tens of minutes — even hours on TB volumes — so a single unresponsive replica could block Phase 3 indefinitely. Use a flat 1m cap. * fix(vacuum): gate post-vacuum mark-writable on commit read-only state Phase 3 force-called VolumeMarkWritable on every replica unconditionally, clearing the read-only flag and persisting ReadOnly=false even for a replica left read-only by an operator, an EIO quarantine, or low disk. That overrode states the master deliberately keeps out of writables; master built-in vacuum gates the same step on the commit's IsReadOnly via SetVolumeAvailable. Capture the VacuumVolumeCommit response and skip Phase 3 when any replica came back read-only, letting it recover on its own ReadOnly=false heartbeat. Drop the 10s post-commit settle sleep: the heartbeat race it guarded needed a replica cached read-only at the master, which the gate now excludes. --------- Co-authored-by: Chris Lu <chris.lu@gmail.com>	2026-05-29 23:43:24 -07:00
Mohamed Chorfa	e5fb547e95	wdclient, dailyrun: add equal jitter to retry backoff (#9737 ) * wdclient, dailyrun: add equal jitter to retry backoff Prevents thundering-herd retries when many clients recover from a transient failure at the same instant (e.g., filer restart, network partition healing). Uses equal jitter: wait in [d/2, d) instead of deterministic d. This bounds the maximum wait while still desynchronizing clients. Files: - weed/wdclient/filer_client.go (LookupVolumeIds retry loop) - weed/s3api/s3lifecycle/dailyrun/dispatch.go (dispatchWithRetry) Tests added for bounds, zero/negative inputs, and distribution sanity. Closes #9735 * wdclient: honor ctx cancellation during LookupVolumeIds backoff --------- Co-authored-by: Mohamed Chorfa <mohamed.chorfa@thalesgroup.com> Co-authored-by: Chris Lu <chris.lu@gmail.com>	2026-05-29 20:54:54 -07:00
Mohamed Chorfa	10c4ab3e33	s3, iam, volume, filer, master: add /healthz and /readyz health probes (#9738 ) Adds standard Kubernetes liveness/readiness endpoints to all HTTP servers that were missing them: - S3: adds /readyz (already had /healthz) - IAM: adds /healthz and /readyz (had none) - Volume: adds /readyz (already had /healthz) - Filer: adds /readyz on default and readonly mux - Master: adds /healthz and /readyz at root level (preserves existing /cluster/healthz) All endpoints reuse existing health handlers or return 200 OK as a minimal foundation. Future PRs can enhance /readyz with dependency checks without breaking the contract. Closes #9736 Co-authored-by: Mohamed Chorfa <mohamed.chorfa@thalesgroup.com>	2026-05-29 20:45:03 -07:00
Chris Lu	4c5d1d53b4	Update README.md	2026-05-29 14:34:24 -07:00
Chris Lu	ba9e74d8a7	docs: add zyner as a gold sponsor	2026-05-29 12:40:58 -07:00
7y-9	fbcba51e73	refactor: avoid unused sql insert result (#9734 )	2026-05-29 00:45:45 -07:00
Chris Lu	c9623007a2	fix(filer.sync): keep sync_offset fresh through filtered-event markers (#9733 ) On a read-only watched path the idle heartbeat keeps sync_offset fresh, but a busy source filer still emits a MaxUnsyncedEvents marker after many filtered events. The marker has a non-nil but empty EventNotification, so the client routed it to the event path, where it advanced no real watermark yet drove offsetFunc to republish the stale processed watermark — regressing the gauge between heartbeats and spiking the derived lag every time a filtered-event burst landed. Route the empty marker through OnIdleHeartbeat like the idle heartbeat so its fresh timestamp keeps the gauge current; it still advances the in-stream resume cursor.	2026-05-28 23:29:59 -07:00
Chris Lu	5955972fe6	fix(shell): verify volume.merge output before overwriting replicas (#9731 ) * fix(shell): verify volume.merge output before overwriting replicas volume.merge overwrote every replica with the merged copy without checking it was complete. Read back the merged copy and refuse to overwrite unless it holds at least as many live needles as the most complete source replica, leaving the originals intact on a short or empty merge. * fix(shell): keep merged volume until all replicas are rebuilt On a copy failure partway through the overwrite loop, the temporary merged copy was deleted along with the half-rebuilt replicas. Stop deleting it until every replica has been rebuilt; on failure the verified copy is kept so the merge can be re-run to completion. * refactor(shell): reuse readVolumeStatus in ensureVolumeReadonly * fix(shell): guard against nil volume status response	2026-05-28 19:29:25 -07:00
Chris Lu	16717b0bf4	fix(s3): authenticate JWT unsigned-streaming uploads (#9729 ) A bearer-token client whose SDK appends a CRC32 trailer sends an unsigned-streaming PUT (STREAMING-UNSIGNED-PAYLOAD-TRAILER) with no SigV4 signature, so getRequestAuthType classifies it as authTypeStreamingUnsigned. The auth dispatch ignored the bearer token and fell back to anonymous, and newChunkedReader tried to verify the bearer token as a SigV4 seed signature and failed, so the body could not be decoded either. Dispatch the streaming-unsigned auth on whatever credential is present (SigV4 / JWT / anonymous), and skip the SigV4 seed-signature recompute for JWT requests in the chunked reader.	2026-05-28 18:10:24 -07:00
Chris Lu	2f0643e5b1	fix(volume): stop flipping volumes read-only on a non-append-ordered .idx (#9726 ) * fix(volume): verify the .dat-tail needle in the integrity check CheckVolumeDataIntegrity checked the last entry by file position in the .idx and, for a live needle, flipped the volume read-only when fileSize > fileTailOffset. That entry is the .dat tail only when the .idx is in append order; a key-sorted .idx (weed fix and other rebuilds listed entries by key) puts the highest-key needle last, whose tail sits mid-file, so healthy volumes went read-only on every load and re-running weed fix only reproduced the sorted index. Locate the needle at the maximum offset — the one physically last in the .dat — and verify the .dat ends exactly at it, regardless of .idx ordering. The append-ordered common case stays O(1) (the last entry's on-disk end matches the .dat size); only a key-sorted index pays a single linear scan. Deletion tombstones at the tail are now verified too, instead of skipping the file-size check. * fix(command): weed fix rebuilds the .idx in .dat offset order SaveToIdx wrote entries via AscendingVisit — sorted by key, the .sdx/.ecx shape — so the rebuilt .idx put the highest-key needle last instead of the .dat-tail needle, and dropped tombstones whose live needle was gone. Collect the live and deleted entries, sort by .dat offset, and write them in append order so the .idx stays a faithful log whose last entry is the real .dat tail.	2026-05-28 18:04:31 -07:00
Chris Lu	685571d93f	fix(s3): allow anonymous unsigned-streaming PutObject (#9727 ) Modern botocore attaches a CRC32 trailer to plain PutObject, turning the payload into STREAMING-UNSIGNED-PAYLOAD-TRAILER. An anonymous upload then carries that header but no Authorization, so it was classified as authTypeStreamingUnsigned and sent straight to SigV4 verification, which rejected it as AccessDenied while explicit credentials kept working. Fall back to the anonymous identity when an unsigned-streaming request carries no signature, mirroring the plain anonymous path. The request stays classified as unsigned-streaming so the chunked body is still decoded.	2026-05-28 17:00:41 -07:00
Chris Lu	f5b833ab6a	test(ec): end-to-end encode over a multi-server multi-disk stuck layout (#9728 ) * test(framework): support multiple disks per server in MultiVolumeCluster StartMultiVolumeClusterWithDisks gives each volume server N data directories (one DiskLocation each), passed to -dir as a comma list, with a per-server disk-dir accessor for file inspection. StartMultiVolumeCluster keeps its one-disk default. * test(ec): end-to-end encode over a multi-server multi-disk stuck layout A volume in the stuck state — real .dat source, a 0-byte stub replica, and partial stale EC shards from an interrupted encode — must converge to one valid EC layout. Asserts the full shard set across servers, .ecx/.vif kept per server (info file survives the source-volume delete), stale shards cleared, and no regular .dat/.idx left behind.	2026-05-28 16:44:42 -07:00
Chris Lu	3674f9d04d	fix(storage): keep EC .vif when deleting a coexisting regular volume (#9723 ) * fix(storage): keep EC .vif when deleting a coexisting regular volume A regular volume and an EC volume for the same id share <base>.vif. When EC shards are distributed onto a server that still holds the regular volume — the encode source, or any replica the planner targets — the post-encode VolumeDelete ran removeVolumeFiles and stripped the shared .vif, leaving the freshly built EC volume without its info file. Skip the .vif in removeVolumeFiles when an EC volume for the same id exists on the disk (mounted, or a sealed .ecx on disk). The regular volume's .dat/.idx still go; the EC sidecars survive. A two-server end-to-end test encodes a volume whose source and a stub replica both also receive shards, and asserts the final on-disk layout: both .dat/.idx gone, each server holding only its assigned shards plus .ecx/.vif. Storage unit tests cover the with-EC and no-EC cases, and the Rust seaweed-volume port carries the same guard and tests. * test(storage): assert .idx is removed in the no-EC destroy case Strengthen TestDestroyRemovesVifWhenNoEc to confirm the full regular volume cleanup (.dat, .idx, .vif) when no EC volume coexists.	2026-05-28 15:39:31 -07:00
Chris Lu	dfd05d14cb	refactor(filer): remove the inode->path index and the NFS gateway (#9724 ) * fix(filer): derive inodes by hash instead of a snowflake sequencer Compute the same inode the FUSE mount would: non-hard-linked entries hash path + crtime, hard links hash their shared HardLinkId so every link resolves to one inode. Removes the snowflake inodeSequencer and the SEAWEEDFS_FILER_SNOWFLAKE_ID knob; inodes are now deterministic across filers. * chore: remove the experimental NFS gateway The NFS frontend ('weed nfs') was the only consumer of the inode->path index. Remove the weed/server/nfs package, the command and its registration, the integration test harness, and the CI workflow; go mod tidy drops the willscott/go-nfs and go-nfs-client dependencies. * refactor(filer): drop the inode->path index With the NFS gateway gone, nothing reads it. A regular file's inode is a pure hash of its path and a hard link's is a hash of its shared HardLinkId -- both derivable on demand -- so the secondary KV index and its write/remove hooks are dead. Removes filer_inode_index.go and the recordInodeIndex hooks from the store wrapper.	2026-05-28 15:00:18 -07:00
Konstantin Lebedev	3537312045	[docker] add make test_keycloak_s3 for local develop and debug (#9719 ) * add make test_keylock_s3 for local develop and debug * fix typos * add condition oidc:azp * docker: reuse test/s3/iam realm and iam config for keycloak dev compose Point the keycloak dev compose at the existing test/s3/iam configs instead of a parallel realm/port/key/role set. Adds one declarative realm import (seaweedfs-test-realm.json) as the single realm source and drops the duplicated iam.json/s3.json. --------- Co-authored-by: Chris Lu <chris.lu@gmail.com>	2026-05-28 13:39:32 -07:00

1 2 3 4 5 ...

14047 Commits