seaweedfs

mirror of https://github.com/seaweedfs/seaweedfs.git synced 2026-05-17 23:31:31 +00:00

Author	SHA1	Message	Date
Chris Lu	3ff92f797d	4.21	2026-04-19 14:38:29 -07:00
Chris Lu	45578a42e9	fix(volume): keep vacuum running past dangling .idx entries (#9115 ) * fix(volume): keep vacuum running past dangling .idx entries Vacuum compaction aborted entirely on the first .idx entry whose offset pointed past the end of the .dat file, surfacing as `cannot hydrate needle from file: EOF` and stalling progress on every other volume. In both Go and Rust: - During compaction, skip an unreadable needle and continue. The bytes it pointed at were already unreachable via reads, so dropping the index reference makes the post-vacuum volume consistent. Real EIO still bails out so a disk fault is not silently papered over. - At volume load, do a single linear scan of the .idx and confirm every (offset + actual size) fits inside .dat. The pre-existing integrity check only looked at the last 10 entries, so deeper corruption (e.g. left over from a crashed batched write) went undetected and only surfaced later as a vacuum EOF. A failure now marks the volume read-only at load time so an operator can react. Refs #8928 * fix(volume): only skip permanent-corruption needle reads during vacuum Address PR review feedback (gemini-code-assist + coderabbit): The original patch skipped any non-EIO read failure, which would silently drop needles on transient errors — Windows hardware bad-sector errors (ERROR_CRC etc.) never surface as syscall.EIO; tiered-storage network timeouts and EROFS would also slip through and shrink the volume. Switch to an explicit whitelist of permanent-corruption shapes: - Add needle.ErrorCorrupted sentinel and wrap CRC and "index out of range" errors with %w so callers can match via errors.Is. - copyDataBasedOnIndexFile now skips only when the read failure is io.EOF, io.ErrUnexpectedEOF, ErrorSizeMismatch, ErrorSizeInvalid, or ErrorCorrupted. Anything else (real disk faults, environmental errors, Windows hardware codes) aborts the compaction so an operator notices. - Mirror the same whitelist in the Rust volume server, matching on io::ErrorKind::UnexpectedEof and the NeedleError corruption variants (SizeMismatch, CrcMismatch, IndexOutOfRange, TailTooShort). Also add `defer v.Close()` in TestVerifyIndexFitsInDat so Windows t.TempDir() cleanup can release the .dat/.idx handles. Refs #8928 * fix(volume): wrap entry-not-found size-mismatch with ErrorSizeMismatch Address PR review: the fallback branch in ReadBytes returned an unwrapped fmt.Errorf, so isSkippableNeedleReadError (and any caller using errors.Is(..., ErrorSizeMismatch)) could not match it. Wrap with %w so the whitelist applies, while leaving the existing direct sentinel return for the OffsetSize==4 / offset<MaxPossibleVolumeSize retry path unchanged so ReadData's `err == ErrorSizeMismatch` retry still triggers. Refs #8928 * fix(volume): integrate dangling-idx check into existing index load walk Address PR review (gemini-code-assist, medium): the structural .idx check used to do a second linear scan of the index file at every volume load, doubling the disk-I/O cost on servers managing many volumes. Track the largest (offset + actual size) seen during the existing needle-map load walks (`LoadCompactNeedleMap`, `NewLevelDbNeedleMap`, `NewSortedFileNeedleMap`'s `newNeedleMapMetricFromIndexFile`, `DoOffsetLoading`) on a new `MaximumNeedleEnd` field on `mapMetric`, exposed as `MaxNeedleEnd()` on the NeedleMapper interface. `volume.load()` then compares `nm.MaxNeedleEnd()` to the .dat size after the load is complete — pure numeric comparison, no extra I/O. The standalone `verifyIndexFitsInDat` helper and its caller in `CheckVolumeDataIntegrity` are removed; the test that used to drive the helper directly now exercises the new path via `LoadCompactNeedleMap`. Mirror the same change in the Rust volume server: track `max_needle_end` on `NeedleMapMetric`, expose via `max_needle_end()` on `CompactNeedleMap`, `RedbNeedleMap`, and the `NeedleMap` enum. The Rust load walk already happens in `load_from_idx` for both map kinds, so the structural check becomes free. Refs #8928	2026-04-16 22:01:34 -07:00
Chris Lu	9d15705c16	fix(mini): shut down admin/s3/webdav/filer before volume/master on Ctrl+C (#9112 ) * fix(mini): shut down admin/s3/webdav/filer before volume/master on Ctrl+C Interrupts fired grace hooks in registration order, so master (started first) shut down before its clients, producing heartbeat-canceled errors and masterClient reconnection noise during weed mini shutdown. Admin/s3/ webdav had no interrupt hooks at all and were killed at os.Exit. - grace: execute interrupt hooks in LIFO (defer-style) order so later- started services tear down first. - filer: consolidate the three separate interrupt hooks (gRPC / HTTP / DB) into one that runs in order, so filer shutdown stays correct independent of FIFO/LIFO semantics. - mini: add MiniClientsShutdownCtx (separate from test-facing MiniClusterCtx) plus an OnMiniClientsShutdown helper. Admin, S3, WebDAV and the maintenance worker observe it; runMini registers a cancel hook after startup so under LIFO it fires first and waits up to 10s on a WaitGroup for those services to drain before filer, volume, and master shut down. Resulting order on Ctrl+C: admin/s3/webdav/worker -> filer (gRPC -> HTTP -> DB) -> volume -> master. * refactor(mini): group mini-client shutdown into one state struct The first pass spread the shutdown plumbing across three globals (MiniClientsShutdownCtx, miniClientsWg, cancelMiniClients) and two ctx-derivation sites (OnMiniClientsShutdown and startMiniAdminWithWorker). Group into a private miniClientsState (ctx/cancel/wg) rebuilt per runMini invocation, and chain its ctx from MiniClusterCtx so clients only observe one signal. Tests that cancel MiniClusterCtx still trigger client shutdown via parent-child propagation. - resetMiniClients() installs fresh state at the top of runMini, so in-process test reruns don't inherit stale ctx/wg. - onMiniClientsShutdown(fn) replaces the exported OnMiniClientsShutdown and only observes one ctx. - trackMiniClient() replaces the manual wg.Add/Done dance for the admin goroutine. - miniClientsCtx() gives the admin startup a ctx without re-deriving. - triggerMiniClientsShutdown(timeout) is the interrupt hook body. No behaviour change; existing tests pass. * refactor: generalize shutdown ctx as an option, not a mini-specific helper Several service files (s3, webdav, filer, master, volume) observed the mini-specific MiniClusterCtx or called onMiniClientsShutdown directly. That leaked mini orchestration into code that also runs under weed s3, weed webdav, weed filer, weed master, and weed volume standalone. Replace with a generic `shutdownCtx context.Context` field on each service's Options struct. When non-nil, the server watches it and shuts down gracefully; when nil (standalone), the shutdown path is a no-op. Mini wires the contexts up from a single place (runMini): - miniMasterOptions/miniOptions.v/miniFilerOptions.shutdownCtx = MiniClusterCtx (drives test-triggered teardown) - miniS3Options/miniWebDavOptions.shutdownCtx = miniClientsCtx() (drives Ctrl+C teardown before filer/volume/master) All knowledge of MiniClusterCtx now lives in mini.go. * fix(mini): stop worker before clients ctx so admin shutdown isn't blocked Symptom on Ctrl+C of a clean weed mini: mini's Shutting down admin/s3/ webdav hook sat for 10s then logged "timed out". Admin had started its shutdown but was blocked inside StopWorkerGrpcServer's GracefulStop, waiting for the still-connected worker stream. That in turn left filer clients connected and cascaded into filer's own 10s gRPC graceful-stop timeout. Two causes, both fixed: 1. worker.Stop() deadlocked on clean shutdown. It sent ActionStop (which makes managerLoop `break out` and exit), then called getTaskLoad() which sends to the same unbuffered cmd channel — no receiver, hangs forever. Reorder Stop() to snapshot the admin client and drain tasks BEFORE sending ActionStop, and call Disconnect() via the local snapshot afterwards. 2. Worker's taskRequestLoop raced with Disconnect(): RequestTask reads from c.incoming, which Disconnect closes, yielding a nil response and a panic on response.Message. Handle the closed channel explicitly. 3. Mini now has a preCancel phase (beforeMiniClientsShutdown) that runs synchronously BEFORE the clients ctx is cancelled. Register worker shutdown there so admin's worker-gRPC GracefulStop finds the worker already disconnected and returns immediately, instead of waiting on a stream that is about to close anyway. Observed shutdown of a clean mini: admin/s3/webdav down in <10ms; full process exit in ~11s (the remaining 10s is a pre-existing filer gRPC graceful-stop timeout, not cascaded from the clients tier). * feat(mini): cap filer gRPC graceful stop at 1s under weed mini Full weed mini shutdown was ~11s on a clean exit, dominated by the filer's default 10s gRPC GracefulStop timeout while background SubscribeLocalMetadata streams drained. Expose the timeout as a FilerOptions.gracefulStopTimeout field (default 10s for standalone weed filer) and set it to 1s in mini. Clean weed mini shutdown now takes ~2s.	2026-04-16 16:11:01 -07:00
Chris Lu	886d50a6a5	feat(mount): singleflight dedup for concurrent chunk reads (#9100 ) * feat(mount): add singleflight deduplication for concurrent chunk reads When multiple FUSE readers request the same uncached chunk concurrently, only one network fetch is performed. Other readers wait and share the downloaded data, reducing redundant volume server traffic under parallel read workloads. * fix(util): make singleflight panic-safe with defer cleanup If the provided function panics, the WaitGroup and map entry are now cleaned up via defer, preventing other waiters from hanging forever. * fix(filer): remove singleflight from reader_cache to fix buffer ownership The singleflight wrapper around chunk fetches returned the same []byte buffer to concurrent callers. Since each SingleChunkCacher owns and frees its data buffer in destroy(), sharing the same slice would cause a use-after-free or double-free with the mem allocator. The downloaders map already deduplicates in-flight downloads for the same fileId, so the singleflight was redundant at this layer. The SingleFlightGroup utility is retained for use elsewhere.	2026-04-16 10:18:05 -07:00
Chris Lu	e1fa4ec756	perf(cache): drop OS page cache after disk cache reads (#9098 ) * perf(cache): drop OS page cache after disk cache reads After reading from the on-disk chunk cache, advise the kernel via FADV_DONTNEED to release the corresponding page cache pages. This prevents double-caching the same data in both user-space and kernel page caches, freeing RAM for other uses on systems with large disk caches. * fix(cache): guard dropReadCache against zero length and invalid fd A zero-length fadvise is interpreted as "to end of file" on Linux, which would inadvertently drop the page cache for the entire remainder of the cache volume. Also check fd >= 0 to avoid unnecessary syscalls when the backend file is closed. * perf(cache): only apply FADV_DONTNEED for reads >= 1 MiB For small needle reads the syscall overhead outweighs the memory savings, and the kernel page cache is more beneficial for warm data. Restrict fadvise to reads of at least 1 MiB where the freed page cache is meaningful.	2026-04-16 09:38:42 -07:00
Chris Lu	08d9193fe1	[nfs] Add NFS (#9067 ) * add filer inode foundation for nfs * nfs command skeleton * add filer inode index foundation for nfs * make nfs inode index hardlink aware * add nfs filehandle and inode lookup plumbing * add read-only nfs frontend foundation * add nfs namespace mutation support * add chunk-backed nfs write path * add nfs protocol integration tests * add stale handle nfs coverage * complete nfs hardlink and failover coverage * add nfs export access controls * add nfs metadata cache invalidation * fix nfs chunk read lookup routing * fix nfs review findings and rename regression * address pr 9067 review comments - filer_inode: fail fast if the snowflake sequencer cannot start, and let operators override the 10-bit node id via SEAWEEDFS_FILER_SNOWFLAKE_ID to avoid multi-filer collisions - filer_inode: drop the redundant retry loop in nextInode - filerstore_wrapper: treat inode-index writes/removals as best-effort so a primary store success no longer surfaces as an operation failure - filer_grpc_server_rename: defer overwritten-target chunk deletion until after CommitTransaction so a rolled-back rename does not strand live metadata pointing at freshly deleted chunks - command/nfs: default ip.bind to loopback and require an explicit filer.path, so the experimental server does not expose the entire filer namespace on first run - nfs integration_test: document why LinkArgs matches go-nfs's on-the-wire layout rather than RFC 1813 LINK3args * mount: pre-allocate inode in Mkdir and Symlink Mkdir and Symlink used to send filer_pb.CreateEntryRequest with Attributes.Inode = 0. After PR 9067, the filer's CreateEntry now assigns its own inode in that case, so the filer-side entry ends up with a different inode than the one the mount allocates via inodeToPath.Lookup and returns to the kernel. Once applyLocalMetadataEvent stores the filer's entry in the meta cache, subsequent GetAttr calls read the cached entry and hit the setAttrByPbEntry override at line 197 of weedfs_attr.go, returning the filer-assigned inode instead of the mount's local one. pjdfstest tests/rename/00.t (subtests 81/87/91) caught this — it lstat'd a freshly-created directory/symlink, renamed it, lstat'd again, and saw a different inode the second time. createRegularFile already pre-allocates via inodeToPath.AllocateInode and stamps it into the create request. Do the same thing in Mkdir and Symlink so both sides agree on the object identity from the very first request, and so GetAttr's cache path returns the same value as Mkdir / Symlink's initial response. * sequence: mask snowflake node id on int→uint32 conversion CodeQL flagged the unchecked uint32(snowflakeId) cast in NewSnowflakeSequencer as a potential truncation bug when snowflakeId is sourced from user input (e.g. via SEAWEEDFS_FILER_SNOWFLAKE_ID). Mask to the 10 bits the snowflake library actually uses so any caller- supplied int is safely clamped into range. * add test/nfs integration suite Boots a real SeaweedFS cluster (master + volume + filer) plus the experimental `weed nfs` frontend as subprocesses and drives it through the NFSv3 wire protocol via go-nfs-client, mirroring the layout of test/sftp. The tests run without a kernel NFS mount, privileged ports, or any platform-specific tooling. Coverage includes read/write round-trip, mkdir/rmdir, nested directories, rename content preservation, overwrite + explicit truncate, 3 MiB binary file, all-byte binary and empty files, symlink round-trip, ReadDirPlus listing, missing-path remove, FSInfo sanity, sequential appends, and readdir-after-remove. Framework notes: - Picks ephemeral ports with net.Listen("127.0.0.1:0") and passes -port.grpc explicitly so the default port+10000 convention cannot overflow uint16 on macOS. - Pre-creates the /nfs_export directory via the filer HTTP API before starting the NFS server — the NFS server's ensureIndexedEntry check requires the export root to exist with a real entry, which filer.Root does not satisfy when the export path is "/". - Reuses the same rpc.Client for mount and target so go-nfs-client does not try to re-dial via portmapper (which concatenates ":111" onto the address). * ci: add NFS integration test workflow Mirror test/sftp's workflow for the new test/nfs suite so PRs that touch the NFS server, the inode filer plumbing it depends on, or the test harness itself run the 14 NFSv3-over-RPC integration tests on Ubuntu 22.04 via `make test`. * nfs: use append for buffer growth in Write and Truncate The previous make+copy pattern reallocated the full buffer on every extending write or truncate, giving O(N^2) behaviour for sequential write loops. Switching to `append(f.content, make([]byte, delta)...)` lets Go's amortized growth strategy absorb the repeated extensions. Called out by gemini-code-assist on PR 9067. * filer: honor caller cancellation in collectInodeIndexEntries Dropping the WithoutCancel wrapper lets DeleteFolderChildren bail out of the inode-index scan if the client disconnects mid-walk. The cleanup is already treated as best-effort by the caller (it logs on error and continues), so a cancelled walk just means the partial index rebuild is skipped — the same failure mode as any other index write error. Flagged as a DoS concern by gemini-code-assist on PR 9067. * nfs: skip filer read on open when O_TRUNC is set openFile used to unconditionally loadWritableContent for every writable open and then discard the buffer if O_TRUNC was set. For large files that is a pointless 64 MiB round-trip. Reorder the branches so we only fetch existing content when the caller intends to keep it, and mark the file dirty right away so the subsequent Close still issues the truncating write. Called out by gemini-code-assist on PR 9067. * nfs: allow Seek on O_APPEND files and document buffered write cap Two related cleanups on filesystem.go: - POSIX only restricts Write on an O_APPEND fd, not lseek. The existing Seek error ("append-only file descriptors may only seek to EOF") prevented read-and-write workloads that legitimately reposition the read cursor. Write already snaps the offset to EOF before persisting (see seaweedFile Write), so Seek can unconditionally accept any offset. Update the unit test that was asserting the old behaviour. - Add a doc comment on maxBufferedWriteSize explaining that it is a per-file ceiling, the memory footprint it implies, and that the real fix for larger whole-file rewrites is streaming / multi-chunk support. Both changes flagged by gemini-code-assist on PR 9067. * nfs: guard offset before casting to int in Write CodeQL flagged `int(f.offset) + len(p)` inside the Write growth path as a potential overflow on architectures where `int` is 32-bit. The existing check only bounded the post-cast value, which is too late. Clamp f.offset against maxBufferedWriteSize before the cast and also reject negative/overflowed endOffset results. Both branches fall through to billy.ErrNotSupported, the same behaviour the caller gets today for any out-of-range buffered write. * nfs: compute Write endOffset in int64 to satisfy CodeQL The previous guard bounded f.offset but left len(p) unchecked, so CodeQL still flagged `int(f.offset) + len(p)` as a possible int-width overflow path. Bound len(p) against maxBufferedWriteSize first, do the addition in int64, and only cast down after the total has been clamped against the buffer ceiling. Behaviour is unchanged: any out-of-range write still returns billy.ErrNotSupported. * ci: drop emojis from nfs-tests workflow summary Plain-text step summary per user preference — no decorative glyphs in the NFS CI output or checklist. * nfs: annotate remaining DEV_PLAN TODOs with status Three of the unchecked items are genuine follow-up PRs rather than missing work in this one, and one was actually already done: - Reuse chunk cache and mutation stream helpers without FUSE deps: checked off — the NFS server imports weed/filer.ReaderCache and weed/util/chunk_cache directly with no weed/mount or go-fuse imports. - Extract shared read/write helpers from mount/WebDAV/SFTP: annotated as deferred to a separate refactor PR (touches four packages). - Expand direct data-path writes beyond the 64 MiB buffered fallback: annotated as deferred — requires a streaming WRITE path. - Shared lock state + lock tests: annotated as blocked upstream on go-nfs's missing NLM/NFSv4 lock state RPCs, matching the existing "Current Blockers" note. * test/nfs: share port+readiness helpers with test/testutil Drop the per-suite mustPickFreePort and waitForService re-implementations in favor of testutil.MustAllocatePorts (atomic batch allocation; no close-then-hope race) and testutil.WaitForPort / SeaweedMiniStartupTimeout. Pull testutil in via a local replace directive so this standalone seaweedfs-nfs-tests module can import the in-repo package without a separate release. Subprocess startup is still master + volume + filer + nfs — no switch to weed mini yet, since mini does not know about the nfs frontend. * nfs: stream writes to volume servers instead of buffering the whole file Before this change the NFS write path held the full contents of every writable open in memory: - OpenFile(write) called loadWritableContent which read the existing file into seaweedFile.content up to maxBufferedWriteSize (64 MiB) - each Write() extended content in-place - Close() uploaded the whole buffer as a single chunk via persistContent + AssignVolume The 64 MiB ceiling made large NFS writes return NFS3ERR_NOTSUPP, and even below the cap every Write paid a whole-file-in-memory cost. This PR rewrites the write path to match how `weed filer` and the S3 gateway persist data: - openFile(write) no longer loads the existing content at all; it only issues an UpdateEntry when O_TRUNC is set and the file is non-empty (so a fresh create+trunc is still zero-RPC) - Write() streams the caller's bytes straight to a volume server via one AssignVolume + one chunk upload, then atomically appends the resulting chunk to the filer entry through mutateEntry. Any previously inlined entry.Content is migrated to a chunk in the same update so the chunk list becomes the authoritative representation. - Truncate() becomes a direct mutateEntry (drop chunks past the new size, clip inline content, update FileSize) instead of resizing an in-memory buffer. - Close() is a no-op because everything was flushed inline. The small-file fast path that the filer HTTP handler uses is preserved: if the post-write size still fits in maxInlineWriteSize (4 MiB) and the file has no existing chunks, we rewrite entry.Content directly and skip the volume-server round-trip. This keeps single-shot tiny writes (echo, small edits) cheap while completely removing the 64 MiB cap on larger files. Read() now always reads through the chunk reader instead of a local byte slice, so reads inside the same session see the freshly appended data. Drops the unused seaweedFile.content / dirty fields, the maxBufferedWriteSize constant, and the loadWritableContent helper. Updates TestSeaweedFileSystemSupportsNamespaceMutations expectations to match the new "no extra O_TRUNC UpdateEntry on an empty file" behavior (still 3 updates: Write + Chmod + Truncate). * filer: extract shared gateway upload helper for NFS and WebDAV Three filer-backed gateways (NFS, WebDAV, and mount) each had a local saveDataAsChunk that wrapped operation.NewUploader().UploadWithRetry with near-identical bodies: build AssignVolumeRequest, build UploadOption, build genFileUrlFn with optional filerProxy rewriting, call UploadWithRetry, validate the result, and call ToPbFileChunk. Pull that body into filer.SaveGatewayDataAsChunk with a GatewayChunkUploadRequest struct so both NFS and WebDAV can delegate to one implementation. - NFS's saveDataAsChunk is now a thin adapter that assembles the GatewayChunkUploadRequest from server options and calls the helper. The chunkUploader interface keeps working for test injection because the new GatewayChunkUploader interface is structurally identical. - WebDAV's saveDataAsChunk is similarly a thin adapter — it drops the local operation.NewUploader call plus the AssignVolume/UploadOption scaffolding. - mount is intentionally left alone. mount's saveDataAsChunk has two features that do not fit the shared helper (a pre-allocated file-id pool used to skip AssignVolume entirely, and a chunkCache write-through at offset 0 so future reads hit the mount's local cache), both of which are mount-specific. Marks the Phase 2 "extract shared read/write helpers from mount, WebDAV, and SFTP" DEV_PLAN item as done. The filer-level chunk read path (NonOverlappingVisibleIntervals + ViewFromVisibleIntervals + NewChunkReaderAtFromClient) was already shared. * nfs: remove DESIGN.md and DEV_PLAN.md The planning documents have served their purpose — all phase 1 and phase 2 items are landed, phase 3 streaming writes are landed, phase 2 shared helpers are extracted, and the two remaining phase 4 items (shared lock state + lock tests) are blocked upstream on github.com/willscott/go-nfs which exposes no NLM or NFSv4 lock state RPCs. The running decision log no longer reflects current code and would just drift. The NFS wiki page (https://github.com/seaweedfs/seaweedfs/wiki/NFS-Server) now carries the overview, configuration surface, architecture notes, and known limitations; the source is the source of truth for the rest.	2026-04-14 20:48:24 -07:00
Chris Lu	50f25bb5cd	4.20	2026-04-13 13:25:13 -07:00
Lars Lehtonen	80db692728	fix(weed/util/chunk_cache): fix dropped errors (#9042 )	2026-04-13 01:16:56 -07:00
Chris Lu	edf7d2a074	fix(filer): eliminate redundant disk reads causing memory/CPU regression (#9039 ) * fix(filer): eliminate redundant disk reads causing memory/CPU regression (#9035) Since 4.18, LocalMetaLogBuffer's ReadFromDiskFn was set to readPersistedLogBufferPosition, causing LoopProcessLogData to call ReadPersistedLogBuffer on every 250ms health-check tick when a subscriber encounters ResumeFromDiskError. Each call creates an OrderedLogVisitor (ListDirectoryEntries on the filer store), spawns a readahead goroutine with a 1024-element channel, finds no data, and returns — 4 times per second even on an idle filer. This is redundant because SubscribeLocalMetadata already manages disk reads explicitly with its own shouldReadFromDisk / lastCheckedFlushTsNs tracking in the outer loop. Set ReadFromDiskFn back to nil for LocalMetaLogBuffer. When LoopProcessLogData encounters ResumeFromDiskError with nil ReadFromDiskFn, the HasData() guard returns ResumeFromDiskError to the caller (SubscribeLocalMetadata), which blocks efficiently on listenersCond.Wait() instead of polling. * fix(filer): add gap detection for slow consumers after disk-read stall When a slow consumer falls behind and LoopProcessLogData returns ResumeFromDiskError with no flush or read-position progress, there may be a gap between persisted data and in-memory data (e.g. writes stopped while consumer was still catching up). Without this, the consumer would block on listenersCond.Wait() forever. Skip forward to the earliest in-memory time to resume progress, matching the gap-handling pattern already used in the shouldReadFromDisk path. * fix(filer): clear stale ResumeFromDiskError after gap-skip to avoid stall The gap-detection block added in the previous commit skips lastReadTime forward to GetEarliestTime() and continues the outer loop. On the next iteration, shouldReadFromDisk becomes true (currentReadTsNs > lastDiskReadTsNs), the disk read returns processedTsNs == 0, and the existing gap handler at the top of the loop runs its own gap check. That check uses readInMemoryLogErr == ResumeFromDiskError as the entry condition — but readInMemoryLogErr is still the stale error from two iterations ago. GetEarliestTime() now equals lastReadTime.Time (we already advanced to it), so earliestTime.After(lastReadTime.Time) is false and the handler falls into listenersCond.Wait() — stuck. Clear readInMemoryLogErr at the gap-skip point, matching the existing pattern at the earlier gap handler that already clears it for the same reason. * fix(log_buffer): GetEarliestTime must include sealed prev buffers GetEarliestTime previously returned only logBuffer.startTime (the active buffer's first timestamp). That is narrower than ReadFromBuffer's tsMemory, which is the min across active + prev buffers. Callers using GetEarliestTime for gap detection after ResumeFromDiskError (the SubscribeLocalMetadata outer loop's disk-read path, the new gap-skip in the in-memory ResumeFromDiskError handler, and MQ HasData) saw a time that was newer than the real earliest in-memory data. Impact in SubscribeLocalMetadata's slow-consumer path: - tsMemory = earliest prev buffer time (T_prev) - GetEarliestTime() = active startTime (T_active, later than T_prev) - Consumer position = T1, with T_prev < T1 < T_active - ReadFromBuffer returns ResumeFromDiskError (T1 < tsMemory) - Gap detect: GetEarliestTime().After(T1) = T_active.After(T1) = true - Skip forward to T_active -- silently drops the prev-buffer data - And when T_active happens to equal the stuck position, gap detect evaluates false, and the subscriber stalls on listenersCond.Wait() This reproduces the TestMetadataSubscribeSlowConsumerKeepsProgressing failure in CI where the consumer stalled at 10220/20000 after writing stopped -- the buffer still had data in prev[0..3], but gap detection was comparing against the active buffer's startTime. Fix: scan all sealed prev buffers under RLock, return the true minimum startTime. Matches the min-of-buffers logic in ReadFromBuffer. * test(log_buffer): make DiskReadRetry test deterministic The previous test added the message via AddToBuffer + ForceFlush and relied on a race: the second disk read had to happen before the data was delivered through the in-memory path. Under the race detector or on a slow CI runner, the reader is woken by AddToBuffer's notification, finds the data in the active buffer or its prev slot, and returns after exactly one disk read — failing the >= 2 disk reads assertion even though the loop behaved correctly. Reproduced on master with race detector (2/5 failures). Rewrite the test to deliver the data exclusively through the disk-read path: no AddToBuffer, no ForceFlush. The test waits until the reader has issued at least one no-op disk read, then atomically flips a "dataReady" flag. The reader's next iteration through readFromDiskFn returns the entry. This deterministically exercises the retry-loop behavior the test was originally written to protect, and removes the in-memory delivery race entirely.	2026-04-11 23:12:54 -07:00
Chris Lu	e648c76bcf	go fmt	2026-04-10 17:31:14 -07:00
Chris Lu	eb5624233d	[filer] fix log buffer idle polling (#9012 ) * fix log buffer idle polling * log_buffer: document notificationHealthCheckInterval tradeoffs Explain that notifyChan is the primary wakeup path and this interval only bounds the fallback / state-recheck cadence, so future maintainers don't tune it without understanding the implications for client-disconnect detection latency. * log_buffer: rename waitForNotification to awaitNotificationOrTimeout The helper returns after either a notification or the health-check timeout; the old name read like it blocked indefinitely. No behavior change. * log_buffer: wake blocked subscribers on shutdown awaitNotificationOrTimeout previously only returned on notifyChan or the health-check timeout, so ShutdownLogBuffer on an idle buffer (where copyToFlush returns nil and loopFlush never fires the post-flush notification) would leave subscribers parked for up to 250ms before they noticed IsStopping. Add an internal shutdownCh closed by ShutdownLogBuffer and select on it from awaitNotificationOrTimeout, which is now a method on LogBuffer. Subscribers wake immediately, re-check IsStopping, and exit. No change to LoopProcessLogData signatures or any caller (filer metadata subscribers, MQ broker, local partition subscribe). log_buffer: regression tests for flush-notify wake-up TestLoopFlush_NotifiesSubscribersAfterFlush directly verifies that loopFlush calls notifySubscribers after processing a flush, so a reader parked on notifyChan wakes promptly when a flush lands. Verified to fail if that notification is removed. TestLoopProcessLogDataWithOffset_WakesOnDataArrival is the end-to-end counterpart: a real LoopProcessLogDataWithOffset reader parks on notifyChan via the ResumeFromDiskError branch, then wakes and processes the entry well under the 250ms fallback once data arrives. * log_buffer: keep notification-timeout logs at V(4) Revert the V(4)->V(5) demotion. Now that the shutdown wake-up path exists and (with the follow-up fix) idle-polling CPU churn is bounded by the 250ms health check, these timeout logs no longer flood at V=4 the way they did on the 10ms fallback, so the previous verbosity is appropriate again. * log_buffer: exit reader loops cleanly on shutdown awaitNotificationOrTimeout returns true on both data notifications and shutdown (shutdownCh closed). Without an explicit IsStopping() guard, the ResumeFromDiskError, offset-based no-data, empty-buffer, and timestamp-wait paths would either tight-spin against a closed shutdownCh or, in the offset-based case, return ResumeFromDiskError to the caller instead of exiting. Add an IsStopping() check after each awaitNotificationOrTimeout call that previously continued or returned ResumeFromDiskError, so subscribers exit promptly with isDone=true and err=nil when ShutdownLogBuffer is called. * log_buffer: regression test for shutdown wake-up Park a real LoopProcessLogDataWithOffset reader on notifyChan via the ResumeFromDiskError branch, call ShutdownLogBuffer, and assert the reader exits with isDone=true and err=nil well under the 250ms fallback. Verified to fail (timeout) if the IsStopping() guards added in the prior commit are removed. * log_buffer: bump reader-park sleep to 50ms with rationale Both wake-path tests use a sleep to give the goroutine time to reach awaitNotificationOrTimeout before the test triggers the wake-up. Bump from 20ms to 50ms and document the timing assumption to reduce flakiness on slow CI. Both paths are race-free either way (a buffered notification or a closed shutdownCh stays valid until consumed), so this is purely about exercising the park-then-wake path rather than the already-pending fast path.	2026-04-09 18:09:57 -07:00
eason	a04c9c7dde	fix: close CPU profile file after stopping profiling (#9000 ) The file handle from os.Create(cpuProfile) was passed to pprof.StartCPUProfile but never closed in the OnInterrupt handler. The block and mutex profile files are correctly closed, but the main CPU profile file was leaked. Add f.Close() after pprof.StopCPUProfile() to prevent the file descriptor leak. Co-authored-by: easonysliu <easonysliu@tencent.com>	2026-04-08 22:13:02 -07:00
Chris Lu	0bdf9b0683	4.19	2026-04-07 19:21:35 -07:00
Chris Lu	2919bb27e5	fix(sync): use per-cluster TLS for HTTP volume connections in filer.sync (#8974 ) * fix(sync): use per-cluster TLS for HTTP volume connections in filer.sync (#8965) When filer.sync runs with -a.security and -b.security flags, only gRPC connections received per-cluster TLS configuration. HTTP clients for volume server reads and uploads used a global singleton with the default security.toml, causing TLS verification failures when clusters use different self-signed certificates. Load per-cluster HTTPS client config from the security files and pass dedicated HTTP clients to FilerSource (for downloads) and FilerSink (for uploads) so each direction uses the correct cluster's certificates. * fix(sync): address review feedback for per-cluster HTTP TLS - Add insecure_skip_verify support to NewHttpClientWithTLS and read it from per-cluster security config via https.client.insecure_skip_verify - Error on partial mTLS config (cert without key or vice versa) - Add nil-check for client parameter in DownloadFileWithClient - Document SetUploader as init-only (same pattern as SetChunkConcurrency)	2026-04-07 14:11:44 -07:00
Chris Lu	f6df7126b6	feat(admin): add profiling options for debugging high memory/CPU usage (#8923 ) * feat(admin): add profiling options for debugging high memory/CPU usage Add -debug, -debug.port, -cpuprofile, and -memprofile flags to the admin command, matching the profiling support already available in master, volume, and other server commands. This enables investigation of resource usage issues like #8919. * refactor(admin): move profiling flags into AdminOptions struct Move cpuprofile and memprofile flags from global variables into the AdminOptions struct and init() function for consistency with other flags. * fix(debug): bind pprof server to localhost only and document profiling flags StartDebugServer was binding to all interfaces (0.0.0.0), exposing runtime profiling data to the network. Restrict to 127.0.0.1 since this is a development/debugging tool. Also add a "Debugging and Profiling" section to the admin command's help text documenting the new flags.	2026-04-04 10:05:19 -07:00
Chris Lu	0798b274dd	feat(s3): add concurrent chunk prefetch for large file downloads (#8917 ) * feat(s3): add concurrent chunk prefetch for large file downloads Add a pipe-based prefetch pipeline that overlaps chunk fetching with response writing during S3 GetObject, SSE downloads, and filer proxy. While chunk N streams to the HTTP response, fetch goroutines for the next K chunks establish HTTP connections to volume servers ahead of time, eliminating the RTT gap between sequential chunk fetches. Uses io.Pipe for minimal memory overhead (~1MB per download regardless of chunk size, vs buffering entire chunks). Also increases the streaming read buffer from 64KB to 256KB to reduce syscall overhead. Benchmark results (64KB chunks, prefetch=4): - 0ms latency: 1058 → 2362 MB/s (2.2× faster) - 5ms latency: 11.0 → 41.7 MB/s (3.8× faster) - 10ms latency: 5.9 → 23.3 MB/s (4.0× faster) - 20ms latency: 3.1 → 12.1 MB/s (3.9× faster) * fix: address review feedback for prefetch pipeline - Fix data race: use chunkPipeResult (pointer) on channel to avoid copying struct while fetch goroutines write to it. Confirmed clean with -race detector. - Remove concurrent map write: retryWithCacheInvalidation no longer updates fileId2Url map. Producer only reads it; consumer never writes. - Use mem.Allocate/mem.Free for copy buffer to reduce GC pressure. - Add local cancellable context so consumer errors (client disconnect) immediately stop the producer and all in-flight fetch goroutines. fix(test): remove dead code and add Range header support in test server - Remove unused allData variable in makeChunksAndServer - Add Range header handling to createTestServer for partial chunk read coverage (206 Partial Content, 416 Range Not Satisfiable) * fix: correct retry condition and goroutine leak in prefetch pipeline - Fix retry condition: use result.fetchErr/result.written instead of copied to decide cache-invalidation retry. The old condition wrongly triggered retry when the fetch succeeded but the response writer failed on the first write (copied==0 despite fetcher having data). Now matches the sequential path (stream.go:197) which checks whether the fetcher itself wrote zero bytes. - Fix goroutine leak: when the producer's send to the results channel is interrupted by context cancellation, the fetch goroutine was already launched but the result was never sent to the channel. The drain loop couldn't handle it. Now waits on result.done before returning so every fetch goroutine is properly awaited.	2026-04-03 19:57:30 -07:00
Chris Lu	995dfc4d5d	chore: remove ~50k lines of unreachable dead code (#8913 ) * chore: remove unreachable dead code across the codebase Remove ~50,000 lines of unreachable code identified by static analysis. Major removals: - weed/filer/redis_lua: entire unused Redis Lua filer store implementation - weed/wdclient/net2, resource_pool: unused connection/resource pool packages - weed/plugin/worker/lifecycle: unused lifecycle plugin worker - weed/s3api: unused S3 policy templates, presigned URL IAM, streaming copy, multipart IAM, key rotation, and various SSE helper functions - weed/mq/kafka: unused partition mapping, compression, schema, and protocol functions - weed/mq/offset: unused SQL storage and migration code - weed/worker: unused registry, task, and monitoring functions - weed/query: unused SQL engine, parquet scanner, and type functions - weed/shell: unused EC proportional rebalance functions - weed/storage/erasure_coding/distribution: unused distribution analysis functions - Individual unreachable functions removed from 150+ files across admin, credential, filer, iam, kms, mount, mq, operation, pb, s3api, server, shell, storage, topology, and util packages * fix(s3): reset shared memory store in IAM test to prevent flaky failure TestLoadIAMManagerFromConfig_EmptyConfigWithFallbackKey was flaky because the MemoryStore credential backend is a singleton registered via init(). Earlier tests that create anonymous identities pollute the shared store, causing LookupAnonymous() to unexpectedly return true. Fix by calling Reset() on the memory store before the test runs. * style: run gofmt on changed files * fix: restore KMS functions used by integration tests * fix(plugin): prevent panic on send to closed worker session channel The Plugin.sendToWorker method could panic with "send on closed channel" when a worker disconnected while a message was being sent. The race was between streamSession.close() closing the outgoing channel and sendToWorker writing to it concurrently. Add a done channel to streamSession that is closed before the outgoing channel, and check it in sendToWorker's select to safely detect closed sessions without panicking.	2026-04-03 16:04:27 -07:00
Chris Lu	6213daf118	4.18	2026-04-01 17:42:41 -07:00
Chris Lu	ced2236cc6	Adjust rename events metadata format (#8854 ) * rename metadata events * fix subscription filter to use NewEntry.Name for rename path matching The server-side subscription filter constructed the new path using OldEntry.Name instead of NewEntry.Name when checking if a rename event's destination matches the subscriber's path prefix. This could cause events to be incorrectly filtered when a rename changes the file name. * fix bucket events to handle rename of bucket directories onBucketEvents only checked IsCreate and IsDelete. A bucket directory rename via AtomicRenameEntry now emits a single rename event (both OldEntry and NewEntry non-nil), which matched neither check. Handle IsRename by deleting the old bucket and creating the new one. * fix replicator to handle rename events across directory boundaries Two issues fixed: 1. The replicator filtered events by checking if the key (old path) was under the source directory. Rename events now use the old path as key, so renames from outside into the watched directory were silently dropped. Now both old and new paths are checked, and cross-boundary renames are converted to create or delete. 2. NewParentPath was passed to the sink without remapping to the sink's target directory structure, causing the sink to write entries at the wrong location. Now NewParentPath is remapped alongside the key. * fix filer sync to handle rename events crossing directory boundaries The early directory-prefix filter only checked resp.Directory (old parent). Rename events now carry the old parent as Directory, so renames from outside the source path into it were dropped before reaching the existing cross-boundary handling logic. Check both old and new directories against sourcePath and excludePaths so the downstream old-key/new-key logic can properly convert these to create or delete operations. * fix metadata event path matching * fix metadata event consumers for rename targets * Fix replication rename target keys Logical rename events now reach replication sinks with distinct source and target paths.\n\nHandle non-filer sinks as delete-plus-create on the translated target key, and make the rename fallback path create at the translated target key too.\n\nAdd focused tests covering non-filer renames, filer rename updates, and the fallback path.\n\nCo-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Fix filer sync rename path scoping Use directory-boundary matching instead of raw prefix checks when classifying source and target paths during filer sync.\n\nAlso apply excludePaths per side so renames across excluded boundaries downgrade cleanly to create/delete instead of being misclassified as in-scope updates.\n\nAdd focused tests for boundary matching and rename classification.\n\nCo-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Fix replicator directory boundary checks Use directory-boundary matching instead of raw prefix checks when deciding whether a source or target path is inside the watched tree or an excluded subtree.\n\nThis prevents sibling paths such as /foo and /foobar from being misclassified during rename handling, and preserves the earlier rename-target-key fix.\n\nAdd focused tests for boundary matching and rename classification across sibling/excluded directories.\n\nCo-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Fix etc-remote rename-out handling Use boundary-safe source/target directory membership when classifying metadata events under DirectoryEtcRemote.\n\nThis prevents rename-out events from being processed as config updates, while still treating them as removals where appropriate for the remote sync and remote gateway command paths.\n\nAdd focused tests for update/removal classification and sibling-prefix handling.\n\nCo-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Defer rename events until commit Queue logical rename metadata events during atomic and streaming renames and publish them only after the transaction commits successfully.\n\nThis prevents subscribers from seeing delete or logical rename events for operations that later fail during delete or commit.\n\nAlso serialize notification.Queue swaps in rename tests and add failure-path coverage.\n\nCo-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Skip descendant rename target lookups Avoid redundant target lookups during recursive directory renames once the destination subtree is known absent.\n\nThe recursive move path now inserts known-absent descendants directly, and the test harness exercises prefixed directory listing so the optimization is covered by a directory rename regression test.\n\nCo-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Tighten rename review tests Return filer_pb.ErrNotFound from the bucket tracking store test stub so it follows the FilerStore contract, and add a webhook filter case for same-name renames across parent directories.\n\nCo-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * fix HardLinkId format verb in InsertEntryKnownAbsent error HardLinkId is a byte slice. %d prints each byte as a decimal number which is not useful for an identifier. Use %x to match the log line two lines above. * only skip descendant target lookup when source and dest use same store moveFolderSubEntries unconditionally passed skipTargetLookup=true for every descendant. This is safe when all paths resolve to the same underlying store, but with path-specific store configuration a child's destination may map to a different backend that already holds an entry at that path. Use FilerStoreWrapper.SameActualStore to check per-child and fall back to the full CreateEntry path when stores differ. * add nil and create edge-case tests for metadata event scope helpers * extract pathIsEqualOrUnder into util.IsEqualOrUnder Identical implementations existed in both replication/replicator.go and command/filer_sync.go. Move to util.IsEqualOrUnder (alongside the existing FullPath.IsUnder) and remove the duplicates. * use MetadataEventTargetDirectory for new-side directory in filer sync The new-side directory checks and sourceNewKey computation used message.NewParentPath directly. If NewParentPath were empty (legacy events, older filer versions during rolling upgrades), sourceNewKey would be wrong (/filename instead of /dir/filename) and the UpdateEntry parent path rewrite would panic on slice bounds. Derive targetDir once from MetadataEventTargetDirectory, which falls back to resp.Directory when NewParentPath is empty, and use it consistently for all new-side checks and the sink parent path.	2026-03-30 18:25:11 -07:00
Chris Lu	92c2fc0d52	Add insecure_skip_verify option for HTTPS client in security.toml (#8781 ) * Add -insecureSkipVerify flag and config option for filer.sync HTTPS connections When using filer.sync between clusters with different CAs (e.g., separate OpenShift clusters), TLS certificate verification fails with "x509: certificate signed by unknown authority". This adds two ways to skip TLS certificate verification: 1. CLI flag: `weed filer.sync -insecureSkipVerify ...` 2. Config option: `insecure_skip_verify = true` under [https.client] in security.toml Closes #8778 * Add insecure_skip_verify option for HTTPS client in security.toml When using filer.sync between clusters with different CAs (e.g., separate OpenShift clusters), TLS certificate verification fails. Adding insecure_skip_verify = true under [https.client] in security.toml allows skipping TLS certificate verification. The option is read during global HTTP client initialization so it applies to all HTTPS connections including filer.sync proxy reads and writes. Closes #8778 --------- Co-authored-by: Copilot <copilot@github.com>	2026-03-26 11:42:47 -07:00
Chris Lu	2877febd73	S3: fix silent PutObject failure and enforce 1024-byte key limit (#8764 ) * S3: add KeyTooLongError error code Add ErrKeyTooLongError (HTTP 400, code "KeyTooLongError") to match the standard AWS S3 error for object keys that exceed length limits. * S3: fix silent PutObject failure when entry name exceeds max_file_name_length putToFiler called client.CreateEntry() directly and discarded the gRPC response. The filer embeds application errors like "entry name too long" in resp.Error (not as gRPC transport errors), so the error was silently swallowed and clients received HTTP 200 with an ETag for objects that were never stored. Switch to the filer_pb.CreateEntry() helper which properly checks resp.Error, and map "entry name too long" to KeyTooLongError (HTTP 400). To avoid fragile string parsing across the gRPC boundary, define shared error message constants in weed/util/constants and use them in both the filer (producing errors) and S3 API (matching errors). Switch filerErrorToS3Error to use strings.Contains/HasSuffix with these constants so matches work regardless of any wrapper prefix. Apply filerErrorToS3Error to the mkdir path for directory markers. Fixes #8759 * S3: enforce 1024-byte maximum object key length AWS S3 limits object keys to 1024 bytes. Add early validation on write paths (PutObject, CopyObject, CreateMultipartUpload) to reject keys exceeding the limit with the standard KeyTooLongError (HTTP 400). The key length check runs before bucket auto-creation to prevent overlong keys from triggering unnecessary side effects. Also use filerErrorToS3Error for CopyObject's mkFile error paths so name-too-long errors from the filer return KeyTooLongError instead of InternalError. Ref #8758 * S3: add handler-level tests for key length validation and error mapping Add tests for filerErrorToS3Error mapping "entry name too long" to KeyTooLongError, including a regression test for the CreateEntry-prefixed "existing ... is a directory" form. Add handler-level integration tests that exercise PutObjectHandler, CopyObjectHandler, and NewMultipartUploadHandler via httptest, verifying HTTP 400 and KeyTooLongError XML response for overlong keys and acceptance of keys at the 1024-byte limit.	2026-03-24 13:35:28 -07:00
Chris Lu	6bf654c25c	fix: keep metadata subscriptions progressing (#8730 ) (#8746 ) * fix: keep metadata subscriptions progressing (#8730) * test: cancel slow metadata writers with parent context * filer: ignore missing persisted log chunks	2026-03-23 15:26:54 -07:00
Chris Lu	08b79a30f6	Fix lock table shared wait condition (#8707 ) * Fix lock table shared wait condition (#8696) * Refactor lock table waiter check * Add exclusive lock wait helper test	2026-03-19 16:08:24 -07:00
Chris Lu	b665c329bc	fix(replication): resume partial chunk reads on EOF instead of re-downloading (#8607 ) * fix(replication): resume partial chunk reads on EOF instead of re-downloading When replicating chunks and the source connection drops mid-transfer, accumulate the bytes already received and retry with a Range header to fetch only the remaining bytes. This avoids re-downloading potentially large chunks from scratch on each retry, reducing load on busy source servers and speeding up recovery. * test(replication): add tests for downloadWithRange including gzip partial reads Tests cover: - No offset (no Range header sent) - With offset (Range header verified) - Content-Disposition filename extraction - Partial read + resume: server drops connection mid-transfer, client resumes with Range from the offset of received bytes - Gzip partial read + resume: first response is gzip-encoded (Go auto- decompresses), connection drops, resume request gets decompressed data (Go doesn't add Accept-Encoding when Range is set, so the server decompresses), combined bytes match original * fix(replication): address PR review comments - Consolidate downloadWithRange into DownloadFile with optional offset parameter (variadic), eliminating code duplication (DRY) - Validate HTTP response status: require 206 + correct Content-Range when offset > 0, reject when server ignores Range header - Use if/else for fullData assignment for clarity - Add test for rejected Range (server returns 200 instead of 206) * refactor(replication): remove unused ReplicationSource interface The interface was never referenced and its signature didn't match the actual FilerSource.ReadPart method. --------- Co-authored-by: Copilot <copilot@github.com>	2026-03-11 22:38:22 -07:00
Chris Lu	4a5243886a	4.17	2026-03-11 02:29:24 -07:00
Chris Lu	8ad58e7002	4.16	2026-03-09 21:52:43 -07:00
Chris Lu	2ec0a67ee3	master: return 503/Unavailable during topology warmup after leader change (#8529 ) * master: return 503/Unavailable during topology warmup after leader change After a master restart or leader change, the topology is empty until volume servers reconnect and send heartbeats. During this warmup window (3 heartbeat intervals = 15 seconds), volume lookups that fail now return 503 Service Unavailable (HTTP) or gRPC Unavailable instead of 404 Not Found, signaling clients to retry with other masters. * master: skip warmup 503 on fresh start and single-master setups - Check MaxVolumeId > 0 to distinguish restart from fresh start (MaxVolumeId is Raft-persisted, so 0 means no prior data) - Check peer count > 1 so single-master deployments aren't affected (no point suggesting "retry with other masters" if there are none) * master: address review feedback and block assigns during warmup - Protect LastLeaderChangeTime with dedicated mutex (fix data race) - Extract warmup multiplier as WarmupPulseMultiplier constant - Derive Retry-After header from pulse config instead of hardcoding - Only trigger warmup 503 for "not found" errors, not parse errors - Return nil response (not partial) on gRPC Unavailable - Add doc comments to IsWarmingUp, getter/setter, WarmupDuration - Block volume assign requests (HTTP and gRPC) during warmup, since the topology is incomplete and assignments would be unreliable - Skip warmup behavior for single-master setups (no peers to retry) * master: apply warmup to all setups, skip only on fresh start Single-master restarts still have an empty topology until heartbeats arrive, so warmup protection should apply there too. The only case to skip is a fresh cluster start (MaxVolumeId == 0), which already has no volumes to look up. - Remove GetMasterCount() > 1 guard from all warmup checks - Remove now-unused GetMasterCount helper - Update error messages to "topology is still loading" (not "retry with other masters" which doesn't apply to single-master) * master: add client-side retry on Unavailable for lookup and assign The server-side 503/Unavailable during warmup needs client cooperation. Previously, LookupVolumeIds and Assign would immediately propagate the error without retry. Now both paths retry with exponential backoff (1s -> 1.5s -> ... up to 6s) when receiving Unavailable, respecting context cancellation. This covers the warmup window where the master's topology is still loading after a restart or leader change. * master: seed warmup timestamp in legacy raft path at setup The legacy raft path only set lastLeaderChangeTime inside the event listener callback, which could fire after IsLeader() was already observed as true in SetRaftServer. Seed the timestamp at setup time (matching the hashicorp path) so IsWarmingUp() is active immediately. * master: fix assign retry loop to cover full warmup window The retry loop used waitTime <= maxWaitTime as a stop condition, causing it to give up after ~13s while warmup lasts 15s. Now cap each individual sleep at maxWaitTime but keep retrying until the context is cancelled. * master: preserve gRPC status in lookup retry and fix retry window Return the raw gRPC error instead of wrapping with fmt.Errorf so status.FromError() can extract the status code. Use proper gRPC status check (codes.Unavailable) instead of string matching. Also cap individual sleep at maxWaitTime while retrying until ctx is done. * master: use gRPC status code instead of string matching in assign retry Use status.FromError/codes.Unavailable instead of brittle strings.Contains for detecting retriable gRPC errors in the assign retry loop. * master: use remaining warmup duration for Retry-After header Set Retry-After to the remaining warmup time instead of the full warmup duration, so clients don't wait longer than necessary. * master: reset ret.Replicas before populating from assign response Clear Replicas slice before appending to prevent duplicate entries when the assign response is retried or when alternative requests are attempted. * master: add unit tests for warmup retry behavior Test that Assign() and LookupVolumeIds() retry on codes.Unavailable and stop promptly when the context is cancelled. * master: record leader change time before initialization work Move SetLastLeaderChangeTime() to fire immediately when the leader change event is received, before DoBarrier(), EnsureTopologyId(), and updatePeers(), so the warmup clock starts at the true moment of leadership transition. * master: use topology warmup duration in volume growth wait loop Replace hardcoded constants.VolumePulsePeriod * 2 with topo.IsWarmingUp() and topo.WarmupDuration() so the growth wait stays in sync with the configured warmup window. Remove unused constants import. * master: resolve master before creating RPC timeout context Move GetMaster() call before context.WithTimeout() so master resolution blocking doesn't consume the gRPC call timeout. * master: use NotFound flag instead of string matching for volume lookup Add a NotFound field to LookupResult and set it in findVolumeLocation when a volume is genuinely missing. Update HTTP and gRPC warmup checks to use this flag instead of strings.Contains on the error message. * master: bound assign retry loop to 30s for deadline-free contexts Without a context deadline, the Unavailable retry loop could spin forever. Add a maxRetryDuration of 30s so the loop gives up even when no context deadline is set. * master: strengthen assign retry cancellation test Verify the retry loop actually retried (callCount > 1) and that the returned error is context.DeadlineExceeded, not just any error. * master: extract shared retry-with-backoff utility Add util.RetryWithBackoff for context-aware, bounded retry with exponential backoff. Refactor both Assign() and LookupVolumeIds() to use it instead of duplicating the retry/sleep/backoff logic. * master: cap waitTime in RetryWithBackoff to prevent unbounded growth Cap the backoff waitTime at maxWaitTime so it doesn't grow indefinitely in long-running retry scenarios. * master: only return Unavailable during warmup when all lookups failed For batched LookupVolume requests, return partial results when some volumes are found. Only return codes.Unavailable when no volumes were successfully resolved, so clients benefit from partial results instead of retrying unnecessarily. * master: set retriable error message in 503 response body When returning 503 during warmup, replace the "not found" error in the JSON body with "service warming up, please retry" so clients don't treat it as a permanent error. * master: guard empty master address in LookupVolumeIds If GetMaster() returns empty (no master found or ctx cancelled), return an appropriate error instead of dialing an empty address. Returns ctx.Err() if context is done, otherwise codes.Unavailable to trigger retry. * master: add comprehensive tests for RetryWithBackoff Test success after retries, non-retryable error handling, context cancellation, and maxDuration cap with context.Background(). * master: enforce hard maxDuration bound in RetryWithBackoff Use a deadline instead of elapsed-time check so the last sleep is capped to remaining time. This prevents the total retry duration from overshooting maxDuration by up to one full backoff interval. * master: respect fresh-start bypass in RemainingWarmupDuration Check IsWarmingUp() first (which returns false when MaxVolumeId==0) so RemainingWarmupDuration returns 0 on fresh clusters. * master: round up Retry-After seconds to avoid underestimating Use math.Ceil so fractional remaining seconds (e.g. 1.9s) round up to the next integer (2) instead of flooring down (1). * master: tighten batch lookup warmup to all-NotFound only Only return codes.Unavailable when every requested volume ID was a transient not-found. Mixed cases with non-NotFound errors now return the response with per-volume error details preserved. * master: reduce retry log noise and fix timer leak Lower per-attempt retry log from V(0) to V(1) to reduce noise during warmup. Replace time.After with time.NewTimer to avoid lingering timers when context is cancelled. * master: add per-attempt timeout for assign RPC Use a 10s per-attempt timeout so a single slow RPC can't consume the entire 30s retry budget when ctx has no deadline. * master: share single 30s retry deadline across assign request entries The Assign() function iterates over primary and fallback requests, previously giving each its own 30s RetryWithBackoff budget. With a primary + fallback, the total could reach 60s. Compute one deadline up front and pass the remaining budget to each RetryWithBackoff call so the entire Assign() call stays within a single 30s cap. * master: strengthen context-cancel test with DeadlineExceeded and retry assertions Assert errors.Is(err, context.DeadlineExceeded) to verify the error is specifically from the context deadline, and check callCount > 1 to prove retries actually occurred before cancellation. Mirrors the pattern used in TestAssignStopsOnContextCancel. * master: bound GetMaster with per-attempt timeout in LookupVolumeIds GetMaster() calls WaitUntilConnected() which can block indefinitely if no master is available. Previously it used the outer ctx, so a slow master resolution could consume the entire RetryWithBackoff budget in a single attempt. Move the per-attempt timeoutCtx creation before the GetMaster call so both master resolution and the gRPC LookupVolume RPC share one grpcTimeout-bounded attempt. * master: use deadline-aware context for assign retry budget The shared 30s deadline only limited RetryWithBackoff's internal wall-clock tracking, but per-attempt contexts were still derived from the original ctx and could run for up to 10s even when the budget was nearly exhausted. Create a deadlineCtx from the computed deadline and derive both RetryWithBackoff and per-attempt timeouts from it so all operations honor the shared 30s cap. * master: skip warmup gate for empty lookup requests When VolumeOrFileIds is empty, notFoundCount == len(req.VolumeOrFileIds) is 0 == 0 which is true, causing empty lookup batches during warmup to return codes.Unavailable and be retried endlessly. Add a len(req.VolumeOrFileIds) > 0 guard so empty requests pass through. * master: validate request fields before warmup gate in Assign Move Replication and Ttl parsing before the IsWarmingUp() check so invalid inputs get a proper validation error instead of being masked by codes.Unavailable during warmup. Pure syntactic validation does not depend on topology state and should run first. * master: check deadline and context before starting retry attempt RetryWithBackoff only checked the deadline and context after an attempt completed or during the sleep select. If the deadline expired or context was canceled during sleep, the next iteration would still call operation() before detecting it. Add pre-operation checks so no new attempt starts after the budget is exhausted. * master: always return ctx.Err() on context cancellation in RetryWithBackoff When ctx.Err() is non-nil, the pre-operation check was returning lastErr instead of ctx.Err(). This broke callers checking errors.Is(err, context.DeadlineExceeded) and contradicted the documented contract. Always return ctx.Err() so the cancellation reason is properly surfaced. * master: handle warmup errors in StreamAssign without killing the stream StreamAssign was returning codes.Unavailable errors from Assign directly, which terminates the gRPC stream and breaks pooled connections. Instead, return transient errors as in-band error responses so the stream survives warmup periods. Also reset assignClient in doAssign on Send/Recv failures so a broken stream doesn't leave the proxy permanently dead. * master: wait for warmup before slot search in findAndGrow findEmptySlotsForOneVolume was called before the warmup wait loop, selecting slots from an incomplete topology. Move the warmup wait before slot search so volume placement uses the fully warmed-up topology with all servers registered. * master: add Retry-After header to /dir/assign warmup response The /dir/lookup handler already sets Retry-After during warmup but /dir/assign did not, leaving HTTP clients without guidance on when to retry. Add the same header using RemainingWarmupDuration(). * master: only seed warmup timestamp on leader at startup SetLastLeaderChangeTime was called unconditionally for both leader and follower nodes. Followers don't need warmup state, and the leader change event listener handles real elections. Move the seed into the IsLeader() block so only the startup leader gets warmup initialized. * master: preserve codes.Unavailable for StreamAssign warmup errors in doAssign StreamAssign returns transient warmup errors as in-band AssignResponse.Error messages. doAssign was converting these to plain fmt.Errorf, losing the codes.Unavailable classification needed for the caller's retry logic. Detect warmup error messages and wrap them as status.Error(codes.Unavailable) so RetryWithBackoff can retry.	2026-03-08 16:05:45 -07:00
Chris Lu	540fc97e00	s3/iam: reuse one request id per request (#8538 ) * request_id: add shared request middleware * s3err: preserve request ids in responses and logs * iam: reuse request ids in XML responses * sts: reuse request ids in XML responses * request_id: drop legacy header fallback * request_id: use AWS-style request id format * iam: fix AWS-compatible XML format for ErrorResponse and field ordering - ErrorResponse uses bare <RequestId> at root level instead of <ResponseMetadata> wrapper, matching the AWS IAM error response spec - Move CommonResponse to last field in success response structs so <ResponseMetadata> serializes after result elements - Add randomness to request ID generation to avoid collisions - Add tests for XML ordering and ErrorResponse format * iam: remove duplicate error_response_test.go Test is already covered by responses_test.go. * address PR review comments - Guard against typed nil pointers in SetResponseRequestID before interface assertion (CodeRabbit) - Use regexp instead of strings.Index in test helpers for extracting request IDs (Gemini) * request_id: prevent spoofing, fix nil-error branch, thread reqID to error writers - Ensure() now always generates a server-side ID, ignoring client-sent x-amz-request-id headers to prevent request ID spoofing. Uses a private context key (contextKey{}) instead of the header string. - writeIamErrorResponse in both iamapi and embedded IAM now accepts reqID as a parameter instead of calling Ensure() internally, ensuring a single request ID per request lifecycle. - The nil-iamError branch in writeIamErrorResponse now writes a 500 Internal Server Error response instead of returning silently. - Updated tests to set request IDs via context (not headers) and added tests for spoofing prevention and context reuse. * sts: add request-id consistency assertions to ActionInBody tests * test: update admin test to expect server-generated request IDs The test previously sent a client x-amz-request-id header and expected it echoed back. Since Ensure() now ignores client headers to prevent spoofing, update the test to verify the server returns a non-empty server-generated request ID instead. * iam: add generic WithRequestID helper alongside reflection-based fallback Add WithRequestID[T] that uses generics to take the address of a value type, satisfying the pointer receiver on SetRequestId without reflection. The existing SetResponseRequestID is kept for the two call sites that operate on interface{} (from large action switches where the concrete type varies at runtime). Generics cannot replace reflection there since Go cannot infer type parameters from interface{}. * Remove reflection and generics from request ID setting Call SetRequestId directly on concrete response types in each switch branch before boxing into interface{}, eliminating the need for WithRequestID (generics) and SetResponseRequestID (reflection). * iam: return pointer responses in action dispatch * Fix IAM error handling consistency and ensure request IDs on all responses - UpdateUser/CreatePolicy error branches: use writeIamErrorResponse instead of s3err.WriteErrorResponse to preserve IAM formatting and request ID - ExecuteAction: accept reqID parameter and generate one if empty, ensuring every response carries a RequestId regardless of caller * Clean up inline policies on DeleteUser and UpdateUser rename DeleteUser: remove InlinePolicies[userName] from policy storage before removing the identity, so policies are not orphaned. UpdateUser: move InlinePolicies[userName] to InlinePolicies[newUserName] when renaming, so GetUserPolicy/DeleteUserPolicy work under the new name. Both operations persist the updated policies and return an error if the storage write fails, preventing partial state.	2026-03-06 15:22:39 -08:00
Chris Lu	b3f7472fd3	4.15	2026-03-04 22:13:57 -08:00
Chris Lu	7799804200	4.14 Co-Authored-By: Copilot <223556219+Copilot@users.noreply.github.com>	2026-03-04 19:22:39 -08:00
Chris Lu	f5c35240be	Add volume dir tags and EC placement priority (#8472 ) * Add volume dir tags to topology Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Add preferred tag config for EC Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Prioritize EC destinations by tags Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Add EC placement planner tag tests Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Refactor EC placement tests to reuse buildActiveTopology Remove buildActiveTopologyWithDiskTags helper function and consolidate tag setup inline in test cases. Tests now use UpdateTopology to apply tags after topology creation, reusing the existing buildActiveTopology function rather than duplicating its logic. All tag scenario tests pass: - TestECPlacementPlannerPrefersTaggedDisks - TestECPlacementPlannerFallsBackWhenTagsInsufficient Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Consolidate normalizeTagList into shared util package Extract normalizeTagList from three locations (volume.go, detection.go, erasure_coding_handler.go) into new weed/util/tag.go as exported NormalizeTagList function. Replace all duplicate implementations with imports and calls to util.NormalizeTagList. This improves code reuse and maintainability by centralizing tag normalization logic. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Add PreferredTags to EC config persistence Add preferred_tags field to ErasureCodingTaskConfig protobuf with field number 5. Update GetConfigSpec to include preferred_tags field in the UI configuration schema. Add PreferredTags to ToTaskPolicy to serialize config to protobuf. Add PreferredTags to FromTaskPolicy to deserialize from protobuf with defensive copy to prevent external mutation. This allows EC preferred tags to be persisted and restored across worker restarts. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Add defensive copy for Tags slice in DiskLocation Copy the incoming tags slice in NewDiskLocation instead of storing by reference. This prevents external callers from mutating the DiskLocation.Tags slice after construction, improving encapsulation and preventing unexpected changes to disk metadata. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Add doc comment to buildCandidateSets method Document the tiered candidate selection and fallback behavior. Explain that for a planner with preferredTags, it accumulates disks matching each tag in order into progressively larger tiers, emits a candidate set once a tier reaches shardsNeeded, and finally falls back to the full candidates set if preferred-tag tiers are insufficient. This clarifies the intended semantics for future maintainers. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Apply final PR review fixes 1. Update parseVolumeTags to replicate single tag entry to all folders instead of leaving some folders with nil tags. This prevents nil pointer dereferences when processing folders without explicit tags. 2. Add defensive copy in ToTaskPolicy for PreferredTags slice to match the pattern used in FromTaskPolicy, preventing external mutation of the returned TaskPolicy. 3. Add clarifying comment in buildCandidateSets explaining that the shardsNeeded <= 0 branch is a defensive check for direct callers, since selectDestinations guarantees shardsNeeded > 0. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Fix nil pointer dereference in parseVolumeTags Ensure all folder tags are initialized to either normalized tags or empty slices, not nil. When multiple tag entries are provided and there are more folders than entries, remaining folders now get empty slices instead of nil, preventing nil pointer dereference in downstream code. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Fix NormalizeTagList to return empty slice instead of nil Change NormalizeTagList to always return a non-nil slice. When all tags are empty or whitespace after normalization, return an empty slice instead of nil. This prevents nil pointer dereferences in downstream code that expects a valid (possibly empty) slice. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Add nil safety check for v.tags pointer Add a safety check to handle the case where v.tags might be nil, preventing a nil pointer dereference. If v.tags is nil, use an empty string instead. This is defensive programming to prevent panics in edge cases. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Add volume.tags flag to weed server and weed mini commands Add the volume.tags CLI option to both the 'weed server' and 'weed mini' commands. This allows users to specify disk tags when running the combined server modes, just like they can with 'weed volume'. The flag uses the same format and description as the volume command: comma-separated tag groups per data dir with ':' separators (e.g. fast:ssd,archive). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <copilot@github.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-03-01 10:22:00 -08:00
Chris Lu	4f647e1036	Worker set its working directory (#8461 ) * set working directory * consolidate to worker directory * working directory * correct directory name * refactoring to use wildcard matcher * simplify * cleaning ec working directory * fix reference * clean * adjust test	2026-02-27 12:22:21 -08:00
Chris Lu	e4b70c2521	go fix	2026-02-20 18:42:00 -08:00
Konstantin Lebedev	01b3125815	[shell]: volume balance capacity by min volume density (#8026 ) volume balance by min volume density and active volumes	2026-02-19 13:30:59 -08:00
Chris Lu	38e14a867b	fix: cancel volume server requests on client disconnect during S3 downloads (#8373 ) * fix: cancel volume server requests on client disconnect during S3 downloads - Use http.NewRequestWithContext in ReadUrlAsStream so in-flight volume server requests are properly aborted when the client disconnects and the request context is canceled - Distinguish context-canceled errors (client disconnect, expected) from real server errors in streamFromVolumeServers; log at V(3) instead of ERROR to reduce noise from client-side disconnects (e.g. Nginx upstream timeout, browser cancel, curl --max-time) Fixes: streamFromVolumeServers: streamFn failed...context canceled" * fixup: separate Canceled/DeadlineExceeded log severity in streamFromVolumeServers - context.Canceled → V(3) Infof "client disconnected" (expected, no noise) - context.DeadlineExceeded → Warningf "server-side deadline exceeded" (unexpected, needs attention) - all other errors → Errorf (unchanged)"	2026-02-18 17:14:54 -08:00
Chris Lu	3c3a78d08e	4.13	2026-02-16 17:01:19 -08:00
Chris Lu	1e4f30c56f	pb: fix IPv6 double brackets in ServerAddress formatting (#8329 ) * pb: fix IPv6 double brackets in ServerAddress formatting * pb: refactor IPv6 tests into table-driven test * util: add JoinHostPortStr and use it in pb to avoid unsafe port parsing	2026-02-12 18:11:03 -08:00
Chris Lu	b57429ef2e	Switch empty-folder cleanup to bucket policy (#8292 ) * Fix Spark _temporary cleanup and add issue #8285 regression test * Generalize empty folder cleanup for Spark temp artifacts * Revert synchronous folder pruning and add cleanup diagnostics * Add actionable empty-folder cleanup diagnostics * Fix Spark temp marker cleanup in async folder cleaner * Fix Spark temp cleanup with implicit directory markers * Keep explicit directory markers non-implicit * logging * more logs * Switch empty-folder cleanup to bucket policy * Seaweed-X-Amz-Allow-Empty-Folders * less logs * go vet * less logs * refactoring	2026-02-10 18:38:38 -08:00
Chris Lu	af8273386d	4.12	2026-02-09 18:15:19 -08:00
Chris Lu	cb9e21cdc5	Normalize hashicorp raft peer ids (#8253 ) * Normalize raft voter ids * 4.11 * Update raft_hashicorp.go	2026-02-09 07:46:34 -08:00
Chris Lu	0c89185291	4.10	2026-02-08 21:16:58 -08:00
Chris Lu	5a5cc38692	4.09	2026-02-03 17:56:25 -08:00
Chris Lu	330bd92ddc	4.08	2026-02-02 20:44:13 -08:00
Chris Lu	ba8816e2e1	4.08	2026-02-02 20:36:03 -08:00
Chris Lu	bc853bdee5	4.07	2026-01-18 15:48:09 -08:00
Chris Lu	ce6e9be66b	4.06	2026-01-10 12:08:16 -08:00
Chris Lu	379c032868	Fix chown Input/output error on large file sets (#7996 ) * Fix chown Input/output error on large file sets (Fixes #7911) Implemented retry logic for MySQL/MariaDB backend to handle transient errors like deadlocks and timeouts. * Fix syntax error: missing closing brace * Refactor: Use %w for error wrapping and errors.As for extraction * Fix: Disable retry logic inside transactions	2026-01-09 18:02:59 -08:00
promalert	9012069bd7	chore: execute goimports to format the code (#7983 ) * chore: execute goimports to format the code Signed-off-by: promalert <promalert@outlook.com> * goimports -w . --------- Signed-off-by: promalert <promalert@outlook.com> Co-authored-by: Chris Lu <chris.lu@gmail.com>	2026-01-07 13:06:08 -08:00
Chris Lu	0e9f433ec4	refactoring	2026-01-04 11:40:42 -08:00
Chris Lu	87b71029f7	4.05	2026-01-01 20:39:22 -08:00

1 2 3 4 5 ...

924 Commits