seaweedfs

mirror of https://github.com/seaweedfs/seaweedfs.git synced 2026-05-17 15:21:31 +00:00

Author	SHA1	Message	Date
Chris Lu	e24a443b17	peer chunk sharing 2/8: filer mount registry (#9131 ) * proto: define MountRegister/MountList and MountPeer service Adds the wire types for peer chunk sharing between weed mount clients: * filer.proto: MountRegister / MountList RPCs so each mount can heartbeat its peer-serve address into a filer-hosted registry, and refresh the list of peers. Tiny payload; the filer stores only O(fleet_size) state. * mount_peer.proto (new): ChunkAnnounce / ChunkLookup RPCs for the mount-to-mount chunk directory. Each fid's directory entry lives on an HRW-assigned mount; announces and lookups route to that mount. No behavior yet — later PRs wire the RPCs into the filer and mount. See design-weed-mount-peer-chunk-sharing.md for the full design. * filer: add mount-server registry behind -peer.registry.enable Implements tier 1 of the peer chunk sharing design: an in-memory registry of live weed mount servers, keyed by peer address, refreshed by MountRegister heartbeats and served by MountList. * weed/filer/peer_registry.go: thread-safe map with TTL eviction; lazy sweep on List plus a background sweeper goroutine for bounded memory. * weed/server/filer_grpc_server_peer.go: MountRegister / MountList RPC handlers. When -peer.registry.enable is false (the default), both RPCs are silent no-ops so probing older filers is harmless. * -peer.registry.enable flag on weed filer; FilerOption.PeerRegistryEnabled wires it through. Phase 1 is single-filer (no cross-filer replication of the registry); mounts that fail over to another filer will re-register on the next heartbeat, so the registry self-heals within one TTL cycle. Part of the peer-chunk-sharing design; no behavior change at runtime until a later PR enables the flag on both filer and mount. * filer: nil-safe peerRegistryEnable + registry hardening Addresses review feedback on PR #9131. * Fix: nil pointer deref in the mini cluster. FilerOptions instances constructed outside weed/command/filer.go (e.g. miniFilerOptions in mini.go) do not populate peerRegistryEnable, so dereferencing the pointer panics at Filer startup. Use the same `nil && deref` idiom already used for distributedLock / writebackCache. * Hardening (gemini review): registry now enforces three invariants: - empty peer_addr is silently rejected (no client-controlled sentinel mass-inserts) - TTL is capped at 1 hour so a runaway client cannot pin entries - new-entry count is capped at 10000 to bound memory; renewals of existing entries are always honored, so a full registry still heartbeats its existing members correctly Covered by new unit tests. * filer: rename -peer.registry.enable flag to -mount.p2p Per review feedback: the old name "peer.registry.enable" leaked the implementation ("registry") into the CLI surface. "mount.p2p" is shorter and describes what it actually controls — whether this filer participates in mount-to-mount peer chunk sharing. Flag renames (all three keep default=true, idle cost is near-zero): -peer.registry.enable -> -mount.p2p (weed filer) -filer.peer.registry.enable -> -filer.mount.p2p (weed mini, weed server) Internal variable names (mountPeerRegistryEnable, MountPeerRegistry) keep their longer form — they describe the component, not the knob. * filer: MountList returns DataCenter + List uses RLock Two review follow-ups on the mount peer registry: * weed/server/filer_grpc_server_mount_peer.go: MountList was dropping the DataCenter on the wire. The whole point of carrying DC separately from Rack is letting the mount-side fetcher re-rank peers by the two-level locality hierarchy (same-rack > same-DC > cross-DC); without DC in the response every remote peer collapsed to "unknown locality." * weed/filer/mount_peer_registry.go: List() was taking a write lock so it could lazy-delete expired entries inline. But MountList is a read-heavy RPC hit on every mount's 30 s refresh loop, and Sweep is already wired as the sole reclamation path (same pattern as the mount-side PeerDirectory). Switch List to RLock + filter, let Sweep do the map mutation, so concurrent MountList callers don't serialize on each other. Test updated to reflect the new contract (List no longer mutates the map; Sweep is what drops expired entries).	2026-04-18 20:03:23 -07:00
Lisandro Pin	6bcacedda9	Export `master_disconnections` metrics on volume servers. (#9104 ) This allows to track connection issues and master failovers in real time via Prometheus metrics. Co-authored-by: Lisandro Pin <lisandro.pin@proton.ch>	2026-04-17 15:15:26 -07:00
Chris Lu	00a2e22478	fix(mount): remove fid pool to stop master over-allocating volumes (#9111 ) * fix(mount): remove fid pool to stop master over-allocating volumes The writeback-cache fid pool pre-allocated file IDs with ExpectedDataSize = ChunkSizeLimit (typically 8+ MB). The master's PickForWrite charges count * expectedDataSize against the volume's effectiveSize, so a full pool refill could charge hundreds of MB against a single volume before any bytes were actually written. That tripped RecordAssign's hard-limit path and eagerly removed volumes from writable, causing the master to grow new volumes even when the real data being written was tiny. Drop the pool entirely. Every chunk upload goes through UploadWithRetry -> AssignVolume with no ExpectedDataSize hint, letting the master fall back to the 1 MB default estimate. The mount->filer grpc connection is already cached in pb.WithGrpcClient (non-streaming mode), so per-chunk AssignVolume is a unary RPC over an existing HTTP/2 stream, not a full dial. Path-based filer.conf storage rules now apply to mount chunk assigns again, which the pool had to skip. Also remove the now-unused operation.UploadWithAssignFunc and its AssignFunc type. * fix(upload): populate ExpectedDataSize from actual chunk bytes UploadWithRetry already buffers the full chunk into `data` before calling AssignVolume, so the real size is known. Previously the assign request went out with ExpectedDataSize=0, making the master fall back to the 1 MB DefaultNeedleSizeEstimate per fid — same over-reservation symptom the pool had, just smaller per call. Stamp ExpectedDataSize = len(data) before the assign RPC when the caller hasn't already set it. This covers mount chunk uploads, filer_copy, filersink, mq/logstore, broker_write, gateway_upload, and nfs — all the UploadWithRetry paths. * fix(assign): pass real ExpectedDataSize at every assign call site After removing the mount fid pool, per-chunk AssignVolume calls went out with ExpectedDataSize=0, making the master fall back to its 1 MB DefaultNeedleSizeEstimate. That's still an over-estimate for small writes. Thread the real payload size through every remaining assign site so RecordAssign charges effectiveSize accurately and stops prematurely marking volumes full. - filer: assignNewFileInfo now takes expectedDataSize and stamps it on both primary and alternate VolumeAssignRequests. Callers pass: - SSE data-to-chunk: len(data) - copy manifest save: len(data) - streamCopyChunk: srcChunk.Size - TUS sub-chunk: bytes read - saveAsChunk (autochunk/manifestize): 0 (small, size unknown until the reader is drained; master uses 1 MB default) - filer gRPC remote fetch-and-write: ExpectedDataSize = chunkSize after the adaptive chunkSize is computed. - ChunkedUploadOption.AssignFunc gains an expectedDataSize parameter; upload_chunked.go passes the buffered dataSize at the call site. S3 PUT assignFunc stamps it on the AssignVolumeRequest. - S3 copy: assignNewVolume / prepareChunkCopy take expectedDataSize; all seven call sites pass the source chunk's Size. - operation.SubmitFiles / FilePart.Upload: derive per-fid size from FileSize (average for batched requests, real per-chunk size for sequential chunk assigns). - benchmark: pass fileSize. - filer append-to-file: pass len(data). * fix(assign): thread size through SaveDataAsChunkFunctionType The saveAsChunk path (autochunk, filer_copy, webdav, mount) ran AssignVolume before the reader was drained, so it had to pass ExpectedDataSize=0 and fall back to the master's 1 MB default. Add an expectedDataSize parameter to SaveDataAsChunkFunctionType. - mergeIntoManifest already has the serialized manifest bytes, so it passes uint64(len(data)) directly. - Mount's saveDataAsChunk ignores the parameter because it uses UploadWithRetry, which already stamps len(data) on the assign after reading the payload. - webdav and filer_copy saveDataAsChunk follow the same UploadWithRetry path and also ignore the hint. - Filer's saveAsChunk (used for manifestize) plumbs the value to assignNewFileInfo so manifest-chunk assigns get a real size. Callers of saveFunc-as-value (weedfs_file_sync, dirty_pages_chunked) pass the chunk size they're about to upload.	2026-04-16 15:51:13 -07:00
Chris Lu	979c54f693	fix(wdclient,volume): compare master leader with ServerAddress.Equals (#9089 ) * fix(wdclient,volume): compare master leader with ServerAddress.Equals Raft leader is advertised as host:httpPort.grpcPort, but clients dial host:httpPort. Raw string comparison against VolumeLocation.Leader / HeartbeatResponse.Leader therefore never matches, causing the masterclient and the volume server heartbeat loop to continuously "redirect" to the already-connected master, tearing down the stream and reconnecting. Use ServerAddress.Equals, which normalizes the grpc-port suffix. * fix(filer,mq): compare ServerAddress via Equals in two more sites filer bootstrap skip (MaybeBootstrapFromOnePeer) and the broker's local partition assignment check both compared a wire-supplied address string against the local self ServerAddress with raw string equality. Both are vulnerable to the same plain-vs-host:port.grpcPort mismatch as the masterclient/volume heartbeat sites: filer would bootstrap from itself, and the broker would fail to claim a partition it was actually assigned. Route both through ServerAddress.Equals. * fix(master,shell): more ServerAddress comparisons via Equals - raft_server_handlers.go HealthzHandler: s.serverAddr == leader would skip the child-lock check on the real leader when the two carry different plain/grpc-suffix forms, returning 200 OK instead of 423. - master_server.go SetRaftServer leader-change callback: the Leader() == Name() guard for ensureTopologyId could disagree with topology.IsLeader() (which already uses Equals), so leader-only initialization could be skipped after an election. - command_volume_merge.go isReplicaServer: the -target guard compared user-supplied host:port against NewServerAddressFromDataNode(...) with ==, letting an existing replica slip through when topology carries the embedded gRPC port. All routed through pb.ServerAddress.Equals. * fix(mq,cluster): more ServerAddress comparisons via Equals - broker_grpc_lookup.go GetTopicPublishers/GetTopicSubscribers: the partition ownership check gated listing on raw LeaderBroker == BrokerAddress().String(), so listings silently omitted partitions hosted locally when the assignment carried the other host:port / host:port.grpcPort form. - lock_client.go: LockHostMovedTo comparison and the seedFiler fallback guard both used raw string equality against configured filer addresses (which may be plain host:port while LockHostMovedTo comes back suffixed), causing spurious host-change churn and blocking the seed-filer fallback. * fix(mq): more ServerAddress comparisons via Equals - pub_balancer/allocate.go EnsureAssignmentsToActiveBrokers: direct activeBrokers.Get() lookup missed brokers when a persisted assignment carried a different address encoding than the registered broker key, triggering a bogus reassignment on every read/write cycle. Added a findActiveBroker helper that falls back to an Equals-based scan and canonicalizes the assignment in place so later writes are stable. - broker_grpc_lookup.go isLockOwner: used raw string equality between LockOwner() and BrokerAddress().String(), so a lock owner could fail to recognize itself and proxy local lookup/config/admin RPCs away. - pub_client/scheduler.go onEachAssignments: reused publisher jobs only on exact LeaderBroker match, so an encoding flip in lookup results tore down and recreated a stream to the same broker.	2026-04-15 12:29:31 -07:00
Chris Lu	08d9193fe1	[nfs] Add NFS (#9067 ) * add filer inode foundation for nfs * nfs command skeleton * add filer inode index foundation for nfs * make nfs inode index hardlink aware * add nfs filehandle and inode lookup plumbing * add read-only nfs frontend foundation * add nfs namespace mutation support * add chunk-backed nfs write path * add nfs protocol integration tests * add stale handle nfs coverage * complete nfs hardlink and failover coverage * add nfs export access controls * add nfs metadata cache invalidation * fix nfs chunk read lookup routing * fix nfs review findings and rename regression * address pr 9067 review comments - filer_inode: fail fast if the snowflake sequencer cannot start, and let operators override the 10-bit node id via SEAWEEDFS_FILER_SNOWFLAKE_ID to avoid multi-filer collisions - filer_inode: drop the redundant retry loop in nextInode - filerstore_wrapper: treat inode-index writes/removals as best-effort so a primary store success no longer surfaces as an operation failure - filer_grpc_server_rename: defer overwritten-target chunk deletion until after CommitTransaction so a rolled-back rename does not strand live metadata pointing at freshly deleted chunks - command/nfs: default ip.bind to loopback and require an explicit filer.path, so the experimental server does not expose the entire filer namespace on first run - nfs integration_test: document why LinkArgs matches go-nfs's on-the-wire layout rather than RFC 1813 LINK3args * mount: pre-allocate inode in Mkdir and Symlink Mkdir and Symlink used to send filer_pb.CreateEntryRequest with Attributes.Inode = 0. After PR 9067, the filer's CreateEntry now assigns its own inode in that case, so the filer-side entry ends up with a different inode than the one the mount allocates via inodeToPath.Lookup and returns to the kernel. Once applyLocalMetadataEvent stores the filer's entry in the meta cache, subsequent GetAttr calls read the cached entry and hit the setAttrByPbEntry override at line 197 of weedfs_attr.go, returning the filer-assigned inode instead of the mount's local one. pjdfstest tests/rename/00.t (subtests 81/87/91) caught this — it lstat'd a freshly-created directory/symlink, renamed it, lstat'd again, and saw a different inode the second time. createRegularFile already pre-allocates via inodeToPath.AllocateInode and stamps it into the create request. Do the same thing in Mkdir and Symlink so both sides agree on the object identity from the very first request, and so GetAttr's cache path returns the same value as Mkdir / Symlink's initial response. * sequence: mask snowflake node id on int→uint32 conversion CodeQL flagged the unchecked uint32(snowflakeId) cast in NewSnowflakeSequencer as a potential truncation bug when snowflakeId is sourced from user input (e.g. via SEAWEEDFS_FILER_SNOWFLAKE_ID). Mask to the 10 bits the snowflake library actually uses so any caller- supplied int is safely clamped into range. * add test/nfs integration suite Boots a real SeaweedFS cluster (master + volume + filer) plus the experimental `weed nfs` frontend as subprocesses and drives it through the NFSv3 wire protocol via go-nfs-client, mirroring the layout of test/sftp. The tests run without a kernel NFS mount, privileged ports, or any platform-specific tooling. Coverage includes read/write round-trip, mkdir/rmdir, nested directories, rename content preservation, overwrite + explicit truncate, 3 MiB binary file, all-byte binary and empty files, symlink round-trip, ReadDirPlus listing, missing-path remove, FSInfo sanity, sequential appends, and readdir-after-remove. Framework notes: - Picks ephemeral ports with net.Listen("127.0.0.1:0") and passes -port.grpc explicitly so the default port+10000 convention cannot overflow uint16 on macOS. - Pre-creates the /nfs_export directory via the filer HTTP API before starting the NFS server — the NFS server's ensureIndexedEntry check requires the export root to exist with a real entry, which filer.Root does not satisfy when the export path is "/". - Reuses the same rpc.Client for mount and target so go-nfs-client does not try to re-dial via portmapper (which concatenates ":111" onto the address). * ci: add NFS integration test workflow Mirror test/sftp's workflow for the new test/nfs suite so PRs that touch the NFS server, the inode filer plumbing it depends on, or the test harness itself run the 14 NFSv3-over-RPC integration tests on Ubuntu 22.04 via `make test`. * nfs: use append for buffer growth in Write and Truncate The previous make+copy pattern reallocated the full buffer on every extending write or truncate, giving O(N^2) behaviour for sequential write loops. Switching to `append(f.content, make([]byte, delta)...)` lets Go's amortized growth strategy absorb the repeated extensions. Called out by gemini-code-assist on PR 9067. * filer: honor caller cancellation in collectInodeIndexEntries Dropping the WithoutCancel wrapper lets DeleteFolderChildren bail out of the inode-index scan if the client disconnects mid-walk. The cleanup is already treated as best-effort by the caller (it logs on error and continues), so a cancelled walk just means the partial index rebuild is skipped — the same failure mode as any other index write error. Flagged as a DoS concern by gemini-code-assist on PR 9067. * nfs: skip filer read on open when O_TRUNC is set openFile used to unconditionally loadWritableContent for every writable open and then discard the buffer if O_TRUNC was set. For large files that is a pointless 64 MiB round-trip. Reorder the branches so we only fetch existing content when the caller intends to keep it, and mark the file dirty right away so the subsequent Close still issues the truncating write. Called out by gemini-code-assist on PR 9067. * nfs: allow Seek on O_APPEND files and document buffered write cap Two related cleanups on filesystem.go: - POSIX only restricts Write on an O_APPEND fd, not lseek. The existing Seek error ("append-only file descriptors may only seek to EOF") prevented read-and-write workloads that legitimately reposition the read cursor. Write already snaps the offset to EOF before persisting (see seaweedFile Write), so Seek can unconditionally accept any offset. Update the unit test that was asserting the old behaviour. - Add a doc comment on maxBufferedWriteSize explaining that it is a per-file ceiling, the memory footprint it implies, and that the real fix for larger whole-file rewrites is streaming / multi-chunk support. Both changes flagged by gemini-code-assist on PR 9067. * nfs: guard offset before casting to int in Write CodeQL flagged `int(f.offset) + len(p)` inside the Write growth path as a potential overflow on architectures where `int` is 32-bit. The existing check only bounded the post-cast value, which is too late. Clamp f.offset against maxBufferedWriteSize before the cast and also reject negative/overflowed endOffset results. Both branches fall through to billy.ErrNotSupported, the same behaviour the caller gets today for any out-of-range buffered write. * nfs: compute Write endOffset in int64 to satisfy CodeQL The previous guard bounded f.offset but left len(p) unchecked, so CodeQL still flagged `int(f.offset) + len(p)` as a possible int-width overflow path. Bound len(p) against maxBufferedWriteSize first, do the addition in int64, and only cast down after the total has been clamped against the buffer ceiling. Behaviour is unchanged: any out-of-range write still returns billy.ErrNotSupported. * ci: drop emojis from nfs-tests workflow summary Plain-text step summary per user preference — no decorative glyphs in the NFS CI output or checklist. * nfs: annotate remaining DEV_PLAN TODOs with status Three of the unchecked items are genuine follow-up PRs rather than missing work in this one, and one was actually already done: - Reuse chunk cache and mutation stream helpers without FUSE deps: checked off — the NFS server imports weed/filer.ReaderCache and weed/util/chunk_cache directly with no weed/mount or go-fuse imports. - Extract shared read/write helpers from mount/WebDAV/SFTP: annotated as deferred to a separate refactor PR (touches four packages). - Expand direct data-path writes beyond the 64 MiB buffered fallback: annotated as deferred — requires a streaming WRITE path. - Shared lock state + lock tests: annotated as blocked upstream on go-nfs's missing NLM/NFSv4 lock state RPCs, matching the existing "Current Blockers" note. * test/nfs: share port+readiness helpers with test/testutil Drop the per-suite mustPickFreePort and waitForService re-implementations in favor of testutil.MustAllocatePorts (atomic batch allocation; no close-then-hope race) and testutil.WaitForPort / SeaweedMiniStartupTimeout. Pull testutil in via a local replace directive so this standalone seaweedfs-nfs-tests module can import the in-repo package without a separate release. Subprocess startup is still master + volume + filer + nfs — no switch to weed mini yet, since mini does not know about the nfs frontend. * nfs: stream writes to volume servers instead of buffering the whole file Before this change the NFS write path held the full contents of every writable open in memory: - OpenFile(write) called loadWritableContent which read the existing file into seaweedFile.content up to maxBufferedWriteSize (64 MiB) - each Write() extended content in-place - Close() uploaded the whole buffer as a single chunk via persistContent + AssignVolume The 64 MiB ceiling made large NFS writes return NFS3ERR_NOTSUPP, and even below the cap every Write paid a whole-file-in-memory cost. This PR rewrites the write path to match how `weed filer` and the S3 gateway persist data: - openFile(write) no longer loads the existing content at all; it only issues an UpdateEntry when O_TRUNC is set and the file is non-empty (so a fresh create+trunc is still zero-RPC) - Write() streams the caller's bytes straight to a volume server via one AssignVolume + one chunk upload, then atomically appends the resulting chunk to the filer entry through mutateEntry. Any previously inlined entry.Content is migrated to a chunk in the same update so the chunk list becomes the authoritative representation. - Truncate() becomes a direct mutateEntry (drop chunks past the new size, clip inline content, update FileSize) instead of resizing an in-memory buffer. - Close() is a no-op because everything was flushed inline. The small-file fast path that the filer HTTP handler uses is preserved: if the post-write size still fits in maxInlineWriteSize (4 MiB) and the file has no existing chunks, we rewrite entry.Content directly and skip the volume-server round-trip. This keeps single-shot tiny writes (echo, small edits) cheap while completely removing the 64 MiB cap on larger files. Read() now always reads through the chunk reader instead of a local byte slice, so reads inside the same session see the freshly appended data. Drops the unused seaweedFile.content / dirty fields, the maxBufferedWriteSize constant, and the loadWritableContent helper. Updates TestSeaweedFileSystemSupportsNamespaceMutations expectations to match the new "no extra O_TRUNC UpdateEntry on an empty file" behavior (still 3 updates: Write + Chmod + Truncate). * filer: extract shared gateway upload helper for NFS and WebDAV Three filer-backed gateways (NFS, WebDAV, and mount) each had a local saveDataAsChunk that wrapped operation.NewUploader().UploadWithRetry with near-identical bodies: build AssignVolumeRequest, build UploadOption, build genFileUrlFn with optional filerProxy rewriting, call UploadWithRetry, validate the result, and call ToPbFileChunk. Pull that body into filer.SaveGatewayDataAsChunk with a GatewayChunkUploadRequest struct so both NFS and WebDAV can delegate to one implementation. - NFS's saveDataAsChunk is now a thin adapter that assembles the GatewayChunkUploadRequest from server options and calls the helper. The chunkUploader interface keeps working for test injection because the new GatewayChunkUploader interface is structurally identical. - WebDAV's saveDataAsChunk is similarly a thin adapter — it drops the local operation.NewUploader call plus the AssignVolume/UploadOption scaffolding. - mount is intentionally left alone. mount's saveDataAsChunk has two features that do not fit the shared helper (a pre-allocated file-id pool used to skip AssignVolume entirely, and a chunkCache write-through at offset 0 so future reads hit the mount's local cache), both of which are mount-specific. Marks the Phase 2 "extract shared read/write helpers from mount, WebDAV, and SFTP" DEV_PLAN item as done. The filer-level chunk read path (NonOverlappingVisibleIntervals + ViewFromVisibleIntervals + NewChunkReaderAtFromClient) was already shared. * nfs: remove DESIGN.md and DEV_PLAN.md The planning documents have served their purpose — all phase 1 and phase 2 items are landed, phase 3 streaming writes are landed, phase 2 shared helpers are extracted, and the two remaining phase 4 items (shared lock state + lock tests) are blocked upstream on github.com/willscott/go-nfs which exposes no NLM or NFSv4 lock state RPCs. The running decision log no longer reflects current code and would just drift. The NFS wiki page (https://github.com/seaweedfs/seaweedfs/wiki/NFS-Server) now carries the overview, configuration surface, architecture notes, and known limitations; the source is the source of truth for the rest.	2026-04-14 20:48:24 -07:00
Lisandro Pin	67a2810d2d	Export `start_time_seconds` metrics on both master & volume servers. (#9046 ) These are to be used to track uptimes. See https://github.com/seaweedfs/seaweedfs/issues/8535 for details. Co-authored-by: Lisandro Pin <lisandro.pin@proton.ch>	2026-04-13 09:34:08 -07:00
Chris Lu	edf7d2a074	fix(filer): eliminate redundant disk reads causing memory/CPU regression (#9039 ) * fix(filer): eliminate redundant disk reads causing memory/CPU regression (#9035) Since 4.18, LocalMetaLogBuffer's ReadFromDiskFn was set to readPersistedLogBufferPosition, causing LoopProcessLogData to call ReadPersistedLogBuffer on every 250ms health-check tick when a subscriber encounters ResumeFromDiskError. Each call creates an OrderedLogVisitor (ListDirectoryEntries on the filer store), spawns a readahead goroutine with a 1024-element channel, finds no data, and returns — 4 times per second even on an idle filer. This is redundant because SubscribeLocalMetadata already manages disk reads explicitly with its own shouldReadFromDisk / lastCheckedFlushTsNs tracking in the outer loop. Set ReadFromDiskFn back to nil for LocalMetaLogBuffer. When LoopProcessLogData encounters ResumeFromDiskError with nil ReadFromDiskFn, the HasData() guard returns ResumeFromDiskError to the caller (SubscribeLocalMetadata), which blocks efficiently on listenersCond.Wait() instead of polling. * fix(filer): add gap detection for slow consumers after disk-read stall When a slow consumer falls behind and LoopProcessLogData returns ResumeFromDiskError with no flush or read-position progress, there may be a gap between persisted data and in-memory data (e.g. writes stopped while consumer was still catching up). Without this, the consumer would block on listenersCond.Wait() forever. Skip forward to the earliest in-memory time to resume progress, matching the gap-handling pattern already used in the shouldReadFromDisk path. * fix(filer): clear stale ResumeFromDiskError after gap-skip to avoid stall The gap-detection block added in the previous commit skips lastReadTime forward to GetEarliestTime() and continues the outer loop. On the next iteration, shouldReadFromDisk becomes true (currentReadTsNs > lastDiskReadTsNs), the disk read returns processedTsNs == 0, and the existing gap handler at the top of the loop runs its own gap check. That check uses readInMemoryLogErr == ResumeFromDiskError as the entry condition — but readInMemoryLogErr is still the stale error from two iterations ago. GetEarliestTime() now equals lastReadTime.Time (we already advanced to it), so earliestTime.After(lastReadTime.Time) is false and the handler falls into listenersCond.Wait() — stuck. Clear readInMemoryLogErr at the gap-skip point, matching the existing pattern at the earlier gap handler that already clears it for the same reason. * fix(log_buffer): GetEarliestTime must include sealed prev buffers GetEarliestTime previously returned only logBuffer.startTime (the active buffer's first timestamp). That is narrower than ReadFromBuffer's tsMemory, which is the min across active + prev buffers. Callers using GetEarliestTime for gap detection after ResumeFromDiskError (the SubscribeLocalMetadata outer loop's disk-read path, the new gap-skip in the in-memory ResumeFromDiskError handler, and MQ HasData) saw a time that was newer than the real earliest in-memory data. Impact in SubscribeLocalMetadata's slow-consumer path: - tsMemory = earliest prev buffer time (T_prev) - GetEarliestTime() = active startTime (T_active, later than T_prev) - Consumer position = T1, with T_prev < T1 < T_active - ReadFromBuffer returns ResumeFromDiskError (T1 < tsMemory) - Gap detect: GetEarliestTime().After(T1) = T_active.After(T1) = true - Skip forward to T_active -- silently drops the prev-buffer data - And when T_active happens to equal the stuck position, gap detect evaluates false, and the subscriber stalls on listenersCond.Wait() This reproduces the TestMetadataSubscribeSlowConsumerKeepsProgressing failure in CI where the consumer stalled at 10220/20000 after writing stopped -- the buffer still had data in prev[0..3], but gap detection was comparing against the active buffer's startTime. Fix: scan all sealed prev buffers under RLock, return the true minimum startTime. Matches the min-of-buffers logic in ReadFromBuffer. * test(log_buffer): make DiskReadRetry test deterministic The previous test added the message via AddToBuffer + ForceFlush and relied on a race: the second disk read had to happen before the data was delivered through the in-memory path. Under the race detector or on a slow CI runner, the reader is woken by AddToBuffer's notification, finds the data in the active buffer or its prev slot, and returns after exactly one disk read — failing the >= 2 disk reads assertion even though the loop behaved correctly. Reproduced on master with race detector (2/5 failures). Rewrite the test to deliver the data exclusively through the disk-read path: no AddToBuffer, no ForceFlush. The test waits until the reader has issued at least one no-op disk read, then atomically flips a "dataReady" flag. The reader's next iteration through readFromDiskFn returns the entry. This deterministically exercises the retry-loop behavior the test was originally written to protect, and removes the in-memory delivery race entirely.	2026-04-11 23:12:54 -07:00
os-pradipbabar	9cae95d749	fix(filer): prevent data corruption during graceful shutdown (#9037 ) * fix: wait for in-flight uploads to complete before filer shutdown Prevents data corruption when SIGTERM is received during active uploads. The filer now waits for all in-flight operations to complete before calling the underlying shutdown logic. This affects all deployment types (Kubernetes, Docker, systemd) and fixes corruption issues during rolling updates, certificate rotation, and manual restarts. Changes: - Add FilerServer.Shutdown() method with upload wait logic - Update grace.OnInterrupt hook to use new shutdown method Fixes data corruption reported by production users during pod restarts. * fix: implement graceful shutdown for gRPC and HTTP servers, ensuring in-flight uploads complete * fix: address review comments on graceful shutdown - Add 10s timeout to gRPC GracefulStop to prevent indefinite blocking from long-lived streams (falls back to Stop on timeout) - Reduce HTTP/HTTPS shutdown timeout from 25s to 15s to fit within Kubernetes default 30s termination grace period - Move fs.Shutdown() (database close) after Serve() returns instead of a separate hook to eliminate race where main goroutine exits before the shutdown hook runs * fix: shut down all HTTP servers before filer database close Address remaining review comments: - Shut down auxiliary HTTP servers (Unix socket, local listener) during graceful shutdown so they can't serve write traffic after the main server stops - Register fs.Shutdown() as a grace.OnInterrupt hook to guarantee it completes before os.Exit(0), fixing the race between the grace goroutine and the main goroutine - Use sync.Once to ensure fs.Shutdown() runs exactly once regardless of whether shutdown is signal-driven or context-driven (MiniCluster) --------- Co-authored-by: Chris Lu <chris.lu@gmail.com>	2026-04-11 21:18:22 -07:00
Chris Lu	b37bbf541a	feat(master): drain pending size before marking volume readonly (#9036 ) * feat(master): drain pending size before marking volume readonly When vacuum, volume move, or EC encoding marks a volume readonly, in-flight assigned bytes may still be pending. This adds a drain step: immediately remove from writable list (stop new assigns), then wait for pending to decay below 4MB or 30s timeout. - Add volumeSizeTracking struct consolidating effectiveSize, reportedSize, and compactRevision into a single map - Add GetPendingSize, waitForPendingDrain, DrainAndRemoveFromWritable, DrainAndSetVolumeReadOnly to VolumeLayout - UpdateVolumeSize detects compaction via compactRevision change and resets effectiveSize instead of decaying - Wire drain into vacuum (topology_vacuum.go) and volume mark readonly (master_grpc_server_volume.go) * fix: use 2MB pending size drain threshold * fix: check crowded state on initial UpdateVolumeSize registration * fix: respect context cancellation in drain, relax test timing - DrainAndSetVolumeReadOnly now accepts context.Context and returns early on cancellation (for gRPC handler timeout/cancel) - waitForPendingDrain uses select on ctx.Done instead of time.Sleep - Increase concurrent heartbeat test timeout from 10s to 15s for CI * fix: use time-based dedup so decay runs even when reported size is unchanged The value-based dedup (same reportedSize + compactRevision = skip) prevented decay from running when pending bytes existed but no writes had landed on disk yet. The reported size stayed the same across heartbeats, so the excess never decayed. Fix: dedup replicas within the same heartbeat cycle using a 2-second time window instead of comparing values. This allows decay to run once per heartbeat cycle even when the reported size is unchanged. Also confirmed finding 1 (draining re-add race) is a false positive: - Vacuum: ensureCorrectWritables only runs for ReadOnly-changed volumes - Move/EC: readonlyVolumes flag prevents re-adding during drain * fix: make VolumeMarkReadonly non-blocking to fix EC integration test timeout The DrainAndSetVolumeReadOnly call in VolumeMarkReadonly gRPC blocked up to 30s waiting for pending bytes to decay. In integration tests (and real clusters during EC encoding), this caused timeouts because multiple volumes are marked readonly sequentially and heartbeats may not arrive fast enough to decay pending within the drain window. Fix: VolumeMarkReadonly now calls SetVolumeReadOnly immediately (stops new assigns) and only logs a warning if pending bytes remain. The drain wait is kept only for vacuum (DrainAndRemoveFromWritable) which runs inside the master's own goroutine pool. Remove DrainAndSetVolumeReadOnly as it's no longer used. * fix: relax test timing, rename test, add post-condition assert * test: add vacuum integration tests with CI workflow Full-cluster integration test for vacuum, modeled on the EC integration tests. Starts a real master + 2 volume servers, uploads data, deletes entries to create garbage, runs volume.vacuum via shell command, and verifies garbage cleanup and data integrity. Test flow: 1. Start cluster (master + 2 volume servers) 2. Upload 10 files to create volume with data 3. Delete 5 files to create ~50% garbage 4. Verify garbage ratio > 10% 5. Run volume.vacuum command 6. Verify garbage cleaned up 7. Verify remaining 5 files are still accessible CI workflow runs on push/PR to master with 15-minute timeout. Log collection on failure via artifact upload. * fix: use 500KB files and delete 75% to exceed vacuum garbage threshold * fix: add shell lock before vacuum command, fix compilation error * fix: strengthen vacuum integration test assertions - waitForServer: use net.DialTimeout instead of grpc.NewClient for real TCP readiness check - verify_garbage_before_vacuum: t.Fatal instead of warning when no garbage detected - verify_cleanup_after_vacuum: t.Fatal if no server reported the volume or cleanup wasn't verified - verify_remaining_data: read actual file contents via HTTP and compare byte-for-byte against original uploaded payloads * fix: use http.Client with timeout and close body before retry	2026-04-11 18:29:11 -07:00
Chris Lu	10b0bdce02	feat: pass expected_data_size from clients for size-aware assignment (#9032 ) * feat: pass expected_data_size from clients for size-aware assignment Add expected_data_size field to AssignRequest (master proto) and AssignVolumeRequest (filer proto) so clients can hint how large the data will be. The master uses this instead of the 1MB default when tracking pending volume sizes for weighted assignment. - Add expected_data_size to master.proto AssignRequest - Add expected_data_size to filer.proto AssignVolumeRequest - Wire through filer AssignVolume handler - Wire through HTTP submit handler (uses actual upload size) - Add ExpectedDataSize to VolumeAssignRequest in operation package - Topology.PickForWrite accepts optional expectedDataSize parameter * fix: guard integer conversions in expected_data_size path - common.go: clamp OriginalDataSize to non-negative before uint64 cast - topology.go: cap expectedDataSize at math.MaxInt64 before int64 cast * fix: parse dataSize hint in HTTP /dir/assign and test non-zero expectedDataSize - HTTP /dir/assign now parses optional "dataSize" query parameter and passes it to PickForWrite instead of hardcoded 0 - Add test assertion for PickForWrite with non-zero expectedDataSize	2026-04-11 11:30:47 -07:00
Chris Lu	e648c76bcf	go fmt	2026-04-10 17:31:14 -07:00
Chris Lu	6f036c7015	fix(master): skip redundant DoJoinCommand on resumeState to prevent deadlock (#8998 ) * fix(master): skip redundant DoJoinCommand on resumeState to prevent deadlock When fastResume is active (single-master + resumeState + non-empty log), the raft server becomes leader within ~1ms. DoJoinCommand then enters the leaderLoop's processCommand path, which calls setCommitIndex to commit all pending entries. The goraft setCommitIndex implementation returns early when it encounters a JoinCommand entry (to recalculate quorum), which can prevent the new entry's event channel from being notified — leaving DoJoinCommand blocked forever. Each restart appends a new raft:join entry to the log, while the conf file's commitIndex (only persisted on AddPeer) lags behind. After 3-4 restarts the uncommitted range contains old JoinCommand entries that trigger the early return before the new entry is reached. Fix: skip DoJoinCommand when the raft log already has entries (the server was already joined in a previous run). The fastResume mechanism handles leader election independently. * fix(master): handle Hashicorp Raft in HasExistingState Add Hashicorp Raft support to HasExistingState by checking AppliedIndex, consistent with how other RaftServer methods handle both raft implementations. * fix(master): use LastIndex() instead of AppliedIndex() for Hashicorp Raft AppliedIndex() reflects in-memory FSM state which starts at 0 before log replay completes. LastIndex() reads from persisted stable storage, correctly mirroring the non-Hashicorp IsLogEmpty() check.	2026-04-08 21:08:50 -07:00
Lars Lehtonen	8edadf7f4a	chore(weed/server): prune unused unexported struct fields (#8980 )	2026-04-07 21:24:30 -07:00
Chris Lu	940eed0bd3	fix(ec): generate .ecx before EC shards to prevent data inconsistency (#8972 ) * fix(ec): generate .ecx before EC shards to prevent data inconsistency In VolumeEcShardsGenerate, the .ecx index was generated from .idx AFTER the EC shards were generated from .dat. If any write occurred between these two steps (e.g. WriteNeedleBlob during replica sync, which bypasses the read-only check), the .ecx would contain entries pointing to data that doesn't exist in the EC shards, causing "shard too short" and "size mismatch" errors on subsequent reads and scrubs. Fix by generating .ecx FIRST, then snapshotting datFileSize, then encoding EC shards. If a write sneaks in after .ecx generation, the EC shards contain more data than .ecx references — which is harmless (the extra data is simply not indexed). Also snapshot datFileSize before EC encoding to ensure the .vif reflects the same .dat state that .ecx was generated from. Add TestEcConsistency_WritesBetweenEncodeAndEcx that reproduces the race condition by appending data between EC encoding and .ecx generation. * fix: pass actual offset to ReadBytes, improve test quality - Pass offset.ToActualOffset() to ReadBytes instead of 0 to preserve correct error metrics and error messages within ReadBytes - Handle Stat() error in assembleFromIntervalsAllowError - Rename TestEcConsistency_DatFileGrowsDuringEncoding to TestEcConsistency_ExactLargeRowEncoding (test verifies fixed-size encoding, not concurrent growth) - Update test comment to clarify it reproduces the old buggy sequence - Fix verification loop to advance by readSize for full data coverage * fix(ec): add dat/idx consistency check in worker EC encoding The erasure_coding worker copies .dat and .idx as separate network transfers. If a write lands on the source between these copies, the .idx may have entries pointing past the end of .dat, leading to EC volumes with .ecx entries that reference non-existent shard data. Add verifyDatIdxConsistency() that walks the .idx and verifies no entry's offset+size exceeds the .dat file size. This fails the EC task early with a clear error instead of silently producing corrupt EC volumes. * test(ec): add integration test verifying .ecx/.ecd consistency TestEcIndexConsistencyAfterEncode uploads multiple needles of varying sizes (14B to 256KB), EC-encodes the volume, mounts data shards, then reads every needle back via the EC read path and verifies payload correctness. This catches any inconsistency between .ecx index entries and EC shard data. * fix(test): account for needle overhead in test volume fixture WriteTestVolumeFiles created a .dat of exactly datSize bytes but the .idx entry claimed a needle of that same size. GetActualSize adds header + checksum + timestamp overhead, so the consistency check correctly rejects this as the needle extends past the .dat file. Fix by sizing the .dat to GetActualSize(datSize) so the .idx entry is consistent with the .dat contents. * fix(test): remove flaky shard ID assertion in EC scrub test When shard 0 is truncated on disk after mount, the volume server may detect corruption via parity mismatches (shards 10-13) rather than a direct read failure on shard 0, depending on OS caching/mmap behavior. Replace the brittle shard-0-specific check with a volume ID validation. * fix(test): close upload response bodies and tighten file count assertion Wrap UploadBytes calls with ReadAllAndClose to prevent connection/fd leaks during test execution. Also tighten TotalFiles check from >= 1 to == 1 since ecSetup uploads exactly one file.	2026-04-07 19:05:36 -07:00
Chris Lu	a4753b6a3b	S3: delay empty folder cleanup to prevent Spark write failures (#8970 ) * S3: delay empty folder cleanup to prevent Spark write failures (#8963) Empty folders were being cleaned up within seconds, causing Apache Spark (s3a) writes to fail when temporary directories like _temporary/0/task_xxx/ were briefly empty. - Increase default cleanup delay from 5s to 2 minutes - Only process queue items that have individually aged past the delay (previously the entire queue was drained once any item triggered) - Make the delay configurable via filer.toml: [filer.options] s3.empty_folder_cleanup_delay = "2m" * test: increase cleanup wait timeout to match 2m delay The empty folder cleanup delay was increased to 2 minutes, so the Spark integration test needs to wait longer for temporary directories to disappear. * fix: eagerly clean parent directories after empty folder deletion After deleting an empty folder, immediately try to clean its parent rather than relying on cascading metadata events that each re-enter the 2-minute delay queue. This prevents multi-minute waits when cleaning nested temporary directory trees (e.g. Spark's _temporary hierarchy with 3+ levels would take 6m+ vs near-instant). Fixes the CI failure where lingering _temporary parent directories were not cleaned within the test's 3-minute timeout.	2026-04-07 13:20:59 -07:00
Chris Lu	4efe0acaf5	fix(master): fast resume state and default resumeState to true (#8925 ) * fix(master): fast resume state and default resumeState to true When resumeState is enabled in single-master mode, the raft server had existing log entries so the self-join path couldn't promote to leader. The server waited the full election timeout (10-20s) before self-electing. Fix by temporarily setting election timeout to 1ms before Start() when in single-master + resumeState mode with existing log, then restoring the original timeout after leader election. This makes resume near-instant. Also change the default for resumeState from false to true across all CLI commands (master, mini, server) so state is preserved by default. * fix(master): prevent fastResume goroutine from hanging forever Use defer to guarantee election timeout is always restored, and bound the polling loop with a timeout so it cannot spin indefinitely if leader election never succeeds. * fix(master): use ticker instead of time.After in fastResume polling loop	2026-04-04 14:15:56 -07:00
Chris Lu	896114d330	fix(admin): fix master leader link showing incorrect port in Admin UI (#8924 ) fix(admin): use gRPC address for current server in RaftListClusterServers The old Raft implementation was returning the HTTP address (ms.option.Master) for the current server, while peers used gRPC addresses (peer.ConnectionString). The Admin UI's GetClusterMasters() converts all addresses from gRPC to HTTP via GrpcAddressToServerAddress (port - 10000), which produced a negative port (-667) for the current server since its address was already in HTTP format (port 9333). Use ToGrpcAddress() for consistency with both HashicorpRaft (which stores gRPC addresses) and old Raft peers. Fixes #8921	2026-04-04 11:50:43 -07:00
Chris Lu	d1823d3784	fix(s3): include static identities in listing operations (#8903 ) * fix(s3): include static identities in listing operations Static identities loaded from -s3.config file were only stored in the S3 API server's in-memory state. Listing operations (s3.configure shell command, aws iam list-users) queried the credential manager which only returned dynamic identities from the backend store. Register static identities with the credential manager after loading so they are included in LoadConfiguration and ListUsers results, and filtered out before SaveConfiguration to avoid persisting them to the dynamic store. Fixes https://github.com/seaweedfs/seaweedfs/discussions/8896 * fix: avoid mutating caller's config and defensive copies - SaveConfiguration: use shallow struct copy instead of mutating the caller's config.Identities field - SetStaticIdentities: skip nil entries to avoid panics - GetStaticIdentities: defensively copy PolicyNames slice to avoid aliasing the original * fix: filter nil static identities and sync on config reload - SetStaticIdentities: filter nil entries from the stored slice (not just from staticNames) to prevent panics in LoadConfiguration/ListUsers - Extract updateCredentialManagerStaticIdentities helper and call it from both startup and the grace.OnReload handler so the credential manager's static snapshot stays current after config file reloads * fix: add mutex for static identity fields and fix ListUsers for store callers - Add sync.RWMutex to protect staticIdentities/staticNames against concurrent reads during config reload - Revert CredentialManager.ListUsers to return only store users, since internal callers (e.g. DeletePolicy) look up each user in the store and fail on non-existent static entries - Merge static usernames in the filer gRPC ListUsers handler instead, via the new GetStaticUsernames method - Fix CI: TestIAMPolicyManagement/managed_policy_crud_lifecycle was failing because DeletePolicy iterated static users that don't exist in the store * fix: show static identities in admin UI and weed shell The admin UI and weed shell s3.configure command query the filer's credential manager via gRPC, which is a separate instance from the S3 server's credential manager. Static identities were only registered on the S3 server's credential manager, so they never appeared in the filer's responses. - Add CredentialManager.LoadS3ConfigFile to parse a static S3 config file and register its identities - Add FilerOptions.s3ConfigFile so the filer can load the same static config that the S3 server uses - Wire s3ConfigFile through in weed mini and weed server modes - Merge static usernames in filer gRPC ListUsers handler - Add CredentialManager.GetStaticUsernames helper - Add sync.RWMutex to protect concurrent access to static identity fields - Avoid importing weed/filer from weed/credential (which pulled in filer store init() registrations and broke test isolation) - Add docker/compose/s3_static_users_example.json * fix(admin): make static users read-only in admin UI Static users loaded from the -s3.config file should not be editable or deletable through the admin UI since they are managed via the config file. - Add IsStatic field to ObjectStoreUser, set from credential manager - Hide edit, delete, and access key buttons for static users in the users table template - Show a "static" badge next to static user names - Return 403 Forbidden from UpdateUser and DeleteUser API handlers when the target user is a static identity * fix(admin): show details for static users GetObjectStoreUserDetails called credentialManager.GetUser which only queries the dynamic store. For static users this returned ErrUserNotFound. Fall back to GetStaticIdentity when the store lookup fails. * fix(admin): load static S3 identities in admin server The admin server has its own credential manager (gRPC store) which is a separate instance from the S3 server's and filer's. It had no static identity data, so IsStaticIdentity returned false (edit/delete buttons shown) and GetStaticIdentity returned nil (details page failed). Pass the -s3.config file path through to the admin server and call LoadS3ConfigFile on its credential manager, matching the approach used for the filer. * fix: use protobuf is_static field instead of passing config file path The previous approach passed -s3.config file path to every component (filer, admin). This is wrong because the admin server should not need to know about S3 config files. Instead, add an is_static field to the Identity protobuf message. The field is set when static identities are serialized (in GetStaticIdentities and LoadS3ConfigFile). Any gRPC client that loads configuration via GetConfiguration automatically sees which identities are static, without needing the config file. - Add is_static field (tag 8) to iam_pb.Identity proto message - Set IsStatic=true in GetStaticIdentities and LoadS3ConfigFile - Admin GetObjectStoreUsers reads identity.IsStatic from proto - Admin IsStaticUser helper loads config via gRPC to check the flag - Filer GetUser gRPC handler falls back to GetStaticIdentity - Remove s3ConfigFile from AdminOptions and NewAdminServer signature	2026-04-03 20:01:28 -07:00
Chris Lu	0798b274dd	feat(s3): add concurrent chunk prefetch for large file downloads (#8917 ) * feat(s3): add concurrent chunk prefetch for large file downloads Add a pipe-based prefetch pipeline that overlaps chunk fetching with response writing during S3 GetObject, SSE downloads, and filer proxy. While chunk N streams to the HTTP response, fetch goroutines for the next K chunks establish HTTP connections to volume servers ahead of time, eliminating the RTT gap between sequential chunk fetches. Uses io.Pipe for minimal memory overhead (~1MB per download regardless of chunk size, vs buffering entire chunks). Also increases the streaming read buffer from 64KB to 256KB to reduce syscall overhead. Benchmark results (64KB chunks, prefetch=4): - 0ms latency: 1058 → 2362 MB/s (2.2× faster) - 5ms latency: 11.0 → 41.7 MB/s (3.8× faster) - 10ms latency: 5.9 → 23.3 MB/s (4.0× faster) - 20ms latency: 3.1 → 12.1 MB/s (3.9× faster) * fix: address review feedback for prefetch pipeline - Fix data race: use chunkPipeResult (pointer) on channel to avoid copying struct while fetch goroutines write to it. Confirmed clean with -race detector. - Remove concurrent map write: retryWithCacheInvalidation no longer updates fileId2Url map. Producer only reads it; consumer never writes. - Use mem.Allocate/mem.Free for copy buffer to reduce GC pressure. - Add local cancellable context so consumer errors (client disconnect) immediately stop the producer and all in-flight fetch goroutines. fix(test): remove dead code and add Range header support in test server - Remove unused allData variable in makeChunksAndServer - Add Range header handling to createTestServer for partial chunk read coverage (206 Partial Content, 416 Range Not Satisfiable) * fix: correct retry condition and goroutine leak in prefetch pipeline - Fix retry condition: use result.fetchErr/result.written instead of copied to decide cache-invalidation retry. The old condition wrongly triggered retry when the fetch succeeded but the response writer failed on the first write (copied==0 despite fetcher having data). Now matches the sequential path (stream.go:197) which checks whether the fetcher itself wrote zero bytes. - Fix goroutine leak: when the producer's send to the results channel is interrupted by context cancellation, the fetch goroutine was already launched but the result was never sent to the channel. The drain loop couldn't handle it. Now waits on result.done before returning so every fetch goroutine is properly awaited.	2026-04-03 19:57:30 -07:00
Chris Lu	995dfc4d5d	chore: remove ~50k lines of unreachable dead code (#8913 ) * chore: remove unreachable dead code across the codebase Remove ~50,000 lines of unreachable code identified by static analysis. Major removals: - weed/filer/redis_lua: entire unused Redis Lua filer store implementation - weed/wdclient/net2, resource_pool: unused connection/resource pool packages - weed/plugin/worker/lifecycle: unused lifecycle plugin worker - weed/s3api: unused S3 policy templates, presigned URL IAM, streaming copy, multipart IAM, key rotation, and various SSE helper functions - weed/mq/kafka: unused partition mapping, compression, schema, and protocol functions - weed/mq/offset: unused SQL storage and migration code - weed/worker: unused registry, task, and monitoring functions - weed/query: unused SQL engine, parquet scanner, and type functions - weed/shell: unused EC proportional rebalance functions - weed/storage/erasure_coding/distribution: unused distribution analysis functions - Individual unreachable functions removed from 150+ files across admin, credential, filer, iam, kms, mount, mq, operation, pb, s3api, server, shell, storage, topology, and util packages * fix(s3): reset shared memory store in IAM test to prevent flaky failure TestLoadIAMManagerFromConfig_EmptyConfigWithFallbackKey was flaky because the MemoryStore credential backend is a singleton registered via init(). Earlier tests that create anonymous identities pollute the shared store, causing LookupAnonymous() to unexpectedly return true. Fix by calling Reset() on the memory store before the test runs. * style: run gofmt on changed files * fix: restore KMS functions used by integration tests * fix(plugin): prevent panic on send to closed worker session channel The Plugin.sendToWorker method could panic with "send on closed channel" when a worker disconnected while a message was being sent. The race was between streamSession.close() closing the outgoing channel and sendToWorker writing to it concurrently. Add a done channel to streamSession that is closed before the outgoing channel, and check it in sendToWorker's select to safely detect closed sessions without panicking.	2026-04-03 16:04:27 -07:00
Chris Lu	af68449a26	Process .ecj deletions during EC decode and vacuum decoded volume (#8863 ) * Process .ecj deletions during EC decode and vacuum decoded volume (#8798) When decoding EC volumes back to normal volumes, deletions recorded in the .ecj journal were not being applied before computing the dat file size or checking for live needles. This caused the decoded volume to include data for deleted files and could produce false positives in the all-deleted check. - Call RebuildEcxFile before HasLiveNeedles/FindDatFileSize in VolumeEcShardsToVolume so .ecj deletions are merged into .ecx first - Vacuum the decoded volume after mounting in ec.decode to compact out deleted needle data from the .dat file - Add integration tests for decoding with non-empty .ecj files * storage: add offline volume compaction helper Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * ec: compact decoded volumes before deleting shards Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * ec: address PR review comments - Fall back to data directory for .ecx when idx directory lacks it - Make compaction failure non-fatal during EC decode - Remove misleading "buffer: 10%" from space check error message * ec: collect .ecj from all shard locations during decode Each server's .ecj only contains deletions for needles whose data resides in shards held by that server. Previously, sources with no new data shards to contribute were skipped entirely, losing their .ecj deletion entries. Now .ecj is always appended from every shard location so RebuildEcxFile sees the full set of deletions. * ec: add integration tests for .ecj collection during decode TestEcDecodePreservesDeletedNeedles: verifies that needles deleted via VolumeEcBlobDelete are excluded from the decoded volume. TestEcDecodeCollectsEcjFromPeer: regression test for the fix in collectEcShards. Deletes a needle only on a peer server that holds no new data shards, then verifies the deletion survives decode via .ecj collection. * ec: address review nits in decode and tests - Remove double error wrapping in mountDecodedVolume - Check VolumeUnmount error in peer ecj test - Assert 404 specifically for deleted needles, fail on 5xx --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-01 01:15:26 -07:00
Chris Lu	75a6a34528	dlm: resilient distributed locks via consistent hashing + backup replication (#8860 ) * dlm: replace modulo hashing with consistent hash ring Introduce HashRing with virtual nodes (CRC32-based consistent hashing) to replace the modulo-based hashKeyToServer. When a filer node is removed, only keys that hashed to that node are remapped to the next server on the ring, leaving all other mappings stable. This is the foundation for backup replication — the successor on the ring is always the natural takeover node. * dlm: add Generation and IsBackup fields to Lock Lock now carries IsBackup (whether this node holds the lock as a backup replica) and Generation (a monotonic fencing token that increments on each fresh acquisition, stays the same on renewal). Add helper methods: AllLocks, PromoteLock, DemoteLock, InsertBackupLock, RemoveLock, GetLock. * dlm: add ReplicateLock RPC and generation/is_backup proto fields Add generation field to LockResponse for fencing tokens. Add generation and is_backup fields to Lock message. Add ReplicateLock RPC for primary-to-backup lock replication. Add ReplicateLockRequest/ReplicateLockResponse messages. * dlm: add async backup replication to DistributedLockManager Route lock/unlock via consistent hash ring's GetPrimaryAndBackup(). After a successful lock or unlock on the primary, asynchronously replicate the operation to the backup server via ReplicateFunc callback. Single-server deployments skip replication. * dlm: add ReplicateLock handler and backup-aware topology changes Add ReplicateLock gRPC handler for primary-to-backup replication. Revise OnDlmChangeSnapshot to handle three cases on topology change: - Promote backup locks when this node becomes primary - Demote primary locks when this node becomes backup - Transfer locks when this node is neither primary nor backup Wire up SetupDlmReplication during filer server initialization. * dlm: expose generation fencing token in lock client LiveLock now captures the generation from LockResponse and exposes it via Generation() method. Consumers can use this as a fencing token to detect stale lock holders. * dlm: update empty folder cleaner to use consistent hash ring Replace local modulo-based hashKeyToServer with LockRing.GetPrimary() which uses the shared consistent hash ring for folder ownership. * dlm: add unit tests for consistent hash ring Test basic operations, consistency on server removal (only keys from removed server move), backup-is-successor property (backup becomes new primary when primary is removed), and key distribution balance. * dlm: add integration tests for lock replication failure scenarios Test cases: - Primary crash with backup promotion (backup has valid token) - Backup crash with primary continuing - Both primary and backup crash (lock lost, re-acquirable) - Rolling restart across all nodes - Generation fencing token increments on new acquisition - Replication failure (primary still works independently) - Unlock replicates deletion to backup - Lock survives server addition (topology change) - Consistent hashing minimal disruption (only removed server's keys move) * dlm: address PR review findings 1. Causal replication ordering: Add per-lock sequence number (Seq) that increments on every mutation. Backup rejects incoming mutations with seq <= current seq, preventing stale async replications from overwriting newer state. Unlock replication also carries seq and is rejected if stale. 2. Demote-after-handoff: OnDlmChangeSnapshot now transfers the lock to the new primary first and only demotes to backup after a successful TransferLocks RPC. If the transfer fails, the lock stays as primary on this node. 3. SetSnapshot candidateServers leak: Replace the candidateServers map entirely instead of appending, so removed servers don't linger. 4. TransferLocks preserves Generation and Seq: InsertLock now accepts generation and seq parameters. After accepting a transferred lock, the receiving node re-replicates to its backup. 5. Rolling restart test: Add re-replication step after promotion and assert survivedCount > 0. Add TestDLM_StaleReplicationRejected. 6. Mixed-version upgrade note: Add comment on HashRing documenting that all filer nodes must be upgraded together. * dlm: serve renewals locally during transfer window on node join When a new node joins and steals hash ranges from surviving nodes, there's a window between ring update and lock transfer where the client gets redirected to a node that doesn't have the lock yet. Fix: if the ring says primary != self but we still hold the lock locally (non-backup, matching token), serve the renewal/unlock here rather than redirecting. The lock will be transferred by OnDlmChangeSnapshot, and subsequent requests will go to the new primary once the transfer completes. Add tests: - TestDLM_NodeDropAndJoin_OwnershipDisruption: measures disruption when a node drops and a new one joins (14/100 surviving-node locks disrupted, all handled by transfer logic) - TestDLM_RenewalDuringTransferWindow: verifies renewal succeeds on old primary during the transfer window * dlm: master-managed lock ring with stabilization batching The master now owns the lock ring membership. Instead of filers independently reacting to individual ClusterNodeUpdate add/remove events, the master: 1. Tracks filer membership in LockRingManager 2. Batches rapid changes with a 1-second stabilization timer (e.g., a node drop + join within 1 second → single ring update) 3. Broadcasts the complete ring snapshot atomically via the new LockRingUpdate message in KeepConnectedResponse Filers receive the ring as a complete snapshot and apply it via SetSnapshot, ensuring all filers converge to the same ring state without intermediate churn. This eliminates the double-churn problem where a rapid drop+join would fire two separate ring mutations, each triggering lock transfers and disrupting ownership on surviving nodes. * dlm: track ring version, reject stale updates, remove dead code SetSnapshot now takes a version parameter from the master. Stale updates (version < current) are rejected, preventing reordered messages from overwriting a newer ring state. Version 0 is always accepted for bootstrap. Remove AddServer/RemoveServer from LockRing — the ring is now exclusively managed by the master via SetSnapshot. Remove the candidateServers map that was only used by those methods. * dlm: fix SelectLocks data race, advance generation on backup insert - SelectLocks: change RLock to Lock since the function deletes map entries, which is a write operation and causes a data race under RLock. - InsertBackupLock: advance nextGeneration to at least the incoming generation so that after failover promotion, new lock acquisitions get a generation strictly greater than any replicated lock. - Bump replication failure log from V(1) to Warningf for production visibility. * dlm: fix SetSnapshot race, test reliability, timer edge cases - SetSnapshot: hold LockRing lock through both version update and Ring.SetServers() so they're atomic. Prevents a concurrent caller from seeing the new version but applying stale servers. - Transfer window test: search for a key that actually moves primary when filer4 joins, instead of relying on a fixed key that may not. - renewLock redirect: pass the existing token to the new primary instead of empty string, so redirected renewals work correctly. - scheduleBroadcast: check timer.Stop() return value. If the timer already fired, the callback picks up latest state. - FlushPending: only broadcast if timer.Stop() returns true (timer was still pending). If false, the callback is already running. - Fix test comment: "idempotent" → "accepted, state-changing". * dlm: use wall-clock nanoseconds for lock ring version The lock ring version was an in-memory counter that reset to 0 on master restart. A filer that had seen version 5 would reject version 1 from the restarted master. Fix: use time.Now().UnixNano() as the version. This survives master restarts without persistence — the restarted master produces a version greater than any pre-restart value. * dlm: treat expired lock owners as missing Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * dlm: reject stale lock transfers Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * dlm: order replication by generation Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * dlm: bootstrap lock ring on reconnect Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-03-30 23:29:56 -07:00
Chris Lu	4705d8b82b	Fix stale admin lock metric when lock expires and is reacquired (#8859 ) * Fix stale admin lock metric when lock expires and is reacquired (#8857) When a lock expired without an explicit unlock and a different client acquired it, the old client's metric was never cleared, causing multiple clients to appear as simultaneously holding the lock. * Use DeleteLabelValues instead of Set(0) to remove stale metric series Avoids cardinality explosion from accumulated stale series when client names are dynamic.	2026-03-30 18:51:38 -07:00
Chris Lu	ced2236cc6	Adjust rename events metadata format (#8854 ) * rename metadata events * fix subscription filter to use NewEntry.Name for rename path matching The server-side subscription filter constructed the new path using OldEntry.Name instead of NewEntry.Name when checking if a rename event's destination matches the subscriber's path prefix. This could cause events to be incorrectly filtered when a rename changes the file name. * fix bucket events to handle rename of bucket directories onBucketEvents only checked IsCreate and IsDelete. A bucket directory rename via AtomicRenameEntry now emits a single rename event (both OldEntry and NewEntry non-nil), which matched neither check. Handle IsRename by deleting the old bucket and creating the new one. * fix replicator to handle rename events across directory boundaries Two issues fixed: 1. The replicator filtered events by checking if the key (old path) was under the source directory. Rename events now use the old path as key, so renames from outside into the watched directory were silently dropped. Now both old and new paths are checked, and cross-boundary renames are converted to create or delete. 2. NewParentPath was passed to the sink without remapping to the sink's target directory structure, causing the sink to write entries at the wrong location. Now NewParentPath is remapped alongside the key. * fix filer sync to handle rename events crossing directory boundaries The early directory-prefix filter only checked resp.Directory (old parent). Rename events now carry the old parent as Directory, so renames from outside the source path into it were dropped before reaching the existing cross-boundary handling logic. Check both old and new directories against sourcePath and excludePaths so the downstream old-key/new-key logic can properly convert these to create or delete operations. * fix metadata event path matching * fix metadata event consumers for rename targets * Fix replication rename target keys Logical rename events now reach replication sinks with distinct source and target paths.\n\nHandle non-filer sinks as delete-plus-create on the translated target key, and make the rename fallback path create at the translated target key too.\n\nAdd focused tests covering non-filer renames, filer rename updates, and the fallback path.\n\nCo-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Fix filer sync rename path scoping Use directory-boundary matching instead of raw prefix checks when classifying source and target paths during filer sync.\n\nAlso apply excludePaths per side so renames across excluded boundaries downgrade cleanly to create/delete instead of being misclassified as in-scope updates.\n\nAdd focused tests for boundary matching and rename classification.\n\nCo-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Fix replicator directory boundary checks Use directory-boundary matching instead of raw prefix checks when deciding whether a source or target path is inside the watched tree or an excluded subtree.\n\nThis prevents sibling paths such as /foo and /foobar from being misclassified during rename handling, and preserves the earlier rename-target-key fix.\n\nAdd focused tests for boundary matching and rename classification across sibling/excluded directories.\n\nCo-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Fix etc-remote rename-out handling Use boundary-safe source/target directory membership when classifying metadata events under DirectoryEtcRemote.\n\nThis prevents rename-out events from being processed as config updates, while still treating them as removals where appropriate for the remote sync and remote gateway command paths.\n\nAdd focused tests for update/removal classification and sibling-prefix handling.\n\nCo-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Defer rename events until commit Queue logical rename metadata events during atomic and streaming renames and publish them only after the transaction commits successfully.\n\nThis prevents subscribers from seeing delete or logical rename events for operations that later fail during delete or commit.\n\nAlso serialize notification.Queue swaps in rename tests and add failure-path coverage.\n\nCo-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Skip descendant rename target lookups Avoid redundant target lookups during recursive directory renames once the destination subtree is known absent.\n\nThe recursive move path now inserts known-absent descendants directly, and the test harness exercises prefixed directory listing so the optimization is covered by a directory rename regression test.\n\nCo-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Tighten rename review tests Return filer_pb.ErrNotFound from the bucket tracking store test stub so it follows the FilerStore contract, and add a webhook filter case for same-name renames across parent directories.\n\nCo-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * fix HardLinkId format verb in InsertEntryKnownAbsent error HardLinkId is a byte slice. %d prints each byte as a decimal number which is not useful for an identifier. Use %x to match the log line two lines above. * only skip descendant target lookup when source and dest use same store moveFolderSubEntries unconditionally passed skipTargetLookup=true for every descendant. This is safe when all paths resolve to the same underlying store, but with path-specific store configuration a child's destination may map to a different backend that already holds an entry at that path. Use FilerStoreWrapper.SameActualStore to check per-child and fall back to the full CreateEntry path when stores differ. * add nil and create edge-case tests for metadata event scope helpers * extract pathIsEqualOrUnder into util.IsEqualOrUnder Identical implementations existed in both replication/replicator.go and command/filer_sync.go. Move to util.IsEqualOrUnder (alongside the existing FullPath.IsUnder) and remove the duplicates. * use MetadataEventTargetDirectory for new-side directory in filer sync The new-side directory checks and sourceNewKey computation used message.NewParentPath directly. If NewParentPath were empty (legacy events, older filer versions during rolling upgrades), sourceNewKey would be wrong (/filename instead of /dir/filename) and the UpdateEntry parent path rewrite would panic on slice bounds. Derive targetDir once from MetadataEventTargetDirectory, which falls back to resp.Directory when NewParentPath is empty, and use it consistently for all new-side checks and the sink parent path.	2026-03-30 18:25:11 -07:00
Chris Lu	e5ad5e8d4a	fix(filer): apply default disk type after location-prefix resolution in gRPC AssignVolume (#8836 ) * fix(filer): apply default disk type after location-prefix resolution in gRPC AssignVolume The gRPC AssignVolume path was applying the filer's default DiskType to the request before calling detectStorageOption. This caused the default to shadow any disk type configured via a filer location-prefix rule, diverging from the HTTP write path which applies the default only when no rule matches. Extract resolveAssignStorageOption to apply the filer default disk type after detectStorageOption, so location-prefix rules take precedence. * fix(filer): apply default disk type after location-prefix resolution in TUS upload path Same class of bug as the gRPC AssignVolume fix: the TUS tusWriteData handler called detectStorageOption0 but never applied the filer's default DiskType when no location-prefix rule matched. This made TUS uploads ignore the -disk flag entirely.	2026-03-29 14:18:24 -07:00
Chris Lu	c2c58419b8	filer.sync: send log file chunk fids to clients for direct volume server reads (#8792 ) * filer.sync: send log file chunk fids to clients for direct volume server reads Instead of the server reading persisted log files from volume servers, parsing entries, and streaming them over gRPC (serial bottleneck), clients that opt in via client_supports_metadata_chunks receive log file chunk references (fids) and read directly from volume servers in parallel. New proto messages: - LogFileChunkRef: chunk fids + timestamp + filer ID for one log file - SubscribeMetadataRequest.client_supports_metadata_chunks: client opt-in - SubscribeMetadataResponse.log_file_refs: server sends refs during backlog Server changes: - CollectLogFileRefs: lists log files and returns chunk refs without any volume server I/O (metadata-only operation) - SubscribeMetadata/SubscribeLocalMetadata: when client opts in, sends refs during persisted log phase, then falls back to normal streaming for in-memory events Client changes: - ReadLogFileRefs: reads log files from volume servers, parses entries, filters by path prefix, invokes processEventFn - MetadataFollowOption.LogFileReaderFn: factory for chunk readers, enables metadata chunks when non-nil - Both filer_pb_tail.go and meta_aggregator.go recv loops accumulate refs then process them at the disk→memory transition Backward compatible: old clients don't set the flag, get existing behavior. Ref: #8771 * filer.sync: merge entries across filers in timestamp order on client side ReadLogFileRefs now groups refs by filer ID and merges entries from multiple filers using a min-heap priority queue — the same algorithm the server uses in OrderedLogVisitor + LogEntryItemPriorityQueue. This ensures events are processed in correct timestamp order even when log files from different filers have interleaved timestamps. Single-filer case takes the fast path (no heap allocation). * filer.sync: integration tests for direct-read metadata chunks Three test categories: 1. Merge correctness (TestReadLogFileRefsMergeOrder): Verifies entries from 3 filers are delivered in strict timestamp order, matching the server-side OrderedLogVisitor guarantee. 2. Path filtering (TestReadLogFileRefsPathFilter): Verifies client-side path prefix filtering works correctly. 3. Throughput comparison (TestDirectReadVsServerSideThroughput): 3 filers × 7 files × 300 events = 6300 events, 2ms per file read: server-side: 6300 events 218ms 28,873 events/sec direct-read: 6300 events 51ms 123,566 events/sec (4.3x) parallel: 6300 events 17ms 378,628 events/sec (13.1x) Direct-read eliminates gRPC send overhead per event (4.3x). Parallel per-filer reading eliminates serial file I/O (13.1x). * filer.sync: parallel per-filer reads with prefetching in ReadLogFileRefs ReadLogFileRefs now has two levels of I/O overlap: 1. Cross-filer parallelism: one goroutine per filer reads its files concurrently. Entries feed into per-filer channels, merged by the main goroutine via min-heap (same ordering guarantee as the server's OrderedLogVisitor). 2. Within-filer prefetching: while the current file's entries are being consumed by the merge heap, the next file is already being read from the volume server in a background goroutine. Single-filer fast path avoids the heap and channels. Test results (3 filers × 7 files × 300 events, 2ms per file read): server-side sequential: 6300 events 212ms 29,760 events/sec parallel + prefetch: 6300 events 36ms 177,443 events/sec Speedup: 6.0x * filer.sync: address all review comments on metadata chunks PR Critical fixes: - sendLogFileRefs: bypass pipelinedSender, send directly on gRPC stream. Ref messages have TsNs=0 and were being incorrectly batched into the Events field by the adaptive batching logic, corrupting ref delivery. - readLogFileEntries: use io.ReadFull instead of reader.Read to prevent partial reads from corrupting size values or protobuf data. - Error handling: only skip chunk-not-found errors (matching server-side isChunkNotFoundError). Other I/O or decode failures are propagated so the follower can retry. High-priority fixes: - CollectLogFileRefs: remove incorrect +24h padding from stopTime. The extra day caused unnecessary log file refs to be collected. - Path filtering: ReadLogFileRefs now accepts PathFilter struct with PathPrefix, AdditionalPathPrefixes, and DirectoriesToWatch. Uses util.Join for path construction (avoids "//foo" on root). Excludes /.system/log/ internal entries. Matches server-side eachEventNotificationFn filtering logic. Medium-priority fixes: - CollectLogFileRefs: accept context.Context, propagate to ListDirectoryEntries calls for cancellation support. - NewChunkStreamReaderFromLookup: accept context.Context, propagate to doNewChunkStreamReader. Test fixes: - Check error returns from ReadLogFileRefs in all test call sites. --------- Co-authored-by: Copilot <copilot@github.com>	2026-03-27 11:01:29 -07:00
Chris Lu	d97660d0cd	filer.sync: pipelined subscription with adaptive batching for faster catch-up (#8791 ) * filer.sync: pipelined subscription with adaptive batching for faster catch-up The SubscribeMetadata pipeline was fully serial: reading a log entry from a volume server, unmarshaling, filtering, and calling stream.Send() all happened one-at-a-time. stream.Send() blocked the entire pipeline until the client acknowledged each event, limiting throughput to ~80 events/sec regardless of the -concurrency setting. Three server-side optimizations that stack: 1. Pipelined sender: decouple stream.Send() from the read loop via a buffered channel (1024 messages). A dedicated goroutine handles gRPC delivery while the reader continues processing the next events. 2. Adaptive batching: when event timestamps are >2min behind wall clock (backlog catch-up), drain multiple events from the channel and pack them into a single stream.Send() using a new `repeated events` field on SubscribeMetadataResponse. When events are recent (real-time), send one-by-one for low latency. Old clients ignore the new field (backward compatible). 3. Persisted log readahead: run the OrderedLogVisitor in a background goroutine so volume server I/O for the next log file overlaps with event processing and gRPC delivery. 4. Event-driven aggregated subscription: replace time.Sleep(1127ms) polling in SubscribeMetadata with notification-driven wake-up using the MetaLogBuffer subscriber mechanism, reducing real-time latency from ~1127ms to sub-millisecond. Combined, these create a 3-stage pipeline: [Volume I/O → readahead buffer] → [Filter → send buffer] → [gRPC Send] Test results (simulated backlog with 50µs gRPC latency per Send): direct (old): 2100 events 2100 sends 168ms 12,512 events/sec pipelined+batched: 2100 events 14 sends 40ms 52,856 events/sec Speedup: 4.2x single-stream throughput Ref: #8771 * filer.sync: require client opt-in for batch event delivery Add ClientSupportsBatching field to SubscribeMetadataRequest. The server only packs events into the Events batch field when the client explicitly sets this flag to true. Old clients (Java SDK, third-party) that don't set the flag get one-event-per-Send, preserving backward compatibility. All Go callers (FollowMetadata, MetaAggregator) set the flag to true since their recv loops already unpack batched events. * filer.sync: clear batch Events field after Send to release references Prevents the envelope message from holding references to the rest of the batch after gRPC serialization, allowing the GC to collect them sooner. * filer.sync: fix Send deadlock, add error propagation test, event-driven local subscribe - pipelinedSender.Send: add case <-s.done to unblock when sender goroutine exits (fixes deadlock when errCh was already consumed by a prior Send). - pipelinedSender.reportErr: remove for-range drain on sendCh that could block indefinitely. Send() now detects exit via s.done instead. - SubscribeLocalMetadata: replace remaining time.Sleep(1127ms) in the gap-detected-no-memory-data path with event-driven listenersCond.Wait(), consistent with the rest of the subscription paths. - Add TestPipelinedSenderErrorPropagation: verifies error surfaces via Send and Close when the underlying stream fails. - Replace goto with labeled break in test simulatePipeline. * filer.sync: check error returns in test code - direct_send: check slowStream.Send error return - pipelined_batched_send: check sender.Close error return - simulatePipeline: return error from sender.Close, propagate to callers --------- Co-authored-by: Copilot <copilot@github.com>	2026-03-26 23:55:42 -07:00
Chris Lu	3a3fff1399	Fix TUS chunked upload and resume failures (#8783 ) (#8786 ) * Fix TUS chunked upload and resume failures caused by request context cancellation (#8783) The filer's TCP connections use a 10-second inactivity timeout (net_timeout.go). After the TUS PATCH request body is fully consumed, internal operations (assigning file IDs via gRPC to the master, uploading data to volume servers, completing uploads) do not generate any activity on the client connection, so the inactivity timer fires and Go's HTTP server cancels the request context. This caused HTTP 500 errors on PATCH requests where body reading + internal processing exceeded the timeout. Fix by using context.WithoutCancel in TUS create and patch handlers, matching the existing pattern used by assignNewFileInfo. This ensures internal operations complete regardless of client connection state. Fixes seaweedfs/seaweedfs#8783 * Add comment to tusCreateHandler explaining context.WithoutCancel rationale * Run TUS integration tests on all PRs, not just TUS file changes The previous path filter meant these tests only ran when TUS-specific files changed. This allowed regressions from changes to shared infrastructure (net_timeout.go, upload paths, gRPC) to go undetected — which is exactly how the context cancellation bug in #8783 was missed. Matches the pattern used by s3-go-tests.yml.	2026-03-26 14:06:21 -07:00
Lisandro Pin	e5cf2d2a19	Give the `ScrubVolume()` RPC an option to flag found broken volumes as read-only. (#8360 ) * Give the `ScrubVolume()` RPC an option to flag found broken volumes as read-only. Also exposes this option in the shell `volume.scrub` command. * Remove redundant test in `TestVolumeMarkReadonlyWritableErrorPaths`. `417051bb` slightly rearranges the logic for `VolumeMarkReadonly()` and `VolumeMarkWritable()`, so calling them for invalid volume IDs will actually yield that error, instead of checking maintnenance mode first.	2026-03-26 10:20:57 -07:00
Chris Lu	94bfa2b340	mount: stream all filer mutations over single ordered gRPC stream (#8770 ) * filer: add StreamMutateEntry bidi streaming RPC Add a bidirectional streaming RPC that carries all filer mutation types (create, update, delete, rename) over a single ordered stream. This eliminates per-request connection overhead for pipelined operations and guarantees mutation ordering within a stream. The server handler delegates each request to the existing unary handlers (CreateEntry, UpdateEntry, DeleteEntry) and uses a proxy stream adapter for rename operations to reuse StreamRenameEntry logic. The is_last field signals completion for multi-response operations (rename sends multiple events per request; create/update/delete always send exactly one response with is_last=true). * mount: add streaming mutation multiplexer (streamMutateMux) Implement a client-side multiplexer that routes all filer mutation RPCs (create, update, delete, rename) over a single bidirectional gRPC stream. Multiple goroutines submit requests through a send channel; a dedicated sendLoop serializes them on the stream; a recvLoop dispatches responses to waiting callers via per-request channels. Key features: - Lazy stream opening on first use - Automatic reconnection on stream failure - Permanent fallback to unary RPCs if filer returns Unimplemented - Monotonic request_id for response correlation - Multi-response support for rename operations (is_last signaling) The mux is initialized on WFS and closed during unmount cleanup. No call sites use it yet — wiring comes in subsequent commits. * mount: route CreateEntry and UpdateEntry through streaming mux Wire all CreateEntry call sites to use wfs.streamCreateEntry() which routes through the StreamMutateEntry stream when available, falling back to unary RPCs otherwise. Also wire Link's UpdateEntry calls through wfs.streamUpdateEntry(). Updated call sites: - flushMetadataToFiler (file flush after write) - Mkdir (directory creation) - Symlink (symbolic link creation) - createRegularFile non-deferred path (Mknod) - flushFileMetadata (periodic metadata flush) - Link (hard link: update source + create link + rollback) * mount: route UpdateEntry and DeleteEntry through streaming mux Wire remaining mutation call sites through the streaming mux: - saveEntry (Setattr/chmod/chown/utimes) → streamUpdateEntry - Unlink → streamDeleteEntry (replaces RemoveWithResponse) - Rmdir → streamDeleteEntry (replaces RemoveWithResponse) All filer mutations except Rename now go through StreamMutateEntry when the filer supports it, with automatic unary RPC fallback. * mount: route Rename through streaming mux Wire Rename to use streamMutate.Rename() when available, with fallback to the existing StreamRenameEntry unary stream. The streaming mux sends rename as a StreamRenameEntryRequest oneof variant. The server processes it through the existing rename logic and sends multiple StreamRenameEntryResponse events (one per moved entry), with is_last=true on the final response. All filer mutations now go through a single ordered stream. * mount: fix stream mux connection ownership WithGrpcClient(streamingMode=true) closes the gRPC connection when the callback returns, destroying the stream. Own the connection directly via pb.GrpcDial so it stays alive for the stream's lifetime. Close it explicitly in recvLoop on stream failure and in Close on shutdown. * mount: fix rename failure for deferred-create files Three fixes for rename operations over the streaming mux: 1. lookupEntry: fall back to local metadata store when filer returns "not found" for entries in uncached directories. Files created with deferFilerCreate=true exist only in the local leveldb store until flushed; lookupEntry skipped the local store when the parent directory had never been readdir'd, causing rename to fail with ENOENT. 2. Rename: wait for pending async flushes and force synchronous flush of dirty metadata before sending rename to the filer. Covers the writebackCache case where close() defers the flush to a background worker that may not complete before rename fires. 3. StreamMutateEntry: propagate rename errors from server to client. Add error/errno fields to StreamMutateEntryResponse so the mount can map filer errors to correct FUSE status codes instead of silently returning OK. Also fix the existing Rename error handler which could return fuse.OK on unrecognized errors. * mount: fix streaming mux error handling, sendLoop lifecycle, and fallback Address PR review comments: 1. Server: populate top-level Error/Errno on StreamMutateEntryResponse for create/update/delete errors, not just rename. Previously update errors were silently dropped and create/delete errors were only in nested response fields that the client didn't check. 2. Client: check nested error fields in CreateEntry (ErrorCode, Error) and DeleteEntry (Error) responses, matching CreateEntryWithResponse behavior. 3. Fix sendLoop lifecycle: give each stream generation a stopSend channel. recvLoop closes it on error to stop the paired sendLoop. Previously a reconnect left the old sendLoop draining sendCh, breaking ordering. 4. Transparent fallback: stream helpers and doRename fall back to unary RPCs on transport errors (ErrStreamTransport), including the first Unimplemented from ensureStream. Previously the first call failed instead of degrading. 5. Filer rotation in openStream: try all filer addresses on dial failure, matching WithFilerClient behavior. Stop early on Unimplemented. 6. Pass metadata-bearing context to StreamMutateEntry RPC call so sw-client-id header is actually sent. 7. Gate lookupEntry local-cache fallback on open dirty handle or pending async flush to avoid resurrecting deleted/renamed entries. 8. Remove dead code in flushFileMetadata (err=nil followed by if err!=nil). 9. Use string matching for rename error-to-errno mapping in the mount to stay portable across Linux/macOS (numeric errno values differ). * mount: make failAllPending idempotent with delete-before-close Change failAllPending to collect pending entries into a local slice (deleting from the sync.Map first) before closing channels. This prevents double-close panics if called concurrently. Also remove the unused err parameter. * mount: add stream generation tracking and teardownStream Introduce a generation counter on streamMutateMux that increments each time a new stream is created. Requests carry the generation they were enqueued for so sendLoop can reject stale requests after reconnect. Add teardownStream(gen) which is idempotent (only acts when gen matches current generation and stream is non-nil). Both sendLoop and recvLoop call it on error, replacing the inline cleanup in recvLoop. sendLoop now actively triggers teardown on send errors instead of silently exiting. ensureStream waits for the prior generation's recvDone before creating a new stream, ensuring all old pending waiters are failed before reconnect. recvLoop now takes the stream, generation, and recvDone channel as parameters to avoid accessing shared fields without the lock. * mount: harden Close to prevent races with teardownStream Nil out stream, cancel, and grpcConn under the lock so that any concurrent teardownStream call from recvLoop/sendLoop becomes a no-op. Call failAllPending before closing sendCh to unblock waiters promptly. Guard recvDone with a nil check for the case where Close is called before any stream was ever opened. * mount: make errCh receive ctx-aware in doUnary and Rename Replace the blocking <-sendReq.errCh with a select that also observes ctx.Done(). If sendLoop exits via stopSend without consuming a buffered request, the caller now returns ctx.Err() instead of blocking forever. The buffered errCh (capacity 1) ensures late acknowledgements from sendLoop don't block the sender. * mount: fix sendLoop/Close race and recvLoop/teardown pending channel race Three related fixes: 1. Stop closing sendCh in Close(). Closing the shared producer channel races with callers who passed ensureStream() but haven't sent yet, causing send-on-closed-channel panics. sendCh is now left open; ensureStream checks m.closed to reject new callers. 2. Drain buffered sendCh items on shutdown. sendLoop defers drainSendCh() on exit so buffered requests get an ErrStreamTransport on their errCh instead of blocking forever. Close() drains again for any stragglers enqueued between sendLoop's drain and the final shutdown. 3. Move failAllPending from teardownStream into recvLoop's defer. teardownStream (called from sendLoop on send error) was closing pending response channels while recvLoop could be between pending.Load and the channel send — a send-on-closed-channel panic. recvLoop is now the sole closer of pending channels, eliminating the race. Close() waits on recvDone (with cancel() to guarantee Recv unblocks) so pending cleanup always completes. * filer/mount: add debug logging for hardlink lifecycle Add V(0) logging at every point where a HardLinkId is created, stored, read, or deleted to trace orphaned hardlink references. Logging covers: - gRPC server: CreateEntry/UpdateEntry when request carries HardLinkId - FilerStoreWrapper: InsertEntry/UpdateEntry when entry has HardLinkId - handleUpdateToHardLinks: entry path, HardLinkId, counter, chunk count - setHardLink: KvPut with blob size - maybeReadHardLink: V(1) on read attempt and successful decode - DeleteHardLink: counter decrement/deletion events - Mount Link(): when NewHardLinkId is generated and link is created This helps diagnose how a git pack .rev file ended up with a HardLinkId during a clone (no hard links should be involved). * test: add git clone/pull integration test for FUSE mount Shell script that exercises git operations on a SeaweedFS mount: 1. Creates a bare repo on the mount 2. Clones locally, makes 3 commits, pushes to mount 3. Clones from mount bare repo into an on-mount working dir 4. Verifies clone integrity (files, content, commit hashes) 5. Pushes 2 more commits with renames and deletes 6. Checks out an older revision on the mount clone 7. Returns to branch and pulls with real changes 8. Verifies file content, renames, deletes after pull 9. Checks git log integrity and clean status 27 assertions covering file existence, content, commit hashes, file counts, renames, deletes, and git status. Run against any existing mount: bash test-git-on-mount.sh /path/to/mount * test: add git clone/pull FUSE integration test to CI suite Add TestGitOperations to the existing fuse_integration test framework. The test exercises git's full file operation surface on the mount: 1. Creates a bare repo on the mount (acts as remote) 2. Clones locally, makes 3 commits (files, bulk data, renames), pushes 3. Clones from mount bare repo into an on-mount working dir 4. Verifies clone integrity (content, commit hash, file count) 5. Pushes 2 more commits with new files, renames, and deletes 6. Checks out an older revision on the mount clone 7. Returns to branch and pulls with real fast-forward changes 8. Verifies post-pull state: content, renames, deletes, file counts 9. Checks git log integrity (5 commits) and clean status Runs automatically in the existing fuse-integration.yml CI workflow. * mount: fix permission check with uid/gid mapping The permission checks in createRegularFile() and Access() compared the caller's local uid/gid against the entry's filer-side uid/gid without applying the uid/gid mapper. With -map.uid 501:0, a directory created as uid 0 on the filer would not match the local caller uid 501, causing hasAccess() to fall through to "other" permission bits and reject write access (0755 → other has r-x, no w). Fix: map entry uid/gid from filer-space to local-space before the hasAccess() call so both sides are in the same namespace. This fixes rsync -a failing with "Permission denied" on mkstempat when using uid/gid mapping. * mount: fix Mkdir/Symlink returning filer-side uid/gid to kernel Mkdir and Symlink used `defer wfs.mapPbIdFromFilerToLocal(entry)` to restore local uid/gid, but `outputPbEntry` writes the kernel response before the function returns — so the kernel received filer-side uid/gid (e.g., 0:0). macFUSE then caches these and rejects subsequent child operations (mkdir, create) because the caller uid (501) doesn't match the directory owner (0), and "other" bits (0755 → r-x) lack write permission. Fix: replace the defer with an explicit call to mapPbIdFromFilerToLocal before outputPbEntry, so the kernel gets local uid/gid. Also add nil guards for UidGidMapper in Access and createRegularFile to prevent panics in tests that don't configure a mapper. This fixes rsync -a "Permission denied" on mkpathat for nested directories when using uid/gid mapping. * mount: fix Link outputting filer-side uid/gid to kernel, add nil guards Link had the same defer-before-outputPbEntry bug as Mkdir and Symlink: the kernel received filer-side uid/gid because the defer hadn't run yet when outputPbEntry wrote the response. Also add nil guards for UidGidMapper in Access and createRegularFile so tests without a mapper don't panic. Audit of all outputPbEntry/outputFilerEntry call sites: - Mkdir: fixed in prior commit (explicit map before output) - Symlink: fixed in prior commit (explicit map before output) - Link: fixed here (explicit map before output) - Create (existing file): entry from maybeLoadEntry (already mapped) - Create (deferred): entry has local uid/gid (never mapped to filer) - Create (non-deferred): createRegularFile defer runs before return - Mknod: createRegularFile defer runs before return - Lookup: entry from lookupEntry (already mapped) - GetAttr: entry from maybeReadEntry/maybeLoadEntry (already mapped) - readdir: entry from cache (mapIdFromFilerToLocal) or filer (mapped) - saveEntry: no kernel output - flushMetadataToFiler: no kernel output - flushFileMetadata: no kernel output * test: fix git test for same-filesystem FUSE clone When both the bare repo and working clone live on the same FUSE mount, git's local transport uses hardlinks and cross-repo stat calls that fail on FUSE. Fix: - Use --no-local on clone to disable local transport optimizations - Use reset --hard instead of checkout to stay on branch - Use fetch + reset --hard origin/<branch> instead of git pull to avoid local transport stat failures during fetch * adjust logging * test: use plain git clone/pull to exercise real FUSE behavior Remove --no-local and fetch+reset workarounds. The test should use the same git commands users run (clone, reset --hard, pull) so it reveals real FUSE issues rather than hiding them. * test: enable V(1) logging for filer/mount and collect logs on failure - Run filer and mount with -v=1 so hardlink lifecycle logs (V(0): create/delete/insert, V(1): read attempts) are captured - On test failure, automatically dump last 16KB of all process logs (master, volume, filer, mount) to test output - Copy process logs to /tmp/seaweedfs-fuse-logs/ for CI artifact upload - Update CI workflow to upload SeaweedFS process logs alongside test output * mount: clone entry for filer flush to prevent uid/gid race flushMetadataToFiler and flushFileMetadata used entry.GetEntry() which returns the file handle's live proto entry pointer, then mutated it in-place via mapPbIdFromLocalToFiler. During the gRPC call window, a concurrent Lookup (which takes entryLock.RLock but NOT fhLockTable) could observe filer-side uid/gid (e.g., 0:0) on the file handle entry and return it to the kernel. The kernel caches these attributes, so subsequent opens by the local user (uid 501) fail with EACCES. Fix: proto.Clone the entry before mapping uid/gid for the filer request. The file handle's live entry is never mutated, so concurrent Lookup always sees local uid/gid. This fixes the intermittent "Permission denied" on .git/FETCH_HEAD after the first git pull on a mount with uid/gid mapping. * mount: add debug logging for stale lock file investigation Add V(0) logging to trace the HEAD.lock recreation issue: - Create: log when O_EXCL fails (file already exists) with uid/gid/mode - completeAsyncFlush: log resolved path, saved path, dirtyMetadata, isDeleted at entry to trace whether async flush fires after rename - flushMetadataToFiler: log the dir/name/fullpath being flushed This will show whether the async flush is recreating the lock file after git renames HEAD.lock → HEAD. * mount: prevent async flush from recreating renamed .lock files When git renames HEAD.lock → HEAD, the async flush from the prior close() can run AFTER the rename and re-insert HEAD.lock into the meta cache via its CreateEntryRequest response event. The next git pull then sees HEAD.lock and fails with "File exists". Fix: add isRenamed flag on FileHandle, set by Rename before waiting for the pending async flush. The async flush checks this flag and skips the metadata flush for renamed files (same pattern as isDeleted for unlinked files). The data pages still flush normally. The Rename handler flushes deferred metadata synchronously (Case 1) before setting isRenamed, ensuring the entry exists on the filer for the rename to proceed. For already-released handles (Case 2), the entry was created by a prior flush. * mount: also mark renamed inodes via entry.Attributes.Inode fallback When GetInode fails (Forget already removed the inode mapping), the Rename handler couldn't find the pending async flush to set isRenamed. The async flush then recreated the .lock file on the filer. Fix: fall back to oldEntry.Attributes.Inode to find the pending async flush when the inode-to-path mapping is gone. Also extract MarkInodeRenamed into a method on FileHandleToInode for clarity. * mount: skip async metadata flush when saved path no longer maps to inode The isRenamed flag approach failed for refs/remotes/origin/HEAD.lock because neither GetInode nor oldEntry.Attributes.Inode could find the inode (Forget already evicted the mapping, and the entry's stored inode was 0). Add a direct check in completeAsyncFlush: before flushing metadata, verify that the saved path still maps to this inode in the inode-to-path table. If the path was renamed or removed (inode mismatch or not found), skip the metadata flush to avoid recreating a stale entry. This catches all rename cases regardless of whether the Rename handler could set the isRenamed flag. * mount: wait for pending async flush in Unlink before filer delete Unlink was deleting the filer entry first, then marking the draining async-flush handle as deleted. The async flush worker could race between these two operations and recreate the just-unlinked entry on the filer. This caused git's .lock files (e.g. refs/remotes/origin/HEAD.lock) to persist after git pull, breaking subsequent git operations. Move the isDeleted marking and add waitForPendingAsyncFlush() before the filer delete so any in-flight flush completes first. Even if the worker raced past the isDeleted check, the wait ensures it finishes before the filer delete cleans up any recreated entry. * mount: reduce async flush and metadata flush log verbosity Raise completeAsyncFlush entry log, saved-path-mismatch skip log, and flushMetadataToFiler entry log from V(0) to V(3)/V(4). These fire for every file close with writebackCache and are too noisy for normal use. * filer: reduce hardlink debug log verbosity from V(0) to V(4) HardLinkId logs in filerstore_wrapper, filerstore_hardlink, and filer_grpc_server fire on every hardlinked file operation (git pack files use hardlinks extensively) and produce excessive noise. * mount/filer: reduce noisy V(0) logs for link, rmdir, and empty folder check - weedfs_link.go: hardlink creation logs V(0) → V(4) - weedfs_dir_mkrm.go: non-empty folder rmdir error V(0) → V(1) - empty_folder_cleaner.go: "not empty" check log V(0) → V(4) * filer: handle missing hardlink KV as expected, not error A "kv: not found" on hardlink read is normal when the link blob was already cleaned up but a stale entry still references it. Log at V(1) for not-found; keep Error level for actual KV failures. * test: add waitForDir before git pull in FUSE git operations test After git reset --hard, the FUSE mount's metadata cache may need a moment to settle on slow CI. The git pull subprocess (unpack-objects) could fail to stat the working directory. Poll for up to 5s. * Update git_operations_test.go * wait * test: simplify FUSE test framework to use weed mini Replace the 4-process setup (master + volume + filer + mount) with 2 processes: "weed mini" (all-in-one) + "weed mount". This simplifies startup, reduces port allocation, and is faster on CI. * test: fix mini flag -admin → -admin.ui	2026-03-25 20:06:34 -07:00
Chris Lu	0b3867dca3	filer: add structured error codes to CreateEntryResponse (#8767 ) * filer: add FilerError enum and error_code field to CreateEntryResponse Add a machine-readable error code alongside the existing string error field. This follows the precedent set by PublishMessageResponse in the MQ broker proto. The string field is kept for human readability and backward compatibility. Defined codes: OK, ENTRY_NAME_TOO_LONG, PARENT_IS_FILE, EXISTING_IS_DIRECTORY, EXISTING_IS_FILE, ENTRY_ALREADY_EXISTS. * filer: add sentinel errors and error code mapping in filer_pb Define sentinel errors (ErrEntryNameTooLong, ErrParentIsFile, etc.) in the filer_pb package so both the filer and consumers can reference them without circular imports. Add FilerErrorToSentinel() to map proto error codes to sentinels, and update CreateEntryWithResponse() to check error_code first, falling back to the string-based path for backward compatibility with old servers. * filer: return wrapped sentinel errors and set proto error codes Replace fmt.Errorf string errors in filer.CreateEntry, UpdateEntry, and ensureParentDirectoryEntry with wrapped filer_pb sentinel errors (using %w). This preserves errors.Is() traversal on the server side. In the gRPC CreateEntry handler, map sentinel errors to the corresponding FilerError proto codes using errors.Is(), setting both resp.Error (string, for backward compat) and resp.ErrorCode (enum). * S3: use errors.Is() with filer sentinels instead of string matching Replace fragile string-based error matching in filerErrorToS3Error and other S3 API consumers with errors.Is() checks against filer_pb sentinel errors. This works because the updated CreateEntryWithResponse helper reconstructs sentinel errors from the proto FilerError code. Update iceberg stage_create and metadata_files to check resp.ErrorCode instead of parsing resp.Error strings. Update SSE-S3 to use errors.Is() for the already-exists check. String matching is retained only for non-filer errors (gRPC transport errors, checksum validation) that don't go through CreateEntryResponse. * filer: remove backward-compat string fallbacks for error codes Clients and servers are always deployed together, so there is no need for backward-compatibility fallback paths that parse resp.Error strings when resp.ErrorCode is unset. Simplify all consumers to rely solely on the structured error code. * iceberg: ensure unknown non-OK error codes are not silently ignored When FilerErrorToSentinel returns nil for an unrecognized error code, return an error including the code and message rather than falling through to return nil. * filer: fix redundant error message and restore error wrapping in helper Use request path instead of resp.Error in the sentinel error format string to avoid duplicating the sentinel message (e.g. "entry already exists: entry already exists"). Restore %w wrapping with errors.New() in the fallback paths so callers can use errors.Is()/errors.As(). * filer: promote file to directory on path conflict instead of erroring S3 allows both "foo/bar" (object) and "foo/bar/xyzzy" (another object) to coexist because S3 has a flat key space. When ensureParentDirectoryEntry finds a parent path that is a file instead of a directory, promote it to a directory by setting ModeDir while preserving the original content and chunks. Use Store.UpdateEntry directly to bypass the Filer.UpdateEntry type-change guard. This fixes the S3 compatibility test failures where creating overlapping keys (e.g. "foo/bar" then "foo/bar/xyzzy") returned ExistingObjectIsFile.	2026-03-24 17:08:22 -07:00
Chris Lu	c31e6b4684	Use filer-side copy for mounted whole-file copy_file_range (#8747 ) * Optimize mounted whole-file copy_file_range * Address mounted copy review feedback * Harden mounted copy fast path --------- Co-authored-by: Copilot <copilot@github.com>	2026-03-23 18:35:15 -07:00
Chris Lu	6bf654c25c	fix: keep metadata subscriptions progressing (#8730 ) (#8746 ) * fix: keep metadata subscriptions progressing (#8730) * test: cancel slow metadata writers with parent context * filer: ignore missing persisted log chunks	2026-03-23 15:26:54 -07:00
Chris Lu	15f4a97029	fix: improve raft leader election reliability and failover speed (#8692 ) * fix: clear raft vote state file on non-resume startup The seaweedfs/raft library v1.1.7 added a persistent `state` file for currentTerm and votedFor. When RaftResumeState=false (the default), the log, conf, and snapshot directories are cleared but this state file was not. On repeated restarts, different masters accumulate divergent terms, causing AppendEntries rejections and preventing leader election. Fixes #8690 * fix: recover TopologyId from snapshot before clearing raft state When RaftResumeState=false clears log/conf/snapshot, the TopologyId (used for license validation) was lost. Now extract it from the latest snapshot before cleanup and restore it on the topology. Both seaweedfs/raft and hashicorp/raft paths are handled, with a shared recoverTopologyIdFromState helper in raft_common.go. * fix: stagger multi-master bootstrap delay by peer index Previously all masters used a fixed 1500ms delay before the bootstrap check. Now the delay is proportional to the peer's sorted index with randomization (matching the hashicorp raft path), giving the designated bootstrap node (peer 0) a head start while later peers wait for gRPC servers to be ready. Also adds diagnostic logging showing why DoJoinCommand was or wasn't called, making leader election issues easier to diagnose from logs. * fix: skip unreachable masters during leader reconnection When a master leader goes down, non-leader masters still redirect clients to the stale leader address. The masterClient would follow these redirects, fail, and retry — wasting round-trips each cycle. Now tryAllMasters tracks which masters failed within a cycle and skips redirects pointing to them, reducing log spam and connection overhead during leader failover. * fix: take snapshot after TopologyId generation for recovery After generating a new TopologyId on the leader, immediately take a raft snapshot so the ID can be recovered from the snapshot on future restarts with RaftResumeState=false. Without this, short-lived clusters would lose the TopologyId on restart since no automatic snapshot had been taken yet. * test: add multi-master raft failover integration tests Integration test framework and 5 test scenarios for 3-node master clusters: - TestLeaderConsistencyAcrossNodes: all nodes agree on leader and TopologyId - TestLeaderDownAndRecoverQuickly: leader stops, new leader elected, old leader rejoins as follower - TestLeaderDownSlowRecover: leader gone for extended period, cluster continues with 2/3 quorum - TestTwoMastersDownAndRestart: quorum lost (2/3 down), recovered when both restart - TestAllMastersDownAndRestart: full cluster restart, leader elected, all nodes agree on TopologyId * fix: address PR review comments - peerIndex: return -1 (not 0) when self not found, add warning log - recoverTopologyIdFromSnapshot: defer dir.Close() - tests: check GetTopologyId errors instead of discarding them * fix: address review comments on failover tests - Assert no leader after quorum loss (was only logging) - Verify follower cs.Leader matches expected leader via ServerAddress.ToHttpAddress() comparison - Check GetTopologyId error in TestTwoMastersDownAndRestart	2026-03-18 23:28:07 -07:00
Jayshan Raghunandan	1f1eac4f08	feat: improve aio support for admin/volume ingress and fix UI links (#8679 ) * feat: improve allInOne mode support for admin/volume ingress and fix master UI links - Add allInOne support to admin ingress template, matching the pattern used by filer and s3 ingress templates (or-based enablement with ternary service name selection) - Add allInOne support to volume ingress template, which previously required volume.enabled even when the volume server runs within the allInOne pod - Expose admin ports in allInOne deployment and service when allInOne.admin.enabled is set - Add allInOne.admin config section to values.yaml (enabled by default, ports inherit from admin.) - Fix legacy master UI templates (master.html, masterNewRaft.html) to prefer PublicUrl over internal Url when linking to volume server UI. The new admin UI already handles this correctly. fix: revert admin allInOne changes and fix PublicUrl in admin dashboard The admin binary (`weed admin`) is a separate process that cannot run inside `weed server` (allInOne mode). Revert the admin-related allInOne helm chart changes that caused 503 errors on admin ingress. Fix bug in cluster_topology.go where VolumeServer.PublicURL was set to node.Id (internal pod address) instead of the actual public URL. Add public_url field to DataNodeInfo proto message so the topology gRPC response carries the public URL set via -volume.publicUrl flag. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: use HTTP /dir/status to populate PublicUrl in admin dashboard The gRPC DataNodeInfo proto does not include PublicUrl, so the admin dashboard showed internal pod IPs instead of the configured public URL. Fetch PublicUrl from the master's /dir/status HTTP endpoint and apply it in both GetClusterTopology and GetClusterVolumeServers code paths. Also reverts the unnecessary proto field additions from the previous commit and cleans up a stray blank line in all-in-one-service.yml. * fix: apply PublicUrl link fix to masterNewRaft.html Match the same conditional logic already applied to master.html — prefer PublicUrl when set and different from Url. * fix: add HTTP timeout and status check to fetchPublicUrlMap Use a 5s-timeout client instead of http.DefaultClient to prevent blocking indefinitely when the master is unresponsive. Also check the HTTP status code before attempting to parse the response body. * fix: fall back to node address when PublicUrl is empty Prevents blank links in the admin dashboard when PublicUrl is not configured, such as in standalone or mixed-version clusters. * fix: log io.ReadAll error in fetchPublicUrlMap --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Chris Lu <chris.lu@gmail.com>	2026-03-18 13:20:55 -07:00
Chris Lu	81369b8a83	improve: large file sync throughput for remote.cache and filer.sync (#8676 ) * improve large file sync throughput for remote.cache and filer.sync Three main throughput improvements: 1. Adaptive chunk sizing for remote.cache: targets ~32 chunks per file instead of always starting at 5MB. A 500MB file now uses ~16MB chunks (32 chunks) instead of 5MB chunks (100 chunks), reducing per-chunk overhead (volume assign, gRPC call, needle write) by 3x. 2. Configurable concurrency at every layer: - remote.cache chunk concurrency: -chunkConcurrency flag (default 8) - remote.cache S3 download concurrency: -downloadConcurrency flag (default raised from 1 to 5 per chunk) - filer.sync chunk concurrency: -chunkConcurrency flag (default 32) 3. S3 multipart download concurrency raised from 1 to 5: the S3 manager downloader was using Concurrency=1, serializing all part downloads within each chunk. This alone can 5x per-chunk download speed. The concurrency values flow through the gRPC request chain: shell command → CacheRemoteObjectToLocalClusterRequest → FetchAndWriteNeedleRequest → S3 downloader Zero values in the request mean "use server defaults", maintaining full backward compatibility with existing callers. Ref #8481 * fix: use full maxMB for chunk size cap and remove loop guard Address review feedback: - Use full maxMB instead of maxMB/2 for maxChunkSize to avoid unnecessarily limiting chunk size for very large files. - Remove chunkSize < maxChunkSize guard from the safety loop so it can always grow past maxChunkSize when needed to stay under 1000 chunks (e.g., extremely large files with small maxMB). * address review feedback: help text, validation, naming, docs - Fix help text for -chunkConcurrency and -downloadConcurrency flags to say "0 = server default" instead of advertising specific numeric defaults that could drift from the server implementation. - Validate chunkConcurrency and downloadConcurrency are within int32 range before narrowing, returning a user-facing error if out of range. - Rename ReadRemoteErr to readRemoteErr to follow Go naming conventions. - Add doc comment to SetChunkConcurrency noting it must be called during initialization before replication goroutines start. - Replace doubling loop in chunk size safety check with direct ceil(remoteSize/1000) computation to guarantee the 1000-chunk cap. * address Copilot review: clamp concurrency, fix chunk count, clarify proto docs - Use ceiling division for chunk count check to avoid overcounting when file size is an exact multiple of chunk size. - Clamp chunkConcurrency (max 1024) and downloadConcurrency (max 1024 at filer, max 64 at volume server) to prevent excessive goroutines. - Always use ReadFileWithConcurrency when the client supports it, falling back to the implementation's default when value is 0. - Clarify proto comments that download_concurrency only applies when the remote storage client supports it (currently S3). - Include specific server defaults in help text (e.g., "0 = server default 8") so users see the actual values in -h output. * fix data race on executionErr and use %w for error wrapping - Protect concurrent writes to executionErr in remote.cache worker goroutines with a sync.Mutex to eliminate the data race. - Use %w instead of %v in volume_grpc_remote.go error formatting to preserve the error chain for errors.Is/errors.As callers.	2026-03-17 16:49:56 -07:00
Chris Lu	f4073107cb	fix: clean up orphaned needles on remote.cache partial download failure (#8675 ) When remote.cache downloads a file in parallel chunks and a gRPC connection drops mid-transfer, chunks already written to volume servers were not cleaned up. Since the filer metadata was never updated, these needles became orphaned — invisible to volume.vacuum and never referenced by the filer. On subsequent cache cycles the file was still treated as uncached, creating more orphans each attempt. Call DeleteUncommittedChunks on the download-error path, matching the cleanup already present for the metadata-update-failure path. Fixes #8481	2026-03-17 13:47:54 -07:00
Chris Lu	acea36a181	filer: add conditional update preconditions (#8647 ) * filer: add conditional update preconditions * iceberg: tighten metadata CAS preconditions	2026-03-16 12:33:32 -07:00
Chris Lu	8cde3d4486	Add data file compaction to iceberg maintenance (Phase 2) (#8503 ) * Add iceberg_maintenance plugin worker handler (Phase 1) Implement automated Iceberg table maintenance as a new plugin worker job type. The handler scans S3 table buckets for tables needing maintenance and executes operations in the correct Iceberg order: expire snapshots, remove orphan files, and rewrite manifests. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Add data file compaction to iceberg maintenance handler (Phase 2) Implement bin-packing compaction for small Parquet data files: - Enumerate data files from manifests, group by partition - Merge small files using parquet-go (read rows, write merged output) - Create new manifest with ADDED/DELETED/EXISTING entries - Commit new snapshot with compaction metadata Add 'compact' operation to maintenance order (runs before expire_snapshots), configurable via target_file_size_bytes and min_input_files thresholds. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Fix memory exhaustion in mergeParquetFiles by processing files sequentially Previously all source Parquet files were loaded into memory simultaneously, risking OOM when a compaction bin contained many small files. Now each file is loaded, its rows are streamed into the output writer, and its data is released before the next file is loaded — keeping peak memory proportional to one input file plus the output buffer. * Validate bucket/namespace/table names against path traversal Reject names containing '..', '/', or '\' in Execute to prevent directory traversal via crafted job parameters. * Add filer address failover in iceberg maintenance handler Try each filer address from cluster context in order instead of only using the first one. This improves resilience when the primary filer is temporarily unreachable. * Add separate MinManifestsToRewrite config for manifest rewrite threshold The rewrite_manifests operation was reusing MinInputFiles (meant for compaction bin file counts) as its manifest count threshold. Add a dedicated MinManifestsToRewrite field with its own config UI section and default value (5) so the two thresholds can be tuned independently. * Fix risky mtime fallback in orphan removal that could delete new files When entry.Attributes is nil, mtime defaulted to Unix epoch (1970), which would always be older than the safety threshold, causing the file to be treated as eligible for deletion. Skip entries with nil Attributes instead, matching the safer logic in operations.go. * Fix undefined function references in iceberg_maintenance_handler.go Use the exported function names (ShouldSkipDetectionByInterval, BuildDetectorActivity, BuildExecutorActivity) matching their definitions in vacuum_handler.go. * Remove duplicated iceberg maintenance handler in favor of iceberg/ subpackage The IcebergMaintenanceHandler and its compaction code in the parent pluginworker package duplicated the logic already present in the iceberg/ subpackage (which self-registers via init()). The old code lacked stale-plan guards, proper path normalization, CAS-based xattr updates, and error-returning parseOperations. Since the registry pattern (default "all") makes the old handler unreachable, remove it entirely. All functionality is provided by iceberg.Handler with the reviewed improvements. * Fix MinManifestsToRewrite clamping to match UI minimum of 2 The clamp reset values below 2 to the default of 5, contradicting the UI's advertised MinValue of 2. Clamp to 2 instead. * Sort entries by size descending in splitOversizedBin for better packing Entries were processed in insertion order which is non-deterministic from map iteration. Sorting largest-first before the splitting loop improves bin packing efficiency by filling bins more evenly. * Add context cancellation check to drainReader loop The row-streaming loop in drainReader did not check ctx between iterations, making long compaction merges uncancellable. Check ctx.Done() at the top of each iteration. * Fix splitOversizedBin to always respect targetSize limit The minFiles check in the split condition allowed bins to grow past targetSize when they had fewer than minFiles entries, defeating the OOM protection. Now bins always split at targetSize, and a trailing runt with fewer than minFiles entries is merged into the previous bin. * Add integration tests for iceberg table maintenance plugin worker Tests start a real weed mini cluster, create S3 buckets and Iceberg table metadata via filer gRPC, then exercise the iceberg.Handler operations (ExpireSnapshots, RemoveOrphans, RewriteManifests) against the live filer. A full maintenance cycle test runs all operations in sequence and verifies metadata consistency. Also adds exported method wrappers (testing_api.go) so the integration test package can call the unexported handler methods. * Fix splitOversizedBin dropping files and add source path to drainReader errors The runt-merge step could leave leading bins with fewer than minFiles entries (e.g. [80,80,10,10] with targetSize=100, minFiles=2 would drop the first 80-byte file). Replace the filter-based approach with an iterative merge that folds any sub-minFiles bin into its smallest neighbor, preserving all eligible files. Also add the source file path to drainReader error messages so callers can identify which Parquet file caused a read/write failure. * Harden integration test error handling - s3put: fail immediately on HTTP 4xx/5xx instead of logging and continuing - lookupEntry: distinguish NotFound (return nil) from unexpected RPC errors (fail the test) - writeOrphan and orphan creation in FullMaintenanceCycle: check CreateEntryResponse.Error in addition to the RPC error * go fmt --------- Co-authored-by: Copilot <copilot@github.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-15 11:27:42 -07:00
Chris Lu	c4d642b8aa	fix(ec): gather shards from all disk locations before rebuild (#8633 ) * fix(ec): gather shards from all disk locations before rebuild (#8631) Fix "too few shards given" error during ec.rebuild on multi-disk volume servers. The root cause has two parts: 1. VolumeEcShardsRebuild only looked at a single disk location for shard files. On multi-disk servers, the existing local shards could be on one disk while copied shards were placed on another, causing the rebuild to see fewer shards than actually available. 2. VolumeEcShardsCopy had a DiskId condition (req.DiskId == 0 && len(vs.store.Locations) > 0) that was always true, making the FindFreeLocation fallback dead code. This meant copies always went to Locations[0] regardless of where existing shards were. Changes: - VolumeEcShardsRebuild now finds the location with the most shards, then gathers shard files from other locations via hard links (or symlinks for cross-device) before rebuilding. Gathered files are cleaned up after rebuild. - VolumeEcShardsCopy now only uses Locations[DiskId] when DiskId > 0 (explicitly set). Otherwise, it prefers the location that already has the EC volume, falling back to HDD then any free location. - generateMissingEcFiles now logs shard counts and provides a clear error message when not enough shards are found, instead of passing through to the opaque reedsolomon "too few shards given" error. * fix(ec): update test to match skip behavior for unrepairable volumes The test expected an error for volumes with insufficient shards, but commit `5acb4578a` changed unrepairable volumes to be skipped with a log message instead of returning an error. Update the test to verify the skip behavior and log output. * fix(ec): address PR review comments - Add comment clarifying DiskId=0 means "not specified" (protobuf default), callers must use DiskId >= 1 to target a specific disk. - Log warnings on cleanup failures for gathered shard links. * fix(ec): read shard files from other disks directly instead of linking Replace the hard link / symlink gathering approach with passing additional search directories into RebuildEcFiles. The rebuild function now opens shard files directly from whichever disk they live on, avoiding filesystem link operations and cleanup. RebuildEcFiles and RebuildEcFilesWithContext gain a variadic additionalDirs parameter (backward compatible with existing callers). * fix(ec): clarify DiskId selection semantics in VolumeEcShardsCopy comment * fix(ec): avoid empty files on failed rebuild; don't skip ecx-only locations - generateMissingEcFiles: two-pass approach — first discover present/missing shards and check reconstructability, only then create output files. This avoids leaving behind empty truncated shard files when there are too few shards to rebuild. - VolumeEcShardsRebuild: compute hasEcx before skipping zero-shard locations. A location with an .ecx file but no shard files (all shards on other disks) is now a valid rebuild candidate instead of being silently skipped. * fix(ec): select ecx-only location as rebuildLocation when none chosen yet When rebuildLocation is nil and a location has hasEcx=true but existingShardCount=0 (all shards on other disks), the condition 0 > 0 was false so it was never promoted to rebuildLocation. Add rebuildLocation == nil to the predicate so the first location with an .ecx file is always selected as a candidate.	2026-03-14 20:59:47 -07:00
Chris Lu	baae672b6f	feat: auto-disable master vacuum when plugin worker is active (#8624 ) * feat: auto-disable master vacuum when plugin vacuum worker is active When a vacuum-capable plugin worker connects to the admin server, the admin server calls DisableVacuum on the master to prevent the automatic scheduled vacuum from conflicting with the plugin worker's vacuum. When the worker disconnects, EnableVacuum is called to restore the default behavior. A safety net in the topology refresh loop re-enables vacuum if the admin server disconnects without cleanup. * rename isAdminServerConnected to isAdminServerConnectedFunc * add 5s timeout to DisableVacuum/EnableVacuum gRPC calls Prevents the monitor goroutine from blocking indefinitely if the master is unresponsive. * track plugin ownership of vacuum disable to avoid overriding operator - Add vacuumDisabledByPlugin flag to Topology, set when DisableVacuum is called while admin server is connected (i.e., by plugin monitor) - Safety net only re-enables vacuum when it was disabled by plugin, not when an operator intentionally disabled it via shell command - EnableVacuum clears the plugin flag * extract syncVacuumState for testability, add fake toggler tests Extract the single sync step into syncVacuumState() with a vacuumToggler interface. Add TestSyncVacuumState with a fake toggler that verifies disable/enable calls on state transitions. * use atomic.Bool for isDisableVacuum and vacuumDisabledByPlugin Both fields are written by gRPC handlers and read by the vacuum goroutine, causing a data race. Use atomic.Bool with Store/Load for thread-safe access. * use explicit by_plugin field instead of connection heuristic Add by_plugin bool to DisableVacuumRequest proto so the caller declares intent explicitly. The admin server monitor sets it to true; shell commands leave it false. This prevents an operator's intentional disable from being auto-reversed by the safety net. * use setter for admin server callback instead of function parameter Move isAdminServerConnected from StartRefreshWritableVolumes parameter to Topology.SetAdminServerConnectedFunc() setter. Keeps the function signature stable and decouples the topology layer from the admin server concept. * suppress repeated log messages on persistent sync failures Add retrying parameter to syncVacuumState so the initial state transition is logged at V(0) but subsequent retries of the same transition are silent until the call succeeds. * clear plugin ownership flag on manual DisableVacuum Prevents stale plugin flag from causing incorrect auto-enable when an operator manually disables vacuum after a plugin had previously disabled it. * add by_plugin to EnableVacuumRequest for symmetric ownership tracking Plugin-driven EnableVacuum now only re-enables if the plugin was the one that disabled it. If an operator manually disabled vacuum after the plugin, the plugin's EnableVacuum is a no-op. This prevents the plugin monitor from overriding operator intent on worker disconnect. * use cancellable context for monitorVacuumWorker goroutine Replace context.Background() with a cancellable context stored as bgCancel on AdminServer. Shutdown() calls bgCancel() so monitorVacuumWorker exits cleanly via ctx.Done(). * track operator and plugin vacuum disables independently Replace single isDisableVacuum flag with two independent flags: vacuumDisabledByOperator and vacuumDisabledByPlugin. Each caller only flips its own flag. The effective disabled state is the OR of both. This prevents a plugin connect/disconnect cycle from overriding an operator's manual disable, and vice versa. * fix safety net to clear plugin flag, not operator flag The safety net should call EnableVacuumByPlugin() to clear only the plugin disable flag when the admin server disconnects. The previous call to EnableVacuum() incorrectly cleared the operator flag instead.	2026-03-13 22:49:12 -07:00
Chris Lu	92a76fc1a2	fix(filer): limit concurrent proxy reads per volume server (#8608 ) * fix(filer): limit concurrent proxy reads per volume server Add a per-volume-server semaphore (default 16) to proxyToVolumeServer to prevent replication bursts from overwhelming individual volume servers with hundreds of concurrent connections, which causes them to drop connections with "unexpected EOF". Excess requests queue up and respect the client's context, returning 503 if the client disconnects while waiting. Also log io.CopyBuffer errors that were previously silently discarded. * Apply suggestion from @gemini-code-assist[bot] Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * fix(filer): use non-blocking release for proxy semaphore Prevents a goroutine from blocking forever if releaseProxySemaphore is ever called without a matching acquire. * test(filer): clean up proxySemaphores entries in all proxy tests --------- Co-authored-by: Copilot <copilot@github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>	2026-03-11 23:32:09 -07:00
Lars Lehtonen	f8b7357350	weed/server: fix dropped error (#8584 ) * weed/server: fix dropped error * Removed the redundant check. --------- Co-authored-by: Chris Lu <chrislusf@users.noreply.github.com> Co-authored-by: Chris Lu <chris.lu@gmail.com>	2026-03-09 18:04:12 -07:00
Chris Lu	2ec0a67ee3	master: return 503/Unavailable during topology warmup after leader change (#8529 ) * master: return 503/Unavailable during topology warmup after leader change After a master restart or leader change, the topology is empty until volume servers reconnect and send heartbeats. During this warmup window (3 heartbeat intervals = 15 seconds), volume lookups that fail now return 503 Service Unavailable (HTTP) or gRPC Unavailable instead of 404 Not Found, signaling clients to retry with other masters. * master: skip warmup 503 on fresh start and single-master setups - Check MaxVolumeId > 0 to distinguish restart from fresh start (MaxVolumeId is Raft-persisted, so 0 means no prior data) - Check peer count > 1 so single-master deployments aren't affected (no point suggesting "retry with other masters" if there are none) * master: address review feedback and block assigns during warmup - Protect LastLeaderChangeTime with dedicated mutex (fix data race) - Extract warmup multiplier as WarmupPulseMultiplier constant - Derive Retry-After header from pulse config instead of hardcoding - Only trigger warmup 503 for "not found" errors, not parse errors - Return nil response (not partial) on gRPC Unavailable - Add doc comments to IsWarmingUp, getter/setter, WarmupDuration - Block volume assign requests (HTTP and gRPC) during warmup, since the topology is incomplete and assignments would be unreliable - Skip warmup behavior for single-master setups (no peers to retry) * master: apply warmup to all setups, skip only on fresh start Single-master restarts still have an empty topology until heartbeats arrive, so warmup protection should apply there too. The only case to skip is a fresh cluster start (MaxVolumeId == 0), which already has no volumes to look up. - Remove GetMasterCount() > 1 guard from all warmup checks - Remove now-unused GetMasterCount helper - Update error messages to "topology is still loading" (not "retry with other masters" which doesn't apply to single-master) * master: add client-side retry on Unavailable for lookup and assign The server-side 503/Unavailable during warmup needs client cooperation. Previously, LookupVolumeIds and Assign would immediately propagate the error without retry. Now both paths retry with exponential backoff (1s -> 1.5s -> ... up to 6s) when receiving Unavailable, respecting context cancellation. This covers the warmup window where the master's topology is still loading after a restart or leader change. * master: seed warmup timestamp in legacy raft path at setup The legacy raft path only set lastLeaderChangeTime inside the event listener callback, which could fire after IsLeader() was already observed as true in SetRaftServer. Seed the timestamp at setup time (matching the hashicorp path) so IsWarmingUp() is active immediately. * master: fix assign retry loop to cover full warmup window The retry loop used waitTime <= maxWaitTime as a stop condition, causing it to give up after ~13s while warmup lasts 15s. Now cap each individual sleep at maxWaitTime but keep retrying until the context is cancelled. * master: preserve gRPC status in lookup retry and fix retry window Return the raw gRPC error instead of wrapping with fmt.Errorf so status.FromError() can extract the status code. Use proper gRPC status check (codes.Unavailable) instead of string matching. Also cap individual sleep at maxWaitTime while retrying until ctx is done. * master: use gRPC status code instead of string matching in assign retry Use status.FromError/codes.Unavailable instead of brittle strings.Contains for detecting retriable gRPC errors in the assign retry loop. * master: use remaining warmup duration for Retry-After header Set Retry-After to the remaining warmup time instead of the full warmup duration, so clients don't wait longer than necessary. * master: reset ret.Replicas before populating from assign response Clear Replicas slice before appending to prevent duplicate entries when the assign response is retried or when alternative requests are attempted. * master: add unit tests for warmup retry behavior Test that Assign() and LookupVolumeIds() retry on codes.Unavailable and stop promptly when the context is cancelled. * master: record leader change time before initialization work Move SetLastLeaderChangeTime() to fire immediately when the leader change event is received, before DoBarrier(), EnsureTopologyId(), and updatePeers(), so the warmup clock starts at the true moment of leadership transition. * master: use topology warmup duration in volume growth wait loop Replace hardcoded constants.VolumePulsePeriod * 2 with topo.IsWarmingUp() and topo.WarmupDuration() so the growth wait stays in sync with the configured warmup window. Remove unused constants import. * master: resolve master before creating RPC timeout context Move GetMaster() call before context.WithTimeout() so master resolution blocking doesn't consume the gRPC call timeout. * master: use NotFound flag instead of string matching for volume lookup Add a NotFound field to LookupResult and set it in findVolumeLocation when a volume is genuinely missing. Update HTTP and gRPC warmup checks to use this flag instead of strings.Contains on the error message. * master: bound assign retry loop to 30s for deadline-free contexts Without a context deadline, the Unavailable retry loop could spin forever. Add a maxRetryDuration of 30s so the loop gives up even when no context deadline is set. * master: strengthen assign retry cancellation test Verify the retry loop actually retried (callCount > 1) and that the returned error is context.DeadlineExceeded, not just any error. * master: extract shared retry-with-backoff utility Add util.RetryWithBackoff for context-aware, bounded retry with exponential backoff. Refactor both Assign() and LookupVolumeIds() to use it instead of duplicating the retry/sleep/backoff logic. * master: cap waitTime in RetryWithBackoff to prevent unbounded growth Cap the backoff waitTime at maxWaitTime so it doesn't grow indefinitely in long-running retry scenarios. * master: only return Unavailable during warmup when all lookups failed For batched LookupVolume requests, return partial results when some volumes are found. Only return codes.Unavailable when no volumes were successfully resolved, so clients benefit from partial results instead of retrying unnecessarily. * master: set retriable error message in 503 response body When returning 503 during warmup, replace the "not found" error in the JSON body with "service warming up, please retry" so clients don't treat it as a permanent error. * master: guard empty master address in LookupVolumeIds If GetMaster() returns empty (no master found or ctx cancelled), return an appropriate error instead of dialing an empty address. Returns ctx.Err() if context is done, otherwise codes.Unavailable to trigger retry. * master: add comprehensive tests for RetryWithBackoff Test success after retries, non-retryable error handling, context cancellation, and maxDuration cap with context.Background(). * master: enforce hard maxDuration bound in RetryWithBackoff Use a deadline instead of elapsed-time check so the last sleep is capped to remaining time. This prevents the total retry duration from overshooting maxDuration by up to one full backoff interval. * master: respect fresh-start bypass in RemainingWarmupDuration Check IsWarmingUp() first (which returns false when MaxVolumeId==0) so RemainingWarmupDuration returns 0 on fresh clusters. * master: round up Retry-After seconds to avoid underestimating Use math.Ceil so fractional remaining seconds (e.g. 1.9s) round up to the next integer (2) instead of flooring down (1). * master: tighten batch lookup warmup to all-NotFound only Only return codes.Unavailable when every requested volume ID was a transient not-found. Mixed cases with non-NotFound errors now return the response with per-volume error details preserved. * master: reduce retry log noise and fix timer leak Lower per-attempt retry log from V(0) to V(1) to reduce noise during warmup. Replace time.After with time.NewTimer to avoid lingering timers when context is cancelled. * master: add per-attempt timeout for assign RPC Use a 10s per-attempt timeout so a single slow RPC can't consume the entire 30s retry budget when ctx has no deadline. * master: share single 30s retry deadline across assign request entries The Assign() function iterates over primary and fallback requests, previously giving each its own 30s RetryWithBackoff budget. With a primary + fallback, the total could reach 60s. Compute one deadline up front and pass the remaining budget to each RetryWithBackoff call so the entire Assign() call stays within a single 30s cap. * master: strengthen context-cancel test with DeadlineExceeded and retry assertions Assert errors.Is(err, context.DeadlineExceeded) to verify the error is specifically from the context deadline, and check callCount > 1 to prove retries actually occurred before cancellation. Mirrors the pattern used in TestAssignStopsOnContextCancel. * master: bound GetMaster with per-attempt timeout in LookupVolumeIds GetMaster() calls WaitUntilConnected() which can block indefinitely if no master is available. Previously it used the outer ctx, so a slow master resolution could consume the entire RetryWithBackoff budget in a single attempt. Move the per-attempt timeoutCtx creation before the GetMaster call so both master resolution and the gRPC LookupVolume RPC share one grpcTimeout-bounded attempt. * master: use deadline-aware context for assign retry budget The shared 30s deadline only limited RetryWithBackoff's internal wall-clock tracking, but per-attempt contexts were still derived from the original ctx and could run for up to 10s even when the budget was nearly exhausted. Create a deadlineCtx from the computed deadline and derive both RetryWithBackoff and per-attempt timeouts from it so all operations honor the shared 30s cap. * master: skip warmup gate for empty lookup requests When VolumeOrFileIds is empty, notFoundCount == len(req.VolumeOrFileIds) is 0 == 0 which is true, causing empty lookup batches during warmup to return codes.Unavailable and be retried endlessly. Add a len(req.VolumeOrFileIds) > 0 guard so empty requests pass through. * master: validate request fields before warmup gate in Assign Move Replication and Ttl parsing before the IsWarmingUp() check so invalid inputs get a proper validation error instead of being masked by codes.Unavailable during warmup. Pure syntactic validation does not depend on topology state and should run first. * master: check deadline and context before starting retry attempt RetryWithBackoff only checked the deadline and context after an attempt completed or during the sleep select. If the deadline expired or context was canceled during sleep, the next iteration would still call operation() before detecting it. Add pre-operation checks so no new attempt starts after the budget is exhausted. * master: always return ctx.Err() on context cancellation in RetryWithBackoff When ctx.Err() is non-nil, the pre-operation check was returning lastErr instead of ctx.Err(). This broke callers checking errors.Is(err, context.DeadlineExceeded) and contradicted the documented contract. Always return ctx.Err() so the cancellation reason is properly surfaced. * master: handle warmup errors in StreamAssign without killing the stream StreamAssign was returning codes.Unavailable errors from Assign directly, which terminates the gRPC stream and breaks pooled connections. Instead, return transient errors as in-band error responses so the stream survives warmup periods. Also reset assignClient in doAssign on Send/Recv failures so a broken stream doesn't leave the proxy permanently dead. * master: wait for warmup before slot search in findAndGrow findEmptySlotsForOneVolume was called before the warmup wait loop, selecting slots from an incomplete topology. Move the warmup wait before slot search so volume placement uses the fully warmed-up topology with all servers registered. * master: add Retry-After header to /dir/assign warmup response The /dir/lookup handler already sets Retry-After during warmup but /dir/assign did not, leaving HTTP clients without guidance on when to retry. Add the same header using RemainingWarmupDuration(). * master: only seed warmup timestamp on leader at startup SetLastLeaderChangeTime was called unconditionally for both leader and follower nodes. Followers don't need warmup state, and the leader change event listener handles real elections. Move the seed into the IsLeader() block so only the startup leader gets warmup initialized. * master: preserve codes.Unavailable for StreamAssign warmup errors in doAssign StreamAssign returns transient warmup errors as in-band AssignResponse.Error messages. doAssign was converting these to plain fmt.Errorf, losing the codes.Unavailable classification needed for the caller's retry logic. Detect warmup error messages and wrap them as status.Error(codes.Unavailable) so RetryWithBackoff can retry.	2026-03-08 16:05:45 -07:00
Chris Lu	3f946fc0c0	mount: make metadata cache rebuilds snapshot-consistent (#8531 ) * filer: expose metadata events and list snapshots * mount: invalidate hot directory caches * mount: read hot directories directly from filer * mount: add sequenced metadata cache applier * mount: apply metadata responses through cache applier * mount: replay snapshot-consistent directory builds * mount: dedupe self metadata events * mount: factor directory build cleanup * mount: replace proto marshal dedup with composite key and ring buffer The dedup logic was doing a full deterministic proto.Marshal on every metadata event just to produce a dedup key. Replace with a cheap composite string key (TsNs\|Directory\|OldName\|NewName). Also replace the sliding-window slice (which leaked the backing array unboundedly) with a fixed-size ring buffer that reuses the same array. * filer: remove mutex and proto.Clone from request-scoped MetadataEventSink MetadataEventSink is created per-request and only accessed by the goroutine handling the gRPC call. The mutex and double proto.Clone (once in Record, once in Last) were unnecessary overhead on every filer write operation. Store the pointer directly instead. * mount: skip proto.Clone for caller-owned metadata events Add ApplyMetadataResponseOwned that takes ownership of the response without cloning. Local metadata events (mkdir, create, flush, etc.) are freshly constructed and never shared, so the clone is unnecessary. * filer: only populate MetadataEvent on successful DeleteEntry Avoid calling eventSink.Last() on error paths where the sink may contain a partial event from an intermediate child deletion during recursive deletes. * mount: avoid map allocation in collectDirectoryNotifications Replace the map with a fixed-size array and linear dedup. There are at most 3 directories to notify (old parent, new parent, new child if directory), so a 3-element array avoids the heap allocation on every metadata event. * mount: fix potential deadlock in enqueueApplyRequest Release applyStateMu before the blocking channel send. Previously, if the channel was full (cap 128), the send would block while holding the mutex, preventing Shutdown from acquiring it to set applyClosed. * mount: restore signature-based self-event filtering as fast path Re-add the signature check that was removed when content-based dedup was introduced. Checking signatures is O(1) on a small slice and avoids enqueuing and processing events that originated from this mount instance. The content-based dedup remains as a fallback. * filer: send snapshotTsNs only in first ListEntries response The snapshot timestamp is identical for every entry in a single ListEntries stream. Sending it in every response message wastes wire bandwidth for large directories. The client already reads it only from the first response. * mount: exit read-through mode after successful full directory listing MarkDirectoryRefreshed was defined but never called, so directories that entered read-through mode (hot invalidation threshold) stayed there permanently, hitting the filer on every readdir even when cold. Call it after a complete read-through listing finishes. * mount: include event shape and full paths in dedup key The previous dedup key only used Names, which could collapse distinct rename targets. Include the event shape (C/D/U/R), source directory, new parent path, and both entry names so structurally different events are never treated as duplicates. * mount: drain pending requests on shutdown in runApplyLoop After receiving the shutdown sentinel, drain any remaining requests from applyCh non-blockingly and signal each with errMetaCacheClosed so callers waiting on req.done are released. * mount: include IsDirectory in synthetic delete events metadataDeleteEvent now accepts an isDirectory parameter so the applier can distinguish directory deletes from file deletes. Rmdir passes true, Unlink passes false. * mount: fall back to synthetic event when MetadataEvent is nil In mknod and mkdir, if the filer response omits MetadataEvent (e.g. older filer without the field), synthesize an equivalent local metadata event so the cache is always updated. * mount: make Flush metadata apply best-effort after successful commit After filer_pb.CreateEntryWithResponse succeeds, the entry is persisted. Don't fail the Flush syscall if the local metadata cache apply fails — log and invalidate the directory cache instead. Also fall back to a synthetic event when MetadataEvent is nil. * mount: make Rename metadata apply best-effort The rename has already succeeded on the filer by the time we apply the local metadata event. Log failures instead of returning errors that would be dropped by the caller anyway. * mount: make saveEntry metadata apply best-effort with fallback After UpdateEntryWithResponse succeeds, treat local metadata apply as non-fatal. Log and invalidate the directory cache on failure. Also fall back to a synthetic event when MetadataEvent is nil. * filer_pb: preserve snapshotTsNs on error in ReadDirAllEntriesWithSnapshot Return the snapshot timestamp even when the first page fails, so callers receive the snapshot boundary when partial data was received. * filer: send snapshot token for empty directory listings When no entries are streamed, send a final ListEntriesResponse with only SnapshotTsNs so clients always receive the snapshot boundary. * mount: distinguish not-found vs transient errors in lookupEntry Return fuse.EIO for non-not-found filer errors instead of unconditionally returning ENOENT, so transient failures don't masquerade as missing entries. * mount: make CacheRemoteObject metadata apply best-effort The file content has already been cached successfully. Don't fail the read if the local metadata cache update fails. * mount: use consistent snapshot for readdir in direct mode Capture the SnapshotTsNs from the first loadDirectoryEntriesDirect call and store it on the DirectoryHandle. Subsequent batch loads pass this stored timestamp so all batches use the same snapshot. Also export DoSeaweedListWithSnapshot so mount can use it directly with snapshot passthrough. * filer_pb: fix test fake to send SnapshotTsNs only on first response Match the server behavior: only the first ListEntriesResponse in a page carries the snapshot timestamp, subsequent entries leave it zero. * Fix nil pointer dereference in ListEntries stream consumers Remove the empty-directory snapshot-only response from ListEntries that sent a ListEntriesResponse with Entry==nil, which crashed every raw stream consumer that assumed resp.Entry is always non-nil. Also add defensive nil checks for resp.Entry in all raw ListEntries stream consumers across: S3 listing, broker topic lookup, broker topic config, admin dashboard, topic retention, hybrid message scanner, Kafka integration, and consumer offset storage. * Add nil guards for resp.Entry in remaining ListEntries stream consumers Covers: S3 object lock check, MQ management dashboard (version/ partition/offset loops), and topic retention version loop. * Make applyLocalMetadataEvent best-effort in Link and Symlink The filer operations already succeeded; failing the syscall because the local cache apply failed is wrong. Log a warning and invalidate the parent directory cache instead. * Make applyLocalMetadataEvent best-effort in Mkdir/Rmdir/Mknod/Unlink The filer RPC already committed; don't fail the syscall when the local metadata cache apply fails. Log a warning and invalidate the parent directory cache to force a re-fetch on next access. * flushFileMetadata: add nil-fallback for metadata event and best-effort apply Synthesize a metadata event when resp.GetMetadataEvent() is nil (matching doFlush), and make the apply best-effort with cache invalidation on failure. * Prevent double-invocation of cleanupBuild in doEnsureVisited Add a cleanupDone guard so the deferred cleanup and inline error-path cleanup don't both call DeleteFolderChildren/AbortDirectoryBuild. * Fix comment: signature check is O(n) not O(1) * Prevent deferred cleanup after successful CompleteDirectoryBuild Set cleanupDone before returning from the success path so the deferred context-cancellation check cannot undo a published build. * Invalidate parent directory caches on rename metadata apply failure When applyLocalMetadataEvent fails during rename, invalidate the source and destination parent directory caches so subsequent accesses trigger a re-fetch from the filer. * Add event nil-fallback and cache invalidation to Link and Symlink Synthesize metadata events when the server doesn't return one, and invalidate parent directory caches on apply failure. * Match requested partition when scanning partition directories Parse the partition range format (NNNN-NNNN) and match against the requested partition parameter instead of using the first directory. * Preserve snapshot timestamp across empty directory listings Initialize actualSnapshotTsNs from the caller-requested value so it isn't lost when the server returns no entries. Re-add the server-side snapshot-only response for empty directories (all raw stream consumers now have nil guards for Entry). * Fix CreateEntry error wrapping to support errors.Is/errors.As Use errors.New + %w instead of %v for resp.Error so callers can unwrap the underlying error. * Fix object lock pagination: only advance on non-nil entries Move entriesReceived inside the nil check so nil entries don't cause repeated ListEntries calls with the same lastFileName. * Guard Attributes nil check before accessing Mtime in MQ management * Do not send nil-Entry response for empty directory listings The snapshot-only ListEntriesResponse (with Entry == nil) for empty directories breaks consumers that treat any received response as an entry (Java FilerClient, S3 listing). The Go client-side DoSeaweedListWithSnapshot already preserves the caller-requested snapshot via actualSnapshotTsNs initialization, so the server-side send is unnecessary. * Fix review findings: subscriber dedup, invalidation normalization, nil guards, shutdown race - Remove self-signature early-return in processEventFn so all events flow through the applier (directory-build buffering sees self-originated events that arrive after a snapshot) - Normalize NewParentPath in collectEntryInvalidations to avoid duplicate invalidations when NewParentPath is empty (same-directory update) - Guard resp.Entry.Attributes for nil in admin_server.go and topic_retention.go to prevent panics on entries without attributes - Fix enqueueApplyRequest race with shutdown by using select on both applyCh and applyDone, preventing sends after the apply loop exits - Add cleanupDone check to deferred cleanup in meta_cache_init.go for clarity alongside the existing guard in cleanupBuild - Add empty directory test case for snapshot consistency * Propagate authoritative metadata event from CacheRemoteObjectToLocalCluster and generate client-side snapshot for empty directories - Add metadata_event field to CacheRemoteObjectToLocalClusterResponse proto so the filer-emitted event is available to callers - Use WithMetadataEventSink in the server handler to capture the event from NotifyUpdateEvent and return it on the response - Update filehandle_read.go to prefer the RPC's metadata event over a locally fabricated one, falling back to metadataUpdateEvent when the server doesn't provide one (e.g., older filers) - Generate a client-side snapshot cutoff in DoSeaweedListWithSnapshot when the server sends no snapshot (empty directory), so callers like CompleteDirectoryBuild get a meaningful boundary for filtering buffered events * Skip directory notifications for dirs being built to prevent mid-build cache wipe When a metadata event is buffered during a directory build, applyMetadataSideEffects was still firing noteDirectoryUpdate for the building directory. If the directory accumulated enough updates to become "hot", markDirectoryReadThrough would call DeleteFolderChildren, wiping entries that EnsureVisited had already inserted. The build would then complete and mark the directory cached with incomplete data. Fix by using applyMetadataSideEffectsSkippingBuildingDirs for buffered events, which suppresses directory notifications for dirs currently in buildingDirs while still applying entry invalidations. * Add test for directory notification suppression during active build TestDirectoryNotificationsSuppressedDuringBuild verifies that metadata events targeting a directory under active EnsureVisited build do NOT fire onDirectoryUpdate for that directory. In production, this prevents markDirectoryReadThrough from calling DeleteFolderChildren mid-build, which would wipe entries already inserted by the listing. The test inserts an entry during a build, sends multiple metadata events for the building directory, asserts no notifications fired for it, verifies the entry survives, and confirms buffered events are replayed after CompleteDirectoryBuild. * Fix create invalidations, build guard, event shape, context, and snapshot error path - collectEntryInvalidations: invalidate FUSE kernel cache on pure create events (OldEntry==nil && NewEntry!=nil), not just updates and deletes - completeDirectoryBuildNow: only call markCachedFn when an active build existed (state != nil), preventing an unpopulated directory from being marked as cached - Add metadataCreateEvent helper that produces a create-shaped event (NewEntry only, no OldEntry) and use it in mkdir, mknod, symlink, and hardlink create fallback paths instead of metadataUpdateEvent which incorrectly set both OldEntry and NewEntry - applyMetadataResponseEnqueue: use context.Background() for the queued mutation so a cancelled caller context cannot abort the apply loop mid-write - DoSeaweedListWithSnapshot: move snapshot initialization before ListEntries call so the error path returns the preserved snapshot instead of 0 * Fix review findings: test loop, cache race, context safety, snapshot consistency - Fix build test loop starting at i=1 instead of i=0, missing new-0.txt verification - Re-check IsDirectoryCached after cache miss to avoid ENOENT race with markDirectoryReadThrough - Use context.Background() in enqueueAndWait so caller cancellation can't abort build/complete mid-way - Pass dh.snapshotTsNs in skip-batch loadDirectoryEntriesDirect for snapshot consistency - Prefer resp.MetadataEvent over fallback in Unlink event derivation - Add comment on MetadataEventSink.Record single-event assumption * Fix empty-directory snapshot clock skew and build cancellation race Empty-directory snapshot: Remove client-side time.Now() synthesis when the server returns no entries. Instead return snapshotTsNs=0, and in completeDirectoryBuildNow replay ALL buffered events when snapshot is 0. This eliminates the clock-skew bug where a client ahead of the filer would filter out legitimate post-list events. Build cancellation: Use context.Background() for BeginDirectoryBuild and CompleteDirectoryBuild calls in doEnsureVisited, so errgroup cancellation doesn't cause enqueueAndWait to return early and trigger cleanupBuild while the operation is still queued. * Add tests for empty-directory build replay and cancellation resilience TestEmptyDirectoryBuildReplaysAllBufferedEvents: verifies that when CompleteDirectoryBuild receives snapshotTsNs=0 (empty directory, no server snapshot), ALL buffered events are replayed regardless of their TsNs values — no clock-skew-sensitive filtering occurs. TestBuildCompletionSurvivesCallerCancellation: verifies that once CompleteDirectoryBuild is enqueued, a cancelled caller context does not prevent the build from completing. The apply loop runs with context.Background(), so the directory becomes cached and buffered events are replayed even when the caller gives up waiting. * Fix directory subtree cleanup, Link rollback, test robustness - applyMetadataResponseLocked: when a directory entry is deleted or moved, call DeleteFolderChildren on the old path so cached descendants don't leak as stale entries. - Link: save original HardLinkId/Counter before mutation. If CreateEntryWithResponse fails after the source was already updated, rollback the source entry to its original state via UpdateEntry. - TestBuildCompletionSurvivesCallerCancellation: replace fixed time.Sleep(50ms) with a deadline-based poll that checks IsDirectoryCached in a loop, failing only after 2s timeout. - TestReadDirAllEntriesWithSnapshotEmptyDirectory: assert that ListEntries was actually invoked on the mock client so the test exercises the RPC path. - newMetadataEvent: add early return when both oldEntry and newEntry are nil to avoid emitting events with empty Directory. --------- Co-authored-by: Copilot <copilot@github.com>	2026-03-07 09:19:40 -08:00
Chris Lu	af4c3fcb31	ec: fall back to data dir when ecx file not found in idx dir (#8541 ) * ec: fall back to data dir when ecx file not found in idx dir (#8540) When -dir.idx is configured after EC encoding, the .ecx/.ecj files remain in the data directory. NewEcVolume now falls back to the data directory when the index file is not found in dirIdx. * ec: add fallback logging and improved error message for ecx lookup * ec: preserve configured dirIdx, track actual ecx location separately The previous fallback set ev.dirIdx = dir when finding .ecx in the data directory, which corrupted IndexBaseFileName() for future writes (e.g., WriteIdxFileFromEcIndex during EC-to-volume conversion would write the .idx file to the data directory instead of the configured index directory). Introduce ecxActualDir to track where .ecx/.ecj were actually found, used only by FileName() for cleanup/destroy. IndexBaseFileName() continues to use the configured dirIdx for new file creation. * ec: check both idx and data dirs for .ecx in all cleanup and lookup paths When -dir.idx is configured after EC encoding, .ecx/.ecj files may reside in the data directory. Several code paths only checked l.IdxDirectory, causing them to miss these files: - removeEcVolumeFiles: now removes .ecx/.ecj from both directories - loadExistingVolume: ecx existence check falls back to data dir - deleteEcShardIdsForEachLocation: ecx existence check and cleanup both cover the data directory - VolumeEcShardsRebuild: ecx lookup falls back to data directory so RebuildEcxFile operates on the correct file	2026-03-07 09:18:48 -08:00
Chris Lu	540fc97e00	s3/iam: reuse one request id per request (#8538 ) * request_id: add shared request middleware * s3err: preserve request ids in responses and logs * iam: reuse request ids in XML responses * sts: reuse request ids in XML responses * request_id: drop legacy header fallback * request_id: use AWS-style request id format * iam: fix AWS-compatible XML format for ErrorResponse and field ordering - ErrorResponse uses bare <RequestId> at root level instead of <ResponseMetadata> wrapper, matching the AWS IAM error response spec - Move CommonResponse to last field in success response structs so <ResponseMetadata> serializes after result elements - Add randomness to request ID generation to avoid collisions - Add tests for XML ordering and ErrorResponse format * iam: remove duplicate error_response_test.go Test is already covered by responses_test.go. * address PR review comments - Guard against typed nil pointers in SetResponseRequestID before interface assertion (CodeRabbit) - Use regexp instead of strings.Index in test helpers for extracting request IDs (Gemini) * request_id: prevent spoofing, fix nil-error branch, thread reqID to error writers - Ensure() now always generates a server-side ID, ignoring client-sent x-amz-request-id headers to prevent request ID spoofing. Uses a private context key (contextKey{}) instead of the header string. - writeIamErrorResponse in both iamapi and embedded IAM now accepts reqID as a parameter instead of calling Ensure() internally, ensuring a single request ID per request lifecycle. - The nil-iamError branch in writeIamErrorResponse now writes a 500 Internal Server Error response instead of returning silently. - Updated tests to set request IDs via context (not headers) and added tests for spoofing prevention and context reuse. * sts: add request-id consistency assertions to ActionInBody tests * test: update admin test to expect server-generated request IDs The test previously sent a client x-amz-request-id header and expected it echoed back. Since Ensure() now ignores client headers to prevent spoofing, update the test to verify the server returns a non-empty server-generated request ID instead. * iam: add generic WithRequestID helper alongside reflection-based fallback Add WithRequestID[T] that uses generics to take the address of a value type, satisfying the pointer receiver on SetRequestId without reflection. The existing SetResponseRequestID is kept for the two call sites that operate on interface{} (from large action switches where the concrete type varies at runtime). Generics cannot replace reflection there since Go cannot infer type parameters from interface{}. * Remove reflection and generics from request ID setting Call SetRequestId directly on concrete response types in each switch branch before boxing into interface{}, eliminating the need for WithRequestID (generics) and SetResponseRequestID (reflection). * iam: return pointer responses in action dispatch * Fix IAM error handling consistency and ensure request IDs on all responses - UpdateUser/CreatePolicy error branches: use writeIamErrorResponse instead of s3err.WriteErrorResponse to preserve IAM formatting and request ID - ExecuteAction: accept reqID parameter and generate one if empty, ensuring every response carries a RequestId regardless of caller * Clean up inline policies on DeleteUser and UpdateUser rename DeleteUser: remove InlinePolicies[userName] from policy storage before removing the identity, so policies are not orphaned. UpdateUser: move InlinePolicies[userName] to InlinePolicies[newUserName] when renaming, so GetUserPolicy/DeleteUserPolicy work under the new name. Both operations persist the updated policies and return an error if the storage write fails, preventing partial state.	2026-03-06 15:22:39 -08:00
Chris Lu	338be16254	fix logs	2026-03-05 15:38:05 -08:00
Chris Lu	230ae9c24e	no need to set default scripts now	2026-03-04 22:27:02 -08:00
Chris Lu	b3620c7e14	admin: auto migrating master maintenance scripts to admin_script plugin config (#8509 ) * admin: seed admin_script plugin config from master maintenance scripts When the admin server starts, fetch the maintenance scripts configuration from the master via GetMasterConfiguration. If the admin_script plugin worker does not already have a saved config, use the master's scripts as the default value. This enables seamless migration from master.toml [master.maintenance] to the admin script plugin worker. Changes: - Add maintenance_scripts and maintenance_sleep_minutes fields to GetMasterConfigurationResponse in master.proto - Populate the new fields from viper config in master_grpc_server.go - On admin server startup, fetch the master config and seed the admin_script plugin config if no config exists yet - Strip lock/unlock commands from the master scripts since the admin script worker handles locking automatically Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: address review comments on admin_script seeding - Replace TOCTOU race (separate Load+Save) with atomic SaveJobTypeConfigIfNotExists on ConfigStore and Plugin - Replace ineffective polling loop with single GetMaster call using 30s context timeout, since GetMaster respects context cancellation - Add unit tests for SaveJobTypeConfigIfNotExists (in-memory + on-disk) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: apply maintenance script defaults in gRPC handler The gRPC handler for GetMasterConfiguration read maintenance scripts from viper without calling SetDefault, relying on startAdminScripts having run first. If the admin server calls GetMasterConfiguration before startAdminScripts sets the defaults, viper returns empty strings and the seeding is silently skipped. Apply SetDefault in the gRPC handler itself so it is self-contained. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Revert "fix: apply maintenance script defaults in gRPC handler" This reverts commit `068a506330`. * fix: use atomic save in ensureJobTypeConfigFromDescriptor ensureJobTypeConfigFromDescriptor used a separate Load + Save, racing with seedAdminScriptFromMaster. If the descriptor defaults (empty script) were saved first, SaveJobTypeConfigIfNotExists in the seeding goroutine would see an existing config and skip, losing the master's maintenance scripts. Switch to SaveJobTypeConfigIfNotExists so both paths are atomic. Whichever wins, the other is a safe no-op. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: fetch master scripts inline during config bootstrap, not in goroutine Replace the seedAdminScriptFromMaster goroutine with a ConfigDefaultsProvider callback. When the plugin bootstraps admin_script defaults from the worker descriptor, it calls the provider which fetches maintenance scripts from the master synchronously. This eliminates the race between the seeding goroutine and the descriptor-based config bootstrap. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * skip commented lock unlock Co-Authored-By: Copilot <223556219+Copilot@users.noreply.github.com> * reduce grpc calls --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-03-04 22:11:07 -08:00

1 2 3 4 5 ...

1849 Commits