seaweedfs

mirror of https://github.com/seaweedfs/seaweedfs.git synced 2026-07-26 10:03:13 +00:00

Author	SHA1	Message	Date
Chris LuandGitHub	d2e64e85ce	fix(log_buffer): back off disk-poll cadence when caught up to disk head (#9161 ) Idle subscribers parked in the ResumeFromDiskError path were re-probing the in-memory buffer and disk every 250ms, emitting a "Notification timeout after ResumeFromDiskError, rechecking state" log line per tick even when nothing had changed. Once ReadFromDiskFn returns without advancing lastReadPosition, we know the disk has no data past the subscriber's current position. Switch to a 2s poll in that state so external disk writers (e.g. Schema Registry) are still re-detected on a bounded cadence, but idle CPU and log noise drop ~8x. Any progress (disk advance or notifyChan wakeup) clears the flag and restores the responsive 250ms tick for active readers. The per-timeout V(4) log is replaced by a single "Caught up to disk head" transition log.	2026-04-20 14:36:56 -07:00
dependabot[bot]GitHubdependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	61c1735cdd	build(deps): bump modernc.org/sqlite from 1.46.1 to 1.49.1 (#9155 ) Bumps [modernc.org/sqlite](https://gitlab.com/cznic/sqlite) from 1.46.1 to 1.49.1. - [Changelog](https://gitlab.com/cznic/sqlite/blob/master/CHANGELOG.md) - [Commits](https://gitlab.com/cznic/sqlite/compare/v1.46.1...v1.49.1) --- updated-dependencies: - dependency-name: modernc.org/sqlite dependency-version: 1.49.1 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2026-04-20 12:20:55 -07:00
dependabot[bot]GitHubdependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	49ce2b7326	build(deps): bump github.com/rclone/rclone from 1.73.1 to 1.73.5 (#9156 ) Bumps [github.com/rclone/rclone](https://github.com/rclone/rclone) from 1.73.1 to 1.73.5. - [Release notes](https://github.com/rclone/rclone/releases) - [Changelog](https://github.com/rclone/rclone/blob/master/RELEASE.md) - [Commits](https://github.com/rclone/rclone/compare/v1.73.1...v1.73.5) --- updated-dependencies: - dependency-name: github.com/rclone/rclone dependency-version: 1.73.5 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2026-04-20 12:20:46 -07:00
dependabot[bot]GitHubdependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	a890300eaf	build(deps): bump cloud.google.com/go/pubsub from 1.50.1 to 1.50.2 (#9154 ) Bumps [cloud.google.com/go/pubsub](https://github.com/googleapis/google-cloud-go) from 1.50.1 to 1.50.2. - [Release notes](https://github.com/googleapis/google-cloud-go/releases) - [Changelog](https://github.com/googleapis/google-cloud-go/blob/main/CHANGES.md) - [Commits](https://github.com/googleapis/google-cloud-go/compare/pubsub/v1.50.1...pubsub/v1.50.2) --- updated-dependencies: - dependency-name: cloud.google.com/go/pubsub dependency-version: 1.50.2 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2026-04-20 12:12:47 -07:00
dependabot[bot]GitHubdependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	6642a64d2b	build(deps): bump github.com/go-git/go-billy/v5 from 5.6.2 to 5.8.0 (#9152 ) Bumps [github.com/go-git/go-billy/v5](https://github.com/go-git/go-billy) from 5.6.2 to 5.8.0. - [Release notes](https://github.com/go-git/go-billy/releases) - [Commits](https://github.com/go-git/go-billy/compare/v5.6.2...v5.8.0) --- updated-dependencies: - dependency-name: github.com/go-git/go-billy/v5 dependency-version: 5.8.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2026-04-20 12:06:33 -07:00
Lars LehtonenandGitHub	93d604d799	chore(weed/s3api/policy): prune unused test functions (#9150 )	2026-04-20 12:04:41 -07:00
dependabot[bot]GitHubdependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	dec09d1484	build(deps): bump github.com/aws/aws-sdk-go-v2 from 1.41.5 to 1.41.6 (#9153 ) Bumps [github.com/aws/aws-sdk-go-v2](https://github.com/aws/aws-sdk-go-v2) from 1.41.5 to 1.41.6. - [Release notes](https://github.com/aws/aws-sdk-go-v2/releases) - [Commits](https://github.com/aws/aws-sdk-go-v2/compare/v1.41.5...v1.41.6) --- updated-dependencies: - dependency-name: github.com/aws/aws-sdk-go-v2 dependency-version: 1.41.6 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2026-04-20 12:03:28 -07:00
dependabot[bot]GitHubdependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	25d7f2c569	build(deps): bump docker/build-push-action from 6 to 7 (#9151 ) Bumps [docker/build-push-action](https://github.com/docker/build-push-action) from 6 to 7. - [Release notes](https://github.com/docker/build-push-action/releases) - [Commits](https://github.com/docker/build-push-action/compare/v6...v7) --- updated-dependencies: - dependency-name: docker/build-push-action dependency-version: '7' dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2026-04-20 12:02:50 -07:00
Chris LuandGitHub	86c5e815d2	fix(kafka): make consumer-group rebalancing work end-to-end (#9143 ) * fix(kafka): make consumer-group rebalancing work end-to-end TestConsumerGroups was failing every run since the job was added (2026-04-17) but the failures were masked by a `\|\| echo ...` trailer on the go test invocation, so the CI reported green. Removing the mask exposes several real bugs in the gateway's group-coordinator code: 1. JoinGroup deduplicated members by ClientID, which collapsed two Sarama consumers that share the default ClientID ("sarama") into a single member slot and broke rebalancing. Key dedup off the TCP ConnectionID instead; keep ClientID on the member for DescribeGroup fidelity. 2. Every JoinGroup replaced the GroupMember struct, wiping the Assignment the leader had just published in its SyncGroup and leaving non-leader consumers with 0 partitions after a rebalance. Update the existing member in place on rejoin. 3. Non-leader SyncGroup returned an empty assignment while the leader was mid-rebalance, so consumers silently came up with no partitions. Return REBALANCE_IN_PROGRESS when the group is not Stable so Sarama retries the join/sync cycle (4 retries x 2s backoff by default). 4. Heartbeat returned ILLEGAL_GENERATION on a gen mismatch even when the group was in PreparingRebalance/CompletingRebalance. Return REBALANCE_IN_PROGRESS in that case so the heartbeat loop cleanly cancels the session instead of tearing it down on a fatal error. 5. LeaveGroup parser only handled v0-v2. Sarama at V2_8_0_0 sends v3 (Members array) by default, so the gateway silently rejected the request as InvalidGroupID and dead consumers stayed in the group as phantom leaders. Added v3 (Members array) and v4+ (flexible/compact/ tagged-fields) parsing. The rebalancing integration tests called Consume() once per consumer, which cannot survive a rebalance (heartbeat RBIP cancels the session and Consume() returns - this is documented Sarama behaviour; callers are expected to loop). Added a runConsumeLoop helper and used it in the four affected sub-tests. RebalanceTestHandler.Setup now overwrites stale entries in its assignments channel so the test observes the settled post-rebalance snapshot rather than whatever arrived first. fix(kafka): address PR review feedback - JoinGroup now snapshots existing members before mutating and restores the snapshot on INCONSISTENT_GROUP_PROTOCOL rollback. Previously the rollback path always deleted the entry, corrupting group state when an existing member rejoined with an incompatible protocol. - handleLeaveGroup iterates request.Members instead of processing only the first entry, so v3+ batch departures (KIP-345 style) correctly remove every listed member and build a per-member response. A single group-state transition runs after the loop, with leader election only triggered if the actual group leader was among the departures. - Added buildLeaveGroupFlexibleResponse for v4+ clients. The parser already decoded flexible versions, but the response still went out in non-flexible encoding (4-byte array lengths, 2-byte strings, no tagged fields), which v4+ clients could not parse. Route flexible versions through the new builder; v1-v3 keep buildLeaveGroupFullResponse. - BasicFunctionality gives each consumer its own ConsumerGroupHandler/ready channel. The previous shared handler closed ready once, so readyCount advanced to numConsumers from a single signal; the test could proceed without the other consumers actually reaching Setup. - RebalanceTestHandler.assignments is now a size-1 channel, so readers always observe the most recent rebalance snapshot instead of an intermediate one from an earlier round.	2026-04-20 10:11:45 -07:00
Ping Qiu GitHub Claude Opus 4.7	f4ce2be875	doc: P14 S8 final bounded close — evidence matrix + P15 handoff (#9142 ) * doc: P14 S8 final bounded close — evidence matrix + P15 handoff Adds the six S8 closure deliverables consolidating S4-S7 evidence, classifying V2 scenarios, and mapping residual product gaps onto canonical P15 tracks (per v3-phase-15-product-plan.md §4). New docs: - v3-phase-14-s8-assignment.md — S8 execution contract. - v3-phase-14-s8-final-bounded-close.md — bounded P14 target, accepted topology, reject conditions. - v3-phase-14-s8-evidence-matrix.md — 16 claims × {L0, L1, L2, L3, Status, Residual}. 15 PROVEN, 1 PARTIAL (Claim 15 fence quantitative bound, P14 internal follow-up). Rounds 2-3 architect corrections: Claim 10 / 12 L2 narrowed; Claim 6 refresh gap closed by the new L1 test (see companion commit in seaweed_block). - v3-phase-14-s8-v2-scenario-classification.md — every V2 scenario mapped to RUNNABLE-P14 / BLOCKED-FRONTEND / BLOCKED-OPS / BLOCKED-HA / BLOCKED-PERF / PORT-MECHANISM; scenario YAMLs kept as L3 shape, not executed evidence. - v3-phase-14-s8-p15-handoff.md — 11 rows (10 canonical P15 tracks + 1 P14 internal follow-up anchored to Claim 15 PARTIAL); §4 integrity check split by row class. - v3-phase-14-s8-closure.md — final P14 closure statement matching the close doc §10 wording; explicit non-goals; all 9 P15 tracks named with canonical numbering. No claim of CSI / frontend / migration / security / performance / production readiness. Every product gap is handed off with a concrete first-proof gate. Companion: seaweed_block commit adds the IntentRefreshEndpoint L1 route test that closes Claim 6. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * doc: P14 S8 — resolve port-now doc conflict (CodeRabbit #4) final-bounded-close.md §7 previously said "Port now: testrunner, scenarios, component harness, qa_block, learn/test" while v2-scenario-classification.md §2 says S8 does NOT port testrunner machinery and defers all actual porting to P15. Align final-bounded-close.md §7 with classification: section renamed "Classify now (S8 scope), port deferred to P15". Every item now states which P15 track actually owns the port (Final Gate or T1 Frontend + Data Path as applicable). No scope expansion; no new handoff gap. Pure doc-consistency fix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 02:24:44 -07:00
Chris LuandGitHub	f1d5f31a93	fix(mount): retry saveEntry on transient filer errors; stop mismapping Canceled to EIO (#9141 ) * fix(mount): retry saveEntry on transient filer errors, stop mismapping Canceled to EIO When the mount's gRPC connection to the filer flaps (e.g. a transient restart or network blip), every in-flight setattr/utimes/chmod/xattr/ rename-driven saveEntry returns "code = Canceled desc = grpc: the client connection is closing" at the same instant. Two bugs in saveEntry then turned each of those into a hard EIO for the user: 1. The error was wrapped with fmt.Errorf(... %v ...) before being passed to grpcErrorToFuseStatus. %v stringifies the status, so status.FromError could no longer unwrap the gRPC code and the Canceled→ETIMEDOUT branch in the classifier never fired; every Canceled error fell through to the default EIO. 2. saveEntry issued a single streamUpdateEntry call with no retry, unlike doFlush which already wraps its CreateEntry in retryMetadataFlush. One stream flap therefore propagated straight to the FUSE caller instead of being ridden out across the 4-attempt / ~7s backoff window. Wrap the UpdateEntry call in retryMetadataFlush (matching doFlush and completeAsyncFlush) and switch the wrap verb to %w so the classifier can still see the gRPC code. This recovers transient closes silently and, if retries are exhausted, returns ETIMEDOUT instead of EIO. Reported by rclone users in #9139 where a large concurrent copy (hundreds of .partial uploads per filer flap) surfaced as walls of EIOs because each .partial rename's post-setattr hit saveEntry at the worst possible moment. * mount: skip saveEntry retries on permanent filer errors Address gemini-code-assist review on #9141: blindly retrying every UpdateEntry failure with exponential backoff means interactive FUSE ops like chmod/utimes/xattr can hang for ~7s before surfacing clearly permanent errors (NotFound, PermissionDenied, InvalidArgument, etc.). Introduce retryMetadataFlushIf, a variant of retryMetadataFlush that accepts a shouldRetry predicate, and an isRetryableFilerError classifier that short-circuits on a conservative whitelist of terminal gRPC codes. Transient errors (Canceled / Unavailable / DeadlineExceeded / ResourceExhausted / Internal) and non-gRPC errors still retry, so the original fix for #9139 (rclone EIO burst during filer connection flaps) is preserved.	2026-04-20 00:31:37 -07:00
Chris Lu	8857dbfb74	fix(test): drop host port mapping from risingwave catalog test to kill TOCTOU flake The random host port allocated by MustFreeMiniPorts was released before docker run bound it, occasionally losing the race to another process and failing with "address already in use". The sidecar already reaches RisingWave via shared netns (--network container:...), so the host -p mapping and the corresponding WaitForPort check were unused.	2026-04-19 23:26:45 -07:00
Chris LuandGitHub	9a6b566fb1	fix(shell): volume.fsck keeps going past a single broken chunk manifest (#9140 ) * fix(shell): volume.fsck no longer aborts on a single broken chunk manifest Previously a single entry whose chunk-manifest could not be read (e.g. the manifest needle was missing or its sub-chunks pointed at a now-gone volume) caused collectFilerFileIdAndPaths to return immediately with "failed to ResolveChunkManifest". The whole fsck run failed, so an operator with even one corrupted file could not use volume.fsck to find or clean up unrelated orphan needles on other volumes — they had to locate and delete the bad entries first, blind, with no help from fsck. Log the resolution failure with the entry path, fall back to recording the top-level chunk fids the entry references (data fids and manifest fids themselves; sub-chunks behind the unresolvable manifest stay unknown), and keep traversing. Track the count of unresolved entries on the command struct and refuse -reallyDeleteFromVolume for the run when the count is non-zero, since the in-use fid set is incomplete and a purge could otherwise delete live sub-chunks behind the broken manifest. Read-only fsck still produces a useful (if conservatively over-reported) orphan listing so the operator can see and fix the broken entries first, then re-run with apply. Discovered while diagnosing #9116. * address review: use callback ctx and atomic counter - Pass the BFS callback's ctx to ResolveChunkManifest so a Ctrl+C / first-error cancellation propagates into the manifest fetch instead of using context.Background(). - TraverseBfs runs the callback across K=5 worker goroutines (filer_pb/filer_client_bfs.go), so the unresolvedManifestEntries field on commandVolumeFsck is shared across workers and was racing. Switch it to atomic.Int64 with Add/Load. * address review: reset counter per Do(), pass through ctx errors - commandVolumeFsck is a singleton registered in init() and reused across shell invocations. Without resetting the unresolved-manifest counter at the top of Do(), a single failed run permanently suppressed -reallyDeleteFromVolume in the same shell session. Reset to 0 right after flag parsing. - Treating context cancellation as manifest corruption was wrong: a Ctrl+C or deadline mid-traversal would inflate the counter and emit misleading "manifest broken" warnings for entries that were never examined. Detect context.Canceled / context.DeadlineExceeded and return the error so the BFS unwinds cleanly. Not changing the findMissingChunksInFiler branch's purgeAbsent / applyPurging gating: that path checks recorded filer fids against volume idx files, and a broken-manifest entry's recorded manifest fid will fail the existence check and get purged — which is the cleanup the operator wants for those entries. Adding a gate would block the exact use case the warning points them at.	2026-04-19 23:06:28 -07:00
Chris Lu	3ff92f797d	4.21 4.21	2026-04-19 14:38:29 -07:00
Chris LuandGitHub	a8ba9d106e	peer chunk sharing 7/8: tryPeerRead read-path hook (#9136 ) * mount: batched announcer + pooled peer conns for mount-to-mount RPCs * peer_announcer.go: non-blocking EnqueueAnnounce + ticker flush that groups fids by HRW owner, fans out one ChunkAnnounce per owner in parallel. announcedAt is pruned at 2× TTL so it stays bounded. * peer_dialer.go: PeerConnPool caches one grpc.ClientConn per peer address; the announcer and (next PR) the fetcher share it so steady-state owner RPCs skip the handshake cost entirely. Bounded at 4096 cached entries; shutdown conns are transparently replaced. * WFS starts both alongside the gRPC server; stops them on unmount. * mount: wire tryPeerRead via FetchChunk streaming gRPC Replaces the HTTP GET byte-transfer path with a gRPC server-stream FetchChunk call. Same fall-through semantics: any failure drops through to entryChunkGroup.ReadDataAt, so reads never slow below status quo. * peer_fetcher.go: tryPeerRead resolves the offset to a leaf chunk (flattening manifests), asks the HRW owner for holders via ChunkLookup, then opens FetchChunk on each holder in LRU order (PR #5) until one succeeds. Assembled bytes are verified against FileChunk.ETag end-to-end — the peer is still treated as untrusted. Reuses the shared PeerConnPool from PR #6 for all outbound gRPC. * peer_grpc.go: expose SelfAddr() so the fetcher can avoid dialing itself on a self-owned fid. * filehandle_read.go: tryPeerRead slot between tryRDMARead and entryChunkGroup.ReadDataAt. Gated by option.PeerEnabled and the presence of peerGrpcServer (the single identity test). Read ordering with the feature enabled is now: local cache -> RDMA sidecar -> peer mount (gRPC stream) -> volume server One port, one identity, one connection pool — no more HTTP bytecast. * test(fuse_p2p): end-to-end CI test for peer chunk sharing Adds a FUSE-backed integration test that proves mount B can satisfy a read from mount A's chunk cache instead of the volume tier. Layout (modelled on test/fuse_dlm): test/fuse_p2p/framework_test.go — cluster harness (1 master, 1 volume, 1 filer, N mounts, all with -peer.enable) test/fuse_p2p/peer_chunk_sharing_test.go — writer-reader scenario The test (TestPeerChunkSharing_ReadersPullFromPeerCache): 1. Starts 3 mounts. Three is the sweet spot: with 2 mounts, HRW owner of a chunk is self ~50 % of the time (peer path short-circuits); with 3+ it drops to ≤ 1/3, so a multi-chunk file almost certainly exercises the remote-owner fan-out. 2. Mount 0 writes a ~8 MiB file, then reads it back through its own FUSE to warm its chunk cache. 3. Waits for seed convergence (one full MountList refresh) plus an announcer flush cycle, so chunk-holder entries have reached each HRW owner. 4. Mount 1 reads the same file. 5. Verifies byte-for-byte equality AND greps mount 1's log for "peer read successful" — content matching alone is not proof (the volume fallback would also succeed), so the log marker is what distinguishes p2p from fallback. Workflow .github/workflows/fuse-p2p-integration.yml triggers on any change to mount/filer peer code, the p2p protos, or the test itself. Failure artifacts (server + mount logs) are uploaded for 3 days. Mounts run with -v=4 so the tryPeerRead success/failure glog messages land in the log file the test greps.	2026-04-19 00:53:12 -07:00
Chris LuandGitHub	73f10fa528	peer chunk sharing 6/8: announce queue + batched flush (#9135 ) mount: batched announcer + pooled peer conns for mount-to-mount RPCs * peer_announcer.go: non-blocking EnqueueAnnounce + ticker flush that groups fids by HRW owner, fans out one ChunkAnnounce per owner in parallel. announcedAt is pruned at 2× TTL so it stays bounded. * peer_dialer.go: PeerConnPool caches one grpc.ClientConn per peer address; the announcer and (next PR) the fetcher share it so steady-state owner RPCs skip the handshake cost entirely. Bounded at 4096 cached entries; shutdown conns are transparently replaced. * WFS starts both alongside the gRPC server; stops them on unmount.	2026-04-18 21:42:36 -07:00
Chris LuandGitHub	fe9ca35bbd	peer chunk sharing 5/8: mount chunk-directory shard (#9134 ) mount: tier-2 chunk directory + FetchChunk streaming on one gRPC port Collapses the old two-port design (HTTP peer-serve + separate gRPC directory) into a single gRPC service that handles every mount-to- mount exchange: ChunkAnnounce, ChunkLookup, and the new FetchChunk byte stream. * peer_directory.go: fid -> holders shard, HRW-gated; returns holders in LRU order; capacity-bounded; Sweep handles eviction under write-lock while Lookup runs under RLock (hot path is concurrent). * peer_grpc.go: single MountPeer gRPC server implementing all three RPCs. FetchChunk frames bytes at 1 MiB per Send so the default 4 MiB message cap does not constrain chunk size; cache miss returns gRPC NOT_FOUND so clients distinguish miss from transport error. Reuses pb.NewGrpcServer for consistent keepalive + msg-size tuning. * peer_bytepool.go: sync.Pool wrapper around []byte that the server uses to avoid a fresh 8 MiB allocation per FetchChunk call. WFS wiring starts the gRPC server on option.PeerListen (the single peer port) using the advertise address resolved in PR #3 as the HRW identity. A background sweeper evicts expired directory entries every 60 s.	2026-04-18 20:18:38 -07:00
Chris LuandGitHub	8a6348d3e9	peer chunk sharing 4/8: mount registrar + HRW owner selection (#9133 ) * proto: define MountRegister/MountList and MountPeer service Adds the wire types for peer chunk sharing between weed mount clients: * filer.proto: MountRegister / MountList RPCs so each mount can heartbeat its peer-serve address into a filer-hosted registry, and refresh the list of peers. Tiny payload; the filer stores only O(fleet_size) state. * mount_peer.proto (new): ChunkAnnounce / ChunkLookup RPCs for the mount-to-mount chunk directory. Each fid's directory entry lives on an HRW-assigned mount; announces and lookups route to that mount. No behavior yet — later PRs wire the RPCs into the filer and mount. See design-weed-mount-peer-chunk-sharing.md for the full design. * filer: add mount-server registry behind -peer.registry.enable Implements tier 1 of the peer chunk sharing design: an in-memory registry of live weed mount servers, keyed by peer address, refreshed by MountRegister heartbeats and served by MountList. * weed/filer/peer_registry.go: thread-safe map with TTL eviction; lazy sweep on List plus a background sweeper goroutine for bounded memory. * weed/server/filer_grpc_server_peer.go: MountRegister / MountList RPC handlers. When -peer.registry.enable is false (the default), both RPCs are silent no-ops so probing older filers is harmless. * -peer.registry.enable flag on weed filer; FilerOption.PeerRegistryEnabled wires it through. Phase 1 is single-filer (no cross-filer replication of the registry); mounts that fail over to another filer will re-register on the next heartbeat, so the registry self-heals within one TTL cycle. Part of the peer-chunk-sharing design; no behavior change at runtime until a later PR enables the flag on both filer and mount. * filer: nil-safe peerRegistryEnable + registry hardening Addresses review feedback on PR #9131. * Fix: nil pointer deref in the mini cluster. FilerOptions instances constructed outside weed/command/filer.go (e.g. miniFilerOptions in mini.go) do not populate peerRegistryEnable, so dereferencing the pointer panics at Filer startup. Use the same `nil && deref` idiom already used for distributedLock / writebackCache. * Hardening (gemini review): registry now enforces three invariants: - empty peer_addr is silently rejected (no client-controlled sentinel mass-inserts) - TTL is capped at 1 hour so a runaway client cannot pin entries - new-entry count is capped at 10000 to bound memory; renewals of existing entries are always honored, so a full registry still heartbeats its existing members correctly Covered by new unit tests. * filer: rename -peer.registry.enable flag to -mount.p2p Per review feedback: the old name "peer.registry.enable" leaked the implementation ("registry") into the CLI surface. "mount.p2p" is shorter and describes what it actually controls — whether this filer participates in mount-to-mount peer chunk sharing. Flag renames (all three keep default=true, idle cost is near-zero): -peer.registry.enable -> -mount.p2p (weed filer) -filer.peer.registry.enable -> -filer.mount.p2p (weed mini, weed server) Internal variable names (mountPeerRegistryEnable, MountPeerRegistry) keep their longer form — they describe the component, not the knob. * filer: MountList returns DataCenter + List uses RLock Two review follow-ups on the mount peer registry: * weed/server/filer_grpc_server_mount_peer.go: MountList was dropping the DataCenter on the wire. The whole point of carrying DC separately from Rack is letting the mount-side fetcher re-rank peers by the two-level locality hierarchy (same-rack > same-DC > cross-DC); without DC in the response every remote peer collapsed to "unknown locality." * weed/filer/mount_peer_registry.go: List() was taking a write lock so it could lazy-delete expired entries inline. But MountList is a read-heavy RPC hit on every mount's 30 s refresh loop, and Sweep is already wired as the sole reclamation path (same pattern as the mount-side PeerDirectory). Switch List to RLock + filter, let Sweep do the map mutation, so concurrent MountList callers don't serialize on each other. Test updated to reflect the new contract (List no longer mutates the map; Sweep is what drops expired entries). * mount: add peer chunk sharing options + advertise address resolver First cut at the peer chunk sharing wiring on the mount side. No functional behavior yet — this PR just introduces the option fields, the -peer.* flags, and the helper that resolves a reachable host:port from them. The server implementation arrives in PR #5 (gRPC service) and the fetcher in PR #7. * ResolvePeerAdvertiseAddr: an explicit -peer.advertise wins; else we use -peer.listen's bind host if specific; else util.DetectedHostAddress combined with the port. This is what gets registered with the filer and announced to peers, so wildcard binds no longer result in unreachable identities like "[::]:18080". * Option fields: PeerEnabled, PeerListen, PeerAdvertise, PeerRack. One port handles both directory RPCs and streaming chunk fetches (see PR #1 FetchChunk proto), so there is no second -peer.grpc.* flag — the old HTTP byte-transfer path is gone. * New flags on weed mount: -peer.enable, -peer.listen (default :18080), -peer.advertise (default auto), -peer.rack. * mount: register with filer and maintain HRW seed view Adds the mount-side tier-1 client. On startup the mount calls MountRegister with its advertise address (PR #3) and keeps both the filer entry and the local seed view fresh via background tickers (30 s register / 30 s list, 90 s filer TTL). * peer_hrw.go: pure rendezvous-hashing helper picking a single owner per fid via top-1 HRW. Adding or removing one seed moves only ~1/N fids. * peer_registrar.go: heartbeat + list poller. Seeds() returns the slice directly (no per-call copy) since listOnce atomically swaps; background RPCs bind their context to Stop() so unmount doesn't hang on a slow filer. * WFS wiring uses ResolvePeerAdvertiseAddr from PR #3 for the identity registered with the filer. No HTTP server, no second port — one reachable address represents the mount. * mount: broadcast MountRegister/MountList to every filer Previously the registrar called through wfs.WithFilerClient, which only reaches whichever filer the WFS filer-client session happens to be on. That meant two mounts pointing at different filers would never see each other: the filer mount registries are in-memory and per-filer (no filer-to-filer sync), so each mount's MountList only returned peers that had also registered through the same filer. This commit makes the registrar multi-filer aware: * NewPeerRegistrar now takes the full FilerAddresses slice and a per-filer dial function. The old single-filer peerFilerClient interface is gone. * registerOnce fans a MountRegister RPC out to every filer in parallel. Succeeds if at least one filer accepted — an unreachable filer is tolerated, logged, and retried on the next heartbeat. * listOnce polls every filer's MountList in parallel and merges the responses by peer_addr, keeping the newest LastSeenNs on duplicates. Mounts talking to different filers therefore converge once every filer has been polled once. The merged-list property is what lets a fleet of mounts spread across multiple filers still form a single HRW seed view. Each filer only ever sees the subset of mounts that heartbeat through it, but the registrar reconstructs the union client-side. New unit tests guard both properties: - RegisterBroadcastsToAllFilers: one registerOnce hits all N filers. - ListMergesAcrossFilers: mount-a on filer-1 and mount-b on filer-2 both appear in the merged seed set. - ListMergeKeepsNewestLastSeen: the same mount reported by two filers collapses to one entry with the freshest timestamp.	2026-04-18 20:03:45 -07:00
Chris LuandGitHub	af1e571297	peer chunk sharing 3/8: mount peer-serve HTTP endpoint (#9132 ) * proto: define MountRegister/MountList and MountPeer service Adds the wire types for peer chunk sharing between weed mount clients: * filer.proto: MountRegister / MountList RPCs so each mount can heartbeat its peer-serve address into a filer-hosted registry, and refresh the list of peers. Tiny payload; the filer stores only O(fleet_size) state. * mount_peer.proto (new): ChunkAnnounce / ChunkLookup RPCs for the mount-to-mount chunk directory. Each fid's directory entry lives on an HRW-assigned mount; announces and lookups route to that mount. No behavior yet — later PRs wire the RPCs into the filer and mount. See design-weed-mount-peer-chunk-sharing.md for the full design. * filer: add mount-server registry behind -peer.registry.enable Implements tier 1 of the peer chunk sharing design: an in-memory registry of live weed mount servers, keyed by peer address, refreshed by MountRegister heartbeats and served by MountList. * weed/filer/peer_registry.go: thread-safe map with TTL eviction; lazy sweep on List plus a background sweeper goroutine for bounded memory. * weed/server/filer_grpc_server_peer.go: MountRegister / MountList RPC handlers. When -peer.registry.enable is false (the default), both RPCs are silent no-ops so probing older filers is harmless. * -peer.registry.enable flag on weed filer; FilerOption.PeerRegistryEnabled wires it through. Phase 1 is single-filer (no cross-filer replication of the registry); mounts that fail over to another filer will re-register on the next heartbeat, so the registry self-heals within one TTL cycle. Part of the peer-chunk-sharing design; no behavior change at runtime until a later PR enables the flag on both filer and mount. * filer: nil-safe peerRegistryEnable + registry hardening Addresses review feedback on PR #9131. * Fix: nil pointer deref in the mini cluster. FilerOptions instances constructed outside weed/command/filer.go (e.g. miniFilerOptions in mini.go) do not populate peerRegistryEnable, so dereferencing the pointer panics at Filer startup. Use the same `nil && deref` idiom already used for distributedLock / writebackCache. * Hardening (gemini review): registry now enforces three invariants: - empty peer_addr is silently rejected (no client-controlled sentinel mass-inserts) - TTL is capped at 1 hour so a runaway client cannot pin entries - new-entry count is capped at 10000 to bound memory; renewals of existing entries are always honored, so a full registry still heartbeats its existing members correctly Covered by new unit tests. * filer: rename -peer.registry.enable flag to -mount.p2p Per review feedback: the old name "peer.registry.enable" leaked the implementation ("registry") into the CLI surface. "mount.p2p" is shorter and describes what it actually controls — whether this filer participates in mount-to-mount peer chunk sharing. Flag renames (all three keep default=true, idle cost is near-zero): -peer.registry.enable -> -mount.p2p (weed filer) -filer.peer.registry.enable -> -filer.mount.p2p (weed mini, weed server) Internal variable names (mountPeerRegistryEnable, MountPeerRegistry) keep their longer form — they describe the component, not the knob. * filer: MountList returns DataCenter + List uses RLock Two review follow-ups on the mount peer registry: * weed/server/filer_grpc_server_mount_peer.go: MountList was dropping the DataCenter on the wire. The whole point of carrying DC separately from Rack is letting the mount-side fetcher re-rank peers by the two-level locality hierarchy (same-rack > same-DC > cross-DC); without DC in the response every remote peer collapsed to "unknown locality." * weed/filer/mount_peer_registry.go: List() was taking a write lock so it could lazy-delete expired entries inline. But MountList is a read-heavy RPC hit on every mount's 30 s refresh loop, and Sweep is already wired as the sole reclamation path (same pattern as the mount-side PeerDirectory). Switch List to RLock + filter, let Sweep do the map mutation, so concurrent MountList callers don't serialize on each other. Test updated to reflect the new contract (List no longer mutates the map; Sweep is what drops expired entries). * mount: add peer chunk sharing options + advertise address resolver First cut at the peer chunk sharing wiring on the mount side. No functional behavior yet — this PR just introduces the option fields, the -peer.* flags, and the helper that resolves a reachable host:port from them. The server implementation arrives in PR #5 (gRPC service) and the fetcher in PR #7. * ResolvePeerAdvertiseAddr: an explicit -peer.advertise wins; else we use -peer.listen's bind host if specific; else util.DetectedHostAddress combined with the port. This is what gets registered with the filer and announced to peers, so wildcard binds no longer result in unreachable identities like "[::]:18080". * Option fields: PeerEnabled, PeerListen, PeerAdvertise, PeerRack. One port handles both directory RPCs and streaming chunk fetches (see PR #1 FetchChunk proto), so there is no second -peer.grpc.* flag — the old HTTP byte-transfer path is gone. * New flags on weed mount: -peer.enable, -peer.listen (default :18080), -peer.advertise (default auto), -peer.rack.	2026-04-18 20:03:34 -07:00
Chris LuandGitHub	e24a443b17	peer chunk sharing 2/8: filer mount registry (#9131 ) * proto: define MountRegister/MountList and MountPeer service Adds the wire types for peer chunk sharing between weed mount clients: * filer.proto: MountRegister / MountList RPCs so each mount can heartbeat its peer-serve address into a filer-hosted registry, and refresh the list of peers. Tiny payload; the filer stores only O(fleet_size) state. * mount_peer.proto (new): ChunkAnnounce / ChunkLookup RPCs for the mount-to-mount chunk directory. Each fid's directory entry lives on an HRW-assigned mount; announces and lookups route to that mount. No behavior yet — later PRs wire the RPCs into the filer and mount. See design-weed-mount-peer-chunk-sharing.md for the full design. * filer: add mount-server registry behind -peer.registry.enable Implements tier 1 of the peer chunk sharing design: an in-memory registry of live weed mount servers, keyed by peer address, refreshed by MountRegister heartbeats and served by MountList. * weed/filer/peer_registry.go: thread-safe map with TTL eviction; lazy sweep on List plus a background sweeper goroutine for bounded memory. * weed/server/filer_grpc_server_peer.go: MountRegister / MountList RPC handlers. When -peer.registry.enable is false (the default), both RPCs are silent no-ops so probing older filers is harmless. * -peer.registry.enable flag on weed filer; FilerOption.PeerRegistryEnabled wires it through. Phase 1 is single-filer (no cross-filer replication of the registry); mounts that fail over to another filer will re-register on the next heartbeat, so the registry self-heals within one TTL cycle. Part of the peer-chunk-sharing design; no behavior change at runtime until a later PR enables the flag on both filer and mount. * filer: nil-safe peerRegistryEnable + registry hardening Addresses review feedback on PR #9131. * Fix: nil pointer deref in the mini cluster. FilerOptions instances constructed outside weed/command/filer.go (e.g. miniFilerOptions in mini.go) do not populate peerRegistryEnable, so dereferencing the pointer panics at Filer startup. Use the same `nil && deref` idiom already used for distributedLock / writebackCache. * Hardening (gemini review): registry now enforces three invariants: - empty peer_addr is silently rejected (no client-controlled sentinel mass-inserts) - TTL is capped at 1 hour so a runaway client cannot pin entries - new-entry count is capped at 10000 to bound memory; renewals of existing entries are always honored, so a full registry still heartbeats its existing members correctly Covered by new unit tests. * filer: rename -peer.registry.enable flag to -mount.p2p Per review feedback: the old name "peer.registry.enable" leaked the implementation ("registry") into the CLI surface. "mount.p2p" is shorter and describes what it actually controls — whether this filer participates in mount-to-mount peer chunk sharing. Flag renames (all three keep default=true, idle cost is near-zero): -peer.registry.enable -> -mount.p2p (weed filer) -filer.peer.registry.enable -> -filer.mount.p2p (weed mini, weed server) Internal variable names (mountPeerRegistryEnable, MountPeerRegistry) keep their longer form — they describe the component, not the knob. * filer: MountList returns DataCenter + List uses RLock Two review follow-ups on the mount peer registry: * weed/server/filer_grpc_server_mount_peer.go: MountList was dropping the DataCenter on the wire. The whole point of carrying DC separately from Rack is letting the mount-side fetcher re-rank peers by the two-level locality hierarchy (same-rack > same-DC > cross-DC); without DC in the response every remote peer collapsed to "unknown locality." * weed/filer/mount_peer_registry.go: List() was taking a write lock so it could lazy-delete expired entries inline. But MountList is a read-heavy RPC hit on every mount's 30 s refresh loop, and Sweep is already wired as the sole reclamation path (same pattern as the mount-side PeerDirectory). Switch List to RLock + filter, let Sweep do the map mutation, so concurrent MountList callers don't serialize on each other. Test updated to reflect the new contract (List no longer mutates the map; Sweep is what drops expired entries).	2026-04-18 20:03:23 -07:00
Chris LuandGitHub	d7d834b8f9	peer chunk sharing 1/8: proto definitions (#9130 ) proto: define MountRegister/MountList and MountPeer service Adds the wire types for peer chunk sharing between weed mount clients: * filer.proto: MountRegister / MountList RPCs so each mount can heartbeat its peer-serve address into a filer-hosted registry, and refresh the list of peers. Tiny payload; the filer stores only O(fleet_size) state. * mount_peer.proto (new): ChunkAnnounce / ChunkLookup RPCs for the mount-to-mount chunk directory. Each fid's directory entry lives on an HRW-assigned mount; announces and lookups route to that mount. No behavior yet — later PRs wire the RPCs into the filer and mount. See design-weed-mount-peer-chunk-sharing.md for the full design.	2026-04-18 20:02:55 -07:00
Chris LuandGitHub	6787a4b4e8	fix kafka gateway and consumer group e2e flakes (#9129 ) * fix(test): reduce kafka gateway and consumer group flakes * fix(kafka): make broker health-check backoff respect context Replace time.Sleep in the retry loop with a select on bc.ctx.Done() and time.After so the backoff is interruptible during shutdown, per review feedback on PR #9129. * fix(kafka): guard broker HealthCheck against nil client Return the same "broker client not connected" error used by the other exported BrokerClient methods instead of panicking on a partially initialized client, per CodeRabbit review feedback on PR #9129.	2026-04-18 10:03:24 -07:00
Chris Lu	6832b9945b	ci(s3tests): install libxml2/libxslt dev headers before pip install ceph/s3-tests pins lxml without an upper bound. When pip picks a release whose prebuilt wheel isn't published for Python 3.9 on the runner, it falls back to sdist and fails without libxml2-dev / libxslt1-dev.	2026-04-18 01:06:39 -07:00
Chris LuandGitHub	1c130c2d47	fix(mount): close inodeLocks cleanup race that let two flock holders coexist (#9128 ) * fix(mount): close inodeLocks cleanup race that allowed two flock holders PosixLockTable.getOrCreateInodeLocks released plt.mu before the caller acquired il.mu. A concurrent maybeCleanupInode could delete the map entry in that window; the first caller would then insert its lock into the orphaned inodeLocks while a later caller created a fresh entry in the map, so findConflict never observed the orphaned lock and two owners could simultaneously believe they held the same exclusive flock. This matches the flaky CI failure seen in TestPosixFileLocking/ConcurrentLockContention: Error: Should be empty, but was [worker N: flock overlap detected with 2 holders] Mark removed inodeLocks as dead under plt.mu+il.mu, and have SetLk / SetLkw recheck the flag after locking il.mu, refetching the live entry from the map when orphaned. Also delete the map entry only if it still points to this il, so a racing recreate is not clobbered. Adds TestConcurrentFlockChurnPreservesMutualExclusion: 16 goroutines x 500 flock/unflock iterations on one inode. Reliably reports 500+ overlaps per run before the fix; clean across 100 race-enabled runs after. * fix(mount): extend dead-flag contract to GetLk and self-heal primitives Address review feedback on the initial cleanup-race fix: 1. GetLk had the same stale-pointer bug as SetLk. A caller could grab an inodeLocks pointer, have cleanup orphan it and a replacement il receive a conflicting lock, then answer F_UNLCK off the empty dead pointer. Add the same dead recheck + refetch loop. 2. getOrCreateInodeLocks and getInodeLocks now treat a dead map entry as defective: the former replaces it with a fresh inodeLocks, the latter drops it and returns nil. Production cannot reach that state (maybeCleanupInode atomically deletes under plt.mu when it sets dead), but the hardening guarantees the SetLk / SetLkw / GetLk retry loops always make progress even if a future refactor reorders those operations, and it lets the white-box tests set up a stale dead entry without spinning. 3. Strengthen the regression suite: - TestSetLkRetriesPastDeadInodeLocks: deterministic white-box test that installs a dead il in the map and asserts SetLk routes the new lock into a fresh il (not the orphan), that GetLk reports the resulting conflict, and that a different-owner acquire is rejected with EAGAIN. - TestGetInodeLocksEvictsDeadEntry: verifies both map-read primitives drop or replace dead entries. - TestConcurrentFlockChurnPreservesMutualExclusion: replace the timing-fragile Add(1)-and-check counter with a Swap+CAS detector. Each worker claims a slot after SetLk OK and releases it before UN, flagging both an observed predecessor and a lost CAS on release. Against a reverted fix the detector fires 1000+ times per run; with the fix clean across 100 race-enabled iterations. * test(mount): fail fast on unexpected SetLk statuses in churn loop The stress test blindly spun on any non-OK SetLk status and discarded the unlock return. If SetLk ever returns something other than OK or EAGAIN (e.g. after a future refactor introduces a new error), the acquire loop would spin forever and an unlock failure would be silently swallowed. Capture the acquire status, retry only on the expected EAGAIN, and assert unlock returns OK. Use t.Errorf + return (not t.Fatalf) because the checks run on worker goroutines where FailNow is unsafe. The Swap+CAS overlap detector is unchanged.	2026-04-17 23:12:26 -07:00
Chris LuandGitHub	c1ccbe97dd	feat(filer.backup): -initialSnapshot seeds destination from live tree (#9126 ) * feat(filer.backup): -initialSnapshot seeds destination from live tree Replaying the metadata event log on a fresh sync only leaves files that still exist on the source at replay time: any entry that was created and later deleted is replayed as a create/delete pair and never materializes on the destination. Users who wipe the destination and re-run filer.backup therefore see "only new files" instead of a full backup, even when -timeAgo=876000h is passed and the subscription genuinely starts from epoch (ref discussion #8672). Add a -initialSnapshot opt-in flag: when set on a fresh sync (no prior checkpoint, -timeAgo unset), walk the live filer tree under -filerPath via TraverseBfs and seed the destination through sink.CreateEntry, then persist the walk-start timestamp as the checkpoint and subscribe from there. Capturing the timestamp before the walk lets the subscription catch any create/update/delete racing with the walk — sink CreateEntry is idempotent across the builtin sinks so replay is safe. Honors existing -filerExcludePaths / -filerExcludeFileNames / -filerExcludePathPatterns filters and skips /topics/.system/log the same way the subscription path does. Also log "starting from <t> (no prior checkpoint)" instead of a misleading "resuming from 1970-01-01" when the KV has no stored offset. * fix(filer.backup): guard initialSnapshot counters under TraverseBfs workers TraverseBfs fans the callback out across 5 worker goroutines, so the entryCount / byteCount updates and the 5-second progress-log gate in runInitialSnapshot were racing. Switch the counters to atomic.Int64 and protect the lastLog check/update with a short-scoped mutex so the heavy sink.CreateEntry call stays outside the critical section. Flagged by gemini-code-assist on #9126; verified with go test -race. * fix(filer.backup): harden initialSnapshot against transient errors and path edge cases Three review items from CodeRabbit on #9126: 1. getOffset errors no longer leave isFreshSync=true. Before, a transient KV read failure would cause runFilerBackup's retry loop to redo the full -initialSnapshot walk on every retry. Treat any offset-read error as "not fresh" so the snapshot only runs when we've verified there really is no prior checkpoint. 2. initialSnapshotTargetKey now normalizes sourcePath to a trailing- slash base before stripping the prefix, so edge cases where sourceKey equals sourcePath (trailing-slash mismatch or root-entry emission) no longer index past the end. Unit tests cover both forms. 3. Documented the TraverseBfs-enumerates-excluded-subtrees performance characteristic on runInitialSnapshot, since pruning requires a separate change to TraverseBfs itself. * fix(filer.backup): retry setOffset after initialSnapshot to avoid full re-walks If the snapshot walk finishes but the subsequent setOffset fails, the retry loop in runFilerBackup will re-enter doFilerBackup with an empty checkpoint and run the full BFS again — on a multi-million-entry tree that's hours of wasted work over a 100-byte KV write. Retry the write a handful of times with exponential backoff before giving up, and log loudly at the final failure (with snapshotTsNs + sinkId) so operators recognize the symptom instead of guessing at mysterious repeated walks. Nitpick raised by CodeRabbit on #9126. * fix(filer.backup): initialSnapshot ignore404, skew margin, exclude dir-entry itself Three review items from CodeRabbit on #9126: 1. ignore404Error now threads into runInitialSnapshot. If a file is listed by TraverseBfs and then deleted before CreateEntry reads its chunks, the follow path already ignores 404s — the snapshot path was aborting and triggering a full re-walk. Treat an ignorable 404 as "skip this entry, continue." 2. snapshotTsNs now uses `time.Now() - 1min` instead of `time.Now()`. Metadata events are stamped server-side, so a fast backup-host clock could skip events that fire during or right after the walk. Matches the 1-minute margin meta_aggregator.go applies on initial peer traversal; duplicate replay is harmless because CreateEntry is idempotent. 3. Exclude checks now run against the entry's own full path, not just its parent. A walked directory whose full path matches SystemLogDir or -filerExcludePaths was being seeded to the destination; only its descendants were being skipped. Verified with a manual repro where -filerExcludePaths=/data/skipdir now keeps the skipdir entry itself off the destination. * refactor(filer): share destKey helper between buildKey and initialSnapshot Extract destKey(dataSink, targetPath, sourcePath, sourceKey, mTime) from buildKey in filer_sync.go. Both the event-log path (buildKey) and the initialSnapshot walk (initialSnapshotTargetKey) now go through the same helper, so a walk-seeded file and an event-replayed file always resolve to the same destination key. As a bonus, buildKey picks up the defensive trailing-slash normalization that initialSnapshotTargetKey introduced — no more index-past-end risk when sourceKey happens to equal sourcePath. Also tightens the mTime lookup to guard against nil Attributes (caught by an existing test against buildKey when I first moved the lookup out of the incremental branch).	2026-04-17 21:21:32 -07:00
Chris LuandGitHub	d57fc67022	fix(shell): fs.mergeVolumes now rewrites manifest chunks for large files (#9127 ) * fix(shell): fs.mergeVolumes now rewrites manifest chunks for large files Previously fs.mergeVolumes skipped any chunk whose IsChunkManifest flag was true, printing "Change volume id for large file is not implemented yet" and continuing. Because the BFS traversal only looks at top-level entry.Chunks, sub-chunks referenced inside a manifest were never considered either. For any file stored as a chunk manifest (large files go this path), chunks in the source volume stayed put, leaving behind a few MB of live data that vacuum and volume.deleteEmpty couldn't clean up. This change resolves each manifest chunk recursively, moves any sub-chunk whose volume id is in the merge plan via the existing moveChunk path, and re-serializes the manifest. If the manifest chunk itself lives in a source volume, or any sub-chunk moved, the new manifest blob is uploaded to a freshly assigned file id (the old needle becomes orphaned and is reclaimed by vacuum like any other moved chunk). Fixes #9116. * address review: batch UpdateEntry, fix dry-run, defer restore, avoid source volumes - Call UpdateEntry once per entry after the chunk loop instead of once per moved chunk (gemini nit). - In dry-run mode, mark anySubChanged when a sub-chunk in the plan is encountered and return changed=true after printing "rewrite manifest", so nested manifests also surface their would-rewrites (gemini nit). - Defer filer_pb.AfterEntryDeserialization so the manifest chunk list is restored even when proto.Marshal fails (coderabbit nit). - Reject AssignVolume results whose file id lands on a volume that is a source in the merge plan, and retry — otherwise the replacement manifest could be written to the volume being emptied (coderabbit).	2026-04-17 21:17:51 -07:00
Jaehoon Kim GitHub Chris Lu	96af27a131	feat(shell): add fs.distributeChunks command for even chunk distribution (#9117 ) * feat(shell): add fs.distributeChunks command for even chunk distribution Add a new weed shell command that redistributes a file's chunks evenly across volume server nodes. Supports three distribution modes via -mode flag: - primary: balance chunk ownership across nodes (default) - replica: balance both ownership and replica copies - round-robin: assign chunks by offset order for sequential read optimization (chunk[0]->A, chunk[1]->B, chunk[2]->C, ...) Additional options: - -nodes=N to target specific number of nodes - -apply to execute (dry-run by default) Usage: fs.distributeChunks -path=/buckets/file.dat fs.distributeChunks -path=/buckets/file.dat -mode=round-robin -apply fs.distributeChunks -path=/buckets/file.dat -mode=replica -apply fs.distributeChunks -path=/buckets/file.dat -nodes=5 -apply * fix(shell): improve fs.distributeChunks robustness and code quality - Propagate flag parse errors instead of swallowing them (return err) - Handle nil chunk.Fid by falling back to legacy FileId string parsing - Simplify node membership check using slices.Contains * fix(shell): fix dead round-robin print loop in fs.distributeChunks The loop was computing targetNode with sc.index%totalNodes (original chunk index) instead of the sequential position, and discarding it via _ = targetNode without printing anything. Replace with a correct loop using pos%totalNodes and actually print the first 12 node assignments. * fix(shell): compute replication/collection per-chunk in fs.distributeChunks Previously replication and collection were derived once from chunks[0] and reused for all moves, causing wrong volume placement for chunks belonging to different volumes or collections. Now each chunk looks up its own volumeInfoMap entry immediately before calling operation.Assign. * fix(shell): prefer assignResult.Auth JWT over local signing key in fs.distributeChunks When the master returns an Auth token in the Assign response, use it directly for the upload instead of generating a new JWT from the local viper signing key. Fall back to local key generation only when Auth is empty, matching the pattern used by other upload paths. * fix(shell): add timeout and error handling to delete requests in fs.distributeChunks The delete loop was ignoring http.NewRequest errors and had no timeout, risking a nil-request panic or indefinite block. Replace with http.NewRequestWithContext and a 30s timeout, handle request creation errors by incrementing deleteFailCount, and cancel the context immediately after Do returns. * feat(shell): parallelize chunk moves in fs.distributeChunks using ErrorWaitGroup Sequential chunk moves are a bottleneck for large LLM model files with hundreds or thousands of chunks. Use ErrorWaitGroup with DefaultMaxParallelization (10) to run download/assign/upload concurrently. Guard movedRecords appends, chunk.Fid updates, and writer output with a mutex. Individual chunk failures are non-fatal and logged inline; only successfully moved chunks are included in the metadata update. * fix(shell): try all replica URLs on download in fs.distributeChunks Previously only the first volume server URL was attempted, causing chunk moves to fail if that replica was unreachable. Now iterates through all URLs returned by LookupVolumeServerUrl and stops at the first success. * refactor(shell): apply extract method pattern to fs.distributeChunks Do() was a single ~615-line function. Break it into focused helpers: - lookupFileEntry: filer entry lookup - validateChunks: chunk manifest guard - collectVolumeTopology: master topology query + ownership mapping - buildDistributionCounts: chunk→node mapping and owner/copy tallies - selectActiveNodes: target node selection - printCurrentDistribution: per-node distribution table - planDistribution: mode-switch planning (primary/replica/round-robin) - printRedistributionPlan: before/after plan table - relevantNodes: active-or-occupied node filter Do() is now ~100 lines of orchestration; each helper has a single clear responsibility. * test(shell): add unit tests for fs.distributeChunks algorithms Cover all three distribution modes and supporting helpers: - shortName, relevantNodes - computeOwnerTarget (even/uneven split, inactive node drain) - buildDistributionCounts (normal + nil Fid fallback) - selectActiveNodes (all nodes / limited count) - planOwnerMoves (imbalanced → balanced, already balanced) - planDistribution primary (chunks balanced, no-op when even) - planDistribution round-robin (offset ordering, correct assignment) - planDistribution replica (owner + copy balancing) - printRedistributionPlan (output format) * fix(shell): add 5-minute timeout to chunk downloads in fs.distributeChunks Download requests had no per-request timeout, unlike delete operations which already use 30s. Replace readUrl() calls with inline http.NewRequestWithContext + context.WithTimeout(5m) so a hung volume server cannot block a goroutine indefinitely during redistribution. * fix(shell): remove redundant deleteOldChunks in fs.distributeChunks filer.UpdateEntry already calls deleteChunksIfNotNew internally, which computes the diff between old and new entry chunks and deletes the ones no longer referenced. Our explicit deleteOldChunks was racing with this filer-side cleanup, causing spurious 404 warnings on ~75% of deletes. Remove deleteOldChunks, movedChunkRecord type, and reduce executeChunkMoves return type to (int, error) for the moved count. * fix(shell): handle nil chunk.Fid via chunkVolumeId helper in fs.distributeChunks chunk.Fid.GetVolumeId() silently returns 0 for legacy chunks stored with a FileId string instead of a Fid struct, causing them to be skipped in the replica balancing loop and looked up incorrectly in volumeInfoMap. Introduce chunkVolumeId() that uses Fid when present and falls back to parsing the legacy FileId string, matching the logic in buildDistributionCounts. Apply it in the replica-mode copies loop and in executeChunkMoves' replication/collection lookup. * fix(shell): use already-parsed oldFid for volumeInfoMap lookup in fs.distributeChunks chunkVolumeId(chunk) was being called to look up replication/collection after oldFid had already been parsed and validated. Use oldFid.VolumeId directly to avoid redundant parsing and guarantee the correct volume ID regardless of whether chunk.Fid is nil. * fix(shell): improve correctness and robustness in fs.distributeChunks - Buffer download body before upload so dlCtx timeout only covers the GET request; upload runs with context.Background() via bytes.NewReader - Replace 'before, after := strings.Cut(...)' + '_ = before' with '_' as the first return value directly - Clone copiesCount before replica planner mutates it, keeping the caller's map immutable - Add nil-entry guard after filer LookupEntry to prevent panic on unexpected nil response * feat(shell): support chunk manifests in fs.distributeChunks Large files stored as chunk manifests were previously rejected. Resolve manifests up front via filer.ResolveChunkManifest, redistribute the underlying data chunks, then re-pack through filer.MaybeManifestize before UpdateEntry. The filer's MinusChunks resolves manifests on both sides of the diff, so old manifest and inner data chunks are GC'd automatically. * fix(shell): match master's SaveDataAsChunkFunctionType 5-param signature Master added expectedDataSize uint64; ignore it in shell-side saveAsChunk. --------- Co-authored-by: Chris Lu <chris.lu@gmail.com>	2026-04-17 21:09:36 -07:00
Lisandro Pin GitHub Lisandro Pin	6bcacedda9	Export `master_disconnections` metrics on volume servers. (#9104 ) This allows to track connection issues and master failovers in real time via Prometheus metrics. Co-authored-by: Lisandro Pin <lisandro.pin@proton.ch>	2026-04-17 15:15:26 -07:00
Chris LuandGitHub	0315da9022	fix(s3api): self-heal stale .versions latest-version pointer on read (#9125 ) * fix(s3api): self-heal stale .versions latest-version pointer on read When the `.versions` directory metadata points at a version file that has gone missing (e.g. a crash between deleting the latest version and rewriting the pointer, or a concurrent delete racing with a read), `getLatestObjectVersion` bailed with a hard error that required manual repair. Tagging, ACL, retention, copy-source, and HEAD/GET all surfaced NoSuchKey even though other versions remained on disk. On `ErrNotFound`/`codes.NotFound` from the pointed-at version lookup, rescan the `.versions` directory, pick the newest remaining non-delete- marker entry, persist the repaired pointer (best-effort), and return that entry. If only delete markers (or nothing) remain, the caller still sees an error and the object correctly appears absent. Extracted the selection logic into a pure `selectLatestContentVersion` helper so `updateLatestVersionAfterDeletion` and the new self-heal path share a single implementation. A warning is logged whenever the heal kicks in so stale-pointer incidents remain visible in operator logs. * fix(s3api): self-heal must promote newest version even if delete marker Review feedback (gemini-code-assist): the self-heal path used `selectLatestContentVersion`, which skips delete markers. That had two bugs: 1. If the chronologically newest entry was a delete marker, an older content version would be promoted, effectively "undeleting" an object that was actually deleted. 2. If only delete markers remained, heal returned an error and the caller surfaced a hard 500 instead of the correct 404-with- x-amz-delete-marker response. Add `selectLatestVersion` that picks the newest entry regardless of type (content or delete marker) and use it in `healStaleLatestVersionPointer`. The promoted entry flows back through `doGetLatestObjectVersion` unchanged; downstream handlers already detect `ExtDeleteMarkerKey` on the returned entry and render NoSuchKey + `x-amz-delete-marker: true` (see `s3api_object_handlers.go:722-728`). Kept `selectLatestContentVersion` in place for `updateLatestVersionAfterDeletion`, which deliberately limits the pointer to a live content version in the post-deletion flow — changing that is out of scope for this fix. Added four tests for the new selector: - promotes newest delete marker over older content (the reviewer case) - picks content when content is newest - promotes newest delete marker when only delete markers remain - returns nil latestEntry on empty/untagged input * fix(s3api): paginate .versions scan in self-heal path Review feedback (gemini-code-assist, coderabbitai): the self-heal rescan did a single-shot list(..., 1000). For objects whose version ids use the old raw-timestamp format, filer ordering is lexicographic-ascending = oldest-first. If the .versions directory held more than one page of entries, the first page contained only the oldest, and the heal would promote an older version as the new "latest" — silently surfacing stale data on subsequent reads. Paginate through the whole directory with `filer.PaginationSize`, running `selectLatestVersion` per page and keeping a single running best candidate across pages (via `compareVersionIds`). This mirrors the pagination pattern already used by `getObjectVersionList` in the same file and closes the window for old-format stacks larger than one page. New-format (inverted-timestamp) ids were not affected because their lexicographic order matches newest-first, but paginating is still the right fix. Also updated the function doc to reflect that self-heal now promotes the newest entry regardless of type (content version or delete marker). * fix(s3api): don't resurrect deleted objects; wrap ErrNotFound sentinel Two review findings addressed: 1. `updateLatestVersionAfterDeletion` was using `selectLatestContentVersion` which skipped delete markers. Scenario: PUT v1, DELETE (dm1 written, pointer->dm1), PUT v3 (pointer->v3), user explicitly deletes v3 by versionId. Remaining files: v1, dm1. S3 semantics say the current version is the chronologically newest = dm1 (object appears deleted). Old code would promote v1, "undeleting" the object silently. Switched to `selectLatestVersion`, which picks the newest entry regardless of type. Also paginated the scan the same way the self-heal path does — otherwise a single-page `list(1000)` still mis-selects the latest for old-format version id stacks that exceed one page. Removed the `hasDeleteMarkers \|\| !isLast` branch: with `selectLatestVersion` any delete marker participates in the comparison and shows up as the latest when it is newest, so the "keep the directory on ambiguity" guard becomes unreachable. Full pagination also makes the `!isLast` guard unnecessary — we either see every entry or surface a list error. 2. `healStaleLatestVersionPointer`'s "no remaining version" error was a plain `fmt.Errorf`, so callers could not distinguish genuine object-absence from a scan failure via `errors.Is`. Wrap it with `filer_pb.ErrNotFound` so the sentinel flows through the chain (the outer wrap already uses `%w`). No test additions needed — `TestSelectLatestVersion_PromotesNewestDeleteMarker` already asserts the "newer delete marker beats older content" invariant that drives both fixes. * refactor(s3api): drop unused selectLatestContentVersion after review Review feedback flagged a stale comment claiming selectLatestContentVersion "mirrors" the post-deletion semantics of updateLatestVersionAfterDeletion. That claim became false when the post-deletion path switched to selectLatestVersion in the previous commit. Verified no production callers remain — only the helper's own tests referenced it — so the cleaner fix is to delete the dead code rather than rewrite the comment to explain "this exists but isn't used." - Removed selectLatestContentVersion. - Removed the four TestSelectLatestContentVersion_* tests. - Preserved a renamed TestSelectLatestVersion_MixedFormats so the mixed-format comparator coverage still runs against the active selector.	2026-04-17 14:57:59 -07:00
Chris LuandGitHub	bfea14320a	fix(s3): reject unknown POST policy conditions and extra x-amz form fields (#9124 ) * fix(s3): reject unknown POST policy conditions and extra x-amz form fields CheckPostPolicy previously accepted policy conditions with unknown $keys (e.g. "$foo") as satisfied, and only rejected stray X-Amz-Meta-* form fields. Reject unknown condition keys outright, and extend the extra- input-fields check to all X-Amz-* form fields except the reserved auth/signing headers. Matches AWS S3 POST Object behavior. * refactor(s3): drop redundant $x-amz-meta- prefix check in CheckPostPolicy The $x-amz- prefix already subsumes $x-amz-meta-, so the explicit $x-amz-meta- check adds no coverage. Simplify the else-if condition. Addresses gemini-code-assist review on PR #9124. * style(s3): align unknown-key policy error with [op, key, value] trailer Reformat the unknown-condition-key error in CheckPostPolicy to include the same "[op, key, value]" trailer used by the other condition-failed messages. The value slot is empty because no comparison occurs for an unknown key. The descriptive "unknown condition key" suffix is kept so operators can still tell this failure from a mismatched value. * fix(s3): honor starts-with prefix-stem POST policies when checking extras AWS POST policies use ["starts-with","$x-amz-meta-",""] to allow any X-Amz-Meta-* form field. The previous exact-match policyXAmzKeys would flag every X-Amz-Meta-Foo as an "Extra input fields" failure because only the stem X-Amz-Meta- was stored. Track starts-with conditions whose key ends in "-" with an empty value as prefix stems, and accept any X-Amz-* form field matching one of those stems. * fix(s3): validate value prefix for starts-with POST policy stems Drop the policy.Value == "" gate when detecting prefix-stem conditions so that ["starts-with","$x-amz-meta-","pfx-"] is recognized as a prefix rule. Track the required value prefix alongside the name prefix, enforce it against every matching form field in the extras loop, and skip the prefix-stem condition in the main iteration (it has no single form field to evaluate). Also include policy.Value in the unknown-condition error trailer for clearer debugging. Addresses gemini-code-assist review on PR #9124. * fix(s3): check every matching POST policy rule, not just the first The extras loop exited early on exact-key match and broke on the first matching prefix stem. Per AWS, a form field must satisfy every policy condition that applies to it, so an exact-match field must still honor any overlapping starts-with stem's value prefix, and multiple stems on the same field must all hold. Drop both early exits: start matched from the exact-key lookup, iterate all prefix stems, and fail on the first value-prefix violation. Addresses gemini-code-assist review on PR #9124.	2026-04-17 14:55:12 -07:00
Chris LuandGitHub	0bddc2652e	fix(s3): propagate validated POST form fields to upload headers (#9123 ) * fix(s3): propagate validated POST form fields to upload headers POST Object form fields like acl, Content-Encoding, x-amz-storage-class, x-amz-tagging, and x-amz-server-side-encryption were validated against the POST policy but never forwarded to the underlying PUT, so the validated values had no effect. Forward all non-reserved x-amz-* fields plus acl (as x-amz-acl) and Content-Encoding. Reserved POST policy mechanism fields (Policy, Signature, Key, etc.) are still excluded. * fix(s3): also forward Content-Language from POST form to upload headers AWS S3 POST Object supports Content-Language; add it to the set of content headers forwarded by applyPostPolicyFormHeaders alongside Cache-Control, Expires, Content-Disposition, and Content-Encoding. Addresses gemini-code-assist review on PR #9123. * refactor(s3): only look up form value in branches that use it applyPostPolicyFormHeaders previously called formValues.Get(k) for every form field, including fields that fall through the switch (non-reserved fields that are neither Acl, a forwarded content header, nor X-Amz-*). Move the lookup inside the switch cases that actually use it.	2026-04-17 14:55:06 -07:00
Chris LuandGitHub	2da24cc230	fix(s3): return 403 on POST policy violation instead of 307 redirect (#9122 ) * fix(s3): return 403 on POST policy violation instead of 307 redirect CheckPostPolicy failures previously responded with HTTP 307 Temporary Redirect to the request URL, which causes clients to re-POST and obscures the failure. Return 403 AccessDenied so the client surfaces the error. * test(s3): exercise PostPolicyBucketHandler end-to-end for 403 mapping Replace the shallow ErrAccessDenied tautology test with one that builds a signed POST multipart request whose policy conditions cannot be satisfied, calls PostPolicyBucketHandler directly, and asserts HTTP 403 with no Location redirect header. Addresses gemini-code-assist review on PR #9122. * fix(s3): surface POST policy failure reason in AccessDenied response Add s3err.WriteErrorResponseWithMessage so a caller can keep the standard error code mapping while providing a specific Message. Use it from PostPolicyBucketHandler so the XML body carries the CheckPostPolicy error (e.g. which condition failed or that the policy expired) rather than the generic "Access Denied." description. Addresses gemini-code- assist review on PR #9122. * refactor(s3err): delegate WriteErrorResponse to WriteErrorResponseWithMessage The two helpers shared every line except the Message override. Fold WriteErrorResponse into a one-line delegation that passes an empty message, so the request-id/mux/apiError logic lives in exactly one place. Addresses gemini-code-assist review on PR #9122.	2026-04-17 14:54:58 -07:00
Chris Lu	32cbed9658	fix(fuse-tests): avoid ephemeral port reuse racing weed mini bind freePort allocated in [20000, 55535], which overlaps the Linux ephemeral range (32768-60999). The kernel could reuse the chosen port for an outbound connection between the test closing its listener and weed mini re-checking availability, causing: Port allocation failed: port N for Filer (specified by flag filer.port) is not available on 0.0.0.0 and cannot be used Narrow the range to [20000, 32000] to stay below the ephemeral floor, and pass -ip.bind=127.0.0.1 so mini's pre-check runs on the same IP the test actually reserved the port on.	2026-04-17 13:47:17 -07:00
Chris LuandGitHub	cce98fcecf	fix(s3): strip client-supplied X-SeaweedFS-Principal/Session-Token in AuthSignatureOnly (#9120 ) * fix(s3): strip client-supplied X-SeaweedFS-Principal/Session-Token in AuthSignatureOnly AuthSignatureOnly is the only auth gate in front of S3Tables routes (incl. CreateTableBucket) and UnifiedPostHandler, but unlike authenticateRequestInternal it did not clear the internal IAM trust headers before running signature verification. S3Tables authorizeIAMAction reads X-SeaweedFS-Principal directly from the request and prefers it over the authenticated identity's PrincipalArn, so a signed low-privilege caller could append that header after signing (unsigned header, SigV4 still verifies) and have IAM policy evaluated against a spoofed principal, bypassing authorization. Clear both X-SeaweedFS-Principal and X-SeaweedFS-Session-Token at the top of AuthSignatureOnly, mirroring the existing guard in authenticateRequestInternal. Add a regression test covering the header-injection path. * refactor(s3): route AuthSignatureOnly through authenticateRequestInternal Addresses review feedback: both entry points were independently maintaining the internal-IAM-header stripping and the auth-type dispatch switch. Collapse AuthSignatureOnly into a thin wrapper around authenticateRequestInternal so the security-critical header scrub and the signature-verify switch live in one place. Post-auth behavior unique to AuthSignatureOnly (AmzAccountId header) stays inline. No functional change beyond two harmless telemetry tweaks that now match authenticateRequestInternal: the per-branch glog verbosity shifts from V(3) to V(4), and the anonymous-found path now sets AmzAuthType. * refactor(s3): centralize X-SeaweedFS-Principal/Session-Token header names Introduce SeaweedFSPrincipalHeader and SeaweedFSSessionTokenHeader in weed/s3api/s3_constants so the trust-header literals are defined once and referenced consistently by the auth scrub, JWT auth path, bucket policy principal resolution, IAM authorization, and S3Tables IAM evaluation. Replace every remaining usage in weed/s3api and weed/s3api/s3tables. This removes the drift risk the reviewer called out: adding another call site with a typo can no longer silently bypass the scrub. Pure rename, no behavior change. No-op integration-test helper in test/s3/iam/s3_iam_framework.go left untouched (separate module, and the server now strips the client-supplied value regardless).	2026-04-17 12:23:21 -07:00
Chris LuandGitHub	88ac2d0431	security(s3api): reject unsigned x-amz-* headers in SigV4 requests (#9121 ) A presigned URL holder could attach arbitrary x-amz-* headers to a PUT request (e.g. x-amz-tagging, x-amz-acl, x-amz-storage-class, x-amz-server-side-encryption, x-amz-object-lock-, x-amz-meta-, x-amz-website-redirect-location, x-amz-grant-). Because only the headers declared in SignedHeaders participate in signature verification, the added headers bypass authentication; the PUT handler then persists them into the object's Extended metadata. Match the AWS SigV4 rule: every x-amz-* header present in the request must appear in SignedHeaders. Exempt x-amz-content-sha256 (already tamper-protected via the canonical request's payload-hash line) and, for presigned URLs, the SigV4 protocol parameters that live in the query string (X-Amz-Algorithm/Credential/Date/Expires/SignedHeaders/ Signature) in case they are duplicated as headers. Applies to both header-based and presigned SigV4; non-amz headers are unaffected.	2026-04-17 12:20:28 -07:00
Chris LuandGitHub	e8767f42b6	Add security policy for vulnerability reporting	2026-04-17 09:51:21 -07:00
Chris LuandGitHub	f720f559cb	ci(kafka-loadtest): switch off Ubuntu/Debian base images to avoid apt mirror flakes (#9119 ) * ci(kafka-loadtest): retry apt-get to survive Ubuntu mirror flakes The Kafka Quick Test workflow's Docker build of Dockerfile.loadtest keeps hitting "Connection failed [IP: ...]" on archive.ubuntu.com / security.ubuntu.com mid-build, e.g.: failed to solve: process "/bin/sh -c apt-get update && \ apt-get install -y ca-certificates curl jq bash netcat ..." did not complete successfully: exit code: 100 Same class of failure PR #9106 fixed for pjdfstest. Apply the same two retry knobs to Dockerfile.loadtest and Dockerfile.seektest so a transient mirror flake retries five times with a 30s timeout instead of failing the whole workflow. (The fuller pjdfstest fix also restructured the build to use docker/build-push-action with type=gha cache. Doing that for kafka-client-loadtest would mean rewriting the make/docker-compose build path; defer until the apt-retry alone proves insufficient.) * ci(kafka-loadtest): also drop recommends/suggests + apt-get clean Address PR review (gemini-code-assist): fully align with PR #9106's pattern by adding --no-install-recommends / --no-install-suggests so the runtime images stay small and don't pull in extra packages, plus apt-get clean before rm -rf /var/lib/apt/lists/* in Dockerfile.seektest. * ci(kafka-loadtest): use alpine / maven base images instead of apt The previous rounds of apt-retry / apt-clean knobs aren't enough: the Ubuntu mirror is persistently unreachable from the GitHub runner for minutes at a time, which blows past the 5-retry / 30-second Acquire configuration and still kills the build (see run 24551809614). Switch both runtime images so no apt fetch is needed at all: - Dockerfile.loadtest now runs on alpine:3.20. All runtime deps (ca-certificates, curl, jq, bash, netcat-openbsd) are in the Alpine main repo, fetched from Alpine's CDN rather than the Ubuntu archive that keeps going dark. - Dockerfile.seektest now uses maven:3.9-eclipse-temurin-11, which ships JDK 11 and Maven preinstalled — no apt-get maven step. This also means the runtime images no longer care about Acquire::Retries / DEBIAN_FRONTEND / apt-get clean, so those lines are removed with the apt call they were configuring.	2026-04-17 07:43:19 -07:00
Chris LuandGitHub	45578a42e9	fix(volume): keep vacuum running past dangling .idx entries (#9115 ) * fix(volume): keep vacuum running past dangling .idx entries Vacuum compaction aborted entirely on the first .idx entry whose offset pointed past the end of the .dat file, surfacing as `cannot hydrate needle from file: EOF` and stalling progress on every other volume. In both Go and Rust: - During compaction, skip an unreadable needle and continue. The bytes it pointed at were already unreachable via reads, so dropping the index reference makes the post-vacuum volume consistent. Real EIO still bails out so a disk fault is not silently papered over. - At volume load, do a single linear scan of the .idx and confirm every (offset + actual size) fits inside .dat. The pre-existing integrity check only looked at the last 10 entries, so deeper corruption (e.g. left over from a crashed batched write) went undetected and only surfaced later as a vacuum EOF. A failure now marks the volume read-only at load time so an operator can react. Refs #8928 * fix(volume): only skip permanent-corruption needle reads during vacuum Address PR review feedback (gemini-code-assist + coderabbit): The original patch skipped any non-EIO read failure, which would silently drop needles on transient errors — Windows hardware bad-sector errors (ERROR_CRC etc.) never surface as syscall.EIO; tiered-storage network timeouts and EROFS would also slip through and shrink the volume. Switch to an explicit whitelist of permanent-corruption shapes: - Add needle.ErrorCorrupted sentinel and wrap CRC and "index out of range" errors with %w so callers can match via errors.Is. - copyDataBasedOnIndexFile now skips only when the read failure is io.EOF, io.ErrUnexpectedEOF, ErrorSizeMismatch, ErrorSizeInvalid, or ErrorCorrupted. Anything else (real disk faults, environmental errors, Windows hardware codes) aborts the compaction so an operator notices. - Mirror the same whitelist in the Rust volume server, matching on io::ErrorKind::UnexpectedEof and the NeedleError corruption variants (SizeMismatch, CrcMismatch, IndexOutOfRange, TailTooShort). Also add `defer v.Close()` in TestVerifyIndexFitsInDat so Windows t.TempDir() cleanup can release the .dat/.idx handles. Refs #8928 * fix(volume): wrap entry-not-found size-mismatch with ErrorSizeMismatch Address PR review: the fallback branch in ReadBytes returned an unwrapped fmt.Errorf, so isSkippableNeedleReadError (and any caller using errors.Is(..., ErrorSizeMismatch)) could not match it. Wrap with %w so the whitelist applies, while leaving the existing direct sentinel return for the OffsetSize==4 / offset<MaxPossibleVolumeSize retry path unchanged so ReadData's `err == ErrorSizeMismatch` retry still triggers. Refs #8928 * fix(volume): integrate dangling-idx check into existing index load walk Address PR review (gemini-code-assist, medium): the structural .idx check used to do a second linear scan of the index file at every volume load, doubling the disk-I/O cost on servers managing many volumes. Track the largest (offset + actual size) seen during the existing needle-map load walks (`LoadCompactNeedleMap`, `NewLevelDbNeedleMap`, `NewSortedFileNeedleMap`'s `newNeedleMapMetricFromIndexFile`, `DoOffsetLoading`) on a new `MaximumNeedleEnd` field on `mapMetric`, exposed as `MaxNeedleEnd()` on the NeedleMapper interface. `volume.load()` then compares `nm.MaxNeedleEnd()` to the .dat size after the load is complete — pure numeric comparison, no extra I/O. The standalone `verifyIndexFitsInDat` helper and its caller in `CheckVolumeDataIntegrity` are removed; the test that used to drive the helper directly now exercises the new path via `LoadCompactNeedleMap`. Mirror the same change in the Rust volume server: track `max_needle_end` on `NeedleMapMetric`, expose via `max_needle_end()` on `CompactNeedleMap`, `RedbNeedleMap`, and the `NeedleMap` enum. The Rust load walk already happens in `load_from_idx` for both map kinds, so the structural check becomes free. Refs #8928	2026-04-16 22:01:34 -07:00
Chris Lu	664ae64646	ci(binaries_dev): serialize concurrent runs to prevent asset name collisions Multiple master pushes within the same minute produced identical BUILD_TIME values, causing concurrent workflow runs to race on identically-named release assets. Upload retries hit 422 already_exists and failed the build. Adding a concurrency group with cancel-in-progress ensures only the latest dev build runs at a time, which is fine since only the latest dev artifacts matter.	2026-04-16 18:09:57 -07:00
dependabot[bot]GitHubdependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	018e648d00	build(deps): bump github.com/jackc/pgx/v5 from 5.8.0 to 5.9.0 (#9113 ) Bumps [github.com/jackc/pgx/v5](https://github.com/jackc/pgx) from 5.8.0 to 5.9.0. - [Changelog](https://github.com/jackc/pgx/blob/master/CHANGELOG.md) - [Commits](https://github.com/jackc/pgx/compare/v5.8.0...v5.9.0) --- updated-dependencies: - dependency-name: github.com/jackc/pgx/v5 dependency-version: 5.9.0 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2026-04-16 16:28:18 -07:00
Lars LehtonenandGitHub	66ce0c29cf	fix(weed/query/engine): check for nil pointers (#9114 )	2026-04-16 16:27:55 -07:00
Chris Lu	108e6b0d5f	Delete scheduled_tasks.lock	2026-04-16 16:16:25 -07:00
Chris LuandGitHub	9d15705c16	fix(mini): shut down admin/s3/webdav/filer before volume/master on Ctrl+C (#9112 ) * fix(mini): shut down admin/s3/webdav/filer before volume/master on Ctrl+C Interrupts fired grace hooks in registration order, so master (started first) shut down before its clients, producing heartbeat-canceled errors and masterClient reconnection noise during weed mini shutdown. Admin/s3/ webdav had no interrupt hooks at all and were killed at os.Exit. - grace: execute interrupt hooks in LIFO (defer-style) order so later- started services tear down first. - filer: consolidate the three separate interrupt hooks (gRPC / HTTP / DB) into one that runs in order, so filer shutdown stays correct independent of FIFO/LIFO semantics. - mini: add MiniClientsShutdownCtx (separate from test-facing MiniClusterCtx) plus an OnMiniClientsShutdown helper. Admin, S3, WebDAV and the maintenance worker observe it; runMini registers a cancel hook after startup so under LIFO it fires first and waits up to 10s on a WaitGroup for those services to drain before filer, volume, and master shut down. Resulting order on Ctrl+C: admin/s3/webdav/worker -> filer (gRPC -> HTTP -> DB) -> volume -> master. * refactor(mini): group mini-client shutdown into one state struct The first pass spread the shutdown plumbing across three globals (MiniClientsShutdownCtx, miniClientsWg, cancelMiniClients) and two ctx-derivation sites (OnMiniClientsShutdown and startMiniAdminWithWorker). Group into a private miniClientsState (ctx/cancel/wg) rebuilt per runMini invocation, and chain its ctx from MiniClusterCtx so clients only observe one signal. Tests that cancel MiniClusterCtx still trigger client shutdown via parent-child propagation. - resetMiniClients() installs fresh state at the top of runMini, so in-process test reruns don't inherit stale ctx/wg. - onMiniClientsShutdown(fn) replaces the exported OnMiniClientsShutdown and only observes one ctx. - trackMiniClient() replaces the manual wg.Add/Done dance for the admin goroutine. - miniClientsCtx() gives the admin startup a ctx without re-deriving. - triggerMiniClientsShutdown(timeout) is the interrupt hook body. No behaviour change; existing tests pass. * refactor: generalize shutdown ctx as an option, not a mini-specific helper Several service files (s3, webdav, filer, master, volume) observed the mini-specific MiniClusterCtx or called onMiniClientsShutdown directly. That leaked mini orchestration into code that also runs under weed s3, weed webdav, weed filer, weed master, and weed volume standalone. Replace with a generic `shutdownCtx context.Context` field on each service's Options struct. When non-nil, the server watches it and shuts down gracefully; when nil (standalone), the shutdown path is a no-op. Mini wires the contexts up from a single place (runMini): - miniMasterOptions/miniOptions.v/miniFilerOptions.shutdownCtx = MiniClusterCtx (drives test-triggered teardown) - miniS3Options/miniWebDavOptions.shutdownCtx = miniClientsCtx() (drives Ctrl+C teardown before filer/volume/master) All knowledge of MiniClusterCtx now lives in mini.go. * fix(mini): stop worker before clients ctx so admin shutdown isn't blocked Symptom on Ctrl+C of a clean weed mini: mini's Shutting down admin/s3/ webdav hook sat for 10s then logged "timed out". Admin had started its shutdown but was blocked inside StopWorkerGrpcServer's GracefulStop, waiting for the still-connected worker stream. That in turn left filer clients connected and cascaded into filer's own 10s gRPC graceful-stop timeout. Two causes, both fixed: 1. worker.Stop() deadlocked on clean shutdown. It sent ActionStop (which makes managerLoop `break out` and exit), then called getTaskLoad() which sends to the same unbuffered cmd channel — no receiver, hangs forever. Reorder Stop() to snapshot the admin client and drain tasks BEFORE sending ActionStop, and call Disconnect() via the local snapshot afterwards. 2. Worker's taskRequestLoop raced with Disconnect(): RequestTask reads from c.incoming, which Disconnect closes, yielding a nil response and a panic on response.Message. Handle the closed channel explicitly. 3. Mini now has a preCancel phase (beforeMiniClientsShutdown) that runs synchronously BEFORE the clients ctx is cancelled. Register worker shutdown there so admin's worker-gRPC GracefulStop finds the worker already disconnected and returns immediately, instead of waiting on a stream that is about to close anyway. Observed shutdown of a clean mini: admin/s3/webdav down in <10ms; full process exit in ~11s (the remaining 10s is a pre-existing filer gRPC graceful-stop timeout, not cascaded from the clients tier). * feat(mini): cap filer gRPC graceful stop at 1s under weed mini Full weed mini shutdown was ~11s on a clean exit, dominated by the filer's default 10s gRPC GracefulStop timeout while background SubscribeLocalMetadata streams drained. Expose the timeout as a FilerOptions.gracefulStopTimeout field (default 10s for standalone weed filer) and set it to 1s in mini. Clean weed mini shutdown now takes ~2s.	2026-04-16 16:11:01 -07:00
Chris LuandGitHub	9554e259dd	fix(iceberg): route catalog clients to the right bucket and vend S3 endpoint (#9109 ) * fix(iceberg): route catalog clients to the right bucket and vend S3 endpoint DuckDB ATTACH 's3://<bucket>/' AS cat (TYPE 'ICEBERG', ...) was failing with "schema does not exist" because GET /v1/config ignored the warehouse query parameter and returned no overrides.prefix, so subsequent requests fell through to the hard-coded "warehouse" default bucket instead of the one the client attached. LoadTable also returned an empty config, forcing clients to discover the S3 endpoint out-of-band and producing 403s on direct iceberg_scan calls. - handleConfig now echoes overrides.prefix = bucket and defaults.warehouse when ?warehouse=s3://<bucket>/ is supplied. - getBucketFromPrefix honors a warehouse query parameter as a fallback for clients that skip the /v1/config handshake. - LoadTable responses advertise s3.endpoint and s3.path-style-access so clients can reach data files without separate configuration. Refs #9103 * address review feedback on iceberg S3 endpoint vending - deriveS3AdvertisedEndpoint is now a method on S3Options; honors externalUrl / S3_EXTERNAL_URL, switches to https when -s3.key.file is set, uses the https port when configured, and brackets IPv6 literals via util.JoinHostPort. - handleCreateTable returns s.buildFileIOConfig() in both its staged and final LoadTableResult branches so create and load flows see the same FileIO hints. - Add unit test coverage for the endpoint derivation scenarios. * address CI and review feedback for #9109 - DuckDB integration test now runs under its own newOAuthTestEnv (with a valid IAM config) so the OAuth2 client_credentials flow DuckDB requires actually works; the shared env has no registered credentials, which was the cause of the CI failure. Helper createTableWithToken was added to create tables via Bearer auth. - Tighten TestIssue9103_LoadTableDoesNotVendS3FileIOCredentials to also assert s3.path-style-access = "true", so a partial regression where the endpoint is vended but path-style is dropped still fails. - deriveS3AdvertisedEndpoint now logs a startup hint when it infers the host from os.Hostname because the bind IP is a wildcard, pointing operators at -s3.externalUrl / S3_EXTERNAL_URL for reverse-proxy deployments where the inferred name is not externally reachable. - handleConfig has a comment explaining that any sub-path in the warehouse URL is dropped because catalog routing is bucket-scoped. * fix(iceberg): make advertised S3 endpoint strictly opt-in; add region The wildcard-bind fallback to os.Hostname() in deriveS3AdvertisedEndpoint was hijacking correctly-configured clients: on the CI runner it produced http://runnervmrc6n4:<port>, which Spark (running in Docker) could not resolve, so Spark iceberg tests began failing after the endpoint started being vended in LoadTable responses. Change the rule so advertising is opt-in and never guesses a host that might not be routable: - -s3.externalUrl / S3_EXTERNAL_URL wins (covers reverse-proxy). - Otherwise, only an explicit, non-wildcard -s3.bindIp is used. - Wildcard / empty bind returns "" so no FileIO endpoint is vended and existing clients keep using their own configuration. buildFileIOConfig additionally vends s3.region (defaulting to the same value baked into table bucket ARNs) whenever it vends an endpoint, so DuckDB's attach does not fail with "No region was provided via the vended credentials" when the operator has opted in. The DuckDB issue-9103 integration test runs under an env with a wildcard bind, so it explicitly sets AWS_REGION in the docker run to pick up the same default. The HTTP-level LoadTable-vending test was dropped because its expectation is now conditional and already covered by unit tests in iceberg_issue_9103_test.go.	2026-04-16 15:51:43 -07:00
Chris LuandGitHub	00a2e22478	fix(mount): remove fid pool to stop master over-allocating volumes (#9111 ) * fix(mount): remove fid pool to stop master over-allocating volumes The writeback-cache fid pool pre-allocated file IDs with ExpectedDataSize = ChunkSizeLimit (typically 8+ MB). The master's PickForWrite charges count * expectedDataSize against the volume's effectiveSize, so a full pool refill could charge hundreds of MB against a single volume before any bytes were actually written. That tripped RecordAssign's hard-limit path and eagerly removed volumes from writable, causing the master to grow new volumes even when the real data being written was tiny. Drop the pool entirely. Every chunk upload goes through UploadWithRetry -> AssignVolume with no ExpectedDataSize hint, letting the master fall back to the 1 MB default estimate. The mount->filer grpc connection is already cached in pb.WithGrpcClient (non-streaming mode), so per-chunk AssignVolume is a unary RPC over an existing HTTP/2 stream, not a full dial. Path-based filer.conf storage rules now apply to mount chunk assigns again, which the pool had to skip. Also remove the now-unused operation.UploadWithAssignFunc and its AssignFunc type. * fix(upload): populate ExpectedDataSize from actual chunk bytes UploadWithRetry already buffers the full chunk into `data` before calling AssignVolume, so the real size is known. Previously the assign request went out with ExpectedDataSize=0, making the master fall back to the 1 MB DefaultNeedleSizeEstimate per fid — same over-reservation symptom the pool had, just smaller per call. Stamp ExpectedDataSize = len(data) before the assign RPC when the caller hasn't already set it. This covers mount chunk uploads, filer_copy, filersink, mq/logstore, broker_write, gateway_upload, and nfs — all the UploadWithRetry paths. * fix(assign): pass real ExpectedDataSize at every assign call site After removing the mount fid pool, per-chunk AssignVolume calls went out with ExpectedDataSize=0, making the master fall back to its 1 MB DefaultNeedleSizeEstimate. That's still an over-estimate for small writes. Thread the real payload size through every remaining assign site so RecordAssign charges effectiveSize accurately and stops prematurely marking volumes full. - filer: assignNewFileInfo now takes expectedDataSize and stamps it on both primary and alternate VolumeAssignRequests. Callers pass: - SSE data-to-chunk: len(data) - copy manifest save: len(data) - streamCopyChunk: srcChunk.Size - TUS sub-chunk: bytes read - saveAsChunk (autochunk/manifestize): 0 (small, size unknown until the reader is drained; master uses 1 MB default) - filer gRPC remote fetch-and-write: ExpectedDataSize = chunkSize after the adaptive chunkSize is computed. - ChunkedUploadOption.AssignFunc gains an expectedDataSize parameter; upload_chunked.go passes the buffered dataSize at the call site. S3 PUT assignFunc stamps it on the AssignVolumeRequest. - S3 copy: assignNewVolume / prepareChunkCopy take expectedDataSize; all seven call sites pass the source chunk's Size. - operation.SubmitFiles / FilePart.Upload: derive per-fid size from FileSize (average for batched requests, real per-chunk size for sequential chunk assigns). - benchmark: pass fileSize. - filer append-to-file: pass len(data). * fix(assign): thread size through SaveDataAsChunkFunctionType The saveAsChunk path (autochunk, filer_copy, webdav, mount) ran AssignVolume before the reader was drained, so it had to pass ExpectedDataSize=0 and fall back to the master's 1 MB default. Add an expectedDataSize parameter to SaveDataAsChunkFunctionType. - mergeIntoManifest already has the serialized manifest bytes, so it passes uint64(len(data)) directly. - Mount's saveDataAsChunk ignores the parameter because it uses UploadWithRetry, which already stamps len(data) on the assign after reading the payload. - webdav and filer_copy saveDataAsChunk follow the same UploadWithRetry path and also ignore the hint. - Filer's saveAsChunk (used for manifestize) plumbs the value to assignNewFileInfo so manifest-chunk assigns get a real size. Callers of saveFunc-as-value (weedfs_file_sync, dirty_pages_chunked) pass the chunk size they're about to upload.	2026-04-16 15:51:13 -07:00
Chris LuandGitHub	7916e61c08	fix(mount): avoid self-notify deadlock in Link and CopyFileRange handlers (#9110 ) The Link and CopyFileRange FUSE request handlers were calling fuseServer.InodeNotify (and EntryNotify for copy) synchronously while the kernel was still waiting for the request's reply on the same /dev/fuse fd. Notifications share that fd, so the syscall.Write can block indefinitely when the kernel hasn't drained its queue yet, hanging the entire mount. A goroutine dump from a stuck mount showed the Link handler blocked in syscall.Write inside InodeNotify while the server's read loop kept waiting for new requests. Drop the synchronous notifies. The local meta cache is still updated inline, so subsequent filesystem ops see the fresh state; the kernel's attr/dentry caches re-fetch once their TTL expires.	2026-04-16 14:09:32 -07:00
Chris LuandGitHub	ecc0390795	fix(master): eagerly remove volume from writable when assign hits limit (#9108 ) * fix(master): eagerly remove volume from writable when RecordAssign hits limit Previously, a volume was only removed from the writable list by the heartbeat-driven CollectDeadNodeAndFullVolumes pass, which runs every pulse (5s) after a 5s heartbeat. Under sustained concurrent writes, fio-style workloads observed in the field grew volumes 8-20x past the configured 100MB limit (median 530MB, peak 1.98GB) during that 5-15s detection window. RecordAssign already tracks effective size (reported + pending) on each /dir/assign. It now also removes the volume from writable the moment effectiveSize reaches volumeSizeLimit, and mirrors the activeVolumeCount decrement that Topology.SetVolumeCapacityFull would have done on the next heartbeat. The heartbeat path remains unchanged and idempotent (vl.SetVolumeCapacityFull returns false if already removed, so no double-decrement). Recovery still works: if a heartbeat later reports size < limit and the volume is not oversized, EnsureCorrectWritables adds it back. - weed/topology/volume_layout.go: RecordAssign returns reachedCapacity bool; adds AdjustActiveVolumeCountForFull helper. - weed/topology/topology.go: PickForWrite invokes the decrement on eager full transitions. - TestPickForWrite: pass a 1024-byte hint instead of 0 so the default 1MB pendingDelta does not immediately bust the test's 32KB limit. - New TestRecordAssignReachingCapacityRemovesFromWritable covers the eager removal, active count accounting, and no-double-accounting. * fix(master): recover eagerly-removed volume once decay clears pending After RecordAssign eagerly removes a volume from writables because effectiveSize reached the limit, decay can later bring effectiveSize back under the limit (e.g., when a burst of assigns didn't all result in uploads). Without recovery the volume would stay non-writable until vacuum or a ReadOnly flip. UpdateVolumeSize now re-adds the volume to writables once all of the following hold: * RecordAssign is what removed it (tracked via fullSince timestamp) * at least capacityRecoveryDelay has elapsed since the removal (30s) — this prevents bouncing during a steady stream of assigns near the limit * effectiveSize has decayed below the crowded threshold (90% of limit) * reportedSize is under the limit (actual disk is not over) * standard EnsureCorrectWritables preconditions: enough copies, all copies writable, not oversized The caller (SyncDataNodeRegistration) re-increments activeVolumeCount symmetrically with the decrement done on eager removal. * review: release VolumeLayout lock before UpAdjustDiskUsageDelta adjustActiveVolumeCount held vl.accessLock across the tree-climbing UpAdjustDiskUsageDelta walk. That walk takes per-level DiskUsages locks and could be re-entered from other call paths that hold a node-level lock and then acquire vl.accessLock. Copy the node list under the VolumeLayout lock and release it before the tree walk to eliminate the lock-ordering hazard.	2026-04-16 12:50:30 -07:00
Chris LuandGitHub	40ffc73aa8	ci(pjdfstest): cache docker layers via GHA to avoid apt mirror flakes (#9106 ) * ci(pjdfstest): cache docker layers via GHA to avoid apt mirror flakes Replace the local buildx cache + manual fallback with docker/setup-buildx-action and docker/build-push-action using type=gha cache. The e2e and pjdfstest Dockerfile layers now persist across runs in GitHub's own cache backend, so apt-get update only hits Ubuntu mirrors when the Dockerfiles change. Also add Acquire::Retries and Timeout so first-run cache-miss builds survive transient mirror sync errors. * ci(pjdfstest): use local registry to share e2e image across buildx builds The docker-container buildx driver cannot see images loaded into the host Docker daemon, so the second build's FROM chrislusf/seaweedfs:e2e failed with "not found" on registry-1.docker.io. Run a local registry:2 on the runner, push both images to localhost:5000, remap the FROM via build-contexts so the Dockerfile stays unchanged, then tag the pulled images locally for docker compose to consume.	2026-04-16 12:50:19 -07:00
Chris LuandGitHub	ff4f96c71f	fix(filer): drop stale master gRPC cache on stream death (#9102 ) (#9107 ) * fix(filer): drop stale master gRPC cache on stream death (#9102) When the master server restarts behind a stable L4 endpoint (e.g. a Kubernetes ClusterIP Service), the filer's streaming KeepConnected channel detects the disconnect and reconnects, but the shared request-path ClientConn cached in pb.grpcClients can remain in READY state while actually being dead. New AssignVolume/LookupVolume calls reuse that cached channel and return `rpc error: code = Canceled desc = context canceled` for every request, until the filer pod is restarted. - Expose pb.InvalidateGrpcConnection(address) to drop a cached ClientConn when a higher-level signal says it is stale. - In MasterClient.tryConnectToMaster, invalidate the cached request-path channel whenever the KeepConnected stream returns, so unrelated callers dial fresh on their next RPC. - Extend operation.Assign's retry predicate to cover Canceled and DeadlineExceeded while the caller context is still live: the first failure invalidates the stale ClientConn via shouldInvalidateConnection, and the retry dials a new channel. * fix(grpc): invalidate cached peer conn on streaming death in other paths Extends the master-client fix to the other streaming-caller + cached non-streaming-peer pairs that share the same stale-channel failure mode when the peer restarts behind a stable L4 endpoint (k8s Service VIP, external load balancer): - pb.FollowMetadata (s3, mount, webdav, mq broker, filer remote gateway, etc. → filer): invalidate the filer's cached ClientConn when the SubscribeMetadata stream returns an error. - filer.MetaAggregator.loopSubscribeToOneFiler (filer → peer filer): invalidate the peer's cached ClientConn after doSubscribeToOneFiler fails, so the next iteration's readFilerStoreSignature / updateOffset calls dial fresh. - mq sub_client.onEachPartition and doKeepConnectedToSubCoordinator (subscriber → broker): invalidate the broker's cached ClientConn when the SubscribeMessage / SubscriberToSubCoordinator stream errors. - mq broker.BrokerConnectToBalancer (broker → broker-balancer): invalidate the balancer's cached ClientConn after the PublisherToPubBalancer stream errors. * address review feedback on InvalidateGrpcConnection - pb.InvalidateGrpcConnection: drop the cache entry under grpcClientsLock but call ClientConn.Close() after releasing the lock, so Close's internal synchronisation/IO doesn't serialise unrelated callers on the global map lock. - wdclient.tryConnectToMaster: only invalidate the cached request-path channel when the streaming call returned an error. On a healthy leader redirect (gprcErr == nil) the cached channel is still usable and invalidating it just causes a needless re-dial from concurrent callers. * refactor(grpc): centralize peer-conn invalidation in streaming path Previously every streaming caller duplicated the same invalidate-cached- non-streaming-peer-conn wrapper around their WithGrpcClient(true, ...) call. Move that logic into WithGrpcClient itself: when the streaming fn returns an error, invalidate any cached ClientConn for the same address. This removes six near-identical call-site wrappers and gives every current and future streaming caller the fix by default. Also aligns the non-streaming branch with the new Invalidate helper's lock discipline: delete the cache entry under grpcClientsLock, then Close the ClientConn after releasing the lock.	2026-04-16 12:10:25 -07:00
Chris LuandGitHub	9896eade51	feat(mount): set FOPEN_KEEP_CACHE on re-open of unchanged files (#9097 ) * feat(mount): set FOPEN_KEEP_CACHE when file mtime is unchanged On re-open of an unmodified file, signal the kernel to preserve its existing page cache. This eliminates redundant volume server reads for workloads that repeatedly open-read-close the same files (build systems, config readers, etc.). * fix(mount): use guarded type assertion for openMtimeCache load Use the two-value form of type assertion when loading from sync.Map to prevent potential panics if a non-int64 value is ever stored. * fix(mount): skip redundant mtime store and invalidate on truncation - Avoid redundant sync.Map Store when cached mtime already matches the current mtime, reducing contention on the hot open path. - Invalidate openMtimeCache in SetAttr when file size changes (truncation), preventing stale kernel page cache after ftruncate. * fix(mount): use nanosecond mtime precision and bounded cache for FOPEN_KEEP_CACHE - Compare both Mtime (seconds) and MtimeNs (nanoseconds) to detect sub-second modifications common in automated workloads. - Replace unbounded sync.Map with a bounded map + mutex (8192 entries, random eviction when full), following the existing atimeMap pattern. - Extract applyKeepCacheFlag and invalidateOpenMtimeCache methods for clarity and testability. - Add tests for nanosecond precision and cache eviction. * fix(mount): invalidate mtime cache in truncateEntry for O_TRUNC consistency Add invalidateOpenMtimeCache call to truncateEntry so the Create path with O_TRUNC follows the same explicit invalidation pattern as SetAttr and Write.	2026-04-16 11:37:52 -07:00

1 2 3 4 5 ...