mirror of
https://github.com/seaweedfs/seaweedfs.git
synced 2026-05-17 15:21:31 +00:00
4.21
13535 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
3ff92f797d | 4.21 4.21 | ||
|
|
a8ba9d106e |
peer chunk sharing 7/8: tryPeerRead read-path hook (#9136)
* mount: batched announcer + pooled peer conns for mount-to-mount RPCs * peer_announcer.go: non-blocking EnqueueAnnounce + ticker flush that groups fids by HRW owner, fans out one ChunkAnnounce per owner in parallel. announcedAt is pruned at 2× TTL so it stays bounded. * peer_dialer.go: PeerConnPool caches one grpc.ClientConn per peer address; the announcer and (next PR) the fetcher share it so steady-state owner RPCs skip the handshake cost entirely. Bounded at 4096 cached entries; shutdown conns are transparently replaced. * WFS starts both alongside the gRPC server; stops them on unmount. * mount: wire tryPeerRead via FetchChunk streaming gRPC Replaces the HTTP GET byte-transfer path with a gRPC server-stream FetchChunk call. Same fall-through semantics: any failure drops through to entryChunkGroup.ReadDataAt, so reads never slow below status quo. * peer_fetcher.go: tryPeerRead resolves the offset to a leaf chunk (flattening manifests), asks the HRW owner for holders via ChunkLookup, then opens FetchChunk on each holder in LRU order (PR #5) until one succeeds. Assembled bytes are verified against FileChunk.ETag end-to-end — the peer is still treated as untrusted. Reuses the shared PeerConnPool from PR #6 for all outbound gRPC. * peer_grpc.go: expose SelfAddr() so the fetcher can avoid dialing itself on a self-owned fid. * filehandle_read.go: tryPeerRead slot between tryRDMARead and entryChunkGroup.ReadDataAt. Gated by option.PeerEnabled and the presence of peerGrpcServer (the single identity test). Read ordering with the feature enabled is now: local cache -> RDMA sidecar -> peer mount (gRPC stream) -> volume server One port, one identity, one connection pool — no more HTTP bytecast. * test(fuse_p2p): end-to-end CI test for peer chunk sharing Adds a FUSE-backed integration test that proves mount B can satisfy a read from mount A's chunk cache instead of the volume tier. Layout (modelled on test/fuse_dlm): test/fuse_p2p/framework_test.go — cluster harness (1 master, 1 volume, 1 filer, N mounts, all with -peer.enable) test/fuse_p2p/peer_chunk_sharing_test.go — writer-reader scenario The test (TestPeerChunkSharing_ReadersPullFromPeerCache): 1. Starts 3 mounts. Three is the sweet spot: with 2 mounts, HRW owner of a chunk is self ~50 % of the time (peer path short-circuits); with 3+ it drops to ≤ 1/3, so a multi-chunk file almost certainly exercises the remote-owner fan-out. 2. Mount 0 writes a ~8 MiB file, then reads it back through its own FUSE to warm its chunk cache. 3. Waits for seed convergence (one full MountList refresh) plus an announcer flush cycle, so chunk-holder entries have reached each HRW owner. 4. Mount 1 reads the same file. 5. Verifies byte-for-byte equality AND greps mount 1's log for "peer read successful" — content matching alone is not proof (the volume fallback would also succeed), so the log marker is what distinguishes p2p from fallback. Workflow .github/workflows/fuse-p2p-integration.yml triggers on any change to mount/filer peer code, the p2p protos, or the test itself. Failure artifacts (server + mount logs) are uploaded for 3 days. Mounts run with -v=4 so the tryPeerRead success/failure glog messages land in the log file the test greps. |
||
|
|
73f10fa528 |
peer chunk sharing 6/8: announce queue + batched flush (#9135)
mount: batched announcer + pooled peer conns for mount-to-mount RPCs * peer_announcer.go: non-blocking EnqueueAnnounce + ticker flush that groups fids by HRW owner, fans out one ChunkAnnounce per owner in parallel. announcedAt is pruned at 2× TTL so it stays bounded. * peer_dialer.go: PeerConnPool caches one grpc.ClientConn per peer address; the announcer and (next PR) the fetcher share it so steady-state owner RPCs skip the handshake cost entirely. Bounded at 4096 cached entries; shutdown conns are transparently replaced. * WFS starts both alongside the gRPC server; stops them on unmount. |
||
|
|
fe9ca35bbd |
peer chunk sharing 5/8: mount chunk-directory shard (#9134)
mount: tier-2 chunk directory + FetchChunk streaming on one gRPC port Collapses the old two-port design (HTTP peer-serve + separate gRPC directory) into a single gRPC service that handles every mount-to- mount exchange: ChunkAnnounce, ChunkLookup, and the new FetchChunk byte stream. * peer_directory.go: fid -> holders shard, HRW-gated; returns holders in LRU order; capacity-bounded; Sweep handles eviction under write-lock while Lookup runs under RLock (hot path is concurrent). * peer_grpc.go: single MountPeer gRPC server implementing all three RPCs. FetchChunk frames bytes at 1 MiB per Send so the default 4 MiB message cap does not constrain chunk size; cache miss returns gRPC NOT_FOUND so clients distinguish miss from transport error. Reuses pb.NewGrpcServer for consistent keepalive + msg-size tuning. * peer_bytepool.go: sync.Pool wrapper around *[]byte that the server uses to avoid a fresh 8 MiB allocation per FetchChunk call. * WFS wiring starts the gRPC server on option.PeerListen (the single peer port) using the advertise address resolved in PR #3 as the HRW identity. A background sweeper evicts expired directory entries every 60 s. |
||
|
|
8a6348d3e9 |
peer chunk sharing 4/8: mount registrar + HRW owner selection (#9133)
* proto: define MountRegister/MountList and MountPeer service Adds the wire types for peer chunk sharing between weed mount clients: * filer.proto: MountRegister / MountList RPCs so each mount can heartbeat its peer-serve address into a filer-hosted registry, and refresh the list of peers. Tiny payload; the filer stores only O(fleet_size) state. * mount_peer.proto (new): ChunkAnnounce / ChunkLookup RPCs for the mount-to-mount chunk directory. Each fid's directory entry lives on an HRW-assigned mount; announces and lookups route to that mount. No behavior yet — later PRs wire the RPCs into the filer and mount. See design-weed-mount-peer-chunk-sharing.md for the full design. * filer: add mount-server registry behind -peer.registry.enable Implements tier 1 of the peer chunk sharing design: an in-memory registry of live weed mount servers, keyed by peer address, refreshed by MountRegister heartbeats and served by MountList. * weed/filer/peer_registry.go: thread-safe map with TTL eviction; lazy sweep on List plus a background sweeper goroutine for bounded memory. * weed/server/filer_grpc_server_peer.go: MountRegister / MountList RPC handlers. When -peer.registry.enable is false (the default), both RPCs are silent no-ops so probing older filers is harmless. * -peer.registry.enable flag on weed filer; FilerOption.PeerRegistryEnabled wires it through. Phase 1 is single-filer (no cross-filer replication of the registry); mounts that fail over to another filer will re-register on the next heartbeat, so the registry self-heals within one TTL cycle. Part of the peer-chunk-sharing design; no behavior change at runtime until a later PR enables the flag on both filer and mount. * filer: nil-safe peerRegistryEnable + registry hardening Addresses review feedback on PR #9131. * Fix: nil pointer deref in the mini cluster. FilerOptions instances constructed outside weed/command/filer.go (e.g. miniFilerOptions in mini.go) do not populate peerRegistryEnable, so dereferencing the pointer panics at Filer startup. Use the same `nil && deref` idiom already used for distributedLock / writebackCache. * Hardening (gemini review): registry now enforces three invariants: - empty peer_addr is silently rejected (no client-controlled sentinel mass-inserts) - TTL is capped at 1 hour so a runaway client cannot pin entries - new-entry count is capped at 10000 to bound memory; renewals of existing entries are always honored, so a full registry still heartbeats its existing members correctly Covered by new unit tests. * filer: rename -peer.registry.enable flag to -mount.p2p Per review feedback: the old name "peer.registry.enable" leaked the implementation ("registry") into the CLI surface. "mount.p2p" is shorter and describes what it actually controls — whether this filer participates in mount-to-mount peer chunk sharing. Flag renames (all three keep default=true, idle cost is near-zero): -peer.registry.enable -> -mount.p2p (weed filer) -filer.peer.registry.enable -> -filer.mount.p2p (weed mini, weed server) Internal variable names (mountPeerRegistryEnable, MountPeerRegistry) keep their longer form — they describe the component, not the knob. * filer: MountList returns DataCenter + List uses RLock Two review follow-ups on the mount peer registry: * weed/server/filer_grpc_server_mount_peer.go: MountList was dropping the DataCenter on the wire. The whole point of carrying DC separately from Rack is letting the mount-side fetcher re-rank peers by the two-level locality hierarchy (same-rack > same-DC > cross-DC); without DC in the response every remote peer collapsed to "unknown locality." * weed/filer/mount_peer_registry.go: List() was taking a write lock so it could lazy-delete expired entries inline. But MountList is a read-heavy RPC hit on every mount's 30 s refresh loop, and Sweep is already wired as the sole reclamation path (same pattern as the mount-side PeerDirectory). Switch List to RLock + filter, let Sweep do the map mutation, so concurrent MountList callers don't serialize on each other. Test updated to reflect the new contract (List no longer mutates the map; Sweep is what drops expired entries). * mount: add peer chunk sharing options + advertise address resolver First cut at the peer chunk sharing wiring on the mount side. No functional behavior yet — this PR just introduces the option fields, the -peer.* flags, and the helper that resolves a reachable host:port from them. The server implementation arrives in PR #5 (gRPC service) and the fetcher in PR #7. * ResolvePeerAdvertiseAddr: an explicit -peer.advertise wins; else we use -peer.listen's bind host if specific; else util.DetectedHostAddress combined with the port. This is what gets registered with the filer and announced to peers, so wildcard binds no longer result in unreachable identities like "[::]:18080". * Option fields: PeerEnabled, PeerListen, PeerAdvertise, PeerRack. One port handles both directory RPCs and streaming chunk fetches (see PR #1 FetchChunk proto), so there is no second -peer.grpc.* flag — the old HTTP byte-transfer path is gone. * New flags on weed mount: -peer.enable, -peer.listen (default :18080), -peer.advertise (default auto), -peer.rack. * mount: register with filer and maintain HRW seed view Adds the mount-side tier-1 client. On startup the mount calls MountRegister with its advertise address (PR #3) and keeps both the filer entry and the local seed view fresh via background tickers (30 s register / 30 s list, 90 s filer TTL). * peer_hrw.go: pure rendezvous-hashing helper picking a single owner per fid via top-1 HRW. Adding or removing one seed moves only ~1/N fids. * peer_registrar.go: heartbeat + list poller. Seeds() returns the slice directly (no per-call copy) since listOnce atomically swaps; background RPCs bind their context to Stop() so unmount doesn't hang on a slow filer. * WFS wiring uses ResolvePeerAdvertiseAddr from PR #3 for the identity registered with the filer. No HTTP server, no second port — one reachable address represents the mount. * mount: broadcast MountRegister/MountList to every filer Previously the registrar called through wfs.WithFilerClient, which only reaches whichever filer the WFS filer-client session happens to be on. That meant two mounts pointing at different filers would never see each other: the filer mount registries are in-memory and per-filer (no filer-to-filer sync), so each mount's MountList only returned peers that had also registered through the same filer. This commit makes the registrar multi-filer aware: * NewPeerRegistrar now takes the full FilerAddresses slice and a per-filer dial function. The old single-filer peerFilerClient interface is gone. * registerOnce fans a MountRegister RPC out to every filer in parallel. Succeeds if at least one filer accepted — an unreachable filer is tolerated, logged, and retried on the next heartbeat. * listOnce polls every filer's MountList in parallel and merges the responses by peer_addr, keeping the newest LastSeenNs on duplicates. Mounts talking to different filers therefore converge once every filer has been polled once. The merged-list property is what lets a fleet of mounts spread across multiple filers still form a single HRW seed view. Each filer only ever sees the subset of mounts that heartbeat through it, but the registrar reconstructs the union client-side. New unit tests guard both properties: - RegisterBroadcastsToAllFilers: one registerOnce hits all N filers. - ListMergesAcrossFilers: mount-a on filer-1 and mount-b on filer-2 both appear in the merged seed set. - ListMergeKeepsNewestLastSeen: the same mount reported by two filers collapses to one entry with the freshest timestamp. |
||
|
|
af1e571297 |
peer chunk sharing 3/8: mount peer-serve HTTP endpoint (#9132)
* proto: define MountRegister/MountList and MountPeer service Adds the wire types for peer chunk sharing between weed mount clients: * filer.proto: MountRegister / MountList RPCs so each mount can heartbeat its peer-serve address into a filer-hosted registry, and refresh the list of peers. Tiny payload; the filer stores only O(fleet_size) state. * mount_peer.proto (new): ChunkAnnounce / ChunkLookup RPCs for the mount-to-mount chunk directory. Each fid's directory entry lives on an HRW-assigned mount; announces and lookups route to that mount. No behavior yet — later PRs wire the RPCs into the filer and mount. See design-weed-mount-peer-chunk-sharing.md for the full design. * filer: add mount-server registry behind -peer.registry.enable Implements tier 1 of the peer chunk sharing design: an in-memory registry of live weed mount servers, keyed by peer address, refreshed by MountRegister heartbeats and served by MountList. * weed/filer/peer_registry.go: thread-safe map with TTL eviction; lazy sweep on List plus a background sweeper goroutine for bounded memory. * weed/server/filer_grpc_server_peer.go: MountRegister / MountList RPC handlers. When -peer.registry.enable is false (the default), both RPCs are silent no-ops so probing older filers is harmless. * -peer.registry.enable flag on weed filer; FilerOption.PeerRegistryEnabled wires it through. Phase 1 is single-filer (no cross-filer replication of the registry); mounts that fail over to another filer will re-register on the next heartbeat, so the registry self-heals within one TTL cycle. Part of the peer-chunk-sharing design; no behavior change at runtime until a later PR enables the flag on both filer and mount. * filer: nil-safe peerRegistryEnable + registry hardening Addresses review feedback on PR #9131. * Fix: nil pointer deref in the mini cluster. FilerOptions instances constructed outside weed/command/filer.go (e.g. miniFilerOptions in mini.go) do not populate peerRegistryEnable, so dereferencing the pointer panics at Filer startup. Use the same `nil && deref` idiom already used for distributedLock / writebackCache. * Hardening (gemini review): registry now enforces three invariants: - empty peer_addr is silently rejected (no client-controlled sentinel mass-inserts) - TTL is capped at 1 hour so a runaway client cannot pin entries - new-entry count is capped at 10000 to bound memory; renewals of existing entries are always honored, so a full registry still heartbeats its existing members correctly Covered by new unit tests. * filer: rename -peer.registry.enable flag to -mount.p2p Per review feedback: the old name "peer.registry.enable" leaked the implementation ("registry") into the CLI surface. "mount.p2p" is shorter and describes what it actually controls — whether this filer participates in mount-to-mount peer chunk sharing. Flag renames (all three keep default=true, idle cost is near-zero): -peer.registry.enable -> -mount.p2p (weed filer) -filer.peer.registry.enable -> -filer.mount.p2p (weed mini, weed server) Internal variable names (mountPeerRegistryEnable, MountPeerRegistry) keep their longer form — they describe the component, not the knob. * filer: MountList returns DataCenter + List uses RLock Two review follow-ups on the mount peer registry: * weed/server/filer_grpc_server_mount_peer.go: MountList was dropping the DataCenter on the wire. The whole point of carrying DC separately from Rack is letting the mount-side fetcher re-rank peers by the two-level locality hierarchy (same-rack > same-DC > cross-DC); without DC in the response every remote peer collapsed to "unknown locality." * weed/filer/mount_peer_registry.go: List() was taking a write lock so it could lazy-delete expired entries inline. But MountList is a read-heavy RPC hit on every mount's 30 s refresh loop, and Sweep is already wired as the sole reclamation path (same pattern as the mount-side PeerDirectory). Switch List to RLock + filter, let Sweep do the map mutation, so concurrent MountList callers don't serialize on each other. Test updated to reflect the new contract (List no longer mutates the map; Sweep is what drops expired entries). * mount: add peer chunk sharing options + advertise address resolver First cut at the peer chunk sharing wiring on the mount side. No functional behavior yet — this PR just introduces the option fields, the -peer.* flags, and the helper that resolves a reachable host:port from them. The server implementation arrives in PR #5 (gRPC service) and the fetcher in PR #7. * ResolvePeerAdvertiseAddr: an explicit -peer.advertise wins; else we use -peer.listen's bind host if specific; else util.DetectedHostAddress combined with the port. This is what gets registered with the filer and announced to peers, so wildcard binds no longer result in unreachable identities like "[::]:18080". * Option fields: PeerEnabled, PeerListen, PeerAdvertise, PeerRack. One port handles both directory RPCs and streaming chunk fetches (see PR #1 FetchChunk proto), so there is no second -peer.grpc.* flag — the old HTTP byte-transfer path is gone. * New flags on weed mount: -peer.enable, -peer.listen (default :18080), -peer.advertise (default auto), -peer.rack. |
||
|
|
e24a443b17 |
peer chunk sharing 2/8: filer mount registry (#9131)
* proto: define MountRegister/MountList and MountPeer service Adds the wire types for peer chunk sharing between weed mount clients: * filer.proto: MountRegister / MountList RPCs so each mount can heartbeat its peer-serve address into a filer-hosted registry, and refresh the list of peers. Tiny payload; the filer stores only O(fleet_size) state. * mount_peer.proto (new): ChunkAnnounce / ChunkLookup RPCs for the mount-to-mount chunk directory. Each fid's directory entry lives on an HRW-assigned mount; announces and lookups route to that mount. No behavior yet — later PRs wire the RPCs into the filer and mount. See design-weed-mount-peer-chunk-sharing.md for the full design. * filer: add mount-server registry behind -peer.registry.enable Implements tier 1 of the peer chunk sharing design: an in-memory registry of live weed mount servers, keyed by peer address, refreshed by MountRegister heartbeats and served by MountList. * weed/filer/peer_registry.go: thread-safe map with TTL eviction; lazy sweep on List plus a background sweeper goroutine for bounded memory. * weed/server/filer_grpc_server_peer.go: MountRegister / MountList RPC handlers. When -peer.registry.enable is false (the default), both RPCs are silent no-ops so probing older filers is harmless. * -peer.registry.enable flag on weed filer; FilerOption.PeerRegistryEnabled wires it through. Phase 1 is single-filer (no cross-filer replication of the registry); mounts that fail over to another filer will re-register on the next heartbeat, so the registry self-heals within one TTL cycle. Part of the peer-chunk-sharing design; no behavior change at runtime until a later PR enables the flag on both filer and mount. * filer: nil-safe peerRegistryEnable + registry hardening Addresses review feedback on PR #9131. * Fix: nil pointer deref in the mini cluster. FilerOptions instances constructed outside weed/command/filer.go (e.g. miniFilerOptions in mini.go) do not populate peerRegistryEnable, so dereferencing the pointer panics at Filer startup. Use the same `nil && deref` idiom already used for distributedLock / writebackCache. * Hardening (gemini review): registry now enforces three invariants: - empty peer_addr is silently rejected (no client-controlled sentinel mass-inserts) - TTL is capped at 1 hour so a runaway client cannot pin entries - new-entry count is capped at 10000 to bound memory; renewals of existing entries are always honored, so a full registry still heartbeats its existing members correctly Covered by new unit tests. * filer: rename -peer.registry.enable flag to -mount.p2p Per review feedback: the old name "peer.registry.enable" leaked the implementation ("registry") into the CLI surface. "mount.p2p" is shorter and describes what it actually controls — whether this filer participates in mount-to-mount peer chunk sharing. Flag renames (all three keep default=true, idle cost is near-zero): -peer.registry.enable -> -mount.p2p (weed filer) -filer.peer.registry.enable -> -filer.mount.p2p (weed mini, weed server) Internal variable names (mountPeerRegistryEnable, MountPeerRegistry) keep their longer form — they describe the component, not the knob. * filer: MountList returns DataCenter + List uses RLock Two review follow-ups on the mount peer registry: * weed/server/filer_grpc_server_mount_peer.go: MountList was dropping the DataCenter on the wire. The whole point of carrying DC separately from Rack is letting the mount-side fetcher re-rank peers by the two-level locality hierarchy (same-rack > same-DC > cross-DC); without DC in the response every remote peer collapsed to "unknown locality." * weed/filer/mount_peer_registry.go: List() was taking a write lock so it could lazy-delete expired entries inline. But MountList is a read-heavy RPC hit on every mount's 30 s refresh loop, and Sweep is already wired as the sole reclamation path (same pattern as the mount-side PeerDirectory). Switch List to RLock + filter, let Sweep do the map mutation, so concurrent MountList callers don't serialize on each other. Test updated to reflect the new contract (List no longer mutates the map; Sweep is what drops expired entries). |
||
|
|
d7d834b8f9 |
peer chunk sharing 1/8: proto definitions (#9130)
proto: define MountRegister/MountList and MountPeer service Adds the wire types for peer chunk sharing between weed mount clients: * filer.proto: MountRegister / MountList RPCs so each mount can heartbeat its peer-serve address into a filer-hosted registry, and refresh the list of peers. Tiny payload; the filer stores only O(fleet_size) state. * mount_peer.proto (new): ChunkAnnounce / ChunkLookup RPCs for the mount-to-mount chunk directory. Each fid's directory entry lives on an HRW-assigned mount; announces and lookups route to that mount. No behavior yet — later PRs wire the RPCs into the filer and mount. See design-weed-mount-peer-chunk-sharing.md for the full design. |
||
|
|
6787a4b4e8 |
fix kafka gateway and consumer group e2e flakes (#9129)
* fix(test): reduce kafka gateway and consumer group flakes * fix(kafka): make broker health-check backoff respect context Replace time.Sleep in the retry loop with a select on bc.ctx.Done() and time.After so the backoff is interruptible during shutdown, per review feedback on PR #9129. * fix(kafka): guard broker HealthCheck against nil client Return the same "broker client not connected" error used by the other exported BrokerClient methods instead of panicking on a partially initialized client, per CodeRabbit review feedback on PR #9129. |
||
|
|
6832b9945b |
ci(s3tests): install libxml2/libxslt dev headers before pip install
ceph/s3-tests pins lxml without an upper bound. When pip picks a release whose prebuilt wheel isn't published for Python 3.9 on the runner, it falls back to sdist and fails without libxml2-dev / libxslt1-dev. |
||
|
|
1c130c2d47 |
fix(mount): close inodeLocks cleanup race that let two flock holders coexist (#9128)
* fix(mount): close inodeLocks cleanup race that allowed two flock holders
PosixLockTable.getOrCreateInodeLocks released plt.mu before the caller
acquired il.mu. A concurrent maybeCleanupInode could delete the map
entry in that window; the first caller would then insert its lock into
the orphaned inodeLocks while a later caller created a fresh entry in
the map, so findConflict never observed the orphaned lock and two
owners could simultaneously believe they held the same exclusive flock.
This matches the flaky CI failure seen in
TestPosixFileLocking/ConcurrentLockContention:
Error: Should be empty, but was [worker N: flock overlap detected with 2 holders]
Mark removed inodeLocks as dead under plt.mu+il.mu, and have SetLk /
SetLkw recheck the flag after locking il.mu, refetching the live entry
from the map when orphaned. Also delete the map entry only if it still
points to this il, so a racing recreate is not clobbered.
Adds TestConcurrentFlockChurnPreservesMutualExclusion: 16 goroutines x
500 flock/unflock iterations on one inode. Reliably reports 500+
overlaps per run before the fix; clean across 100 race-enabled runs
after.
* fix(mount): extend dead-flag contract to GetLk and self-heal primitives
Address review feedback on the initial cleanup-race fix:
1. GetLk had the same stale-pointer bug as SetLk. A caller could grab
an inodeLocks pointer, have cleanup orphan it and a replacement il
receive a conflicting lock, then answer F_UNLCK off the empty dead
pointer. Add the same dead recheck + refetch loop.
2. getOrCreateInodeLocks and getInodeLocks now treat a dead map entry
as defective: the former replaces it with a fresh inodeLocks, the
latter drops it and returns nil. Production cannot reach that state
(maybeCleanupInode atomically deletes under plt.mu when it sets
dead), but the hardening guarantees the SetLk / SetLkw / GetLk
retry loops always make progress even if a future refactor reorders
those operations, and it lets the white-box tests set up a stale
dead entry without spinning.
3. Strengthen the regression suite:
- TestSetLkRetriesPastDeadInodeLocks: deterministic white-box test
that installs a dead il in the map and asserts SetLk routes the
new lock into a fresh il (not the orphan), that GetLk reports the
resulting conflict, and that a different-owner acquire is rejected
with EAGAIN.
- TestGetInodeLocksEvictsDeadEntry: verifies both map-read primitives
drop or replace dead entries.
- TestConcurrentFlockChurnPreservesMutualExclusion: replace the
timing-fragile Add(1)-and-check counter with a Swap+CAS detector.
Each worker claims a slot after SetLk OK and releases it before
UN, flagging both an observed predecessor and a lost CAS on
release. Against a reverted fix the detector fires 1000+ times per
run; with the fix clean across 100 race-enabled iterations.
* test(mount): fail fast on unexpected SetLk statuses in churn loop
The stress test blindly spun on any non-OK SetLk status and discarded
the unlock return. If SetLk ever returns something other than OK or
EAGAIN (e.g. after a future refactor introduces a new error), the
acquire loop would spin forever and an unlock failure would be
silently swallowed.
Capture the acquire status, retry only on the expected EAGAIN, and
assert unlock returns OK. Use t.Errorf + return (not t.Fatalf) because
the checks run on worker goroutines where FailNow is unsafe. The
Swap+CAS overlap detector is unchanged.
|
||
|
|
c1ccbe97dd |
feat(filer.backup): -initialSnapshot seeds destination from live tree (#9126)
* feat(filer.backup): -initialSnapshot seeds destination from live tree Replaying the metadata event log on a fresh sync only leaves files that still exist on the source at replay time: any entry that was created and later deleted is replayed as a create/delete pair and never materializes on the destination. Users who wipe the destination and re-run filer.backup therefore see "only new files" instead of a full backup, even when -timeAgo=876000h is passed and the subscription genuinely starts from epoch (ref discussion #8672). Add a -initialSnapshot opt-in flag: when set on a fresh sync (no prior checkpoint, -timeAgo unset), walk the live filer tree under -filerPath via TraverseBfs and seed the destination through sink.CreateEntry, then persist the walk-start timestamp as the checkpoint and subscribe from there. Capturing the timestamp before the walk lets the subscription catch any create/update/delete racing with the walk — sink CreateEntry is idempotent across the builtin sinks so replay is safe. Honors existing -filerExcludePaths / -filerExcludeFileNames / -filerExcludePathPatterns filters and skips /topics/.system/log the same way the subscription path does. Also log "starting from <t> (no prior checkpoint)" instead of a misleading "resuming from 1970-01-01" when the KV has no stored offset. * fix(filer.backup): guard initialSnapshot counters under TraverseBfs workers TraverseBfs fans the callback out across 5 worker goroutines, so the entryCount / byteCount updates and the 5-second progress-log gate in runInitialSnapshot were racing. Switch the counters to atomic.Int64 and protect the lastLog check/update with a short-scoped mutex so the heavy sink.CreateEntry call stays outside the critical section. Flagged by gemini-code-assist on #9126; verified with go test -race. * fix(filer.backup): harden initialSnapshot against transient errors and path edge cases Three review items from CodeRabbit on #9126: 1. getOffset errors no longer leave isFreshSync=true. Before, a transient KV read failure would cause runFilerBackup's retry loop to redo the full -initialSnapshot walk on every retry. Treat any offset-read error as "not fresh" so the snapshot only runs when we've verified there really is no prior checkpoint. 2. initialSnapshotTargetKey now normalizes sourcePath to a trailing- slash base before stripping the prefix, so edge cases where sourceKey equals sourcePath (trailing-slash mismatch or root-entry emission) no longer index past the end. Unit tests cover both forms. 3. Documented the TraverseBfs-enumerates-excluded-subtrees performance characteristic on runInitialSnapshot, since pruning requires a separate change to TraverseBfs itself. * fix(filer.backup): retry setOffset after initialSnapshot to avoid full re-walks If the snapshot walk finishes but the subsequent setOffset fails, the retry loop in runFilerBackup will re-enter doFilerBackup with an empty checkpoint and run the full BFS again — on a multi-million-entry tree that's hours of wasted work over a 100-byte KV write. Retry the write a handful of times with exponential backoff before giving up, and log loudly at the final failure (with snapshotTsNs + sinkId) so operators recognize the symptom instead of guessing at mysterious repeated walks. Nitpick raised by CodeRabbit on #9126. * fix(filer.backup): initialSnapshot ignore404, skew margin, exclude dir-entry itself Three review items from CodeRabbit on #9126: 1. ignore404Error now threads into runInitialSnapshot. If a file is listed by TraverseBfs and then deleted before CreateEntry reads its chunks, the follow path already ignores 404s — the snapshot path was aborting and triggering a full re-walk. Treat an ignorable 404 as "skip this entry, continue." 2. snapshotTsNs now uses `time.Now() - 1min` instead of `time.Now()`. Metadata events are stamped server-side, so a fast backup-host clock could skip events that fire during or right after the walk. Matches the 1-minute margin meta_aggregator.go applies on initial peer traversal; duplicate replay is harmless because CreateEntry is idempotent. 3. Exclude checks now run against the entry's own full path, not just its parent. A walked directory whose full path matches SystemLogDir or -filerExcludePaths was being seeded to the destination; only its descendants were being skipped. Verified with a manual repro where -filerExcludePaths=/data/skipdir now keeps the skipdir entry itself off the destination. * refactor(filer): share destKey helper between buildKey and initialSnapshot Extract destKey(dataSink, targetPath, sourcePath, sourceKey, mTime) from buildKey in filer_sync.go. Both the event-log path (buildKey) and the initialSnapshot walk (initialSnapshotTargetKey) now go through the same helper, so a walk-seeded file and an event-replayed file always resolve to the same destination key. As a bonus, buildKey picks up the defensive trailing-slash normalization that initialSnapshotTargetKey introduced — no more index-past-end risk when sourceKey happens to equal sourcePath. Also tightens the mTime lookup to guard against nil Attributes (caught by an existing test against buildKey when I first moved the lookup out of the incremental branch). |
||
|
|
d57fc67022 |
fix(shell): fs.mergeVolumes now rewrites manifest chunks for large files (#9127)
* fix(shell): fs.mergeVolumes now rewrites manifest chunks for large files Previously fs.mergeVolumes skipped any chunk whose IsChunkManifest flag was true, printing "Change volume id for large file is not implemented yet" and continuing. Because the BFS traversal only looks at top-level entry.Chunks, sub-chunks referenced inside a manifest were never considered either. For any file stored as a chunk manifest (large files go this path), chunks in the source volume stayed put, leaving behind a few MB of live data that vacuum and volume.deleteEmpty couldn't clean up. This change resolves each manifest chunk recursively, moves any sub-chunk whose volume id is in the merge plan via the existing moveChunk path, and re-serializes the manifest. If the manifest chunk itself lives in a source volume, or any sub-chunk moved, the new manifest blob is uploaded to a freshly assigned file id (the old needle becomes orphaned and is reclaimed by vacuum like any other moved chunk). Fixes #9116. * address review: batch UpdateEntry, fix dry-run, defer restore, avoid source volumes - Call UpdateEntry once per entry after the chunk loop instead of once per moved chunk (gemini nit). - In dry-run mode, mark anySubChanged when a sub-chunk in the plan is encountered and return changed=true after printing "rewrite manifest", so nested manifests also surface their would-rewrites (gemini nit). - Defer filer_pb.AfterEntryDeserialization so the manifest chunk list is restored even when proto.Marshal fails (coderabbit nit). - Reject AssignVolume results whose file id lands on a volume that is a source in the merge plan, and retry — otherwise the replacement manifest could be written to the volume being emptied (coderabbit). |
||
|
|
96af27a131 |
feat(shell): add fs.distributeChunks command for even chunk distribution (#9117)
* feat(shell): add fs.distributeChunks command for even chunk distribution
Add a new weed shell command that redistributes a file's chunks evenly
across volume server nodes.
Supports three distribution modes via -mode flag:
- primary: balance chunk ownership across nodes (default)
- replica: balance both ownership and replica copies
- round-robin: assign chunks by offset order for sequential read
optimization (chunk[0]->A, chunk[1]->B, chunk[2]->C, ...)
Additional options:
- -nodes=N to target specific number of nodes
- -apply to execute (dry-run by default)
Usage:
fs.distributeChunks -path=/buckets/file.dat
fs.distributeChunks -path=/buckets/file.dat -mode=round-robin -apply
fs.distributeChunks -path=/buckets/file.dat -mode=replica -apply
fs.distributeChunks -path=/buckets/file.dat -nodes=5 -apply
* fix(shell): improve fs.distributeChunks robustness and code quality
- Propagate flag parse errors instead of swallowing them (return err)
- Handle nil chunk.Fid by falling back to legacy FileId string parsing
- Simplify node membership check using slices.Contains
* fix(shell): fix dead round-robin print loop in fs.distributeChunks
The loop was computing targetNode with sc.index%totalNodes (original
chunk index) instead of the sequential position, and discarding it via
_ = targetNode without printing anything. Replace with a correct loop
using pos%totalNodes and actually print the first 12 node assignments.
* fix(shell): compute replication/collection per-chunk in fs.distributeChunks
Previously replication and collection were derived once from chunks[0]
and reused for all moves, causing wrong volume placement for chunks
belonging to different volumes or collections. Now each chunk looks up
its own volumeInfoMap entry immediately before calling operation.Assign.
* fix(shell): prefer assignResult.Auth JWT over local signing key in fs.distributeChunks
When the master returns an Auth token in the Assign response, use it
directly for the upload instead of generating a new JWT from the local
viper signing key. Fall back to local key generation only when Auth is
empty, matching the pattern used by other upload paths.
* fix(shell): add timeout and error handling to delete requests in fs.distributeChunks
The delete loop was ignoring http.NewRequest errors and had no timeout,
risking a nil-request panic or indefinite block. Replace with
http.NewRequestWithContext and a 30s timeout, handle request creation
errors by incrementing deleteFailCount, and cancel the context
immediately after Do returns.
* feat(shell): parallelize chunk moves in fs.distributeChunks using ErrorWaitGroup
Sequential chunk moves are a bottleneck for large LLM model files with
hundreds or thousands of chunks. Use ErrorWaitGroup with
DefaultMaxParallelization (10) to run download/assign/upload concurrently.
Guard movedRecords appends, chunk.Fid updates, and writer output with a
mutex. Individual chunk failures are non-fatal and logged inline; only
successfully moved chunks are included in the metadata update.
* fix(shell): try all replica URLs on download in fs.distributeChunks
Previously only the first volume server URL was attempted, causing chunk
moves to fail if that replica was unreachable. Now iterates through all
URLs returned by LookupVolumeServerUrl and stops at the first success.
* refactor(shell): apply extract method pattern to fs.distributeChunks
Do() was a single ~615-line function. Break it into focused helpers:
- lookupFileEntry: filer entry lookup
- validateChunks: chunk manifest guard
- collectVolumeTopology: master topology query + ownership mapping
- buildDistributionCounts: chunk→node mapping and owner/copy tallies
- selectActiveNodes: target node selection
- printCurrentDistribution: per-node distribution table
- planDistribution: mode-switch planning (primary/replica/round-robin)
- printRedistributionPlan: before/after plan table
- relevantNodes: active-or-occupied node filter
Do() is now ~100 lines of orchestration; each helper has a single
clear responsibility.
* test(shell): add unit tests for fs.distributeChunks algorithms
Cover all three distribution modes and supporting helpers:
- shortName, relevantNodes
- computeOwnerTarget (even/uneven split, inactive node drain)
- buildDistributionCounts (normal + nil Fid fallback)
- selectActiveNodes (all nodes / limited count)
- planOwnerMoves (imbalanced → balanced, already balanced)
- planDistribution primary (chunks balanced, no-op when even)
- planDistribution round-robin (offset ordering, correct assignment)
- planDistribution replica (owner + copy balancing)
- printRedistributionPlan (output format)
* fix(shell): add 5-minute timeout to chunk downloads in fs.distributeChunks
Download requests had no per-request timeout, unlike delete operations
which already use 30s. Replace readUrl() calls with inline
http.NewRequestWithContext + context.WithTimeout(5m) so a hung volume
server cannot block a goroutine indefinitely during redistribution.
* fix(shell): remove redundant deleteOldChunks in fs.distributeChunks
filer.UpdateEntry already calls deleteChunksIfNotNew internally, which
computes the diff between old and new entry chunks and deletes the ones
no longer referenced. Our explicit deleteOldChunks was racing with this
filer-side cleanup, causing spurious 404 warnings on ~75% of deletes.
Remove deleteOldChunks, movedChunkRecord type, and reduce
executeChunkMoves return type to (int, error) for the moved count.
* fix(shell): handle nil chunk.Fid via chunkVolumeId helper in fs.distributeChunks
chunk.Fid.GetVolumeId() silently returns 0 for legacy chunks stored with
a FileId string instead of a Fid struct, causing them to be skipped in
the replica balancing loop and looked up incorrectly in volumeInfoMap.
Introduce chunkVolumeId() that uses Fid when present and falls back to
parsing the legacy FileId string, matching the logic in
buildDistributionCounts. Apply it in the replica-mode copies loop and
in executeChunkMoves' replication/collection lookup.
* fix(shell): use already-parsed oldFid for volumeInfoMap lookup in fs.distributeChunks
chunkVolumeId(chunk) was being called to look up replication/collection
after oldFid had already been parsed and validated. Use oldFid.VolumeId
directly to avoid redundant parsing and guarantee the correct volume ID
regardless of whether chunk.Fid is nil.
* fix(shell): improve correctness and robustness in fs.distributeChunks
- Buffer download body before upload so dlCtx timeout only covers the
GET request; upload runs with context.Background() via bytes.NewReader
- Replace 'before, after := strings.Cut(...)' + '_ = before' with '_'
as the first return value directly
- Clone copiesCount before replica planner mutates it, keeping the
caller's map immutable
- Add nil-entry guard after filer LookupEntry to prevent panic on
unexpected nil response
* feat(shell): support chunk manifests in fs.distributeChunks
Large files stored as chunk manifests were previously rejected. Resolve
manifests up front via filer.ResolveChunkManifest, redistribute the
underlying data chunks, then re-pack through filer.MaybeManifestize
before UpdateEntry. The filer's MinusChunks resolves manifests on both
sides of the diff, so old manifest and inner data chunks are GC'd
automatically.
* fix(shell): match master's SaveDataAsChunkFunctionType 5-param signature
Master added expectedDataSize uint64; ignore it in shell-side saveAsChunk.
---------
Co-authored-by: Chris Lu <chris.lu@gmail.com>
|
||
|
|
6bcacedda9 |
Export master_disconnections metrics on volume servers. (#9104)
This allows to track connection issues and master failovers in real time via Prometheus metrics. Co-authored-by: Lisandro Pin <lisandro.pin@proton.ch> |
||
|
|
0315da9022 |
fix(s3api): self-heal stale .versions latest-version pointer on read (#9125)
* fix(s3api): self-heal stale .versions latest-version pointer on read When the `.versions` directory metadata points at a version file that has gone missing (e.g. a crash between deleting the latest version and rewriting the pointer, or a concurrent delete racing with a read), `getLatestObjectVersion` bailed with a hard error that required manual repair. Tagging, ACL, retention, copy-source, and HEAD/GET all surfaced NoSuchKey even though other versions remained on disk. On `ErrNotFound`/`codes.NotFound` from the pointed-at version lookup, rescan the `.versions` directory, pick the newest remaining non-delete- marker entry, persist the repaired pointer (best-effort), and return that entry. If only delete markers (or nothing) remain, the caller still sees an error and the object correctly appears absent. Extracted the selection logic into a pure `selectLatestContentVersion` helper so `updateLatestVersionAfterDeletion` and the new self-heal path share a single implementation. A warning is logged whenever the heal kicks in so stale-pointer incidents remain visible in operator logs. * fix(s3api): self-heal must promote newest version even if delete marker Review feedback (gemini-code-assist): the self-heal path used `selectLatestContentVersion`, which skips delete markers. That had two bugs: 1. If the chronologically newest entry was a delete marker, an older content version would be promoted, effectively "undeleting" an object that was actually deleted. 2. If only delete markers remained, heal returned an error and the caller surfaced a hard 500 instead of the correct 404-with- x-amz-delete-marker response. Add `selectLatestVersion` that picks the newest entry regardless of type (content or delete marker) and use it in `healStaleLatestVersionPointer`. The promoted entry flows back through `doGetLatestObjectVersion` unchanged; downstream handlers already detect `ExtDeleteMarkerKey` on the returned entry and render NoSuchKey + `x-amz-delete-marker: true` (see `s3api_object_handlers.go:722-728`). Kept `selectLatestContentVersion` in place for `updateLatestVersionAfterDeletion`, which deliberately limits the pointer to a live content version in the post-deletion flow — changing that is out of scope for this fix. Added four tests for the new selector: - promotes newest delete marker over older content (the reviewer case) - picks content when content is newest - promotes newest delete marker when only delete markers remain - returns nil latestEntry on empty/untagged input * fix(s3api): paginate .versions scan in self-heal path Review feedback (gemini-code-assist, coderabbitai): the self-heal rescan did a single-shot list(..., 1000). For objects whose version ids use the old raw-timestamp format, filer ordering is lexicographic-ascending = oldest-first. If the .versions directory held more than one page of entries, the first page contained only the oldest, and the heal would promote an older version as the new "latest" — silently surfacing stale data on subsequent reads. Paginate through the whole directory with `filer.PaginationSize`, running `selectLatestVersion` per page and keeping a single running best candidate across pages (via `compareVersionIds`). This mirrors the pagination pattern already used by `getObjectVersionList` in the same file and closes the window for old-format stacks larger than one page. New-format (inverted-timestamp) ids were not affected because their lexicographic order matches newest-first, but paginating is still the right fix. Also updated the function doc to reflect that self-heal now promotes the newest entry regardless of type (content version or delete marker). * fix(s3api): don't resurrect deleted objects; wrap ErrNotFound sentinel Two review findings addressed: 1. `updateLatestVersionAfterDeletion` was using `selectLatestContentVersion` which skipped delete markers. Scenario: PUT v1, DELETE (dm1 written, pointer->dm1), PUT v3 (pointer->v3), user explicitly deletes v3 by versionId. Remaining files: v1, dm1. S3 semantics say the current version is the chronologically newest = dm1 (object appears deleted). Old code would promote v1, "undeleting" the object silently. Switched to `selectLatestVersion`, which picks the newest entry regardless of type. Also paginated the scan the same way the self-heal path does — otherwise a single-page `list(1000)` still mis-selects the latest for old-format version id stacks that exceed one page. Removed the `hasDeleteMarkers || !isLast` branch: with `selectLatestVersion` any delete marker participates in the comparison and shows up as the latest when it is newest, so the "keep the directory on ambiguity" guard becomes unreachable. Full pagination also makes the `!isLast` guard unnecessary — we either see every entry or surface a list error. 2. `healStaleLatestVersionPointer`'s "no remaining version" error was a plain `fmt.Errorf`, so callers could not distinguish genuine object-absence from a scan failure via `errors.Is`. Wrap it with `filer_pb.ErrNotFound` so the sentinel flows through the chain (the outer wrap already uses `%w`). No test additions needed — `TestSelectLatestVersion_PromotesNewestDeleteMarker` already asserts the "newer delete marker beats older content" invariant that drives both fixes. * refactor(s3api): drop unused selectLatestContentVersion after review Review feedback flagged a stale comment claiming selectLatestContentVersion "mirrors" the post-deletion semantics of updateLatestVersionAfterDeletion. That claim became false when the post-deletion path switched to selectLatestVersion in the previous commit. Verified no production callers remain — only the helper's own tests referenced it — so the cleaner fix is to delete the dead code rather than rewrite the comment to explain "this exists but isn't used." - Removed selectLatestContentVersion. - Removed the four TestSelectLatestContentVersion_* tests. - Preserved a renamed TestSelectLatestVersion_MixedFormats so the mixed-format comparator coverage still runs against the active selector. |
||
|
|
bfea14320a |
fix(s3): reject unknown POST policy conditions and extra x-amz form fields (#9124)
* fix(s3): reject unknown POST policy conditions and extra x-amz form fields CheckPostPolicy previously accepted policy conditions with unknown $keys (e.g. "$foo") as satisfied, and only rejected stray X-Amz-Meta-* form fields. Reject unknown condition keys outright, and extend the extra- input-fields check to all X-Amz-* form fields except the reserved auth/signing headers. Matches AWS S3 POST Object behavior. * refactor(s3): drop redundant $x-amz-meta- prefix check in CheckPostPolicy The $x-amz- prefix already subsumes $x-amz-meta-, so the explicit $x-amz-meta- check adds no coverage. Simplify the else-if condition. Addresses gemini-code-assist review on PR #9124. * style(s3): align unknown-key policy error with [op, key, value] trailer Reformat the unknown-condition-key error in CheckPostPolicy to include the same "[op, key, value]" trailer used by the other condition-failed messages. The value slot is empty because no comparison occurs for an unknown key. The descriptive "unknown condition key" suffix is kept so operators can still tell this failure from a mismatched value. * fix(s3): honor starts-with prefix-stem POST policies when checking extras AWS POST policies use ["starts-with","$x-amz-meta-",""] to allow any X-Amz-Meta-* form field. The previous exact-match policyXAmzKeys would flag every X-Amz-Meta-Foo as an "Extra input fields" failure because only the stem X-Amz-Meta- was stored. Track starts-with conditions whose key ends in "-" with an empty value as prefix stems, and accept any X-Amz-* form field matching one of those stems. * fix(s3): validate value prefix for starts-with POST policy stems Drop the policy.Value == "" gate when detecting prefix-stem conditions so that ["starts-with","$x-amz-meta-","pfx-"] is recognized as a prefix rule. Track the required value prefix alongside the name prefix, enforce it against every matching form field in the extras loop, and skip the prefix-stem condition in the main iteration (it has no single form field to evaluate). Also include policy.Value in the unknown-condition error trailer for clearer debugging. Addresses gemini-code-assist review on PR #9124. * fix(s3): check every matching POST policy rule, not just the first The extras loop exited early on exact-key match and broke on the first matching prefix stem. Per AWS, a form field must satisfy every policy condition that applies to it, so an exact-match field must still honor any overlapping starts-with stem's value prefix, and multiple stems on the same field must all hold. Drop both early exits: start matched from the exact-key lookup, iterate all prefix stems, and fail on the first value-prefix violation. Addresses gemini-code-assist review on PR #9124. |
||
|
|
0bddc2652e |
fix(s3): propagate validated POST form fields to upload headers (#9123)
* fix(s3): propagate validated POST form fields to upload headers POST Object form fields like acl, Content-Encoding, x-amz-storage-class, x-amz-tagging, and x-amz-server-side-encryption were validated against the POST policy but never forwarded to the underlying PUT, so the validated values had no effect. Forward all non-reserved x-amz-* fields plus acl (as x-amz-acl) and Content-Encoding. Reserved POST policy mechanism fields (Policy, Signature, Key, etc.) are still excluded. * fix(s3): also forward Content-Language from POST form to upload headers AWS S3 POST Object supports Content-Language; add it to the set of content headers forwarded by applyPostPolicyFormHeaders alongside Cache-Control, Expires, Content-Disposition, and Content-Encoding. Addresses gemini-code-assist review on PR #9123. * refactor(s3): only look up form value in branches that use it applyPostPolicyFormHeaders previously called formValues.Get(k) for every form field, including fields that fall through the switch (non-reserved fields that are neither Acl, a forwarded content header, nor X-Amz-*). Move the lookup inside the switch cases that actually use it. |
||
|
|
2da24cc230 |
fix(s3): return 403 on POST policy violation instead of 307 redirect (#9122)
* fix(s3): return 403 on POST policy violation instead of 307 redirect CheckPostPolicy failures previously responded with HTTP 307 Temporary Redirect to the request URL, which causes clients to re-POST and obscures the failure. Return 403 AccessDenied so the client surfaces the error. * test(s3): exercise PostPolicyBucketHandler end-to-end for 403 mapping Replace the shallow ErrAccessDenied tautology test with one that builds a signed POST multipart request whose policy conditions cannot be satisfied, calls PostPolicyBucketHandler directly, and asserts HTTP 403 with no Location redirect header. Addresses gemini-code-assist review on PR #9122. * fix(s3): surface POST policy failure reason in AccessDenied response Add s3err.WriteErrorResponseWithMessage so a caller can keep the standard error code mapping while providing a specific Message. Use it from PostPolicyBucketHandler so the XML body carries the CheckPostPolicy error (e.g. which condition failed or that the policy expired) rather than the generic "Access Denied." description. Addresses gemini-code- assist review on PR #9122. * refactor(s3err): delegate WriteErrorResponse to WriteErrorResponseWithMessage The two helpers shared every line except the Message override. Fold WriteErrorResponse into a one-line delegation that passes an empty message, so the request-id/mux/apiError logic lives in exactly one place. Addresses gemini-code-assist review on PR #9122. |
||
|
|
32cbed9658 |
fix(fuse-tests): avoid ephemeral port reuse racing weed mini bind
freePort allocated in [20000, 55535], which overlaps the Linux ephemeral range (32768-60999). The kernel could reuse the chosen port for an outbound connection between the test closing its listener and weed mini re-checking availability, causing: Port allocation failed: port N for Filer (specified by flag filer.port) is not available on 0.0.0.0 and cannot be used Narrow the range to [20000, 32000] to stay below the ephemeral floor, and pass -ip.bind=127.0.0.1 so mini's pre-check runs on the same IP the test actually reserved the port on. |
||
|
|
cce98fcecf |
fix(s3): strip client-supplied X-SeaweedFS-Principal/Session-Token in AuthSignatureOnly (#9120)
* fix(s3): strip client-supplied X-SeaweedFS-Principal/Session-Token in AuthSignatureOnly AuthSignatureOnly is the only auth gate in front of S3Tables routes (incl. CreateTableBucket) and UnifiedPostHandler, but unlike authenticateRequestInternal it did not clear the internal IAM trust headers before running signature verification. S3Tables authorizeIAMAction reads X-SeaweedFS-Principal directly from the request and prefers it over the authenticated identity's PrincipalArn, so a signed low-privilege caller could append that header after signing (unsigned header, SigV4 still verifies) and have IAM policy evaluated against a spoofed principal, bypassing authorization. Clear both X-SeaweedFS-Principal and X-SeaweedFS-Session-Token at the top of AuthSignatureOnly, mirroring the existing guard in authenticateRequestInternal. Add a regression test covering the header-injection path. * refactor(s3): route AuthSignatureOnly through authenticateRequestInternal Addresses review feedback: both entry points were independently maintaining the internal-IAM-header stripping and the auth-type dispatch switch. Collapse AuthSignatureOnly into a thin wrapper around authenticateRequestInternal so the security-critical header scrub and the signature-verify switch live in one place. Post-auth behavior unique to AuthSignatureOnly (AmzAccountId header) stays inline. No functional change beyond two harmless telemetry tweaks that now match authenticateRequestInternal: the per-branch glog verbosity shifts from V(3) to V(4), and the anonymous-found path now sets AmzAuthType. * refactor(s3): centralize X-SeaweedFS-Principal/Session-Token header names Introduce SeaweedFSPrincipalHeader and SeaweedFSSessionTokenHeader in weed/s3api/s3_constants so the trust-header literals are defined once and referenced consistently by the auth scrub, JWT auth path, bucket policy principal resolution, IAM authorization, and S3Tables IAM evaluation. Replace every remaining usage in weed/s3api and weed/s3api/s3tables. This removes the drift risk the reviewer called out: adding another call site with a typo can no longer silently bypass the scrub. Pure rename, no behavior change. No-op integration-test helper in test/s3/iam/s3_iam_framework.go left untouched (separate module, and the server now strips the client-supplied value regardless). |
||
|
|
88ac2d0431 |
security(s3api): reject unsigned x-amz-* headers in SigV4 requests (#9121)
A presigned URL holder could attach arbitrary x-amz-* headers to a PUT request (e.g. x-amz-tagging, x-amz-acl, x-amz-storage-class, x-amz-server-side-encryption*, x-amz-object-lock-*, x-amz-meta-*, x-amz-website-redirect-location, x-amz-grant-*). Because only the headers declared in SignedHeaders participate in signature verification, the added headers bypass authentication; the PUT handler then persists them into the object's Extended metadata. Match the AWS SigV4 rule: every x-amz-* header present in the request must appear in SignedHeaders. Exempt x-amz-content-sha256 (already tamper-protected via the canonical request's payload-hash line) and, for presigned URLs, the SigV4 protocol parameters that live in the query string (X-Amz-Algorithm/Credential/Date/Expires/SignedHeaders/ Signature) in case they are duplicated as headers. Applies to both header-based and presigned SigV4; non-amz headers are unaffected. |
||
|
|
e8767f42b6 | Add security policy for vulnerability reporting | ||
|
|
f720f559cb |
ci(kafka-loadtest): switch off Ubuntu/Debian base images to avoid apt mirror flakes (#9119)
* ci(kafka-loadtest): retry apt-get to survive Ubuntu mirror flakes
The Kafka Quick Test workflow's Docker build of Dockerfile.loadtest
keeps hitting "Connection failed [IP: ...]" on archive.ubuntu.com /
security.ubuntu.com mid-build, e.g.:
failed to solve: process "/bin/sh -c apt-get update && \
apt-get install -y ca-certificates curl jq bash netcat ..."
did not complete successfully: exit code: 100
Same class of failure PR #9106 fixed for pjdfstest. Apply the same
two retry knobs to Dockerfile.loadtest and Dockerfile.seektest so a
transient mirror flake retries five times with a 30s timeout instead
of failing the whole workflow.
(The fuller pjdfstest fix also restructured the build to use
docker/build-push-action with type=gha cache. Doing that for
kafka-client-loadtest would mean rewriting the make/docker-compose
build path; defer until the apt-retry alone proves insufficient.)
* ci(kafka-loadtest): also drop recommends/suggests + apt-get clean
Address PR review (gemini-code-assist): fully align with PR #9106's
pattern by adding --no-install-recommends / --no-install-suggests so
the runtime images stay small and don't pull in extra packages, plus
apt-get clean before rm -rf /var/lib/apt/lists/* in Dockerfile.seektest.
* ci(kafka-loadtest): use alpine / maven base images instead of apt
The previous rounds of apt-retry / apt-clean knobs aren't enough:
the Ubuntu mirror is persistently unreachable from the GitHub runner
for minutes at a time, which blows past the 5-retry / 30-second
Acquire configuration and still kills the build (see run 24551809614).
Switch both runtime images so no apt fetch is needed at all:
- Dockerfile.loadtest now runs on alpine:3.20. All runtime deps
(ca-certificates, curl, jq, bash, netcat-openbsd) are in the
Alpine main repo, fetched from Alpine's CDN rather than the
Ubuntu archive that keeps going dark.
- Dockerfile.seektest now uses maven:3.9-eclipse-temurin-11, which
ships JDK 11 and Maven preinstalled — no apt-get maven step.
This also means the runtime images no longer care about
Acquire::Retries / DEBIAN_FRONTEND / apt-get clean, so those lines
are removed with the apt call they were configuring.
|
||
|
|
45578a42e9 |
fix(volume): keep vacuum running past dangling .idx entries (#9115)
* fix(volume): keep vacuum running past dangling .idx entries Vacuum compaction aborted entirely on the first .idx entry whose offset pointed past the end of the .dat file, surfacing as `cannot hydrate needle from file: EOF` and stalling progress on every other volume. In both Go and Rust: - During compaction, skip an unreadable needle and continue. The bytes it pointed at were already unreachable via reads, so dropping the index reference makes the post-vacuum volume consistent. Real EIO still bails out so a disk fault is not silently papered over. - At volume load, do a single linear scan of the .idx and confirm every (offset + actual size) fits inside .dat. The pre-existing integrity check only looked at the last 10 entries, so deeper corruption (e.g. left over from a crashed batched write) went undetected and only surfaced later as a vacuum EOF. A failure now marks the volume read-only at load time so an operator can react. Refs #8928 * fix(volume): only skip permanent-corruption needle reads during vacuum Address PR review feedback (gemini-code-assist + coderabbit): The original patch skipped any non-EIO read failure, which would silently drop needles on transient errors — Windows hardware bad-sector errors (ERROR_CRC etc.) never surface as syscall.EIO; tiered-storage network timeouts and EROFS would also slip through and shrink the volume. Switch to an explicit whitelist of permanent-corruption shapes: - Add needle.ErrorCorrupted sentinel and wrap CRC and "index out of range" errors with %w so callers can match via errors.Is. - copyDataBasedOnIndexFile now skips only when the read failure is io.EOF, io.ErrUnexpectedEOF, ErrorSizeMismatch, ErrorSizeInvalid, or ErrorCorrupted. Anything else (real disk faults, environmental errors, Windows hardware codes) aborts the compaction so an operator notices. - Mirror the same whitelist in the Rust volume server, matching on io::ErrorKind::UnexpectedEof and the NeedleError corruption variants (SizeMismatch, CrcMismatch, IndexOutOfRange, TailTooShort). Also add `defer v.Close()` in TestVerifyIndexFitsInDat so Windows t.TempDir() cleanup can release the .dat/.idx handles. Refs #8928 * fix(volume): wrap entry-not-found size-mismatch with ErrorSizeMismatch Address PR review: the fallback branch in ReadBytes returned an unwrapped fmt.Errorf, so isSkippableNeedleReadError (and any caller using errors.Is(..., ErrorSizeMismatch)) could not match it. Wrap with %w so the whitelist applies, while leaving the existing direct sentinel return for the OffsetSize==4 / offset<MaxPossibleVolumeSize retry path unchanged so ReadData's `err == ErrorSizeMismatch` retry still triggers. Refs #8928 * fix(volume): integrate dangling-idx check into existing index load walk Address PR review (gemini-code-assist, medium): the structural .idx check used to do a second linear scan of the index file at every volume load, doubling the disk-I/O cost on servers managing many volumes. Track the largest (offset + actual size) seen during the existing needle-map load walks (`LoadCompactNeedleMap`, `NewLevelDbNeedleMap`, `NewSortedFileNeedleMap`'s `newNeedleMapMetricFromIndexFile`, `DoOffsetLoading`) on a new `MaximumNeedleEnd` field on `mapMetric`, exposed as `MaxNeedleEnd()` on the NeedleMapper interface. `volume.load()` then compares `nm.MaxNeedleEnd()` to the .dat size after the load is complete — pure numeric comparison, no extra I/O. The standalone `verifyIndexFitsInDat` helper and its caller in `CheckVolumeDataIntegrity` are removed; the test that used to drive the helper directly now exercises the new path via `LoadCompactNeedleMap`. Mirror the same change in the Rust volume server: track `max_needle_end` on `NeedleMapMetric`, expose via `max_needle_end()` on `CompactNeedleMap`, `RedbNeedleMap`, and the `NeedleMap` enum. The Rust load walk already happens in `load_from_idx` for both map kinds, so the structural check becomes free. Refs #8928 |
||
|
|
664ae64646 |
ci(binaries_dev): serialize concurrent runs to prevent asset name collisions
Multiple master pushes within the same minute produced identical BUILD_TIME values, causing concurrent workflow runs to race on identically-named release assets. Upload retries hit 422 already_exists and failed the build. Adding a concurrency group with cancel-in-progress ensures only the latest dev build runs at a time, which is fine since only the latest dev artifacts matter. |
||
|
|
018e648d00 |
build(deps): bump github.com/jackc/pgx/v5 from 5.8.0 to 5.9.0 (#9113)
Bumps [github.com/jackc/pgx/v5](https://github.com/jackc/pgx) from 5.8.0 to 5.9.0. - [Changelog](https://github.com/jackc/pgx/blob/master/CHANGELOG.md) - [Commits](https://github.com/jackc/pgx/compare/v5.8.0...v5.9.0) --- updated-dependencies: - dependency-name: github.com/jackc/pgx/v5 dependency-version: 5.9.0 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> |
||
|
|
66ce0c29cf | fix(weed/query/engine): check for nil pointers (#9114) | ||
|
|
108e6b0d5f | Delete scheduled_tasks.lock | ||
|
|
9d15705c16 |
fix(mini): shut down admin/s3/webdav/filer before volume/master on Ctrl+C (#9112)
* fix(mini): shut down admin/s3/webdav/filer before volume/master on Ctrl+C Interrupts fired grace hooks in registration order, so master (started first) shut down before its clients, producing heartbeat-canceled errors and masterClient reconnection noise during weed mini shutdown. Admin/s3/ webdav had no interrupt hooks at all and were killed at os.Exit. - grace: execute interrupt hooks in LIFO (defer-style) order so later- started services tear down first. - filer: consolidate the three separate interrupt hooks (gRPC / HTTP / DB) into one that runs in order, so filer shutdown stays correct independent of FIFO/LIFO semantics. - mini: add MiniClientsShutdownCtx (separate from test-facing MiniClusterCtx) plus an OnMiniClientsShutdown helper. Admin, S3, WebDAV and the maintenance worker observe it; runMini registers a cancel hook after startup so under LIFO it fires first and waits up to 10s on a WaitGroup for those services to drain before filer, volume, and master shut down. Resulting order on Ctrl+C: admin/s3/webdav/worker -> filer (gRPC -> HTTP -> DB) -> volume -> master. * refactor(mini): group mini-client shutdown into one state struct The first pass spread the shutdown plumbing across three globals (MiniClientsShutdownCtx, miniClientsWg, cancelMiniClients) and two ctx-derivation sites (OnMiniClientsShutdown and startMiniAdminWithWorker). Group into a private miniClientsState (ctx/cancel/wg) rebuilt per runMini invocation, and chain its ctx from MiniClusterCtx so clients only observe one signal. Tests that cancel MiniClusterCtx still trigger client shutdown via parent-child propagation. - resetMiniClients() installs fresh state at the top of runMini, so in-process test reruns don't inherit stale ctx/wg. - onMiniClientsShutdown(fn) replaces the exported OnMiniClientsShutdown and only observes one ctx. - trackMiniClient() replaces the manual wg.Add/Done dance for the admin goroutine. - miniClientsCtx() gives the admin startup a ctx without re-deriving. - triggerMiniClientsShutdown(timeout) is the interrupt hook body. No behaviour change; existing tests pass. * refactor: generalize shutdown ctx as an option, not a mini-specific helper Several service files (s3, webdav, filer, master, volume) observed the mini-specific MiniClusterCtx or called onMiniClientsShutdown directly. That leaked mini orchestration into code that also runs under weed s3, weed webdav, weed filer, weed master, and weed volume standalone. Replace with a generic `shutdownCtx context.Context` field on each service's Options struct. When non-nil, the server watches it and shuts down gracefully; when nil (standalone), the shutdown path is a no-op. Mini wires the contexts up from a single place (runMini): - miniMasterOptions/miniOptions.v/miniFilerOptions.shutdownCtx = MiniClusterCtx (drives test-triggered teardown) - miniS3Options/miniWebDavOptions.shutdownCtx = miniClientsCtx() (drives Ctrl+C teardown before filer/volume/master) All knowledge of MiniClusterCtx now lives in mini.go. * fix(mini): stop worker before clients ctx so admin shutdown isn't blocked Symptom on Ctrl+C of a clean weed mini: mini's Shutting down admin/s3/ webdav hook sat for 10s then logged "timed out". Admin had started its shutdown but was blocked inside StopWorkerGrpcServer's GracefulStop, waiting for the still-connected worker stream. That in turn left filer clients connected and cascaded into filer's own 10s gRPC graceful-stop timeout. Two causes, both fixed: 1. worker.Stop() deadlocked on clean shutdown. It sent ActionStop (which makes managerLoop `break out` and exit), then called getTaskLoad() which sends to the same unbuffered cmd channel — no receiver, hangs forever. Reorder Stop() to snapshot the admin client and drain tasks BEFORE sending ActionStop, and call Disconnect() via the local snapshot afterwards. 2. Worker's taskRequestLoop raced with Disconnect(): RequestTask reads from c.incoming, which Disconnect closes, yielding a nil response and a panic on response.Message. Handle the closed channel explicitly. 3. Mini now has a preCancel phase (beforeMiniClientsShutdown) that runs synchronously BEFORE the clients ctx is cancelled. Register worker shutdown there so admin's worker-gRPC GracefulStop finds the worker already disconnected and returns immediately, instead of waiting on a stream that is about to close anyway. Observed shutdown of a clean mini: admin/s3/webdav down in <10ms; full process exit in ~11s (the remaining 10s is a pre-existing filer gRPC graceful-stop timeout, not cascaded from the clients tier). * feat(mini): cap filer gRPC graceful stop at 1s under weed mini Full weed mini shutdown was ~11s on a clean exit, dominated by the filer's default 10s gRPC GracefulStop timeout while background SubscribeLocalMetadata streams drained. Expose the timeout as a FilerOptions.gracefulStopTimeout field (default 10s for standalone weed filer) and set it to 1s in mini. Clean weed mini shutdown now takes ~2s. |
||
|
|
9554e259dd |
fix(iceberg): route catalog clients to the right bucket and vend S3 endpoint (#9109)
* fix(iceberg): route catalog clients to the right bucket and vend S3 endpoint DuckDB ATTACH 's3://<bucket>/' AS cat (TYPE 'ICEBERG', ...) was failing with "schema does not exist" because GET /v1/config ignored the warehouse query parameter and returned no overrides.prefix, so subsequent requests fell through to the hard-coded "warehouse" default bucket instead of the one the client attached. LoadTable also returned an empty config, forcing clients to discover the S3 endpoint out-of-band and producing 403s on direct iceberg_scan calls. - handleConfig now echoes overrides.prefix = bucket and defaults.warehouse when ?warehouse=s3://<bucket>/ is supplied. - getBucketFromPrefix honors a warehouse query parameter as a fallback for clients that skip the /v1/config handshake. - LoadTable responses advertise s3.endpoint and s3.path-style-access so clients can reach data files without separate configuration. Refs #9103 * address review feedback on iceberg S3 endpoint vending - deriveS3AdvertisedEndpoint is now a method on S3Options; honors externalUrl / S3_EXTERNAL_URL, switches to https when -s3.key.file is set, uses the https port when configured, and brackets IPv6 literals via util.JoinHostPort. - handleCreateTable returns s.buildFileIOConfig() in both its staged and final LoadTableResult branches so create and load flows see the same FileIO hints. - Add unit test coverage for the endpoint derivation scenarios. * address CI and review feedback for #9109 - DuckDB integration test now runs under its own newOAuthTestEnv (with a valid IAM config) so the OAuth2 client_credentials flow DuckDB requires actually works; the shared env has no registered credentials, which was the cause of the CI failure. Helper createTableWithToken was added to create tables via Bearer auth. - Tighten TestIssue9103_LoadTableDoesNotVendS3FileIOCredentials to also assert s3.path-style-access = "true", so a partial regression where the endpoint is vended but path-style is dropped still fails. - deriveS3AdvertisedEndpoint now logs a startup hint when it infers the host from os.Hostname because the bind IP is a wildcard, pointing operators at -s3.externalUrl / S3_EXTERNAL_URL for reverse-proxy deployments where the inferred name is not externally reachable. - handleConfig has a comment explaining that any sub-path in the warehouse URL is dropped because catalog routing is bucket-scoped. * fix(iceberg): make advertised S3 endpoint strictly opt-in; add region The wildcard-bind fallback to os.Hostname() in deriveS3AdvertisedEndpoint was hijacking correctly-configured clients: on the CI runner it produced http://runnervmrc6n4:<port>, which Spark (running in Docker) could not resolve, so Spark iceberg tests began failing after the endpoint started being vended in LoadTable responses. Change the rule so advertising is opt-in and never guesses a host that might not be routable: - -s3.externalUrl / S3_EXTERNAL_URL wins (covers reverse-proxy). - Otherwise, only an explicit, non-wildcard -s3.bindIp is used. - Wildcard / empty bind returns "" so no FileIO endpoint is vended and existing clients keep using their own configuration. buildFileIOConfig additionally vends s3.region (defaulting to the same value baked into table bucket ARNs) whenever it vends an endpoint, so DuckDB's attach does not fail with "No region was provided via the vended credentials" when the operator has opted in. The DuckDB issue-9103 integration test runs under an env with a wildcard bind, so it explicitly sets AWS_REGION in the docker run to pick up the same default. The HTTP-level LoadTable-vending test was dropped because its expectation is now conditional and already covered by unit tests in iceberg_issue_9103_test.go. |
||
|
|
00a2e22478 |
fix(mount): remove fid pool to stop master over-allocating volumes (#9111)
* fix(mount): remove fid pool to stop master over-allocating volumes
The writeback-cache fid pool pre-allocated file IDs with
ExpectedDataSize = ChunkSizeLimit (typically 8+ MB). The master's
PickForWrite charges count * expectedDataSize against the volume's
effectiveSize, so a full pool refill could charge hundreds of MB
against a single volume before any bytes were actually written.
That tripped RecordAssign's hard-limit path and eagerly removed
volumes from writable, causing the master to grow new volumes
even when the real data being written was tiny.
Drop the pool entirely. Every chunk upload goes through
UploadWithRetry -> AssignVolume with no ExpectedDataSize hint,
letting the master fall back to the 1 MB default estimate. The
mount->filer grpc connection is already cached in pb.WithGrpcClient
(non-streaming mode), so per-chunk AssignVolume is a unary RPC
over an existing HTTP/2 stream, not a full dial. Path-based
filer.conf storage rules now apply to mount chunk assigns again,
which the pool had to skip.
Also remove the now-unused operation.UploadWithAssignFunc and its
AssignFunc type.
* fix(upload): populate ExpectedDataSize from actual chunk bytes
UploadWithRetry already buffers the full chunk into `data` before
calling AssignVolume, so the real size is known. Previously the
assign request went out with ExpectedDataSize=0, making the master
fall back to the 1 MB DefaultNeedleSizeEstimate per fid — same
over-reservation symptom the pool had, just smaller per call.
Stamp ExpectedDataSize = len(data) before the assign RPC when the
caller hasn't already set it. This covers mount chunk uploads,
filer_copy, filersink, mq/logstore, broker_write, gateway_upload,
and nfs — all the UploadWithRetry paths.
* fix(assign): pass real ExpectedDataSize at every assign call site
After removing the mount fid pool, per-chunk AssignVolume calls went
out with ExpectedDataSize=0, making the master fall back to its 1 MB
DefaultNeedleSizeEstimate. That's still an over-estimate for small
writes. Thread the real payload size through every remaining assign
site so RecordAssign charges effectiveSize accurately and stops
prematurely marking volumes full.
- filer: assignNewFileInfo now takes expectedDataSize and stamps it
on both primary and alternate VolumeAssignRequests. Callers pass:
- SSE data-to-chunk: len(data)
- copy manifest save: len(data)
- streamCopyChunk: srcChunk.Size
- TUS sub-chunk: bytes read
- saveAsChunk (autochunk/manifestize): 0 (small, size unknown
until the reader is drained; master uses 1 MB default)
- filer gRPC remote fetch-and-write: ExpectedDataSize = chunkSize
after the adaptive chunkSize is computed.
- ChunkedUploadOption.AssignFunc gains an expectedDataSize parameter;
upload_chunked.go passes the buffered dataSize at the call site.
S3 PUT assignFunc stamps it on the AssignVolumeRequest.
- S3 copy: assignNewVolume / prepareChunkCopy take expectedDataSize;
all seven call sites pass the source chunk's Size.
- operation.SubmitFiles / FilePart.Upload: derive per-fid size from
FileSize (average for batched requests, real per-chunk size for
sequential chunk assigns).
- benchmark: pass fileSize.
- filer append-to-file: pass len(data).
* fix(assign): thread size through SaveDataAsChunkFunctionType
The saveAsChunk path (autochunk, filer_copy, webdav, mount) ran
AssignVolume before the reader was drained, so it had to pass
ExpectedDataSize=0 and fall back to the master's 1 MB default.
Add an expectedDataSize parameter to SaveDataAsChunkFunctionType.
- mergeIntoManifest already has the serialized manifest bytes, so
it passes uint64(len(data)) directly.
- Mount's saveDataAsChunk ignores the parameter because it uses
UploadWithRetry, which already stamps len(data) on the assign
after reading the payload.
- webdav and filer_copy saveDataAsChunk follow the same UploadWithRetry
path and also ignore the hint.
- Filer's saveAsChunk (used for manifestize) plumbs the value to
assignNewFileInfo so manifest-chunk assigns get a real size.
Callers of saveFunc-as-value (weedfs_file_sync, dirty_pages_chunked)
pass the chunk size they're about to upload.
|
||
|
|
7916e61c08 |
fix(mount): avoid self-notify deadlock in Link and CopyFileRange handlers (#9110)
The Link and CopyFileRange FUSE request handlers were calling fuseServer.InodeNotify (and EntryNotify for copy) synchronously while the kernel was still waiting for the request's reply on the same /dev/fuse fd. Notifications share that fd, so the syscall.Write can block indefinitely when the kernel hasn't drained its queue yet, hanging the entire mount. A goroutine dump from a stuck mount showed the Link handler blocked in syscall.Write inside InodeNotify while the server's read loop kept waiting for new requests. Drop the synchronous notifies. The local meta cache is still updated inline, so subsequent filesystem ops see the fresh state; the kernel's attr/dentry caches re-fetch once their TTL expires. |
||
|
|
ecc0390795 |
fix(master): eagerly remove volume from writable when assign hits limit (#9108)
* fix(master): eagerly remove volume from writable when RecordAssign hits limit
Previously, a volume was only removed from the writable list by the
heartbeat-driven CollectDeadNodeAndFullVolumes pass, which runs every
pulse (5s) after a 5s heartbeat. Under sustained concurrent writes,
fio-style workloads observed in the field grew volumes 8-20x past the
configured 100MB limit (median 530MB, peak 1.98GB) during that
5-15s detection window.
RecordAssign already tracks effective size (reported + pending) on each
/dir/assign. It now also removes the volume from writable the moment
effectiveSize reaches volumeSizeLimit, and mirrors the activeVolumeCount
decrement that Topology.SetVolumeCapacityFull would have done on the
next heartbeat. The heartbeat path remains unchanged and idempotent
(vl.SetVolumeCapacityFull returns false if already removed, so no
double-decrement).
Recovery still works: if a heartbeat later reports size < limit and
the volume is not oversized, EnsureCorrectWritables adds it back.
- weed/topology/volume_layout.go: RecordAssign returns reachedCapacity
bool; adds AdjustActiveVolumeCountForFull helper.
- weed/topology/topology.go: PickForWrite invokes the decrement on
eager full transitions.
- TestPickForWrite: pass a 1024-byte hint instead of 0 so the default
1MB pendingDelta does not immediately bust the test's 32KB limit.
- New TestRecordAssignReachingCapacityRemovesFromWritable covers the
eager removal, active count accounting, and no-double-accounting.
* fix(master): recover eagerly-removed volume once decay clears pending
After RecordAssign eagerly removes a volume from writables because
effectiveSize reached the limit, decay can later bring effectiveSize
back under the limit (e.g., when a burst of assigns didn't all result
in uploads). Without recovery the volume would stay non-writable until
vacuum or a ReadOnly flip.
UpdateVolumeSize now re-adds the volume to writables once all of the
following hold:
* RecordAssign is what removed it (tracked via fullSince timestamp)
* at least capacityRecoveryDelay has elapsed since the removal (30s)
— this prevents bouncing during a steady stream of assigns near
the limit
* effectiveSize has decayed below the crowded threshold (90% of limit)
* reportedSize is under the limit (actual disk is not over)
* standard EnsureCorrectWritables preconditions: enough copies, all
copies writable, not oversized
The caller (SyncDataNodeRegistration) re-increments activeVolumeCount
symmetrically with the decrement done on eager removal.
* review: release VolumeLayout lock before UpAdjustDiskUsageDelta
adjustActiveVolumeCount held vl.accessLock across the tree-climbing
UpAdjustDiskUsageDelta walk. That walk takes per-level DiskUsages
locks and could be re-entered from other call paths that hold a
node-level lock and then acquire vl.accessLock. Copy the node list
under the VolumeLayout lock and release it before the tree walk to
eliminate the lock-ordering hazard.
|
||
|
|
40ffc73aa8 |
ci(pjdfstest): cache docker layers via GHA to avoid apt mirror flakes (#9106)
* ci(pjdfstest): cache docker layers via GHA to avoid apt mirror flakes Replace the local buildx cache + manual fallback with docker/setup-buildx-action and docker/build-push-action using type=gha cache. The e2e and pjdfstest Dockerfile layers now persist across runs in GitHub's own cache backend, so apt-get update only hits Ubuntu mirrors when the Dockerfiles change. Also add Acquire::Retries and Timeout so first-run cache-miss builds survive transient mirror sync errors. * ci(pjdfstest): use local registry to share e2e image across buildx builds The docker-container buildx driver cannot see images loaded into the host Docker daemon, so the second build's FROM chrislusf/seaweedfs:e2e failed with "not found" on registry-1.docker.io. Run a local registry:2 on the runner, push both images to localhost:5000, remap the FROM via build-contexts so the Dockerfile stays unchanged, then tag the pulled images locally for docker compose to consume. |
||
|
|
ff4f96c71f |
fix(filer): drop stale master gRPC cache on stream death (#9102) (#9107)
* fix(filer): drop stale master gRPC cache on stream death (#9102) When the master server restarts behind a stable L4 endpoint (e.g. a Kubernetes ClusterIP Service), the filer's streaming KeepConnected channel detects the disconnect and reconnects, but the shared request-path ClientConn cached in pb.grpcClients can remain in READY state while actually being dead. New AssignVolume/LookupVolume calls reuse that cached channel and return `rpc error: code = Canceled desc = context canceled` for every request, until the filer pod is restarted. - Expose pb.InvalidateGrpcConnection(address) to drop a cached ClientConn when a higher-level signal says it is stale. - In MasterClient.tryConnectToMaster, invalidate the cached request-path channel whenever the KeepConnected stream returns, so unrelated callers dial fresh on their next RPC. - Extend operation.Assign's retry predicate to cover Canceled and DeadlineExceeded while the caller context is still live: the first failure invalidates the stale ClientConn via shouldInvalidateConnection, and the retry dials a new channel. * fix(grpc): invalidate cached peer conn on streaming death in other paths Extends the master-client fix to the other streaming-caller + cached non-streaming-peer pairs that share the same stale-channel failure mode when the peer restarts behind a stable L4 endpoint (k8s Service VIP, external load balancer): - pb.FollowMetadata (s3, mount, webdav, mq broker, filer remote gateway, etc. → filer): invalidate the filer's cached ClientConn when the SubscribeMetadata stream returns an error. - filer.MetaAggregator.loopSubscribeToOneFiler (filer → peer filer): invalidate the peer's cached ClientConn after doSubscribeToOneFiler fails, so the next iteration's readFilerStoreSignature / updateOffset calls dial fresh. - mq sub_client.onEachPartition and doKeepConnectedToSubCoordinator (subscriber → broker): invalidate the broker's cached ClientConn when the SubscribeMessage / SubscriberToSubCoordinator stream errors. - mq broker.BrokerConnectToBalancer (broker → broker-balancer): invalidate the balancer's cached ClientConn after the PublisherToPubBalancer stream errors. * address review feedback on InvalidateGrpcConnection - pb.InvalidateGrpcConnection: drop the cache entry under grpcClientsLock but call ClientConn.Close() after releasing the lock, so Close's internal synchronisation/IO doesn't serialise unrelated callers on the global map lock. - wdclient.tryConnectToMaster: only invalidate the cached request-path channel when the streaming call returned an error. On a healthy leader redirect (gprcErr == nil) the cached channel is still usable and invalidating it just causes a needless re-dial from concurrent callers. * refactor(grpc): centralize peer-conn invalidation in streaming path Previously every streaming caller duplicated the same invalidate-cached- non-streaming-peer-conn wrapper around their WithGrpcClient(true, ...) call. Move that logic into WithGrpcClient itself: when the streaming fn returns an error, invalidate any cached ClientConn for the same address. This removes six near-identical call-site wrappers and gives every current and future streaming caller the fix by default. Also aligns the non-streaming branch with the new Invalidate helper's lock discipline: delete the cache entry under grpcClientsLock, then Close the ClientConn after releasing the lock. |
||
|
|
9896eade51 |
feat(mount): set FOPEN_KEEP_CACHE on re-open of unchanged files (#9097)
* feat(mount): set FOPEN_KEEP_CACHE when file mtime is unchanged On re-open of an unmodified file, signal the kernel to preserve its existing page cache. This eliminates redundant volume server reads for workloads that repeatedly open-read-close the same files (build systems, config readers, etc.). * fix(mount): use guarded type assertion for openMtimeCache load Use the two-value form of type assertion when loading from sync.Map to prevent potential panics if a non-int64 value is ever stored. * fix(mount): skip redundant mtime store and invalidate on truncation - Avoid redundant sync.Map Store when cached mtime already matches the current mtime, reducing contention on the hot open path. - Invalidate openMtimeCache in SetAttr when file size changes (truncation), preventing stale kernel page cache after ftruncate. * fix(mount): use nanosecond mtime precision and bounded cache for FOPEN_KEEP_CACHE - Compare both Mtime (seconds) and MtimeNs (nanoseconds) to detect sub-second modifications common in automated workloads. - Replace unbounded sync.Map with a bounded map + mutex (8192 entries, random eviction when full), following the existing atimeMap pattern. - Extract applyKeepCacheFlag and invalidateOpenMtimeCache methods for clarity and testability. - Add tests for nanosecond precision and cache eviction. * fix(mount): invalidate mtime cache in truncateEntry for O_TRUNC consistency Add invalidateOpenMtimeCache call to truncateEntry so the Create path with O_TRUNC follows the same explicit invalidation pattern as SetAttr and Write. |
||
|
|
ceae01d05b |
test(kafka): retry ConsumeWithGroup on failed initial join (#9105)
The consumer group resumption flow in TestOffsetManagement occasionally fails with `read tcp ... i/o timeout` on the second consumer's first FetchMessage. Re-joining an existing group races with the previous member's LeaveGroup and session cleanup; the new reader can observe transient coordinator state that the kafka-go client surfaces as a connection read timeout. Wrap ConsumeWithGroup in a 3-attempt retry that rebuilds the reader only when no message was received yet, so a transient join failure doesn't fail the whole test. Partial progress (at least one message consumed) still returns immediately, preserving the original error semantics for post-join failures. |
||
|
|
216b52c13a |
perf(mount): add graduated write backpressure (#9099)
* perf(mount): add graduated write backpressure before buffer cap Introduce soft (80%) and hard (95%) throttling thresholds in WriteBufferAccountant. When write buffer usage approaches the cap, Reserve() inserts brief sleeps to slow writers gradually rather than blocking them completely at the cap. This smooths out write latency under sustained load. * fix(mount): address review feedback for graduated backpressure - Use projected usage (used + n) for threshold checks so a single reservation crossing a threshold is throttled immediately. - Widen no-throttle test timing tolerance to softThrottleDelay to avoid CI flakes from scheduling jitter. - Replace fragile timing upper-bound in recovery test with counter-based invariant (hardThrottleCount stays at 1). * docs(mount): clarify single-shot throttle design in WriteBufferAccountant Add a comment explaining why graduated throttling runs once per Reserve call rather than inside the blocking loop: once at the cap, the evictor + cond.Wait mechanism frees actual capacity, which time-based sleeps cannot do. |
||
|
|
886d50a6a5 |
feat(mount): singleflight dedup for concurrent chunk reads (#9100)
* feat(mount): add singleflight deduplication for concurrent chunk reads When multiple FUSE readers request the same uncached chunk concurrently, only one network fetch is performed. Other readers wait and share the downloaded data, reducing redundant volume server traffic under parallel read workloads. * fix(util): make singleflight panic-safe with defer cleanup If the provided function panics, the WaitGroup and map entry are now cleaned up via defer, preventing other waiters from hanging forever. * fix(filer): remove singleflight from reader_cache to fix buffer ownership The singleflight wrapper around chunk fetches returned the same []byte buffer to concurrent callers. Since each SingleChunkCacher owns and frees its data buffer in destroy(), sharing the same slice would cause a use-after-free or double-free with the mem allocator. The downloaders map already deduplicates in-flight downloads for the same fileId, so the singleflight was redundant at this layer. The SingleFlightGroup utility is retained for use elsewhere. |
||
|
|
e1fa4ec756 |
perf(cache): drop OS page cache after disk cache reads (#9098)
* perf(cache): drop OS page cache after disk cache reads After reading from the on-disk chunk cache, advise the kernel via FADV_DONTNEED to release the corresponding page cache pages. This prevents double-caching the same data in both user-space and kernel page caches, freeing RAM for other uses on systems with large disk caches. * fix(cache): guard dropReadCache against zero length and invalid fd A zero-length fadvise is interpreted as "to end of file" on Linux, which would inadvertently drop the page cache for the entire remainder of the cache volume. Also check fd >= 0 to avoid unnecessary syscalls when the backend file is closed. * perf(cache): only apply FADV_DONTNEED for reads >= 1 MiB For small needle reads the syscall overhead outweighs the memory savings, and the kernel page cache is more beneficial for warm data. Restrict fadvise to reads of at least 1 MiB where the freed page cache is meaningful. |
||
|
|
213b6c3107 |
fix(mount): count manifest sizes in merge condition to prevent accumulation (#9101)
* fix(mount): count manifest sizes in merge condition to prevent manifest accumulation shouldMergeChunks only counted non-manifest chunk sizes toward the bloat threshold, so overlapping manifests accumulated undetected across flush cycles. During sustained random writes (e.g. fio), each metadata flush compacts non-manifest chunks and may group them into a new manifest via MaybeManifestize, while carrying forward all existing manifests. Since the merge condition only checked non-manifest totals against 2x file size, the growing pile of redundant manifests never triggered a merge. In a real workload: 4 GB file, 25 manifests each covering ~4 GB (107 GB manifest data on volume servers), but shouldMergeChunks saw only 4.2 GB of non-manifest chunks vs the 8.6 GB threshold -- no merge. Fix: include manifest coverage sizes in totalChunkSize. This correctly detects the 111 GB total vs 8.6 GB threshold and triggers the merge, which re-reads the file as clean chunks and lets the filer's MinusChunks (which resolves manifests) garbage-collect all redundant sub-chunks. * test(mount,filer): add manifest chunk correctness tests Filer-level tests (filechunk_manifest_test.go): - TestManifestRoundTripPreservesChunks: create -> serialize -> deserialize preserves fileId, offset, size, and timestamp for all sub-chunks - TestCompactResolvedOverlappingManifests: two manifests covering the same range, older sub-chunks correctly identified as garbage after compaction - TestDoMinusChunksWithResolvedManifests: old manifest sub-chunks detected as garbage when compared against new clean chunks (mirrors filer cleanup) - TestManifestizeSmallBatchWithRemainder: batch=3 with 7 chunks produces 2 manifests + 1 remainder, all data recoverable after resolve - TestCompactMultipleOverlappingManifestGenerations: 5 generations of full-file manifests, compaction keeps only the newest generation - TestManifestBloatDetection: with N overlapping manifests, merge triggers at N >= 2 (stored > 2x file size) Mount-level test (weedfs_file_sync_test.go): - TestFlushCycleManifestAccumulation: simulates the flushMetadataToFiler pipeline over multiple cycles with a small manifest batch (5 chunks). Verifies merge triggers at cycle 2 when accumulated manifests exceed the 2x threshold. Uses SeparateManifestChunks + CompactFileChunks + shouldMergeChunks to mirror the real code path. |
||
|
|
f9df187928 |
fix(mount): reduce chunk fragmentation from random writes (#9096)
* fix(mount): reduce chunk fragmentation from random writes Random writes via FUSE mount create many small partially-overlapping chunks that bloat storage and slow reads. Two complementary fixes: 1. Fill gaps at seal time: when a partial writable chunk is sealed for upload, read existing file data into the unwritten regions so SaveContent uploads one complete chunk instead of many fragments. No overhead for sequential writes (IsComplete short-circuits). 2. Merge on fsync: after CompactFileChunks, if total non-manifest chunk data exceeds 2x the logical file size, re-read and re-upload the entire file as properly-sized chunks. The filer's cleanupChunks garbage-collects all superseded chunks. * revert FillGaps: reading from volume servers during write path is too expensive * test(mount): add tests for chunk merge condition Extract shouldMergeChunks as a testable function and add tests covering: - empty file, single chunk, non-overlapping chunks (no merge) - exactly 2x boundary (no merge) vs just over 2x (merge) - manifest chunks extend file size but don't count toward stored total - input slices are not mutated (safe append) - CompactFileChunks + condition: fully superseded, staggered overlaps, 75% overlap pattern, many 4K writes at 1K step - randomized bloat detection (50 iterations) - visible content preserved after compaction (100 iterations) * fix(mount): cap merge buffer at file size for small files Avoids allocating a full ChunkSizeLimit buffer when the file being merged is smaller. Addresses PR review feedback. |
||
|
|
44ab2ffee3 |
fix(plugin): remove Min Volume Age field from vacuum plugin worker config (#9095)
* fix(plugin): remove Min Volume Age field from vacuum plugin worker config The min_volume_age_seconds setting is not needed as a user-configurable field in the plugin worker form. The detection logic continues to use the hardcoded default from NewDefaultConfig(). * fix(plugin): disable volume age filtering in plugin worker detection Set MinVolumeAgeSeconds to 0 in deriveVacuumConfig so the plugin worker does not filter out volumes by age during detection. * fix(plugin): remove all volume age filtering from plugin worker detection Remove age-related fields from detection trace messages, activity reports, and per-volume diagnostics. The plugin worker now only filters by garbage threshold. |
||
|
|
d7865909ba |
feat(mount): proactive flush of idle writable chunks (#9094)
* feat(mount): proactive flush of idle writable chunks Add a background goroutine that periodically scans writable chunks across all open file handles and seals those that are idle and unlikely to receive further writes, submitting them for async upload. A writable chunk is proactively flushed when it has been idle for 500ms AND meets one of: nearly full (>=90%), behind the sequential write frontier by 2+ chunks, or stale for 5+ seconds. Flushing only happens when the upload pipeline has spare capacity (< half of concurrent writer slots in use). This prevents partial chunks from accumulating until fsync/close, which is particularly beneficial for bursty or small-file workloads where chunks may never reach IsComplete(). Also fixes a latent bug in ActivityScore where MarkRead/MarkWrite used value receivers, silently discarding all mutations. * refactor(mount): reuse WriterPattern instead of duplicating sequential detection Remove the isSequential atomic from UploadPipeline. The proactive flusher now reads IsSequentialMode() from the existing WriterPattern on PageWriter and passes it as a parameter to ProactiveFlush. This avoids duplicating the sequential/random detection that WriterPattern already maintains. * fix(mount): address PR review feedback - Make ActivityScore thread-safe using atomics (CAS loop for score updates, atomic load/swap for timestamp). Previously MarkRead was called under RLock while MarkWrite held a write lock, creating a data race on the shared fields. - Fix ProactiveFlush half-capacity guard: use multiplication (uploaderCount*2 >= max) instead of floor division (max/2) which misbehaves for odd or small concurrentWriterMax values. * fix(mount): review fixes for proactive flush - Fix TOCTOU race in lastWriteChunkIndex update: use CAS loop so concurrent writers cannot regress the frontier. - Remove unused UploaderCount() getter. - Reuse the caller-provided tsNs instead of calling time.Now() again in WriteDataAt for lastWriteTsNs, eliminating a redundant syscall per write. |
||
|
|
46b801aedb |
fix(admin): list all masters and dedupe EC file counts in dashboard (#9093)
* fix(admin): list all masters and dedupe EC file counts in dashboard Dashboard -> Master Nodes only ever showed the currently connected master because getMasterNodesStatus hard-coded a single entry. Replace it with a RaftListClusterServers call that returns every master in the raft group and tags the real leader, falling back to the current master only if the raft call fails. Buckets -> Object Store Buckets could render 0 objects for a bucket backed by an EC volume. Every shard holder reports the same whole-volume file_count (read from the replicated .ecx), so the first-seen value wins; if that first node had not yet finished loading .ecx it reported 0 and pinned the aggregate at 0. Take the max across reporting nodes instead. The dashboard header total_files also dropped after volumes were converted to erasure coding because getTopologyViaGRPC never folded EC file_count into topology.TotalFiles. Aggregate it with the same max/sum dedupe. * fix(admin): address PR review comments - bound RaftListClusterServers with a 3s timeout so the dashboard endpoint cannot hang on a stalled master - pre-validate raft addresses with net.SplitHostPort before calling pb.GrpcAddressToServerAddress, which otherwise glog.Fatalf's on a malformed entry and would crash the admin process - when raft is unreachable, mark the fallback master as not-leader rather than claiming leadership the code cannot verify - warn when summed EC delete_count exceeds file_count while folding into topology.TotalFiles, matching collectCollectionStats * fix(admin): distinguish empty raft response from RPC failure When RaftListClusterServers returns successfully with no servers, raft is not initialized (standalone/non-raft cluster), so the single fallback master is the leader. Only treat the fallback as a non-leader when the RPC actually failed. * fix(admin): remove misleading Objects column from S3 buckets page The bucket "Objects" column displayed needle counts from volume collection stats, not actual S3 object counts. This is confusing because a single S3 object can span multiple needles (multipart uploads, versions) and the count is inaccurate for EC volumes. Remove the ObjectCount field from S3Bucket, the Objects table column, the sort-by-objects handler, the detail-view row, and both CSV export references. * fix(admin): correct cell indexes in fallback bucket CSV export After the Objects column was removed, the fallback CSV exporter in admin.js still used stale cell indexes: cells[1] mapped to Owner (not Created), cells[2] to Created (not Size), cells[3] to Logical Size (not Quota). Align all indexes with the current table column order and include Owner, Logical Size, and Physical Size. |
||
|
|
dfecd664f9 |
fix(master): do not re-enter warmup when a fresh cluster grows its first volume (#9092)
* fix(master): do not re-enter warmup when a fresh cluster grows its first volume Follow-up investigation on #8777. After fixing the writable-chunk cap deadlock, a second issue surfaced on the same fio reproducer: the weed mount would log thousands of upload data X: filerGrpcAddress assign volume: assign volume failure count:1 path:"/test2/X": assign volume: rpc error: code = Canceled desc = grpc: the client connection is closing cascading from the filer's handler. The mount's own cached gRPC connection to the filer was never invalidated — the "client connection is closing" text is the FILER's cached connection to the MASTER going away, and the message is forwarded verbatim in the filer's AssignVolumeResponse.Error. Root cause: Topology.IsWarmingUp checks the *live* GetMaxVolumeId. On a fresh cluster the master is initialized with SetLastLeaderChangeTime (master_server.go:239 "Seed the warmup timestamp so IsWarmingUp() is active even if the leader change event hasn't fired yet"), but the MaxVolumeId==0 guard is supposed to short-circuit IsWarmingUp for bootstraps so there is no wait. That guard breaks the moment the first volume is grown inside the warmup window: MaxVolumeId flips from 0 to 1, the lastLeaderChangeTime is still within 3*pulse (15 s default), and IsWarmingUp retroactively returns true for the next several seconds. Every AssignVolume in that window returns codes.Unavailable, which trips the filer's shouldInvalidateConnection guard and tears down its cached master connection, which in turn surfaces as "client connection is closing" to every concurrent in-flight call from the mount's file-close flush storm. fio reports EIO on close and the user sees a thousand scary error lines in the mount log for an otherwise correct run. Fix: snapshot `hadVolumesAtLeaderChange` inside SetLastLeaderChangeTime and read that snapshot in IsWarmingUp instead of the live MaxVolumeId. A fresh cluster snapshots "no volumes at leader change" → IsWarmingUp stays false through the entire warmup window regardless of how fast the first grow lands. A real leader transition on a populated cluster still snapshots "has volumes" → IsWarmingUp behaves exactly as before until the 3*pulse window closes. The lock ordering in SetLastLeaderChangeTime reads MaxVolumeId before taking the lastLeaderChangeTimeLock so the two calls cannot interleave weirdly; IsWarmingUp reads both fields under a single RLock acquisition. Verified with the same containerized reproducer used for the deadlock fix: 4 jobs × 250 nrfiles × 40 MiB × 4k randwrite direct. - baseline (master): fio rc=1 (EIO on close), 2809 mount error lines matching "filerGrpcAddress assign volume ... client connection is closing", 788 filer lines matching "warming up", 331 "Removing cached gRPC connection to ...19333 due to error: master is warming up". - patched: fio rc=0 in 1.9 s at 79.2 MiB/s, 0 mount errors, 0 filer "warming up" lines, 0 cached-master-conn invalidations. go test ./weed/topology/... passes. * fix(topology): make NodeImpl.maxVolumeId atomic CodeRabbit flagged on #9092 that my new IsWarmingUp snapshot (hadVolumesAtLeaderChange) reads through GetMaxVolumeId(), which until now returned an unprotected int field on NodeImpl. UpAdjustMaxVolumeId is called from the volume server heartbeat path in parallel with GetMaxVolumeId reads on the assign path, and neither side had any synchronization — a long-standing data race the race detector would flag if the warmup test suite stressed both sides. Switch maxVolumeId to atomic.Uint32 (needle.VolumeId is uint32) and implement UpAdjustMaxVolumeId as a CAS loop so the check-then-set stays linearizable: two heartbeats racing to promote the field will land the higher value deterministically and propagate to the parent exactly once. GetMaxVolumeId is a single atomic load. Callers of both helpers are unchanged; the struct field comment documents why the atomic is necessary. go test -race ./weed/topology/... passes. * refactor(topology): use WarmupDuration helper in IsWarmingUp Gemini review on #9092 flagged that IsWarmingUp re-derives the warmup duration from pulse and WarmupPulseMultiplier instead of using the existing WarmupDuration() helper, which RemainingWarmupDuration already uses. Fold the duration calculation through the helper and short-circuit on lastChange.IsZero() before the time.Since call. No behavior change. |
||
|
|
bdebd63f7c |
fix(mount): evict writable chunks when writeBufferSizeMB cap is reached (#9091)
* fix(mount): evict writable chunks when writeBufferSizeMB cap is reached Reported on #8777: fio 4k randwrite with --nrfiles=1000 on a weed mount hangs indefinitely after ~1 second of activity, volume server QPS drops to zero, and the test never makes progress beyond a few hundred writes. A scaled-down local repro (4 jobs x 250 nrfiles x 40 MiB with -chunkSizeLimitMB=32 -writeBufferSizeMB=10240) reproduced the hang exactly: fio stuck in fio_state=SETTING_UP, mount idle, all writer goroutines parked in page_writer.(*WriteBufferAccountant).Reserve with n=0x2000000 (32 MiB). Root cause: UploadPipeline.SaveDataAt reserves a full chunkSize slot against the global WriteBufferAccountant the first time it allocates a writable chunk for a given file. The reservation is released by SealedChunk.FreeReference only after the chunk's async upload completes, and chunks only move from writable to sealed when they fill or when the file is flushed/closed. For 4k random writes with small per-file data and iodepth holding files open, writable chunks never fill and never close, so their slots are pinned forever. Once openFiles * chunkSize exceeds writeBufferSizeMB, every new writer blocks in Reserve's cond.Wait with no possible waker, a hard deadlock. The user-visible symptoms (volume QPS=0, waited 30min, still running) are exactly that state. In the reported config, 8 jobs * 1000 nrfiles * 32 MiB = 256 GB of reservations against a 10 GB cap. Fix: add a SetEvictor hook on WriteBufferAccountant. When Reserve would otherwise block on the cond, it single-flights an evictor callback that walks wfs.fhMap, picks the first open file handle with a writable chunk, and force-seals its fullest writable chunk via a new UploadPipeline.EvictOneWritableChunk. Force-sealing submits the chunk for async upload on the existing uploader pool; the upload's SealedChunk.FreeReference releases the global slot, broadcasts on the accountant cond, and unblocks the waiter. The fullest-first heuristic matches the existing over-limit path in SaveDataAt and keeps us from thrashing on repeatedly re-sealing half-empty chunks. The original #9066 motivation (cap global write-pipeline growth so swap can't be filled while uploads stall) is preserved: Reserve still blocks when the cap is actually exhausted by in-flight uploads, and a failed evictor (no writable chunks to seal) falls through to cond.Wait exactly as before. Verified end-to-end in a privileged debian container running a full master + volume + filer + weed mount stack. Against the reporter's config shrunk to the same memory footprint (4 jobs x 250 nrfiles x 40 MiB, -chunkSizeLimitMB=32 -writeBufferSizeMB=10240): - baseline binary (master): fio stalls after ~320 writes, killed by timeout at 400s with io=1280 KiB and bw=2383 B/s. - patched binary: fio completes in 1.9 s at 82.8 MiB/s, all 40,000 writes issued, 156 MiB written, rc=0. page_writer unit tests still pass. Touches: WriteBufferAccountant (SetEvictor + single-flight in Reserve), UploadPipeline.EvictOneWritableChunk, DirtyPages interface + the two existing implementers (ChunkedDirtyPages, PageWriter), and a new weedfs_write_buffer_evict.go holding the WFS.evictOneWritableChunk callback that walks fhMap. * fix(mount): re-check write-budget cap after evict and make eviction panic-safe Addresses two review concerns on #9091: 1. CodeRabbit flagged that the original Reserve loop only considered the break condition when the evictor reported success: if evicted { if a.used+n <= a.cap || a.used == 0 { break } } a.cond.Wait() A concurrent Release firing during the evictor's execution window (we drop a.mu around the callback so uploader goroutines can drain) would land its Broadcast on an empty waiter set and then be missed, because we unconditionally call cond.Wait() on the same iteration. In practice an in-flight async upload eventually fires another broadcast so this would delay rather than permanently hang — but the logic is wrong either way and should re-check the cap after every evictor round. 2. Gemini flagged that setting `a.evicting = true` and dropping `a.mu` without a deferred unwind leaves the accountant in a broken state if the evictor panics: the flag stays set forever and no Reserve caller can ever run another eviction, even though the mutex gets unlocked by the normal stack unwind. Pull the evictor call into a tiny helper runEvictorLocked whose defer re-acquires a.mu and clears a.evicting regardless of panic status, then broadcasts so other blocked Reservers re-evaluate. Reserve checks the cap condition immediately after the helper returns; only falls through to cond.Wait() when the budget is still exhausted. No behavior change on the happy path; the fix is in the error and race corners. * doc(mount): clarify eviction-scan cost on the blocked-Reserve path Gemini review on #9091 flagged that the fhMap scan inside evictOneWritableChunk is O(open files). Document why the cost is bounded to the Reserve-blocked path and paid at most once per chunkSize drained, rather than on the write hot path, and record the reason we do not preempt with an LRU. |
||
|
|
2fd60cfbc3 |
fix(balance): guard against destination overshoot and oscillation (#9090)
* fix(balance): guard against destination overshoot and oscillation Plugin-worker volume_balance detection re-selects maxServer/minServer each iteration based on utilization ratio. With heterogeneous MaxVolumeCount values, a single greedy move can flip which server is most-utilized, causing A->B, B->A oscillation within one detection cycle and pushing destinations past the cluster ideal. Mirror the shell balancer's per-move guard (weed/shell/command_volume_balance.go:440): before scheduling a move, verify that the destination's post-move utilization would not strictly exceed the source's post-move utilization. If it would, no single move can improve balance, so stop. Add regression tests that cover: - TestDetection_HeterogeneousMax_NoOvershootNoOscillation: 2 servers with different caps just above threshold; detection must not oscillate or make the imbalance worse. - TestDetection_RespectsClusterIdealUtilization: 3-server heterogeneous layout; destinations must not overshoot cluster ideal. * fix(balance): use effective capacity when resolving destination disk resolveBalanceDestination read VolumeCount directly from the topology snapshot, which is not updated when AddPendingTask registers a move within the current detection cycle. This meant multiple moves planned in a single cycle all saw the same static count and could target the same disk past its effective capacity. Switch to ActiveTopology.GetNodeDisks + GetEffectiveAvailableCapacity so that destination planning accounts for all pending and assigned tasks affecting the disk — consistent with how the detection loop already tracks effectiveCounts at the server level. Add a unit test that seeds two pending balance tasks against a destination disk with 2 free slots and asserts resolveBalanceDestination rejects a third planned move. * fix(ec_balance): capacity-weighted guard in Phase 4 global rebalance detectGlobalImbalance picked min/max nodes by raw shard count and compared them against a simple (unweighted) rack-wide average. With heterogeneous MaxVolumeCount across nodes in the same rack, this lets the greedy algorithm move shards from a large, barely-used node to a small, nearly-full node just because the small node has fewer shards in absolute terms — strictly worsening imbalance by utilization and potentially overfilling the small node. Snapshot each node's total shard capacity (current shards plus free slots) at loop start and add a per-move convergence guard: reject any move where the destination's post-move utilization would strictly exceed the source's post-move utilization. Mirrors the fix in weed/worker/tasks/balance/detection.go. Regression test TestDetectGlobalImbalance_HeterogeneousCapacity covers a rack with node1 (cap 100, 10 shards → 10% util) and node2 (cap 5, 3 shards → 60% util). Before the fix, Phase 4 moves 2 shards from node1 to node2, filling node2 to 100% util. After the fix, the guard blocks both moves. * fix(ec_balance): utilization-based max/min in Phase 4 rebalance Phase 4's global rebalancer picked source and destination nodes by raw shard count, and compared against a simple raw-count average. With heterogeneous MaxVolumeCount across nodes in a rack, this got the direction wrong: a large-capacity node holding many shards in absolute terms but only a small fraction of its capacity would be picked as the "overloaded" source, while a small-capacity node nearly at its slot limit (but holding fewer absolute shards) would be picked as the "underloaded" destination. The previous fix added a strict-improvement guard that prevented the bad move but left balance untouched — the rack stayed in an uneven state. Switch to utilization-based selection and a utilization-based pre-check: - Pick max/min by (count / capacity), where capacity is the node's current allowed shards plus remaining free slots (snapshotted once per rack and held constant for the duration of the loop). - Replace the raw-count imbalance gate (exceedsImbalanceThreshold) with a new exceedsUtilImbalanceThreshold helper that compares fractional fullness. The raw-count gate is still used by Phase 2 and Phase 3, where the per-rack / per-volume semantics differ. - Drop the raw-count guards (maxCount <= avgShards || minCount+1 > avgShards and maxCount-minCount <= 1) now that the per-move strict-improvement check handles termination correctly for both homogeneous and heterogeneous capacity. Also fix a latent bug in the inner shard-selection loop: it was not updating shardBits between iterations, so every iteration picked the same lowest-set bit and emitted duplicate move requests for the same physical shard. Update maxNode and minNode's shardBits immediately after appending a move, mirroring what applyMovesToTopology does between phases. Update TestDetectGlobalImbalance_HeterogeneousCapacity to assert: - Moves flow from the higher-util node2 to the lower-util node1 (direction check), and - Each (volumeID, shardID) pair appears at most once in the move list (duplicate-shard guard). * fix(ec_balance): keep source freeSlots in sync after planned shard moves All three phase loops that plan EC shard moves (detectCrossRackImbalance, detectWithinRackImbalance, detectGlobalImbalance) decrement the destination node's freeSlots but leave the source node's freeSlots stale. Over the course of a detection run that processes many volumes or iterates within a rack, the source's reported freeSlots drifts below its actual value. In Phase 4 specifically, the per-move strict-improvement guard prevents the source from becoming a destination candidate, so the stale value never affects decisions. In Phases 2 and 3 it can: a node that sheds shards for one volume's rebalance is eligible as a destination for another volume in the same run, and the destination selection uses node.freeSlots <= 0 as a hard skip (findDestNodeInUnderloadedRack / findLeastLoadedNodeInRack). A tightly-provisioned node could be skipped as a destination even after it has freed slots. Increment maxNode.freeSlots / node.freeSlots symmetrically at each scheduled move so freeSlots remains an accurate running view of available slot capacity throughout a detection run. |
||
|
|
979c54f693 |
fix(wdclient,volume): compare master leader with ServerAddress.Equals (#9089)
* fix(wdclient,volume): compare master leader with ServerAddress.Equals Raft leader is advertised as host:httpPort.grpcPort, but clients dial host:httpPort. Raw string comparison against VolumeLocation.Leader / HeartbeatResponse.Leader therefore never matches, causing the masterclient and the volume server heartbeat loop to continuously "redirect" to the already-connected master, tearing down the stream and reconnecting. Use ServerAddress.Equals, which normalizes the grpc-port suffix. * fix(filer,mq): compare ServerAddress via Equals in two more sites filer bootstrap skip (MaybeBootstrapFromOnePeer) and the broker's local partition assignment check both compared a wire-supplied address string against the local self ServerAddress with raw string equality. Both are vulnerable to the same plain-vs-host:port.grpcPort mismatch as the masterclient/volume heartbeat sites: filer would bootstrap from itself, and the broker would fail to claim a partition it was actually assigned. Route both through ServerAddress.Equals. * fix(master,shell): more ServerAddress comparisons via Equals - raft_server_handlers.go HealthzHandler: s.serverAddr == leader would skip the child-lock check on the real leader when the two carry different plain/grpc-suffix forms, returning 200 OK instead of 423. - master_server.go SetRaftServer leader-change callback: the Leader() == Name() guard for ensureTopologyId could disagree with topology.IsLeader() (which already uses Equals), so leader-only initialization could be skipped after an election. - command_volume_merge.go isReplicaServer: the -target guard compared user-supplied host:port against NewServerAddressFromDataNode(...) with ==, letting an existing replica slip through when topology carries the embedded gRPC port. All routed through pb.ServerAddress.Equals. * fix(mq,cluster): more ServerAddress comparisons via Equals - broker_grpc_lookup.go GetTopicPublishers/GetTopicSubscribers: the partition ownership check gated listing on raw LeaderBroker == BrokerAddress().String(), so listings silently omitted partitions hosted locally when the assignment carried the other host:port / host:port.grpcPort form. - lock_client.go: LockHostMovedTo comparison and the seedFiler fallback guard both used raw string equality against configured filer addresses (which may be plain host:port while LockHostMovedTo comes back suffixed), causing spurious host-change churn and blocking the seed-filer fallback. * fix(mq): more ServerAddress comparisons via Equals - pub_balancer/allocate.go EnsureAssignmentsToActiveBrokers: direct activeBrokers.Get() lookup missed brokers when a persisted assignment carried a different address encoding than the registered broker key, triggering a bogus reassignment on every read/write cycle. Added a findActiveBroker helper that falls back to an Equals-based scan and canonicalizes the assignment in place so later writes are stable. - broker_grpc_lookup.go isLockOwner: used raw string equality between LockOwner() and BrokerAddress().String(), so a lock owner could fail to recognize itself and proxy local lookup/config/admin RPCs away. - pub_client/scheduler.go onEachAssignments: reused publisher jobs only on exact LeaderBroker match, so an encoding flip in lookup results tore down and recreated a stream to the same broker. |