Commit Graph

13548 Commits

Author SHA1 Message Date
Chris Lu
d2e64e85ce fix(log_buffer): back off disk-poll cadence when caught up to disk head (#9161)
Idle subscribers parked in the ResumeFromDiskError path were re-probing
the in-memory buffer and disk every 250ms, emitting a "Notification
timeout after ResumeFromDiskError, rechecking state" log line per tick
even when nothing had changed. Once ReadFromDiskFn returns without
advancing lastReadPosition, we know the disk has no data past the
subscriber's current position. Switch to a 2s poll in that state so
external disk writers (e.g. Schema Registry) are still re-detected on a
bounded cadence, but idle CPU and log noise drop ~8x. Any progress
(disk advance or notifyChan wakeup) clears the flag and restores the
responsive 250ms tick for active readers. The per-timeout V(4) log is
replaced by a single "Caught up to disk head" transition log.
2026-04-20 14:36:56 -07:00
dependabot[bot]
61c1735cdd build(deps): bump modernc.org/sqlite from 1.46.1 to 1.49.1 (#9155)
Bumps [modernc.org/sqlite](https://gitlab.com/cznic/sqlite) from 1.46.1 to 1.49.1.
- [Changelog](https://gitlab.com/cznic/sqlite/blob/master/CHANGELOG.md)
- [Commits](https://gitlab.com/cznic/sqlite/compare/v1.46.1...v1.49.1)

---
updated-dependencies:
- dependency-name: modernc.org/sqlite
  dependency-version: 1.49.1
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-04-20 12:20:55 -07:00
dependabot[bot]
49ce2b7326 build(deps): bump github.com/rclone/rclone from 1.73.1 to 1.73.5 (#9156)
Bumps [github.com/rclone/rclone](https://github.com/rclone/rclone) from 1.73.1 to 1.73.5.
- [Release notes](https://github.com/rclone/rclone/releases)
- [Changelog](https://github.com/rclone/rclone/blob/master/RELEASE.md)
- [Commits](https://github.com/rclone/rclone/compare/v1.73.1...v1.73.5)

---
updated-dependencies:
- dependency-name: github.com/rclone/rclone
  dependency-version: 1.73.5
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-04-20 12:20:46 -07:00
dependabot[bot]
a890300eaf build(deps): bump cloud.google.com/go/pubsub from 1.50.1 to 1.50.2 (#9154)
Bumps [cloud.google.com/go/pubsub](https://github.com/googleapis/google-cloud-go) from 1.50.1 to 1.50.2.
- [Release notes](https://github.com/googleapis/google-cloud-go/releases)
- [Changelog](https://github.com/googleapis/google-cloud-go/blob/main/CHANGES.md)
- [Commits](https://github.com/googleapis/google-cloud-go/compare/pubsub/v1.50.1...pubsub/v1.50.2)

---
updated-dependencies:
- dependency-name: cloud.google.com/go/pubsub
  dependency-version: 1.50.2
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-04-20 12:12:47 -07:00
dependabot[bot]
6642a64d2b build(deps): bump github.com/go-git/go-billy/v5 from 5.6.2 to 5.8.0 (#9152)
Bumps [github.com/go-git/go-billy/v5](https://github.com/go-git/go-billy) from 5.6.2 to 5.8.0.
- [Release notes](https://github.com/go-git/go-billy/releases)
- [Commits](https://github.com/go-git/go-billy/compare/v5.6.2...v5.8.0)

---
updated-dependencies:
- dependency-name: github.com/go-git/go-billy/v5
  dependency-version: 5.8.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-04-20 12:06:33 -07:00
Lars Lehtonen
93d604d799 chore(weed/s3api/policy): prune unused test functions (#9150) 2026-04-20 12:04:41 -07:00
dependabot[bot]
dec09d1484 build(deps): bump github.com/aws/aws-sdk-go-v2 from 1.41.5 to 1.41.6 (#9153)
Bumps [github.com/aws/aws-sdk-go-v2](https://github.com/aws/aws-sdk-go-v2) from 1.41.5 to 1.41.6.
- [Release notes](https://github.com/aws/aws-sdk-go-v2/releases)
- [Commits](https://github.com/aws/aws-sdk-go-v2/compare/v1.41.5...v1.41.6)

---
updated-dependencies:
- dependency-name: github.com/aws/aws-sdk-go-v2
  dependency-version: 1.41.6
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-04-20 12:03:28 -07:00
dependabot[bot]
25d7f2c569 build(deps): bump docker/build-push-action from 6 to 7 (#9151)
Bumps [docker/build-push-action](https://github.com/docker/build-push-action) from 6 to 7.
- [Release notes](https://github.com/docker/build-push-action/releases)
- [Commits](https://github.com/docker/build-push-action/compare/v6...v7)

---
updated-dependencies:
- dependency-name: docker/build-push-action
  dependency-version: '7'
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-04-20 12:02:50 -07:00
Chris Lu
86c5e815d2 fix(kafka): make consumer-group rebalancing work end-to-end (#9143)
* fix(kafka): make consumer-group rebalancing work end-to-end

TestConsumerGroups was failing every run since the job was added
(2026-04-17) but the failures were masked by a `|| echo ...` trailer on
the go test invocation, so the CI reported green. Removing the mask
exposes several real bugs in the gateway's group-coordinator code:

1. JoinGroup deduplicated members by ClientID, which collapsed two
   Sarama consumers that share the default ClientID ("sarama") into a
   single member slot and broke rebalancing. Key dedup off the TCP
   ConnectionID instead; keep ClientID on the member for DescribeGroup
   fidelity.

2. Every JoinGroup replaced the *GroupMember struct, wiping the
   Assignment the leader had just published in its SyncGroup and leaving
   non-leader consumers with 0 partitions after a rebalance. Update the
   existing member in place on rejoin.

3. Non-leader SyncGroup returned an empty assignment while the leader
   was mid-rebalance, so consumers silently came up with no partitions.
   Return REBALANCE_IN_PROGRESS when the group is not Stable so Sarama
   retries the join/sync cycle (4 retries x 2s backoff by default).

4. Heartbeat returned ILLEGAL_GENERATION on a gen mismatch even when
   the group was in PreparingRebalance/CompletingRebalance. Return
   REBALANCE_IN_PROGRESS in that case so the heartbeat loop cleanly
   cancels the session instead of tearing it down on a fatal error.

5. LeaveGroup parser only handled v0-v2. Sarama at V2_8_0_0 sends v3
   (Members array) by default, so the gateway silently rejected the
   request as InvalidGroupID and dead consumers stayed in the group as
   phantom leaders. Added v3 (Members array) and v4+ (flexible/compact/
   tagged-fields) parsing.

The rebalancing integration tests called Consume() once per consumer,
which cannot survive a rebalance (heartbeat RBIP cancels the session
and Consume() returns - this is documented Sarama behaviour; callers
are expected to loop). Added a runConsumeLoop helper and used it in the
four affected sub-tests. RebalanceTestHandler.Setup now overwrites
stale entries in its assignments channel so the test observes the
settled post-rebalance snapshot rather than whatever arrived first.

* fix(kafka): address PR review feedback

- JoinGroup now snapshots existing members before mutating and restores
  the snapshot on INCONSISTENT_GROUP_PROTOCOL rollback. Previously the
  rollback path always deleted the entry, corrupting group state when
  an existing member rejoined with an incompatible protocol.

- handleLeaveGroup iterates request.Members instead of processing only
  the first entry, so v3+ batch departures (KIP-345 style) correctly
  remove every listed member and build a per-member response. A single
  group-state transition runs after the loop, with leader election
  only triggered if the actual group leader was among the departures.

- Added buildLeaveGroupFlexibleResponse for v4+ clients. The parser
  already decoded flexible versions, but the response still went out in
  non-flexible encoding (4-byte array lengths, 2-byte strings, no
  tagged fields), which v4+ clients could not parse. Route flexible
  versions through the new builder; v1-v3 keep buildLeaveGroupFullResponse.

- BasicFunctionality gives each consumer its own
  ConsumerGroupHandler/ready channel. The previous shared handler
  closed ready once, so readyCount advanced to numConsumers from a
  single signal; the test could proceed without the other consumers
  actually reaching Setup.

- RebalanceTestHandler.assignments is now a size-1 channel, so readers
  always observe the most recent rebalance snapshot instead of an
  intermediate one from an earlier round.
2026-04-20 10:11:45 -07:00
Ping Qiu
f4ce2be875 doc: P14 S8 final bounded close — evidence matrix + P15 handoff (#9142)
* doc: P14 S8 final bounded close — evidence matrix + P15 handoff

Adds the six S8 closure deliverables consolidating S4-S7 evidence,
classifying V2 scenarios, and mapping residual product gaps onto
canonical P15 tracks (per v3-phase-15-product-plan.md §4).

New docs:
- v3-phase-14-s8-assignment.md — S8 execution contract.
- v3-phase-14-s8-final-bounded-close.md — bounded P14 target,
  accepted topology, reject conditions.
- v3-phase-14-s8-evidence-matrix.md — 16 claims × {L0, L1, L2, L3,
  Status, Residual}. 15 PROVEN, 1 PARTIAL (Claim 15 fence
  quantitative bound, P14 internal follow-up). Rounds 2-3 architect
  corrections: Claim 10 / 12 L2 narrowed; Claim 6 refresh gap closed
  by the new L1 test (see companion commit in seaweed_block).
- v3-phase-14-s8-v2-scenario-classification.md — every V2 scenario
  mapped to RUNNABLE-P14 / BLOCKED-FRONTEND / BLOCKED-OPS /
  BLOCKED-HA / BLOCKED-PERF / PORT-MECHANISM; scenario YAMLs kept
  as L3 shape, not executed evidence.
- v3-phase-14-s8-p15-handoff.md — 11 rows (10 canonical P15 tracks
  + 1 P14 internal follow-up anchored to Claim 15 PARTIAL); §4
  integrity check split by row class.
- v3-phase-14-s8-closure.md — final P14 closure statement matching
  the close doc §10 wording; explicit non-goals; all 9 P15 tracks
  named with canonical numbering.

No claim of CSI / frontend / migration / security / performance /
production readiness. Every product gap is handed off with a
concrete first-proof gate.

Companion: seaweed_block commit adds the IntentRefreshEndpoint L1
route test that closes Claim 6.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* doc: P14 S8 — resolve port-now doc conflict (CodeRabbit #4)

final-bounded-close.md §7 previously said "Port now: testrunner,
scenarios, component harness, qa_block, learn/test" while
v2-scenario-classification.md §2 says S8 does NOT port testrunner
machinery and defers all actual porting to P15.

Align final-bounded-close.md §7 with classification: section
renamed "Classify now (S8 scope), port deferred to P15". Every
item now states which P15 track actually owns the port (Final Gate
or T1 Frontend + Data Path as applicable).

No scope expansion; no new handoff gap. Pure doc-consistency fix.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 02:24:44 -07:00
Chris Lu
f1d5f31a93 fix(mount): retry saveEntry on transient filer errors; stop mismapping Canceled to EIO (#9141)
* fix(mount): retry saveEntry on transient filer errors, stop mismapping Canceled to EIO

When the mount's gRPC connection to the filer flaps (e.g. a transient
restart or network blip), every in-flight setattr/utimes/chmod/xattr/
rename-driven saveEntry returns "code = Canceled desc = grpc: the client
connection is closing" at the same instant. Two bugs in saveEntry then
turned each of those into a hard EIO for the user:

1. The error was wrapped with fmt.Errorf(... %v ...) before being passed
   to grpcErrorToFuseStatus. %v stringifies the status, so
   status.FromError could no longer unwrap the gRPC code and the
   Canceled→ETIMEDOUT branch in the classifier never fired; every
   Canceled error fell through to the default EIO.

2. saveEntry issued a single streamUpdateEntry call with no retry,
   unlike doFlush which already wraps its CreateEntry in
   retryMetadataFlush. One stream flap therefore propagated straight to
   the FUSE caller instead of being ridden out across the 4-attempt /
   ~7s backoff window.

Wrap the UpdateEntry call in retryMetadataFlush (matching doFlush and
completeAsyncFlush) and switch the wrap verb to %w so the classifier
can still see the gRPC code. This recovers transient closes silently
and, if retries are exhausted, returns ETIMEDOUT instead of EIO.

Reported by rclone users in #9139 where a large concurrent copy
(hundreds of .partial uploads per filer flap) surfaced as walls of EIOs
because each .partial rename's post-setattr hit saveEntry at the worst
possible moment.

* mount: skip saveEntry retries on permanent filer errors

Address gemini-code-assist review on #9141: blindly retrying every
UpdateEntry failure with exponential backoff means interactive FUSE ops
like chmod/utimes/xattr can hang for ~7s before surfacing clearly
permanent errors (NotFound, PermissionDenied, InvalidArgument, etc.).

Introduce retryMetadataFlushIf, a variant of retryMetadataFlush that
accepts a shouldRetry predicate, and an isRetryableFilerError classifier
that short-circuits on a conservative whitelist of terminal gRPC codes.
Transient errors (Canceled / Unavailable / DeadlineExceeded /
ResourceExhausted / Internal) and non-gRPC errors still retry, so the
original fix for #9139 (rclone EIO burst during filer connection
flaps) is preserved.
2026-04-20 00:31:37 -07:00
Chris Lu
8857dbfb74 fix(test): drop host port mapping from risingwave catalog test to kill TOCTOU flake
The random host port allocated by MustFreeMiniPorts was released before
docker run bound it, occasionally losing the race to another process and
failing with "address already in use". The sidecar already reaches
RisingWave via shared netns (--network container:...), so the host -p
mapping and the corresponding WaitForPort check were unused.
2026-04-19 23:26:45 -07:00
Chris Lu
9a6b566fb1 fix(shell): volume.fsck keeps going past a single broken chunk manifest (#9140)
* fix(shell): volume.fsck no longer aborts on a single broken chunk manifest

Previously a single entry whose chunk-manifest could not be read (e.g. the
manifest needle was missing or its sub-chunks pointed at a now-gone volume)
caused collectFilerFileIdAndPaths to return immediately with
"failed to ResolveChunkManifest". The whole fsck run failed, so an operator
with even one corrupted file could not use volume.fsck to find or clean up
unrelated orphan needles on other volumes — they had to locate and delete
the bad entries first, blind, with no help from fsck.

Log the resolution failure with the entry path, fall back to recording the
top-level chunk fids the entry references (data fids and manifest fids
themselves; sub-chunks behind the unresolvable manifest stay unknown), and
keep traversing. Track the count of unresolved entries on the command struct
and refuse -reallyDeleteFromVolume for the run when the count is non-zero,
since the in-use fid set is incomplete and a purge could otherwise delete
live sub-chunks behind the broken manifest. Read-only fsck still produces a
useful (if conservatively over-reported) orphan listing so the operator can
see and fix the broken entries first, then re-run with apply.

Discovered while diagnosing #9116.

* address review: use callback ctx and atomic counter

- Pass the BFS callback's ctx to ResolveChunkManifest so a Ctrl+C / first-error
  cancellation propagates into the manifest fetch instead of using
  context.Background().
- TraverseBfs runs the callback across K=5 worker goroutines (filer_pb/filer_client_bfs.go),
  so the unresolvedManifestEntries field on commandVolumeFsck is shared across
  workers and was racing. Switch it to atomic.Int64 with Add/Load.

* address review: reset counter per Do(), pass through ctx errors

- commandVolumeFsck is a singleton registered in init() and reused across
  shell invocations. Without resetting the unresolved-manifest counter at
  the top of Do(), a single failed run permanently suppressed
  -reallyDeleteFromVolume in the same shell session. Reset to 0 right
  after flag parsing.
- Treating context cancellation as manifest corruption was wrong: a
  Ctrl+C or deadline mid-traversal would inflate the counter and emit
  misleading "manifest broken" warnings for entries that were never
  examined. Detect context.Canceled / context.DeadlineExceeded and
  return the error so the BFS unwinds cleanly.

Not changing the findMissingChunksInFiler branch's purgeAbsent /
applyPurging gating: that path checks recorded filer fids against
volume idx files, and a broken-manifest entry's recorded manifest fid
will fail the existence check and get purged — which is the cleanup
the operator wants for those entries. Adding a gate would block the
exact use case the warning points them at.
2026-04-19 23:06:28 -07:00
Chris Lu
3ff92f797d 4.21 4.21 2026-04-19 14:38:29 -07:00
Chris Lu
a8ba9d106e peer chunk sharing 7/8: tryPeerRead read-path hook (#9136)
* mount: batched announcer + pooled peer conns for mount-to-mount RPCs

* peer_announcer.go: non-blocking EnqueueAnnounce + ticker flush that
  groups fids by HRW owner, fans out one ChunkAnnounce per owner in
  parallel. announcedAt is pruned at 2× TTL so it stays bounded.

* peer_dialer.go: PeerConnPool caches one grpc.ClientConn per peer
  address; the announcer and (next PR) the fetcher share it so
  steady-state owner RPCs skip the handshake cost entirely. Bounded
  at 4096 cached entries; shutdown conns are transparently replaced.

* WFS starts both alongside the gRPC server; stops them on unmount.

* mount: wire tryPeerRead via FetchChunk streaming gRPC

Replaces the HTTP GET byte-transfer path with a gRPC server-stream
FetchChunk call. Same fall-through semantics: any failure drops
through to entryChunkGroup.ReadDataAt, so reads never slow below
status quo.

* peer_fetcher.go: tryPeerRead resolves the offset to a leaf chunk
  (flattening manifests), asks the HRW owner for holders via
  ChunkLookup, then opens FetchChunk on each holder in LRU order
  (PR #5) until one succeeds. Assembled bytes are verified against
  FileChunk.ETag end-to-end — the peer is still treated as
  untrusted. Reuses the shared PeerConnPool from PR #6 for all
  outbound gRPC.

* peer_grpc.go: expose SelfAddr() so the fetcher can avoid dialing
  itself on a self-owned fid.

* filehandle_read.go: tryPeerRead slot between tryRDMARead and
  entryChunkGroup.ReadDataAt. Gated by option.PeerEnabled and the
  presence of peerGrpcServer (the single identity test).

Read ordering with the feature enabled is now:
   local cache -> RDMA sidecar -> peer mount (gRPC stream) -> volume server

One port, one identity, one connection pool — no more HTTP bytecast.

* test(fuse_p2p): end-to-end CI test for peer chunk sharing

Adds a FUSE-backed integration test that proves mount B can satisfy a
read from mount A's chunk cache instead of the volume tier.

Layout (modelled on test/fuse_dlm):

  test/fuse_p2p/framework_test.go        — cluster harness (1 master,
                                           1 volume, 1 filer, N mounts,
                                           all with -peer.enable)
  test/fuse_p2p/peer_chunk_sharing_test.go
                                         — writer-reader scenario

The test (TestPeerChunkSharing_ReadersPullFromPeerCache):

  1. Starts 3 mounts. Three is the sweet spot: with 2 mounts, HRW owner
     of a chunk is self ~50 % of the time (peer path short-circuits);
     with 3+ it drops to ≤ 1/3, so a multi-chunk file almost certainly
     exercises the remote-owner fan-out.
  2. Mount 0 writes a ~8 MiB file, then reads it back through its own
     FUSE to warm its chunk cache.
  3. Waits for seed convergence (one full MountList refresh) plus an
     announcer flush cycle, so chunk-holder entries have reached each
     HRW owner.
  4. Mount 1 reads the same file.
  5. Verifies byte-for-byte equality AND greps mount 1's log for
     "peer read successful" — content matching alone is not proof
     (the volume fallback would also succeed), so the log marker is
     what distinguishes p2p from fallback.

Workflow .github/workflows/fuse-p2p-integration.yml triggers on any
change to mount/filer peer code, the p2p protos, or the test itself.
Failure artifacts (server + mount logs) are uploaded for 3 days.

Mounts run with -v=4 so the tryPeerRead success/failure glog messages
land in the log file the test greps.
2026-04-19 00:53:12 -07:00
Chris Lu
73f10fa528 peer chunk sharing 6/8: announce queue + batched flush (#9135)
mount: batched announcer + pooled peer conns for mount-to-mount RPCs

* peer_announcer.go: non-blocking EnqueueAnnounce + ticker flush that
  groups fids by HRW owner, fans out one ChunkAnnounce per owner in
  parallel. announcedAt is pruned at 2× TTL so it stays bounded.

* peer_dialer.go: PeerConnPool caches one grpc.ClientConn per peer
  address; the announcer and (next PR) the fetcher share it so
  steady-state owner RPCs skip the handshake cost entirely. Bounded
  at 4096 cached entries; shutdown conns are transparently replaced.

* WFS starts both alongside the gRPC server; stops them on unmount.
2026-04-18 21:42:36 -07:00
Chris Lu
fe9ca35bbd peer chunk sharing 5/8: mount chunk-directory shard (#9134)
mount: tier-2 chunk directory + FetchChunk streaming on one gRPC port

Collapses the old two-port design (HTTP peer-serve + separate gRPC
directory) into a single gRPC service that handles every mount-to-
mount exchange: ChunkAnnounce, ChunkLookup, and the new FetchChunk
byte stream.

* peer_directory.go: fid -> holders shard, HRW-gated; returns holders
  in LRU order; capacity-bounded; Sweep handles eviction under
  write-lock while Lookup runs under RLock (hot path is concurrent).

* peer_grpc.go: single MountPeer gRPC server implementing all three
  RPCs. FetchChunk frames bytes at 1 MiB per Send so the default
  4 MiB message cap does not constrain chunk size; cache miss
  returns gRPC NOT_FOUND so clients distinguish miss from transport
  error. Reuses pb.NewGrpcServer for consistent keepalive + msg-size
  tuning.

* peer_bytepool.go: sync.Pool wrapper around *[]byte that the server
  uses to avoid a fresh 8 MiB allocation per FetchChunk call.

* WFS wiring starts the gRPC server on option.PeerListen (the single
  peer port) using the advertise address resolved in PR #3 as the
  HRW identity. A background sweeper evicts expired directory
  entries every 60 s.
2026-04-18 20:18:38 -07:00
Chris Lu
8a6348d3e9 peer chunk sharing 4/8: mount registrar + HRW owner selection (#9133)
* proto: define MountRegister/MountList and MountPeer service

Adds the wire types for peer chunk sharing between weed mount clients:

* filer.proto: MountRegister / MountList RPCs so each mount can heartbeat
  its peer-serve address into a filer-hosted registry, and refresh the
  list of peers. Tiny payload; the filer stores only O(fleet_size) state.

* mount_peer.proto (new): ChunkAnnounce / ChunkLookup RPCs for the
  mount-to-mount chunk directory. Each fid's directory entry lives on
  an HRW-assigned mount; announces and lookups route to that mount.

No behavior yet — later PRs wire the RPCs into the filer and mount.
See design-weed-mount-peer-chunk-sharing.md for the full design.

* filer: add mount-server registry behind -peer.registry.enable

Implements tier 1 of the peer chunk sharing design: an in-memory registry
of live weed mount servers, keyed by peer address, refreshed by
MountRegister heartbeats and served by MountList.

* weed/filer/peer_registry.go: thread-safe map with TTL eviction; lazy
  sweep on List plus a background sweeper goroutine for bounded memory.

* weed/server/filer_grpc_server_peer.go: MountRegister / MountList RPC
  handlers. When -peer.registry.enable is false (the default), both RPCs
  are silent no-ops so probing older filers is harmless.

* -peer.registry.enable flag on weed filer; FilerOption.PeerRegistryEnabled
  wires it through.

Phase 1 is single-filer (no cross-filer replication of the registry);
mounts that fail over to another filer will re-register on the next
heartbeat, so the registry self-heals within one TTL cycle.

Part of the peer-chunk-sharing design; no behavior change at runtime
until a later PR enables the flag on both filer and mount.

* filer: nil-safe peerRegistryEnable + registry hardening

Addresses review feedback on PR #9131.

* Fix: nil pointer deref in the mini cluster. FilerOptions instances
  constructed outside weed/command/filer.go (e.g. miniFilerOptions in
  mini.go) do not populate peerRegistryEnable, so dereferencing the
  pointer panics at Filer startup. Use the same
  `nil && deref` idiom already used for distributedLock / writebackCache.

* Hardening (gemini review): registry now enforces three invariants:
  - empty peer_addr is silently rejected (no client-controlled sentinel
    mass-inserts)
  - TTL is capped at 1 hour so a runaway client cannot pin entries
  - new-entry count is capped at 10000 to bound memory; renewals of
    existing entries are always honored, so a full registry still
    heartbeats its existing members correctly

Covered by new unit tests.

* filer: rename -peer.registry.enable flag to -mount.p2p

Per review feedback: the old name "peer.registry.enable" leaked
the implementation ("registry") into the CLI surface. "mount.p2p"
is shorter and describes what it actually controls — whether this
filer participates in mount-to-mount peer chunk sharing.

Flag renames (all three keep default=true, idle cost is near-zero):
  -peer.registry.enable        ->  -mount.p2p         (weed filer)
  -filer.peer.registry.enable  ->  -filer.mount.p2p   (weed mini, weed server)

Internal variable names (mountPeerRegistryEnable, MountPeerRegistry)
keep their longer form — they describe the component, not the knob.

* filer: MountList returns DataCenter + List uses RLock

Two review follow-ups on the mount peer registry:

* weed/server/filer_grpc_server_mount_peer.go: MountList was dropping
  the DataCenter on the wire. The whole point of carrying DC separately
  from Rack is letting the mount-side fetcher re-rank peers by the
  two-level locality hierarchy (same-rack > same-DC > cross-DC); without
  DC in the response every remote peer collapsed to "unknown locality."

* weed/filer/mount_peer_registry.go: List() was taking a write lock so
  it could lazy-delete expired entries inline. But MountList is a
  read-heavy RPC hit on every mount's 30 s refresh loop, and Sweep is
  already wired as the sole reclamation path (same pattern as the
  mount-side PeerDirectory). Switch List to RLock + filter, let Sweep
  do the map mutation, so concurrent MountList callers don't serialize
  on each other.

Test updated to reflect the new contract (List no longer mutates the
map; Sweep is what drops expired entries).

* mount: add peer chunk sharing options + advertise address resolver

First cut at the peer chunk sharing wiring on the mount side. No
functional behavior yet — this PR just introduces the option fields,
the -peer.* flags, and the helper that resolves a reachable
host:port from them. The server implementation arrives in PR #5
(gRPC service) and the fetcher in PR #7.

* ResolvePeerAdvertiseAddr: an explicit -peer.advertise wins; else we
  use -peer.listen's bind host if specific; else util.DetectedHostAddress
  combined with the port. This is what gets registered with the filer
  and announced to peers, so wildcard binds no longer result in
  unreachable identities like "[::]:18080".

* Option fields: PeerEnabled, PeerListen, PeerAdvertise, PeerRack.
  One port handles both directory RPCs and streaming chunk fetches
  (see PR #1 FetchChunk proto), so there is no second -peer.grpc.*
  flag — the old HTTP byte-transfer path is gone.

* New flags on weed mount: -peer.enable, -peer.listen (default :18080),
  -peer.advertise (default auto), -peer.rack.

* mount: register with filer and maintain HRW seed view

Adds the mount-side tier-1 client. On startup the mount calls
MountRegister with its advertise address (PR #3) and keeps both the
filer entry and the local seed view fresh via background tickers
(30 s register / 30 s list, 90 s filer TTL).

* peer_hrw.go: pure rendezvous-hashing helper picking a single owner
  per fid via top-1 HRW. Adding or removing one seed moves only
  ~1/N fids.

* peer_registrar.go: heartbeat + list poller. Seeds() returns the
  slice directly (no per-call copy) since listOnce atomically swaps;
  background RPCs bind their context to Stop() so unmount doesn't
  hang on a slow filer.

* WFS wiring uses ResolvePeerAdvertiseAddr from PR #3 for the
  identity registered with the filer. No HTTP server, no second
  port — one reachable address represents the mount.

* mount: broadcast MountRegister/MountList to every filer

Previously the registrar called through wfs.WithFilerClient, which only
reaches whichever filer the WFS filer-client session happens to be on.
That meant two mounts pointing at different filers would never see each
other: the filer mount registries are in-memory and per-filer (no
filer-to-filer sync), so each mount's MountList only returned peers
that had also registered through the same filer.

This commit makes the registrar multi-filer aware:

  * NewPeerRegistrar now takes the full FilerAddresses slice and a
    per-filer dial function. The old single-filer peerFilerClient
    interface is gone.

  * registerOnce fans a MountRegister RPC out to every filer in
    parallel. Succeeds if at least one filer accepted — an unreachable
    filer is tolerated, logged, and retried on the next heartbeat.

  * listOnce polls every filer's MountList in parallel and merges the
    responses by peer_addr, keeping the newest LastSeenNs on duplicates.
    Mounts talking to different filers therefore converge once every
    filer has been polled once.

The merged-list property is what lets a fleet of mounts spread across
multiple filers still form a single HRW seed view. Each filer only ever
sees the subset of mounts that heartbeat through it, but the registrar
reconstructs the union client-side.

New unit tests guard both properties:
  - RegisterBroadcastsToAllFilers: one registerOnce hits all N filers.
  - ListMergesAcrossFilers: mount-a on filer-1 and mount-b on filer-2
    both appear in the merged seed set.
  - ListMergeKeepsNewestLastSeen: the same mount reported by two
    filers collapses to one entry with the freshest timestamp.
2026-04-18 20:03:45 -07:00
Chris Lu
af1e571297 peer chunk sharing 3/8: mount peer-serve HTTP endpoint (#9132)
* proto: define MountRegister/MountList and MountPeer service

Adds the wire types for peer chunk sharing between weed mount clients:

* filer.proto: MountRegister / MountList RPCs so each mount can heartbeat
  its peer-serve address into a filer-hosted registry, and refresh the
  list of peers. Tiny payload; the filer stores only O(fleet_size) state.

* mount_peer.proto (new): ChunkAnnounce / ChunkLookup RPCs for the
  mount-to-mount chunk directory. Each fid's directory entry lives on
  an HRW-assigned mount; announces and lookups route to that mount.

No behavior yet — later PRs wire the RPCs into the filer and mount.
See design-weed-mount-peer-chunk-sharing.md for the full design.

* filer: add mount-server registry behind -peer.registry.enable

Implements tier 1 of the peer chunk sharing design: an in-memory registry
of live weed mount servers, keyed by peer address, refreshed by
MountRegister heartbeats and served by MountList.

* weed/filer/peer_registry.go: thread-safe map with TTL eviction; lazy
  sweep on List plus a background sweeper goroutine for bounded memory.

* weed/server/filer_grpc_server_peer.go: MountRegister / MountList RPC
  handlers. When -peer.registry.enable is false (the default), both RPCs
  are silent no-ops so probing older filers is harmless.

* -peer.registry.enable flag on weed filer; FilerOption.PeerRegistryEnabled
  wires it through.

Phase 1 is single-filer (no cross-filer replication of the registry);
mounts that fail over to another filer will re-register on the next
heartbeat, so the registry self-heals within one TTL cycle.

Part of the peer-chunk-sharing design; no behavior change at runtime
until a later PR enables the flag on both filer and mount.

* filer: nil-safe peerRegistryEnable + registry hardening

Addresses review feedback on PR #9131.

* Fix: nil pointer deref in the mini cluster. FilerOptions instances
  constructed outside weed/command/filer.go (e.g. miniFilerOptions in
  mini.go) do not populate peerRegistryEnable, so dereferencing the
  pointer panics at Filer startup. Use the same
  `nil && deref` idiom already used for distributedLock / writebackCache.

* Hardening (gemini review): registry now enforces three invariants:
  - empty peer_addr is silently rejected (no client-controlled sentinel
    mass-inserts)
  - TTL is capped at 1 hour so a runaway client cannot pin entries
  - new-entry count is capped at 10000 to bound memory; renewals of
    existing entries are always honored, so a full registry still
    heartbeats its existing members correctly

Covered by new unit tests.

* filer: rename -peer.registry.enable flag to -mount.p2p

Per review feedback: the old name "peer.registry.enable" leaked
the implementation ("registry") into the CLI surface. "mount.p2p"
is shorter and describes what it actually controls — whether this
filer participates in mount-to-mount peer chunk sharing.

Flag renames (all three keep default=true, idle cost is near-zero):
  -peer.registry.enable        ->  -mount.p2p         (weed filer)
  -filer.peer.registry.enable  ->  -filer.mount.p2p   (weed mini, weed server)

Internal variable names (mountPeerRegistryEnable, MountPeerRegistry)
keep their longer form — they describe the component, not the knob.

* filer: MountList returns DataCenter + List uses RLock

Two review follow-ups on the mount peer registry:

* weed/server/filer_grpc_server_mount_peer.go: MountList was dropping
  the DataCenter on the wire. The whole point of carrying DC separately
  from Rack is letting the mount-side fetcher re-rank peers by the
  two-level locality hierarchy (same-rack > same-DC > cross-DC); without
  DC in the response every remote peer collapsed to "unknown locality."

* weed/filer/mount_peer_registry.go: List() was taking a write lock so
  it could lazy-delete expired entries inline. But MountList is a
  read-heavy RPC hit on every mount's 30 s refresh loop, and Sweep is
  already wired as the sole reclamation path (same pattern as the
  mount-side PeerDirectory). Switch List to RLock + filter, let Sweep
  do the map mutation, so concurrent MountList callers don't serialize
  on each other.

Test updated to reflect the new contract (List no longer mutates the
map; Sweep is what drops expired entries).

* mount: add peer chunk sharing options + advertise address resolver

First cut at the peer chunk sharing wiring on the mount side. No
functional behavior yet — this PR just introduces the option fields,
the -peer.* flags, and the helper that resolves a reachable
host:port from them. The server implementation arrives in PR #5
(gRPC service) and the fetcher in PR #7.

* ResolvePeerAdvertiseAddr: an explicit -peer.advertise wins; else we
  use -peer.listen's bind host if specific; else util.DetectedHostAddress
  combined with the port. This is what gets registered with the filer
  and announced to peers, so wildcard binds no longer result in
  unreachable identities like "[::]:18080".

* Option fields: PeerEnabled, PeerListen, PeerAdvertise, PeerRack.
  One port handles both directory RPCs and streaming chunk fetches
  (see PR #1 FetchChunk proto), so there is no second -peer.grpc.*
  flag — the old HTTP byte-transfer path is gone.

* New flags on weed mount: -peer.enable, -peer.listen (default :18080),
  -peer.advertise (default auto), -peer.rack.
2026-04-18 20:03:34 -07:00
Chris Lu
e24a443b17 peer chunk sharing 2/8: filer mount registry (#9131)
* proto: define MountRegister/MountList and MountPeer service

Adds the wire types for peer chunk sharing between weed mount clients:

* filer.proto: MountRegister / MountList RPCs so each mount can heartbeat
  its peer-serve address into a filer-hosted registry, and refresh the
  list of peers. Tiny payload; the filer stores only O(fleet_size) state.

* mount_peer.proto (new): ChunkAnnounce / ChunkLookup RPCs for the
  mount-to-mount chunk directory. Each fid's directory entry lives on
  an HRW-assigned mount; announces and lookups route to that mount.

No behavior yet — later PRs wire the RPCs into the filer and mount.
See design-weed-mount-peer-chunk-sharing.md for the full design.

* filer: add mount-server registry behind -peer.registry.enable

Implements tier 1 of the peer chunk sharing design: an in-memory registry
of live weed mount servers, keyed by peer address, refreshed by
MountRegister heartbeats and served by MountList.

* weed/filer/peer_registry.go: thread-safe map with TTL eviction; lazy
  sweep on List plus a background sweeper goroutine for bounded memory.

* weed/server/filer_grpc_server_peer.go: MountRegister / MountList RPC
  handlers. When -peer.registry.enable is false (the default), both RPCs
  are silent no-ops so probing older filers is harmless.

* -peer.registry.enable flag on weed filer; FilerOption.PeerRegistryEnabled
  wires it through.

Phase 1 is single-filer (no cross-filer replication of the registry);
mounts that fail over to another filer will re-register on the next
heartbeat, so the registry self-heals within one TTL cycle.

Part of the peer-chunk-sharing design; no behavior change at runtime
until a later PR enables the flag on both filer and mount.

* filer: nil-safe peerRegistryEnable + registry hardening

Addresses review feedback on PR #9131.

* Fix: nil pointer deref in the mini cluster. FilerOptions instances
  constructed outside weed/command/filer.go (e.g. miniFilerOptions in
  mini.go) do not populate peerRegistryEnable, so dereferencing the
  pointer panics at Filer startup. Use the same
  `nil && deref` idiom already used for distributedLock / writebackCache.

* Hardening (gemini review): registry now enforces three invariants:
  - empty peer_addr is silently rejected (no client-controlled sentinel
    mass-inserts)
  - TTL is capped at 1 hour so a runaway client cannot pin entries
  - new-entry count is capped at 10000 to bound memory; renewals of
    existing entries are always honored, so a full registry still
    heartbeats its existing members correctly

Covered by new unit tests.

* filer: rename -peer.registry.enable flag to -mount.p2p

Per review feedback: the old name "peer.registry.enable" leaked
the implementation ("registry") into the CLI surface. "mount.p2p"
is shorter and describes what it actually controls — whether this
filer participates in mount-to-mount peer chunk sharing.

Flag renames (all three keep default=true, idle cost is near-zero):
  -peer.registry.enable        ->  -mount.p2p         (weed filer)
  -filer.peer.registry.enable  ->  -filer.mount.p2p   (weed mini, weed server)

Internal variable names (mountPeerRegistryEnable, MountPeerRegistry)
keep their longer form — they describe the component, not the knob.

* filer: MountList returns DataCenter + List uses RLock

Two review follow-ups on the mount peer registry:

* weed/server/filer_grpc_server_mount_peer.go: MountList was dropping
  the DataCenter on the wire. The whole point of carrying DC separately
  from Rack is letting the mount-side fetcher re-rank peers by the
  two-level locality hierarchy (same-rack > same-DC > cross-DC); without
  DC in the response every remote peer collapsed to "unknown locality."

* weed/filer/mount_peer_registry.go: List() was taking a write lock so
  it could lazy-delete expired entries inline. But MountList is a
  read-heavy RPC hit on every mount's 30 s refresh loop, and Sweep is
  already wired as the sole reclamation path (same pattern as the
  mount-side PeerDirectory). Switch List to RLock + filter, let Sweep
  do the map mutation, so concurrent MountList callers don't serialize
  on each other.

Test updated to reflect the new contract (List no longer mutates the
map; Sweep is what drops expired entries).
2026-04-18 20:03:23 -07:00
Chris Lu
d7d834b8f9 peer chunk sharing 1/8: proto definitions (#9130)
proto: define MountRegister/MountList and MountPeer service

Adds the wire types for peer chunk sharing between weed mount clients:

* filer.proto: MountRegister / MountList RPCs so each mount can heartbeat
  its peer-serve address into a filer-hosted registry, and refresh the
  list of peers. Tiny payload; the filer stores only O(fleet_size) state.

* mount_peer.proto (new): ChunkAnnounce / ChunkLookup RPCs for the
  mount-to-mount chunk directory. Each fid's directory entry lives on
  an HRW-assigned mount; announces and lookups route to that mount.

No behavior yet — later PRs wire the RPCs into the filer and mount.
See design-weed-mount-peer-chunk-sharing.md for the full design.
2026-04-18 20:02:55 -07:00
Chris Lu
6787a4b4e8 fix kafka gateway and consumer group e2e flakes (#9129)
* fix(test): reduce kafka gateway and consumer group flakes

* fix(kafka): make broker health-check backoff respect context

Replace time.Sleep in the retry loop with a select on bc.ctx.Done() and
time.After so the backoff is interruptible during shutdown, per review
feedback on PR #9129.

* fix(kafka): guard broker HealthCheck against nil client

Return the same "broker client not connected" error used by the other
exported BrokerClient methods instead of panicking on a partially
initialized client, per CodeRabbit review feedback on PR #9129.
2026-04-18 10:03:24 -07:00
Chris Lu
6832b9945b ci(s3tests): install libxml2/libxslt dev headers before pip install
ceph/s3-tests pins lxml without an upper bound. When pip picks a release
whose prebuilt wheel isn't published for Python 3.9 on the runner, it
falls back to sdist and fails without libxml2-dev / libxslt1-dev.
2026-04-18 01:06:39 -07:00
Chris Lu
1c130c2d47 fix(mount): close inodeLocks cleanup race that let two flock holders coexist (#9128)
* fix(mount): close inodeLocks cleanup race that allowed two flock holders

PosixLockTable.getOrCreateInodeLocks released plt.mu before the caller
acquired il.mu. A concurrent maybeCleanupInode could delete the map
entry in that window; the first caller would then insert its lock into
the orphaned inodeLocks while a later caller created a fresh entry in
the map, so findConflict never observed the orphaned lock and two
owners could simultaneously believe they held the same exclusive flock.

This matches the flaky CI failure seen in
TestPosixFileLocking/ConcurrentLockContention:

    Error: Should be empty, but was [worker N: flock overlap detected with 2 holders]

Mark removed inodeLocks as dead under plt.mu+il.mu, and have SetLk /
SetLkw recheck the flag after locking il.mu, refetching the live entry
from the map when orphaned. Also delete the map entry only if it still
points to this il, so a racing recreate is not clobbered.

Adds TestConcurrentFlockChurnPreservesMutualExclusion: 16 goroutines x
500 flock/unflock iterations on one inode. Reliably reports 500+
overlaps per run before the fix; clean across 100 race-enabled runs
after.

* fix(mount): extend dead-flag contract to GetLk and self-heal primitives

Address review feedback on the initial cleanup-race fix:

1. GetLk had the same stale-pointer bug as SetLk. A caller could grab
   an inodeLocks pointer, have cleanup orphan it and a replacement il
   receive a conflicting lock, then answer F_UNLCK off the empty dead
   pointer. Add the same dead recheck + refetch loop.

2. getOrCreateInodeLocks and getInodeLocks now treat a dead map entry
   as defective: the former replaces it with a fresh inodeLocks, the
   latter drops it and returns nil. Production cannot reach that state
   (maybeCleanupInode atomically deletes under plt.mu when it sets
   dead), but the hardening guarantees the SetLk / SetLkw / GetLk
   retry loops always make progress even if a future refactor reorders
   those operations, and it lets the white-box tests set up a stale
   dead entry without spinning.

3. Strengthen the regression suite:
   - TestSetLkRetriesPastDeadInodeLocks: deterministic white-box test
     that installs a dead il in the map and asserts SetLk routes the
     new lock into a fresh il (not the orphan), that GetLk reports the
     resulting conflict, and that a different-owner acquire is rejected
     with EAGAIN.
   - TestGetInodeLocksEvictsDeadEntry: verifies both map-read primitives
     drop or replace dead entries.
   - TestConcurrentFlockChurnPreservesMutualExclusion: replace the
     timing-fragile Add(1)-and-check counter with a Swap+CAS detector.
     Each worker claims a slot after SetLk OK and releases it before
     UN, flagging both an observed predecessor and a lost CAS on
     release. Against a reverted fix the detector fires 1000+ times per
     run; with the fix clean across 100 race-enabled iterations.

* test(mount): fail fast on unexpected SetLk statuses in churn loop

The stress test blindly spun on any non-OK SetLk status and discarded
the unlock return. If SetLk ever returns something other than OK or
EAGAIN (e.g. after a future refactor introduces a new error), the
acquire loop would spin forever and an unlock failure would be
silently swallowed.

Capture the acquire status, retry only on the expected EAGAIN, and
assert unlock returns OK. Use t.Errorf + return (not t.Fatalf) because
the checks run on worker goroutines where FailNow is unsafe. The
Swap+CAS overlap detector is unchanged.
2026-04-17 23:12:26 -07:00
Chris Lu
c1ccbe97dd feat(filer.backup): -initialSnapshot seeds destination from live tree (#9126)
* feat(filer.backup): -initialSnapshot seeds destination from live tree

Replaying the metadata event log on a fresh sync only leaves files that
still exist on the source at replay time: any entry that was created and
later deleted is replayed as a create/delete pair and never materializes
on the destination. Users who wipe the destination and re-run
filer.backup therefore see "only new files" instead of a full backup,
even when -timeAgo=876000h is passed and the subscription genuinely
starts from epoch (ref discussion #8672).

Add a -initialSnapshot opt-in flag: when set on a fresh sync (no prior
checkpoint, -timeAgo unset), walk the live filer tree under -filerPath
via TraverseBfs and seed the destination through sink.CreateEntry, then
persist the walk-start timestamp as the checkpoint and subscribe from
there. Capturing the timestamp before the walk lets the subscription
catch any create/update/delete racing with the walk — sink CreateEntry
is idempotent across the builtin sinks so replay is safe.

Honors existing -filerExcludePaths / -filerExcludeFileNames /
-filerExcludePathPatterns filters and skips /topics/.system/log the
same way the subscription path does.

Also log "starting from <t> (no prior checkpoint)" instead of a
misleading "resuming from 1970-01-01" when the KV has no stored offset.

* fix(filer.backup): guard initialSnapshot counters under TraverseBfs workers

TraverseBfs fans the callback out across 5 worker goroutines, so the
entryCount / byteCount updates and the 5-second progress-log gate in
runInitialSnapshot were racing. Switch the counters to atomic.Int64 and
protect the lastLog check/update with a short-scoped mutex so the heavy
sink.CreateEntry call stays outside the critical section.

Flagged by gemini-code-assist on #9126; verified with go test -race.

* fix(filer.backup): harden initialSnapshot against transient errors and path edge cases

Three review items from CodeRabbit on #9126:

1. getOffset errors no longer leave isFreshSync=true. Before, a transient
   KV read failure would cause runFilerBackup's retry loop to redo the
   full -initialSnapshot walk on every retry. Treat any offset-read
   error as "not fresh" so the snapshot only runs when we've verified
   there really is no prior checkpoint.

2. initialSnapshotTargetKey now normalizes sourcePath to a trailing-
   slash base before stripping the prefix, so edge cases where
   sourceKey equals sourcePath (trailing-slash mismatch or root-entry
   emission) no longer index past the end. Unit tests cover both
   forms.

3. Documented the TraverseBfs-enumerates-excluded-subtrees performance
   characteristic on runInitialSnapshot, since pruning requires a
   separate change to TraverseBfs itself.

* fix(filer.backup): retry setOffset after initialSnapshot to avoid full re-walks

If the snapshot walk finishes but the subsequent setOffset fails, the
retry loop in runFilerBackup will re-enter doFilerBackup with an empty
checkpoint and run the full BFS again — on a multi-million-entry tree
that's hours of wasted work over a 100-byte KV write. Retry the write a
handful of times with exponential backoff before giving up, and log
loudly at the final failure (with snapshotTsNs + sinkId) so operators
recognize the symptom instead of guessing at mysterious repeated walks.

Nitpick raised by CodeRabbit on #9126.

* fix(filer.backup): initialSnapshot ignore404, skew margin, exclude dir-entry itself

Three review items from CodeRabbit on #9126:

1. ignore404Error now threads into runInitialSnapshot. If a file is listed
   by TraverseBfs and then deleted before CreateEntry reads its chunks,
   the follow path already ignores 404s — the snapshot path was aborting
   and triggering a full re-walk. Treat an ignorable 404 as "skip this
   entry, continue."

2. snapshotTsNs now uses `time.Now() - 1min` instead of `time.Now()`.
   Metadata events are stamped server-side, so a fast backup-host clock
   could skip events that fire during or right after the walk. Matches
   the 1-minute margin meta_aggregator.go applies on initial peer
   traversal; duplicate replay is harmless because CreateEntry is
   idempotent.

3. Exclude checks now run against the entry's own full path, not just
   its parent. A walked directory whose full path matches SystemLogDir
   or -filerExcludePaths was being seeded to the destination; only its
   descendants were being skipped. Verified with a manual repro where
   -filerExcludePaths=/data/skipdir now keeps the skipdir entry itself
   off the destination.

* refactor(filer): share destKey helper between buildKey and initialSnapshot

Extract destKey(dataSink, targetPath, sourcePath, sourceKey, mTime) from
buildKey in filer_sync.go. Both the event-log path (buildKey) and the
initialSnapshot walk (initialSnapshotTargetKey) now go through the same
helper, so a walk-seeded file and an event-replayed file always resolve
to the same destination key.

As a bonus, buildKey picks up the defensive trailing-slash normalization
that initialSnapshotTargetKey introduced — no more index-past-end risk
when sourceKey happens to equal sourcePath. Also tightens the mTime
lookup to guard against nil Attributes (caught by an existing test
against buildKey when I first moved the lookup out of the incremental
branch).
2026-04-17 21:21:32 -07:00
Chris Lu
d57fc67022 fix(shell): fs.mergeVolumes now rewrites manifest chunks for large files (#9127)
* fix(shell): fs.mergeVolumes now rewrites manifest chunks for large files

Previously fs.mergeVolumes skipped any chunk whose IsChunkManifest flag was
true, printing "Change volume id for large file is not implemented yet" and
continuing. Because the BFS traversal only looks at top-level
entry.Chunks, sub-chunks referenced inside a manifest were never
considered either. For any file stored as a chunk manifest (large files
go this path), chunks in the source volume stayed put, leaving behind a
few MB of live data that vacuum and volume.deleteEmpty couldn't clean
up.

This change resolves each manifest chunk recursively, moves any
sub-chunk whose volume id is in the merge plan via the existing
moveChunk path, and re-serializes the manifest. If the manifest chunk
itself lives in a source volume, or any sub-chunk moved, the new
manifest blob is uploaded to a freshly assigned file id (the old
needle becomes orphaned and is reclaimed by vacuum like any other
moved chunk).

Fixes #9116.

* address review: batch UpdateEntry, fix dry-run, defer restore, avoid source volumes

- Call UpdateEntry once per entry after the chunk loop instead of once per
  moved chunk (gemini nit).
- In dry-run mode, mark anySubChanged when a sub-chunk in the plan is
  encountered and return changed=true after printing "rewrite manifest",
  so nested manifests also surface their would-rewrites (gemini nit).
- Defer filer_pb.AfterEntryDeserialization so the manifest chunk list is
  restored even when proto.Marshal fails (coderabbit nit).
- Reject AssignVolume results whose file id lands on a volume that is a
  source in the merge plan, and retry — otherwise the replacement
  manifest could be written to the volume being emptied (coderabbit).
2026-04-17 21:17:51 -07:00
Jaehoon Kim
96af27a131 feat(shell): add fs.distributeChunks command for even chunk distribution (#9117)
* feat(shell): add fs.distributeChunks command for even chunk distribution

Add a new weed shell command that redistributes a file's chunks evenly
across volume server nodes.

Supports three distribution modes via -mode flag:
- primary:     balance chunk ownership across nodes (default)
- replica:     balance both ownership and replica copies
- round-robin: assign chunks by offset order for sequential read
               optimization (chunk[0]->A, chunk[1]->B, chunk[2]->C, ...)

Additional options:
- -nodes=N to target specific number of nodes
- -apply to execute (dry-run by default)

Usage:
  fs.distributeChunks -path=/buckets/file.dat
  fs.distributeChunks -path=/buckets/file.dat -mode=round-robin -apply
  fs.distributeChunks -path=/buckets/file.dat -mode=replica -apply
  fs.distributeChunks -path=/buckets/file.dat -nodes=5 -apply

* fix(shell): improve fs.distributeChunks robustness and code quality

- Propagate flag parse errors instead of swallowing them (return err)
- Handle nil chunk.Fid by falling back to legacy FileId string parsing
- Simplify node membership check using slices.Contains

* fix(shell): fix dead round-robin print loop in fs.distributeChunks

The loop was computing targetNode with sc.index%totalNodes (original
chunk index) instead of the sequential position, and discarding it via
_ = targetNode without printing anything. Replace with a correct loop
using pos%totalNodes and actually print the first 12 node assignments.

* fix(shell): compute replication/collection per-chunk in fs.distributeChunks

Previously replication and collection were derived once from chunks[0]
and reused for all moves, causing wrong volume placement for chunks
belonging to different volumes or collections. Now each chunk looks up
its own volumeInfoMap entry immediately before calling operation.Assign.

* fix(shell): prefer assignResult.Auth JWT over local signing key in fs.distributeChunks

When the master returns an Auth token in the Assign response, use it
directly for the upload instead of generating a new JWT from the local
viper signing key. Fall back to local key generation only when Auth is
empty, matching the pattern used by other upload paths.

* fix(shell): add timeout and error handling to delete requests in fs.distributeChunks

The delete loop was ignoring http.NewRequest errors and had no timeout,
risking a nil-request panic or indefinite block. Replace with
http.NewRequestWithContext and a 30s timeout, handle request creation
errors by incrementing deleteFailCount, and cancel the context
immediately after Do returns.

* feat(shell): parallelize chunk moves in fs.distributeChunks using ErrorWaitGroup

Sequential chunk moves are a bottleneck for large LLM model files with
hundreds or thousands of chunks. Use ErrorWaitGroup with
DefaultMaxParallelization (10) to run download/assign/upload concurrently.
Guard movedRecords appends, chunk.Fid updates, and writer output with a
mutex. Individual chunk failures are non-fatal and logged inline; only
successfully moved chunks are included in the metadata update.

* fix(shell): try all replica URLs on download in fs.distributeChunks

Previously only the first volume server URL was attempted, causing chunk
moves to fail if that replica was unreachable. Now iterates through all
URLs returned by LookupVolumeServerUrl and stops at the first success.

* refactor(shell): apply extract method pattern to fs.distributeChunks

Do() was a single ~615-line function. Break it into focused helpers:
- lookupFileEntry: filer entry lookup
- validateChunks: chunk manifest guard
- collectVolumeTopology: master topology query + ownership mapping
- buildDistributionCounts: chunk→node mapping and owner/copy tallies
- selectActiveNodes: target node selection
- printCurrentDistribution: per-node distribution table
- planDistribution: mode-switch planning (primary/replica/round-robin)
- printRedistributionPlan: before/after plan table
- relevantNodes: active-or-occupied node filter

Do() is now ~100 lines of orchestration; each helper has a single
clear responsibility.

* test(shell): add unit tests for fs.distributeChunks algorithms

Cover all three distribution modes and supporting helpers:
- shortName, relevantNodes
- computeOwnerTarget (even/uneven split, inactive node drain)
- buildDistributionCounts (normal + nil Fid fallback)
- selectActiveNodes (all nodes / limited count)
- planOwnerMoves (imbalanced → balanced, already balanced)
- planDistribution primary (chunks balanced, no-op when even)
- planDistribution round-robin (offset ordering, correct assignment)
- planDistribution replica (owner + copy balancing)
- printRedistributionPlan (output format)

* fix(shell): add 5-minute timeout to chunk downloads in fs.distributeChunks

Download requests had no per-request timeout, unlike delete operations
which already use 30s. Replace readUrl() calls with inline
http.NewRequestWithContext + context.WithTimeout(5m) so a hung volume
server cannot block a goroutine indefinitely during redistribution.

* fix(shell): remove redundant deleteOldChunks in fs.distributeChunks

filer.UpdateEntry already calls deleteChunksIfNotNew internally, which
computes the diff between old and new entry chunks and deletes the ones
no longer referenced. Our explicit deleteOldChunks was racing with this
filer-side cleanup, causing spurious 404 warnings on ~75% of deletes.

Remove deleteOldChunks, movedChunkRecord type, and reduce
executeChunkMoves return type to (int, error) for the moved count.

* fix(shell): handle nil chunk.Fid via chunkVolumeId helper in fs.distributeChunks

chunk.Fid.GetVolumeId() silently returns 0 for legacy chunks stored with
a FileId string instead of a Fid struct, causing them to be skipped in
the replica balancing loop and looked up incorrectly in volumeInfoMap.

Introduce chunkVolumeId() that uses Fid when present and falls back to
parsing the legacy FileId string, matching the logic in
buildDistributionCounts. Apply it in the replica-mode copies loop and
in executeChunkMoves' replication/collection lookup.

* fix(shell): use already-parsed oldFid for volumeInfoMap lookup in fs.distributeChunks

chunkVolumeId(chunk) was being called to look up replication/collection
after oldFid had already been parsed and validated. Use oldFid.VolumeId
directly to avoid redundant parsing and guarantee the correct volume ID
regardless of whether chunk.Fid is nil.

* fix(shell): improve correctness and robustness in fs.distributeChunks

- Buffer download body before upload so dlCtx timeout only covers the
  GET request; upload runs with context.Background() via bytes.NewReader
- Replace 'before, after := strings.Cut(...)' + '_ = before' with '_'
  as the first return value directly
- Clone copiesCount before replica planner mutates it, keeping the
  caller's map immutable
- Add nil-entry guard after filer LookupEntry to prevent panic on
  unexpected nil response

* feat(shell): support chunk manifests in fs.distributeChunks

Large files stored as chunk manifests were previously rejected. Resolve
manifests up front via filer.ResolveChunkManifest, redistribute the
underlying data chunks, then re-pack through filer.MaybeManifestize
before UpdateEntry. The filer's MinusChunks resolves manifests on both
sides of the diff, so old manifest and inner data chunks are GC'd
automatically.

* fix(shell): match master's SaveDataAsChunkFunctionType 5-param signature

Master added expectedDataSize uint64; ignore it in shell-side saveAsChunk.

---------

Co-authored-by: Chris Lu <chris.lu@gmail.com>
2026-04-17 21:09:36 -07:00
Lisandro Pin
6bcacedda9 Export master_disconnections metrics on volume servers. (#9104)
This allows to track connection issues and master failovers in real time via Prometheus metrics.

Co-authored-by: Lisandro Pin <lisandro.pin@proton.ch>
2026-04-17 15:15:26 -07:00
Chris Lu
0315da9022 fix(s3api): self-heal stale .versions latest-version pointer on read (#9125)
* fix(s3api): self-heal stale .versions latest-version pointer on read

When the `.versions` directory metadata points at a version file that has
gone missing (e.g. a crash between deleting the latest version and
rewriting the pointer, or a concurrent delete racing with a read),
`getLatestObjectVersion` bailed with a hard error that required manual
repair. Tagging, ACL, retention, copy-source, and HEAD/GET all surfaced
NoSuchKey even though other versions remained on disk.

On `ErrNotFound`/`codes.NotFound` from the pointed-at version lookup,
rescan the `.versions` directory, pick the newest remaining non-delete-
marker entry, persist the repaired pointer (best-effort), and return
that entry. If only delete markers (or nothing) remain, the caller still
sees an error and the object correctly appears absent.

Extracted the selection logic into a pure `selectLatestContentVersion`
helper so `updateLatestVersionAfterDeletion` and the new self-heal path
share a single implementation. A warning is logged whenever the heal
kicks in so stale-pointer incidents remain visible in operator logs.

* fix(s3api): self-heal must promote newest version even if delete marker

Review feedback (gemini-code-assist): the self-heal path used
`selectLatestContentVersion`, which skips delete markers. That had two
bugs:

1. If the chronologically newest entry was a delete marker, an older
   content version would be promoted, effectively "undeleting" an
   object that was actually deleted.
2. If only delete markers remained, heal returned an error and the
   caller surfaced a hard 500 instead of the correct 404-with-
   x-amz-delete-marker response.

Add `selectLatestVersion` that picks the newest entry regardless of
type (content or delete marker) and use it in
`healStaleLatestVersionPointer`. The promoted entry flows back through
`doGetLatestObjectVersion` unchanged; downstream handlers already
detect `ExtDeleteMarkerKey` on the returned entry and render
NoSuchKey + `x-amz-delete-marker: true` (see
`s3api_object_handlers.go:722-728`).

Kept `selectLatestContentVersion` in place for
`updateLatestVersionAfterDeletion`, which deliberately limits the
pointer to a live content version in the post-deletion flow — changing
that is out of scope for this fix.

Added four tests for the new selector:
- promotes newest delete marker over older content (the reviewer case)
- picks content when content is newest
- promotes newest delete marker when only delete markers remain
- returns nil latestEntry on empty/untagged input

* fix(s3api): paginate .versions scan in self-heal path

Review feedback (gemini-code-assist, coderabbitai): the self-heal rescan
did a single-shot list(..., 1000). For objects whose version ids use the
old raw-timestamp format, filer ordering is lexicographic-ascending =
oldest-first. If the .versions directory held more than one page of
entries, the first page contained only the oldest, and the heal would
promote an older version as the new "latest" — silently surfacing stale
data on subsequent reads.

Paginate through the whole directory with `filer.PaginationSize`,
running `selectLatestVersion` per page and keeping a single running
best candidate across pages (via `compareVersionIds`). This mirrors the
pagination pattern already used by `getObjectVersionList` in the same
file and closes the window for old-format stacks larger than one page.

New-format (inverted-timestamp) ids were not affected because their
lexicographic order matches newest-first, but paginating is still the
right fix.

Also updated the function doc to reflect that self-heal now promotes
the newest entry regardless of type (content version or delete marker).

* fix(s3api): don't resurrect deleted objects; wrap ErrNotFound sentinel

Two review findings addressed:

1. `updateLatestVersionAfterDeletion` was using `selectLatestContentVersion`
   which skipped delete markers. Scenario: PUT v1, DELETE (dm1 written,
   pointer->dm1), PUT v3 (pointer->v3), user explicitly deletes v3 by
   versionId. Remaining files: v1, dm1. S3 semantics say the current
   version is the chronologically newest = dm1 (object appears deleted).
   Old code would promote v1, "undeleting" the object silently.

   Switched to `selectLatestVersion`, which picks the newest entry
   regardless of type. Also paginated the scan the same way the self-heal
   path does — otherwise a single-page `list(1000)` still mis-selects the
   latest for old-format version id stacks that exceed one page.

   Removed the `hasDeleteMarkers || !isLast` branch: with `selectLatestVersion`
   any delete marker participates in the comparison and shows up as the
   latest when it is newest, so the "keep the directory on ambiguity"
   guard becomes unreachable. Full pagination also makes the `!isLast`
   guard unnecessary — we either see every entry or surface a list error.

2. `healStaleLatestVersionPointer`'s "no remaining version" error was a
   plain `fmt.Errorf`, so callers could not distinguish genuine
   object-absence from a scan failure via `errors.Is`. Wrap it with
   `filer_pb.ErrNotFound` so the sentinel flows through the chain (the
   outer wrap already uses `%w`).

No test additions needed — `TestSelectLatestVersion_PromotesNewestDeleteMarker`
already asserts the "newer delete marker beats older content" invariant
that drives both fixes.

* refactor(s3api): drop unused selectLatestContentVersion after review

Review feedback flagged a stale comment claiming
selectLatestContentVersion "mirrors" the post-deletion semantics of
updateLatestVersionAfterDeletion. That claim became false when the
post-deletion path switched to selectLatestVersion in the previous
commit. Verified no production callers remain — only the helper's own
tests referenced it — so the cleaner fix is to delete the dead code
rather than rewrite the comment to explain "this exists but isn't
used."

- Removed selectLatestContentVersion.
- Removed the four TestSelectLatestContentVersion_* tests.
- Preserved a renamed TestSelectLatestVersion_MixedFormats so the
  mixed-format comparator coverage still runs against the active
  selector.
2026-04-17 14:57:59 -07:00
Chris Lu
bfea14320a fix(s3): reject unknown POST policy conditions and extra x-amz form fields (#9124)
* fix(s3): reject unknown POST policy conditions and extra x-amz form fields

CheckPostPolicy previously accepted policy conditions with unknown $keys
(e.g. "$foo") as satisfied, and only rejected stray X-Amz-Meta-* form
fields. Reject unknown condition keys outright, and extend the extra-
input-fields check to all X-Amz-* form fields except the reserved
auth/signing headers. Matches AWS S3 POST Object behavior.

* refactor(s3): drop redundant $x-amz-meta- prefix check in CheckPostPolicy

The $x-amz- prefix already subsumes $x-amz-meta-, so the explicit
$x-amz-meta- check adds no coverage. Simplify the else-if condition.
Addresses gemini-code-assist review on PR #9124.

* style(s3): align unknown-key policy error with [op, key, value] trailer

Reformat the unknown-condition-key error in CheckPostPolicy to include
the same "[op, key, value]" trailer used by the other condition-failed
messages. The value slot is empty because no comparison occurs for an
unknown key. The descriptive "unknown condition key" suffix is kept so
operators can still tell this failure from a mismatched value.

* fix(s3): honor starts-with prefix-stem POST policies when checking extras

AWS POST policies use ["starts-with","$x-amz-meta-",""] to allow any
X-Amz-Meta-* form field. The previous exact-match policyXAmzKeys would
flag every X-Amz-Meta-Foo as an "Extra input fields" failure because
only the stem X-Amz-Meta- was stored. Track starts-with conditions whose
key ends in "-" with an empty value as prefix stems, and accept any
X-Amz-* form field matching one of those stems.

* fix(s3): validate value prefix for starts-with POST policy stems

Drop the policy.Value == "" gate when detecting prefix-stem conditions
so that ["starts-with","$x-amz-meta-","pfx-"] is recognized as a prefix
rule. Track the required value prefix alongside the name prefix, enforce
it against every matching form field in the extras loop, and skip the
prefix-stem condition in the main iteration (it has no single form
field to evaluate). Also include policy.Value in the unknown-condition
error trailer for clearer debugging. Addresses gemini-code-assist
review on PR #9124.

* fix(s3): check every matching POST policy rule, not just the first

The extras loop exited early on exact-key match and broke on the first
matching prefix stem. Per AWS, a form field must satisfy every policy
condition that applies to it, so an exact-match field must still honor
any overlapping starts-with stem's value prefix, and multiple stems on
the same field must all hold. Drop both early exits: start matched from
the exact-key lookup, iterate all prefix stems, and fail on the first
value-prefix violation. Addresses gemini-code-assist review on PR
#9124.
2026-04-17 14:55:12 -07:00
Chris Lu
0bddc2652e fix(s3): propagate validated POST form fields to upload headers (#9123)
* fix(s3): propagate validated POST form fields to upload headers

POST Object form fields like acl, Content-Encoding, x-amz-storage-class,
x-amz-tagging, and x-amz-server-side-encryption were validated against
the POST policy but never forwarded to the underlying PUT, so the
validated values had no effect. Forward all non-reserved x-amz-* fields
plus acl (as x-amz-acl) and Content-Encoding. Reserved POST policy
mechanism fields (Policy, Signature, Key, etc.) are still excluded.

* fix(s3): also forward Content-Language from POST form to upload headers

AWS S3 POST Object supports Content-Language; add it to the set of
content headers forwarded by applyPostPolicyFormHeaders alongside
Cache-Control, Expires, Content-Disposition, and Content-Encoding.
Addresses gemini-code-assist review on PR #9123.

* refactor(s3): only look up form value in branches that use it

applyPostPolicyFormHeaders previously called formValues.Get(k) for every
form field, including fields that fall through the switch (non-reserved
fields that are neither Acl, a forwarded content header, nor X-Amz-*).
Move the lookup inside the switch cases that actually use it.
2026-04-17 14:55:06 -07:00
Chris Lu
2da24cc230 fix(s3): return 403 on POST policy violation instead of 307 redirect (#9122)
* fix(s3): return 403 on POST policy violation instead of 307 redirect

CheckPostPolicy failures previously responded with HTTP 307 Temporary
Redirect to the request URL, which causes clients to re-POST and
obscures the failure. Return 403 AccessDenied so the client surfaces
the error.

* test(s3): exercise PostPolicyBucketHandler end-to-end for 403 mapping

Replace the shallow ErrAccessDenied tautology test with one that builds
a signed POST multipart request whose policy conditions cannot be
satisfied, calls PostPolicyBucketHandler directly, and asserts HTTP 403
with no Location redirect header. Addresses gemini-code-assist review on
PR #9122.

* fix(s3): surface POST policy failure reason in AccessDenied response

Add s3err.WriteErrorResponseWithMessage so a caller can keep the
standard error code mapping while providing a specific Message. Use it
from PostPolicyBucketHandler so the XML body carries the CheckPostPolicy
error (e.g. which condition failed or that the policy expired) rather
than the generic "Access Denied." description. Addresses gemini-code-
assist review on PR #9122.

* refactor(s3err): delegate WriteErrorResponse to WriteErrorResponseWithMessage

The two helpers shared every line except the Message override. Fold
WriteErrorResponse into a one-line delegation that passes an empty
message, so the request-id/mux/apiError logic lives in exactly one
place. Addresses gemini-code-assist review on PR #9122.
2026-04-17 14:54:58 -07:00
Chris Lu
32cbed9658 fix(fuse-tests): avoid ephemeral port reuse racing weed mini bind
freePort allocated in [20000, 55535], which overlaps the Linux ephemeral
range (32768-60999). The kernel could reuse the chosen port for an
outbound connection between the test closing its listener and weed mini
re-checking availability, causing:

  Port allocation failed: port N for Filer (specified by flag
  filer.port) is not available on 0.0.0.0 and cannot be used

Narrow the range to [20000, 32000] to stay below the ephemeral floor,
and pass -ip.bind=127.0.0.1 so mini's pre-check runs on the same IP the
test actually reserved the port on.
2026-04-17 13:47:17 -07:00
Chris Lu
cce98fcecf fix(s3): strip client-supplied X-SeaweedFS-Principal/Session-Token in AuthSignatureOnly (#9120)
* fix(s3): strip client-supplied X-SeaweedFS-Principal/Session-Token in AuthSignatureOnly

AuthSignatureOnly is the only auth gate in front of S3Tables routes
(incl. CreateTableBucket) and UnifiedPostHandler, but unlike
authenticateRequestInternal it did not clear the internal IAM
trust headers before running signature verification. S3Tables
authorizeIAMAction reads X-SeaweedFS-Principal directly from the
request and prefers it over the authenticated identity's PrincipalArn,
so a signed low-privilege caller could append that header after signing
(unsigned header, SigV4 still verifies) and have IAM policy evaluated
against a spoofed principal, bypassing authorization.

Clear both X-SeaweedFS-Principal and X-SeaweedFS-Session-Token at the
top of AuthSignatureOnly, mirroring the existing guard in
authenticateRequestInternal. Add a regression test covering the
header-injection path.

* refactor(s3): route AuthSignatureOnly through authenticateRequestInternal

Addresses review feedback: both entry points were independently
maintaining the internal-IAM-header stripping and the auth-type dispatch
switch. Collapse AuthSignatureOnly into a thin wrapper around
authenticateRequestInternal so the security-critical header scrub and
the signature-verify switch live in one place. Post-auth behavior
unique to AuthSignatureOnly (AmzAccountId header) stays inline.

No functional change beyond two harmless telemetry tweaks that now
match authenticateRequestInternal: the per-branch glog verbosity shifts
from V(3) to V(4), and the anonymous-found path now sets AmzAuthType.

* refactor(s3): centralize X-SeaweedFS-Principal/Session-Token header names

Introduce SeaweedFSPrincipalHeader and SeaweedFSSessionTokenHeader in
weed/s3api/s3_constants so the trust-header literals are defined once and
referenced consistently by the auth scrub, JWT auth path, bucket policy
principal resolution, IAM authorization, and S3Tables IAM evaluation.
Replace every remaining usage in weed/s3api and weed/s3api/s3tables.
This removes the drift risk the reviewer called out: adding another call
site with a typo can no longer silently bypass the scrub.

Pure rename, no behavior change. No-op integration-test helper in
test/s3/iam/s3_iam_framework.go left untouched (separate module, and the
server now strips the client-supplied value regardless).
2026-04-17 12:23:21 -07:00
Chris Lu
88ac2d0431 security(s3api): reject unsigned x-amz-* headers in SigV4 requests (#9121)
A presigned URL holder could attach arbitrary x-amz-* headers to a PUT
request (e.g. x-amz-tagging, x-amz-acl, x-amz-storage-class,
x-amz-server-side-encryption*, x-amz-object-lock-*, x-amz-meta-*,
x-amz-website-redirect-location, x-amz-grant-*). Because only the
headers declared in SignedHeaders participate in signature
verification, the added headers bypass authentication; the PUT handler
then persists them into the object's Extended metadata.

Match the AWS SigV4 rule: every x-amz-* header present in the request
must appear in SignedHeaders. Exempt x-amz-content-sha256 (already
tamper-protected via the canonical request's payload-hash line) and,
for presigned URLs, the SigV4 protocol parameters that live in the
query string (X-Amz-Algorithm/Credential/Date/Expires/SignedHeaders/
Signature) in case they are duplicated as headers.

Applies to both header-based and presigned SigV4; non-amz headers are
unaffected.
2026-04-17 12:20:28 -07:00
Chris Lu
e8767f42b6 Add security policy for vulnerability reporting 2026-04-17 09:51:21 -07:00
Chris Lu
f720f559cb ci(kafka-loadtest): switch off Ubuntu/Debian base images to avoid apt mirror flakes (#9119)
* ci(kafka-loadtest): retry apt-get to survive Ubuntu mirror flakes

The Kafka Quick Test workflow's Docker build of Dockerfile.loadtest
keeps hitting "Connection failed [IP: ...]" on archive.ubuntu.com /
security.ubuntu.com mid-build, e.g.:

    failed to solve: process "/bin/sh -c apt-get update && \
      apt-get install -y ca-certificates curl jq bash netcat ..."
      did not complete successfully: exit code: 100

Same class of failure PR #9106 fixed for pjdfstest. Apply the same
two retry knobs to Dockerfile.loadtest and Dockerfile.seektest so a
transient mirror flake retries five times with a 30s timeout instead
of failing the whole workflow.

(The fuller pjdfstest fix also restructured the build to use
docker/build-push-action with type=gha cache. Doing that for
kafka-client-loadtest would mean rewriting the make/docker-compose
build path; defer until the apt-retry alone proves insufficient.)

* ci(kafka-loadtest): also drop recommends/suggests + apt-get clean

Address PR review (gemini-code-assist): fully align with PR #9106's
pattern by adding --no-install-recommends / --no-install-suggests so
the runtime images stay small and don't pull in extra packages, plus
apt-get clean before rm -rf /var/lib/apt/lists/* in Dockerfile.seektest.

* ci(kafka-loadtest): use alpine / maven base images instead of apt

The previous rounds of apt-retry / apt-clean knobs aren't enough:
the Ubuntu mirror is persistently unreachable from the GitHub runner
for minutes at a time, which blows past the 5-retry / 30-second
Acquire configuration and still kills the build (see run 24551809614).

Switch both runtime images so no apt fetch is needed at all:

- Dockerfile.loadtest now runs on alpine:3.20. All runtime deps
  (ca-certificates, curl, jq, bash, netcat-openbsd) are in the
  Alpine main repo, fetched from Alpine's CDN rather than the
  Ubuntu archive that keeps going dark.
- Dockerfile.seektest now uses maven:3.9-eclipse-temurin-11, which
  ships JDK 11 and Maven preinstalled — no apt-get maven step.

This also means the runtime images no longer care about
Acquire::Retries / DEBIAN_FRONTEND / apt-get clean, so those lines
are removed with the apt call they were configuring.
2026-04-17 07:43:19 -07:00
Chris Lu
45578a42e9 fix(volume): keep vacuum running past dangling .idx entries (#9115)
* fix(volume): keep vacuum running past dangling .idx entries

Vacuum compaction aborted entirely on the first .idx entry whose offset
pointed past the end of the .dat file, surfacing as `cannot hydrate
needle from file: EOF` and stalling progress on every other volume.

In both Go and Rust:

- During compaction, skip an unreadable needle and continue. The bytes
  it pointed at were already unreachable via reads, so dropping the
  index reference makes the post-vacuum volume consistent. Real EIO
  still bails out so a disk fault is not silently papered over.

- At volume load, do a single linear scan of the .idx and confirm
  every (offset + actual size) fits inside .dat. The pre-existing
  integrity check only looked at the last 10 entries, so deeper
  corruption (e.g. left over from a crashed batched write) went
  undetected and only surfaced later as a vacuum EOF. A failure now
  marks the volume read-only at load time so an operator can react.

Refs #8928

* fix(volume): only skip permanent-corruption needle reads during vacuum

Address PR review feedback (gemini-code-assist + coderabbit):

The original patch skipped any non-EIO read failure, which would silently
drop needles on transient errors — Windows hardware bad-sector errors
(ERROR_CRC etc.) never surface as syscall.EIO; tiered-storage network
timeouts and EROFS would also slip through and shrink the volume.

Switch to an explicit whitelist of permanent-corruption shapes:

- Add needle.ErrorCorrupted sentinel and wrap CRC and "index out of
  range" errors with %w so callers can match via errors.Is.
- copyDataBasedOnIndexFile now skips only when the read failure is
  io.EOF, io.ErrUnexpectedEOF, ErrorSizeMismatch, ErrorSizeInvalid,
  or ErrorCorrupted. Anything else (real disk faults, environmental
  errors, Windows hardware codes) aborts the compaction so an
  operator notices.
- Mirror the same whitelist in the Rust volume server, matching on
  io::ErrorKind::UnexpectedEof and the NeedleError corruption variants
  (SizeMismatch, CrcMismatch, IndexOutOfRange, TailTooShort).

Also add `defer v.Close()` in TestVerifyIndexFitsInDat so Windows
t.TempDir() cleanup can release the .dat/.idx handles.

Refs #8928

* fix(volume): wrap entry-not-found size-mismatch with ErrorSizeMismatch

Address PR review: the fallback branch in ReadBytes returned an
unwrapped fmt.Errorf, so isSkippableNeedleReadError (and any caller
using errors.Is(..., ErrorSizeMismatch)) could not match it. Wrap
with %w so the whitelist applies, while leaving the existing direct
sentinel return for the OffsetSize==4 / offset<MaxPossibleVolumeSize
retry path unchanged so ReadData's `err == ErrorSizeMismatch` retry
still triggers.

Refs #8928

* fix(volume): integrate dangling-idx check into existing index load walk

Address PR review (gemini-code-assist, medium): the structural .idx
check used to do a second linear scan of the index file at every volume
load, doubling the disk-I/O cost on servers managing many volumes.

Track the largest (offset + actual size) seen during the existing
needle-map load walks (`LoadCompactNeedleMap`, `NewLevelDbNeedleMap`,
`NewSortedFileNeedleMap`'s `newNeedleMapMetricFromIndexFile`,
`DoOffsetLoading`) on a new `MaximumNeedleEnd` field on `mapMetric`,
exposed as `MaxNeedleEnd()` on the NeedleMapper interface.
`volume.load()` then compares `nm.MaxNeedleEnd()` to the .dat size
after the load is complete — pure numeric comparison, no extra I/O.

The standalone `verifyIndexFitsInDat` helper and its caller in
`CheckVolumeDataIntegrity` are removed; the test that used to drive
the helper directly now exercises the new path via
`LoadCompactNeedleMap`.

Mirror the same change in the Rust volume server: track
`max_needle_end` on `NeedleMapMetric`, expose via `max_needle_end()`
on `CompactNeedleMap`, `RedbNeedleMap`, and the `NeedleMap` enum.
The Rust load walk already happens in `load_from_idx` for both map
kinds, so the structural check becomes free.

Refs #8928
2026-04-16 22:01:34 -07:00
Chris Lu
664ae64646 ci(binaries_dev): serialize concurrent runs to prevent asset name collisions
Multiple master pushes within the same minute produced identical BUILD_TIME
values, causing concurrent workflow runs to race on identically-named release
assets. Upload retries hit 422 already_exists and failed the build. Adding a
concurrency group with cancel-in-progress ensures only the latest dev build
runs at a time, which is fine since only the latest dev artifacts matter.
2026-04-16 18:09:57 -07:00
dependabot[bot]
018e648d00 build(deps): bump github.com/jackc/pgx/v5 from 5.8.0 to 5.9.0 (#9113)
Bumps [github.com/jackc/pgx/v5](https://github.com/jackc/pgx) from 5.8.0 to 5.9.0.
- [Changelog](https://github.com/jackc/pgx/blob/master/CHANGELOG.md)
- [Commits](https://github.com/jackc/pgx/compare/v5.8.0...v5.9.0)

---
updated-dependencies:
- dependency-name: github.com/jackc/pgx/v5
  dependency-version: 5.9.0
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-04-16 16:28:18 -07:00
Lars Lehtonen
66ce0c29cf fix(weed/query/engine): check for nil pointers (#9114) 2026-04-16 16:27:55 -07:00
Chris Lu
108e6b0d5f Delete scheduled_tasks.lock 2026-04-16 16:16:25 -07:00
Chris Lu
9d15705c16 fix(mini): shut down admin/s3/webdav/filer before volume/master on Ctrl+C (#9112)
* fix(mini): shut down admin/s3/webdav/filer before volume/master on Ctrl+C

Interrupts fired grace hooks in registration order, so master (started
first) shut down before its clients, producing heartbeat-canceled errors
and masterClient reconnection noise during weed mini shutdown. Admin/s3/
webdav had no interrupt hooks at all and were killed at os.Exit.

- grace: execute interrupt hooks in LIFO (defer-style) order so later-
  started services tear down first.
- filer: consolidate the three separate interrupt hooks (gRPC / HTTP /
  DB) into one that runs in order, so filer shutdown stays correct
  independent of FIFO/LIFO semantics.
- mini: add MiniClientsShutdownCtx (separate from test-facing
  MiniClusterCtx) plus an OnMiniClientsShutdown helper. Admin, S3,
  WebDAV and the maintenance worker observe it; runMini registers a
  cancel hook after startup so under LIFO it fires first and waits up to
  10s on a WaitGroup for those services to drain before filer, volume,
  and master shut down.

Resulting order on Ctrl+C: admin/s3/webdav/worker -> filer (gRPC -> HTTP
-> DB) -> volume -> master.

* refactor(mini): group mini-client shutdown into one state struct

The first pass spread the shutdown plumbing across three globals
(MiniClientsShutdownCtx, miniClientsWg, cancelMiniClients) and two
ctx-derivation sites (OnMiniClientsShutdown and startMiniAdminWithWorker).

Group into a private miniClientsState (ctx/cancel/wg) rebuilt per runMini
invocation, and chain its ctx from MiniClusterCtx so clients only observe
one signal. Tests that cancel MiniClusterCtx still trigger client
shutdown via parent-child propagation.

- resetMiniClients() installs fresh state at the top of runMini, so
  in-process test reruns don't inherit stale ctx/wg.
- onMiniClientsShutdown(fn) replaces the exported OnMiniClientsShutdown
  and only observes one ctx.
- trackMiniClient() replaces the manual wg.Add/Done dance for the admin
  goroutine.
- miniClientsCtx() gives the admin startup a ctx without re-deriving.
- triggerMiniClientsShutdown(timeout) is the interrupt hook body.

No behaviour change; existing tests pass.

* refactor: generalize shutdown ctx as an option, not a mini-specific helper

Several service files (s3, webdav, filer, master, volume) observed the
mini-specific MiniClusterCtx or called onMiniClientsShutdown directly.
That leaked mini orchestration into code that also runs under weed s3,
weed webdav, weed filer, weed master, and weed volume standalone.

Replace with a generic `shutdownCtx context.Context` field on each
service's Options struct. When non-nil, the server watches it and shuts
down gracefully; when nil (standalone), the shutdown path is a no-op.

Mini wires the contexts up from a single place (runMini):
 - miniMasterOptions/miniOptions.v/miniFilerOptions.shutdownCtx =
   MiniClusterCtx (drives test-triggered teardown)
 - miniS3Options/miniWebDavOptions.shutdownCtx = miniClientsCtx() (drives
   Ctrl+C teardown before filer/volume/master)

All knowledge of MiniClusterCtx now lives in mini.go.

* fix(mini): stop worker before clients ctx so admin shutdown isn't blocked

Symptom on Ctrl+C of a clean weed mini: mini's Shutting down admin/s3/
webdav hook sat for 10s then logged "timed out". Admin had started its
shutdown but was blocked inside StopWorkerGrpcServer's GracefulStop,
waiting for the still-connected worker stream. That in turn left filer
clients connected and cascaded into filer's own 10s gRPC graceful-stop
timeout.

Two causes, both fixed:

1. worker.Stop() deadlocked on clean shutdown. It sent ActionStop (which
   makes managerLoop `break out` and exit), then called getTaskLoad()
   which sends to the same unbuffered cmd channel — no receiver, hangs
   forever. Reorder Stop() to snapshot the admin client and drain tasks
   BEFORE sending ActionStop, and call Disconnect() via the local
   snapshot afterwards.

2. Worker's taskRequestLoop raced with Disconnect(): RequestTask reads
   from c.incoming, which Disconnect closes, yielding a nil response and
   a panic on response.Message. Handle the closed channel explicitly.

3. Mini now has a preCancel phase (beforeMiniClientsShutdown) that runs
   synchronously BEFORE the clients ctx is cancelled. Register worker
   shutdown there so admin's worker-gRPC GracefulStop finds the worker
   already disconnected and returns immediately, instead of waiting on
   a stream that is about to close anyway.

Observed shutdown of a clean mini: admin/s3/webdav down in <10ms; full
process exit in ~11s (the remaining 10s is a pre-existing filer gRPC
graceful-stop timeout, not cascaded from the clients tier).

* feat(mini): cap filer gRPC graceful stop at 1s under weed mini

Full weed mini shutdown was ~11s on a clean exit, dominated by the
filer's default 10s gRPC GracefulStop timeout while background
SubscribeLocalMetadata streams drained.

Expose the timeout as a FilerOptions.gracefulStopTimeout field (default
10s for standalone weed filer) and set it to 1s in mini. Clean weed mini
shutdown now takes ~2s.
2026-04-16 16:11:01 -07:00
Chris Lu
9554e259dd fix(iceberg): route catalog clients to the right bucket and vend S3 endpoint (#9109)
* fix(iceberg): route catalog clients to the right bucket and vend S3 endpoint

DuckDB ATTACH 's3://<bucket>/' AS cat (TYPE 'ICEBERG', ...) was failing
with "schema does not exist" because GET /v1/config ignored the warehouse
query parameter and returned no overrides.prefix, so subsequent requests
fell through to the hard-coded "warehouse" default bucket instead of the
one the client attached. LoadTable also returned an empty config, forcing
clients to discover the S3 endpoint out-of-band and producing 403s on
direct iceberg_scan calls.

- handleConfig now echoes overrides.prefix = bucket and defaults.warehouse
  when ?warehouse=s3://<bucket>/ is supplied.
- getBucketFromPrefix honors a warehouse query parameter as a fallback for
  clients that skip the /v1/config handshake.
- LoadTable responses advertise s3.endpoint and s3.path-style-access so
  clients can reach data files without separate configuration.

Refs #9103

* address review feedback on iceberg S3 endpoint vending

- deriveS3AdvertisedEndpoint is now a method on S3Options; honors
  externalUrl / S3_EXTERNAL_URL, switches to https when -s3.key.file is
  set, uses the https port when configured, and brackets IPv6 literals
  via util.JoinHostPort.
- handleCreateTable returns s.buildFileIOConfig() in both its staged and
  final LoadTableResult branches so create and load flows see the same
  FileIO hints.
- Add unit test coverage for the endpoint derivation scenarios.

* address CI and review feedback for #9109

- DuckDB integration test now runs under its own newOAuthTestEnv (with a
  valid IAM config) so the OAuth2 client_credentials flow DuckDB requires
  actually works; the shared env has no registered credentials, which was
  the cause of the CI failure. Helper createTableWithToken was added to
  create tables via Bearer auth.
- Tighten TestIssue9103_LoadTableDoesNotVendS3FileIOCredentials to also
  assert s3.path-style-access = "true", so a partial regression where the
  endpoint is vended but path-style is dropped still fails.
- deriveS3AdvertisedEndpoint now logs a startup hint when it infers the
  host from os.Hostname because the bind IP is a wildcard, pointing
  operators at -s3.externalUrl / S3_EXTERNAL_URL for reverse-proxy
  deployments where the inferred name is not externally reachable.
- handleConfig has a comment explaining that any sub-path in the
  warehouse URL is dropped because catalog routing is bucket-scoped.

* fix(iceberg): make advertised S3 endpoint strictly opt-in; add region

The wildcard-bind fallback to os.Hostname() in deriveS3AdvertisedEndpoint
was hijacking correctly-configured clients: on the CI runner it produced
http://runnervmrc6n4:<port>, which Spark (running in Docker) could not
resolve, so Spark iceberg tests began failing after the endpoint started
being vended in LoadTable responses.

Change the rule so advertising is opt-in and never guesses a host that
might not be routable:
  - -s3.externalUrl / S3_EXTERNAL_URL wins (covers reverse-proxy).
  - Otherwise, only an explicit, non-wildcard -s3.bindIp is used.
  - Wildcard / empty bind returns "" so no FileIO endpoint is vended and
    existing clients keep using their own configuration.

buildFileIOConfig additionally vends s3.region (defaulting to the same
value baked into table bucket ARNs) whenever it vends an endpoint, so
DuckDB's attach does not fail with "No region was provided via the
vended credentials" when the operator has opted in.

The DuckDB issue-9103 integration test runs under an env with a
wildcard bind, so it explicitly sets AWS_REGION in the docker run to
pick up the same default. The HTTP-level LoadTable-vending test was
dropped because its expectation is now conditional and already covered
by unit tests in iceberg_issue_9103_test.go.
2026-04-16 15:51:43 -07:00
Chris Lu
00a2e22478 fix(mount): remove fid pool to stop master over-allocating volumes (#9111)
* fix(mount): remove fid pool to stop master over-allocating volumes

The writeback-cache fid pool pre-allocated file IDs with
ExpectedDataSize = ChunkSizeLimit (typically 8+ MB). The master's
PickForWrite charges count * expectedDataSize against the volume's
effectiveSize, so a full pool refill could charge hundreds of MB
against a single volume before any bytes were actually written.
That tripped RecordAssign's hard-limit path and eagerly removed
volumes from writable, causing the master to grow new volumes
even when the real data being written was tiny.

Drop the pool entirely. Every chunk upload goes through
UploadWithRetry -> AssignVolume with no ExpectedDataSize hint,
letting the master fall back to the 1 MB default estimate. The
mount->filer grpc connection is already cached in pb.WithGrpcClient
(non-streaming mode), so per-chunk AssignVolume is a unary RPC
over an existing HTTP/2 stream, not a full dial. Path-based
filer.conf storage rules now apply to mount chunk assigns again,
which the pool had to skip.

Also remove the now-unused operation.UploadWithAssignFunc and its
AssignFunc type.

* fix(upload): populate ExpectedDataSize from actual chunk bytes

UploadWithRetry already buffers the full chunk into `data` before
calling AssignVolume, so the real size is known. Previously the
assign request went out with ExpectedDataSize=0, making the master
fall back to the 1 MB DefaultNeedleSizeEstimate per fid — same
over-reservation symptom the pool had, just smaller per call.

Stamp ExpectedDataSize = len(data) before the assign RPC when the
caller hasn't already set it. This covers mount chunk uploads,
filer_copy, filersink, mq/logstore, broker_write, gateway_upload,
and nfs — all the UploadWithRetry paths.

* fix(assign): pass real ExpectedDataSize at every assign call site

After removing the mount fid pool, per-chunk AssignVolume calls went
out with ExpectedDataSize=0, making the master fall back to its 1 MB
DefaultNeedleSizeEstimate. That's still an over-estimate for small
writes. Thread the real payload size through every remaining assign
site so RecordAssign charges effectiveSize accurately and stops
prematurely marking volumes full.

- filer: assignNewFileInfo now takes expectedDataSize and stamps it
  on both primary and alternate VolumeAssignRequests. Callers pass:
  - SSE data-to-chunk: len(data)
  - copy manifest save: len(data)
  - streamCopyChunk: srcChunk.Size
  - TUS sub-chunk: bytes read
  - saveAsChunk (autochunk/manifestize): 0 (small, size unknown
    until the reader is drained; master uses 1 MB default)
- filer gRPC remote fetch-and-write: ExpectedDataSize = chunkSize
  after the adaptive chunkSize is computed.
- ChunkedUploadOption.AssignFunc gains an expectedDataSize parameter;
  upload_chunked.go passes the buffered dataSize at the call site.
  S3 PUT assignFunc stamps it on the AssignVolumeRequest.
- S3 copy: assignNewVolume / prepareChunkCopy take expectedDataSize;
  all seven call sites pass the source chunk's Size.
- operation.SubmitFiles / FilePart.Upload: derive per-fid size from
  FileSize (average for batched requests, real per-chunk size for
  sequential chunk assigns).
- benchmark: pass fileSize.
- filer append-to-file: pass len(data).

* fix(assign): thread size through SaveDataAsChunkFunctionType

The saveAsChunk path (autochunk, filer_copy, webdav, mount) ran
AssignVolume before the reader was drained, so it had to pass
ExpectedDataSize=0 and fall back to the master's 1 MB default.

Add an expectedDataSize parameter to SaveDataAsChunkFunctionType.
- mergeIntoManifest already has the serialized manifest bytes, so
  it passes uint64(len(data)) directly.
- Mount's saveDataAsChunk ignores the parameter because it uses
  UploadWithRetry, which already stamps len(data) on the assign
  after reading the payload.
- webdav and filer_copy saveDataAsChunk follow the same UploadWithRetry
  path and also ignore the hint.
- Filer's saveAsChunk (used for manifestize) plumbs the value to
  assignNewFileInfo so manifest-chunk assigns get a real size.

Callers of saveFunc-as-value (weedfs_file_sync, dirty_pages_chunked)
pass the chunk size they're about to upload.
2026-04-16 15:51:13 -07:00
Chris Lu
7916e61c08 fix(mount): avoid self-notify deadlock in Link and CopyFileRange handlers (#9110)
The Link and CopyFileRange FUSE request handlers were calling
fuseServer.InodeNotify (and EntryNotify for copy) synchronously while
the kernel was still waiting for the request's reply on the same
/dev/fuse fd. Notifications share that fd, so the syscall.Write can
block indefinitely when the kernel hasn't drained its queue yet,
hanging the entire mount. A goroutine dump from a stuck mount showed
the Link handler blocked in syscall.Write inside InodeNotify while the
server's read loop kept waiting for new requests.

Drop the synchronous notifies. The local meta cache is still updated
inline, so subsequent filesystem ops see the fresh state; the kernel's
attr/dentry caches re-fetch once their TTL expires.
2026-04-16 14:09:32 -07:00
Chris Lu
ecc0390795 fix(master): eagerly remove volume from writable when assign hits limit (#9108)
* fix(master): eagerly remove volume from writable when RecordAssign hits limit

Previously, a volume was only removed from the writable list by the
heartbeat-driven CollectDeadNodeAndFullVolumes pass, which runs every
pulse (5s) after a 5s heartbeat. Under sustained concurrent writes,
fio-style workloads observed in the field grew volumes 8-20x past the
configured 100MB limit (median 530MB, peak 1.98GB) during that
5-15s detection window.

RecordAssign already tracks effective size (reported + pending) on each
/dir/assign. It now also removes the volume from writable the moment
effectiveSize reaches volumeSizeLimit, and mirrors the activeVolumeCount
decrement that Topology.SetVolumeCapacityFull would have done on the
next heartbeat. The heartbeat path remains unchanged and idempotent
(vl.SetVolumeCapacityFull returns false if already removed, so no
double-decrement).

Recovery still works: if a heartbeat later reports size < limit and
the volume is not oversized, EnsureCorrectWritables adds it back.

- weed/topology/volume_layout.go: RecordAssign returns reachedCapacity
  bool; adds AdjustActiveVolumeCountForFull helper.
- weed/topology/topology.go: PickForWrite invokes the decrement on
  eager full transitions.
- TestPickForWrite: pass a 1024-byte hint instead of 0 so the default
  1MB pendingDelta does not immediately bust the test's 32KB limit.
- New TestRecordAssignReachingCapacityRemovesFromWritable covers the
  eager removal, active count accounting, and no-double-accounting.

* fix(master): recover eagerly-removed volume once decay clears pending

After RecordAssign eagerly removes a volume from writables because
effectiveSize reached the limit, decay can later bring effectiveSize
back under the limit (e.g., when a burst of assigns didn't all result
in uploads). Without recovery the volume would stay non-writable until
vacuum or a ReadOnly flip.

UpdateVolumeSize now re-adds the volume to writables once all of the
following hold:

  * RecordAssign is what removed it (tracked via fullSince timestamp)
  * at least capacityRecoveryDelay has elapsed since the removal (30s)
    — this prevents bouncing during a steady stream of assigns near
    the limit
  * effectiveSize has decayed below the crowded threshold (90% of limit)
  * reportedSize is under the limit (actual disk is not over)
  * standard EnsureCorrectWritables preconditions: enough copies, all
    copies writable, not oversized

The caller (SyncDataNodeRegistration) re-increments activeVolumeCount
symmetrically with the decrement done on eager removal.

* review: release VolumeLayout lock before UpAdjustDiskUsageDelta

adjustActiveVolumeCount held vl.accessLock across the tree-climbing
UpAdjustDiskUsageDelta walk. That walk takes per-level DiskUsages
locks and could be re-entered from other call paths that hold a
node-level lock and then acquire vl.accessLock. Copy the node list
under the VolumeLayout lock and release it before the tree walk to
eliminate the lock-ordering hazard.
2026-04-16 12:50:30 -07:00
Chris Lu
40ffc73aa8 ci(pjdfstest): cache docker layers via GHA to avoid apt mirror flakes (#9106)
* ci(pjdfstest): cache docker layers via GHA to avoid apt mirror flakes

Replace the local buildx cache + manual fallback with docker/setup-buildx-action
and docker/build-push-action using type=gha cache. The e2e and pjdfstest Dockerfile
layers now persist across runs in GitHub's own cache backend, so apt-get update
only hits Ubuntu mirrors when the Dockerfiles change. Also add Acquire::Retries
and Timeout so first-run cache-miss builds survive transient mirror sync errors.

* ci(pjdfstest): use local registry to share e2e image across buildx builds

The docker-container buildx driver cannot see images loaded into the host
Docker daemon, so the second build's FROM chrislusf/seaweedfs:e2e failed with
"not found" on registry-1.docker.io.

Run a local registry:2 on the runner, push both images to localhost:5000,
remap the FROM via build-contexts so the Dockerfile stays unchanged, then
tag the pulled images locally for docker compose to consume.
2026-04-16 12:50:19 -07:00
Chris Lu
ff4f96c71f fix(filer): drop stale master gRPC cache on stream death (#9102) (#9107)
* fix(filer): drop stale master gRPC cache on stream death (#9102)

When the master server restarts behind a stable L4 endpoint (e.g. a
Kubernetes ClusterIP Service), the filer's streaming KeepConnected
channel detects the disconnect and reconnects, but the shared
request-path ClientConn cached in pb.grpcClients can remain in READY
state while actually being dead. New AssignVolume/LookupVolume calls
reuse that cached channel and return `rpc error: code = Canceled desc
= context canceled` for every request, until the filer pod is
restarted.

- Expose pb.InvalidateGrpcConnection(address) to drop a cached
  ClientConn when a higher-level signal says it is stale.
- In MasterClient.tryConnectToMaster, invalidate the cached
  request-path channel whenever the KeepConnected stream returns, so
  unrelated callers dial fresh on their next RPC.
- Extend operation.Assign's retry predicate to cover Canceled and
  DeadlineExceeded while the caller context is still live: the first
  failure invalidates the stale ClientConn via
  shouldInvalidateConnection, and the retry dials a new channel.

* fix(grpc): invalidate cached peer conn on streaming death in other paths

Extends the master-client fix to the other streaming-caller + cached
non-streaming-peer pairs that share the same stale-channel failure mode
when the peer restarts behind a stable L4 endpoint (k8s Service VIP,
external load balancer):

- pb.FollowMetadata (s3, mount, webdav, mq broker, filer remote gateway,
  etc. → filer): invalidate the filer's cached ClientConn when the
  SubscribeMetadata stream returns an error.
- filer.MetaAggregator.loopSubscribeToOneFiler (filer → peer filer):
  invalidate the peer's cached ClientConn after doSubscribeToOneFiler
  fails, so the next iteration's readFilerStoreSignature / updateOffset
  calls dial fresh.
- mq sub_client.onEachPartition and doKeepConnectedToSubCoordinator
  (subscriber → broker): invalidate the broker's cached ClientConn when
  the SubscribeMessage / SubscriberToSubCoordinator stream errors.
- mq broker.BrokerConnectToBalancer (broker → broker-balancer):
  invalidate the balancer's cached ClientConn after the
  PublisherToPubBalancer stream errors.

* address review feedback on InvalidateGrpcConnection

- pb.InvalidateGrpcConnection: drop the cache entry under grpcClientsLock
  but call ClientConn.Close() after releasing the lock, so Close's
  internal synchronisation/IO doesn't serialise unrelated callers on the
  global map lock.
- wdclient.tryConnectToMaster: only invalidate the cached request-path
  channel when the streaming call returned an error. On a healthy leader
  redirect (gprcErr == nil) the cached channel is still usable and
  invalidating it just causes a needless re-dial from concurrent callers.

* refactor(grpc): centralize peer-conn invalidation in streaming path

Previously every streaming caller duplicated the same invalidate-cached-
non-streaming-peer-conn wrapper around their WithGrpcClient(true, ...)
call. Move that logic into WithGrpcClient itself: when the streaming
fn returns an error, invalidate any cached ClientConn for the same
address. This removes six near-identical call-site wrappers and gives
every current and future streaming caller the fix by default.

Also aligns the non-streaming branch with the new Invalidate helper's
lock discipline: delete the cache entry under grpcClientsLock, then
Close the ClientConn after releasing the lock.
2026-04-16 12:10:25 -07:00
Chris Lu
9896eade51 feat(mount): set FOPEN_KEEP_CACHE on re-open of unchanged files (#9097)
* feat(mount): set FOPEN_KEEP_CACHE when file mtime is unchanged

On re-open of an unmodified file, signal the kernel to preserve its
existing page cache. This eliminates redundant volume server reads for
workloads that repeatedly open-read-close the same files (build systems,
config readers, etc.).

* fix(mount): use guarded type assertion for openMtimeCache load

Use the two-value form of type assertion when loading from sync.Map
to prevent potential panics if a non-int64 value is ever stored.

* fix(mount): skip redundant mtime store and invalidate on truncation

- Avoid redundant sync.Map Store when cached mtime already matches
  the current mtime, reducing contention on the hot open path.
- Invalidate openMtimeCache in SetAttr when file size changes
  (truncation), preventing stale kernel page cache after ftruncate.

* fix(mount): use nanosecond mtime precision and bounded cache for FOPEN_KEEP_CACHE

- Compare both Mtime (seconds) and MtimeNs (nanoseconds) to detect
  sub-second modifications common in automated workloads.
- Replace unbounded sync.Map with a bounded map + mutex (8192 entries,
  random eviction when full), following the existing atimeMap pattern.
- Extract applyKeepCacheFlag and invalidateOpenMtimeCache methods for
  clarity and testability.
- Add tests for nanosecond precision and cache eviction.

* fix(mount): invalidate mtime cache in truncateEntry for O_TRUNC consistency

Add invalidateOpenMtimeCache call to truncateEntry so the Create path
with O_TRUNC follows the same explicit invalidation pattern as SetAttr
and Write.
2026-04-16 11:37:52 -07:00