Files
seaweedfs/weed/pb
Chris Lu ff4f96c71f fix(filer): drop stale master gRPC cache on stream death (#9102) (#9107)
* fix(filer): drop stale master gRPC cache on stream death (#9102)

When the master server restarts behind a stable L4 endpoint (e.g. a
Kubernetes ClusterIP Service), the filer's streaming KeepConnected
channel detects the disconnect and reconnects, but the shared
request-path ClientConn cached in pb.grpcClients can remain in READY
state while actually being dead. New AssignVolume/LookupVolume calls
reuse that cached channel and return `rpc error: code = Canceled desc
= context canceled` for every request, until the filer pod is
restarted.

- Expose pb.InvalidateGrpcConnection(address) to drop a cached
  ClientConn when a higher-level signal says it is stale.
- In MasterClient.tryConnectToMaster, invalidate the cached
  request-path channel whenever the KeepConnected stream returns, so
  unrelated callers dial fresh on their next RPC.
- Extend operation.Assign's retry predicate to cover Canceled and
  DeadlineExceeded while the caller context is still live: the first
  failure invalidates the stale ClientConn via
  shouldInvalidateConnection, and the retry dials a new channel.

* fix(grpc): invalidate cached peer conn on streaming death in other paths

Extends the master-client fix to the other streaming-caller + cached
non-streaming-peer pairs that share the same stale-channel failure mode
when the peer restarts behind a stable L4 endpoint (k8s Service VIP,
external load balancer):

- pb.FollowMetadata (s3, mount, webdav, mq broker, filer remote gateway,
  etc. → filer): invalidate the filer's cached ClientConn when the
  SubscribeMetadata stream returns an error.
- filer.MetaAggregator.loopSubscribeToOneFiler (filer → peer filer):
  invalidate the peer's cached ClientConn after doSubscribeToOneFiler
  fails, so the next iteration's readFilerStoreSignature / updateOffset
  calls dial fresh.
- mq sub_client.onEachPartition and doKeepConnectedToSubCoordinator
  (subscriber → broker): invalidate the broker's cached ClientConn when
  the SubscribeMessage / SubscriberToSubCoordinator stream errors.
- mq broker.BrokerConnectToBalancer (broker → broker-balancer):
  invalidate the balancer's cached ClientConn after the
  PublisherToPubBalancer stream errors.

* address review feedback on InvalidateGrpcConnection

- pb.InvalidateGrpcConnection: drop the cache entry under grpcClientsLock
  but call ClientConn.Close() after releasing the lock, so Close's
  internal synchronisation/IO doesn't serialise unrelated callers on the
  global map lock.
- wdclient.tryConnectToMaster: only invalidate the cached request-path
  channel when the streaming call returned an error. On a healthy leader
  redirect (gprcErr == nil) the cached channel is still usable and
  invalidating it just causes a needless re-dial from concurrent callers.

* refactor(grpc): centralize peer-conn invalidation in streaming path

Previously every streaming caller duplicated the same invalidate-cached-
non-streaming-peer-conn wrapper around their WithGrpcClient(true, ...)
call. Move that logic into WithGrpcClient itself: when the streaming
fn returns an error, invalidate any cached ClientConn for the same
address. This removes six near-identical call-site wrappers and gives
every current and future streaming caller the fix by default.

Also aligns the non-streaming branch with the new Invalidate helper's
lock discipline: delete the cache entry under grpcClientsLock, then
Close the ClientConn after releasing the lock.
2026-04-16 12:10:25 -07:00
..
2022-07-28 23:24:38 -07:00
2026-04-10 17:31:14 -07:00
2022-07-28 23:24:38 -07:00
2025-10-13 18:05:17 -07:00
2025-10-13 18:05:17 -07:00
2022-08-18 00:15:46 -07:00
2026-03-09 11:54:32 -07:00