mirror of
https://github.com/seaweedfs/seaweedfs.git
synced 2026-05-23 18:21:28 +00:00
* fix(filer): drop stale master gRPC cache on stream death (#9102) When the master server restarts behind a stable L4 endpoint (e.g. a Kubernetes ClusterIP Service), the filer's streaming KeepConnected channel detects the disconnect and reconnects, but the shared request-path ClientConn cached in pb.grpcClients can remain in READY state while actually being dead. New AssignVolume/LookupVolume calls reuse that cached channel and return `rpc error: code = Canceled desc = context canceled` for every request, until the filer pod is restarted. - Expose pb.InvalidateGrpcConnection(address) to drop a cached ClientConn when a higher-level signal says it is stale. - In MasterClient.tryConnectToMaster, invalidate the cached request-path channel whenever the KeepConnected stream returns, so unrelated callers dial fresh on their next RPC. - Extend operation.Assign's retry predicate to cover Canceled and DeadlineExceeded while the caller context is still live: the first failure invalidates the stale ClientConn via shouldInvalidateConnection, and the retry dials a new channel. * fix(grpc): invalidate cached peer conn on streaming death in other paths Extends the master-client fix to the other streaming-caller + cached non-streaming-peer pairs that share the same stale-channel failure mode when the peer restarts behind a stable L4 endpoint (k8s Service VIP, external load balancer): - pb.FollowMetadata (s3, mount, webdav, mq broker, filer remote gateway, etc. → filer): invalidate the filer's cached ClientConn when the SubscribeMetadata stream returns an error. - filer.MetaAggregator.loopSubscribeToOneFiler (filer → peer filer): invalidate the peer's cached ClientConn after doSubscribeToOneFiler fails, so the next iteration's readFilerStoreSignature / updateOffset calls dial fresh. - mq sub_client.onEachPartition and doKeepConnectedToSubCoordinator (subscriber → broker): invalidate the broker's cached ClientConn when the SubscribeMessage / SubscriberToSubCoordinator stream errors. - mq broker.BrokerConnectToBalancer (broker → broker-balancer): invalidate the balancer's cached ClientConn after the PublisherToPubBalancer stream errors. * address review feedback on InvalidateGrpcConnection - pb.InvalidateGrpcConnection: drop the cache entry under grpcClientsLock but call ClientConn.Close() after releasing the lock, so Close's internal synchronisation/IO doesn't serialise unrelated callers on the global map lock. - wdclient.tryConnectToMaster: only invalidate the cached request-path channel when the streaming call returned an error. On a healthy leader redirect (gprcErr == nil) the cached channel is still usable and invalidating it just causes a needless re-dial from concurrent callers. * refactor(grpc): centralize peer-conn invalidation in streaming path Previously every streaming caller duplicated the same invalidate-cached- non-streaming-peer-conn wrapper around their WithGrpcClient(true, ...) call. Move that logic into WithGrpcClient itself: when the streaming fn returns an error, invalidate any cached ClientConn for the same address. This removes six near-identical call-site wrappers and gives every current and future streaming caller the fix by default. Also aligns the non-streaming branch with the new Invalidate helper's lock discipline: delete the cache entry under grpcClientsLock, then Close the ClientConn after releasing the lock.