mirror of
https://github.com/scylladb/scylladb.git
synced 2026-05-29 11:10:40 +00:00
In topology on raft mode, the events "new node starts its group0 server"
and "new node is added to group0 configuration" are not synchronized
with each other. Therefore it might happen that the cluster starts
sending commands to the new node before the node starts its server. This
might lead to harmless, but ugly messages like:
INFO 2023-09-27 15:42:42,611 [shard 0:stat] rpc - client
127.0.0.1:56352 msg_id 2: exception "Raft group
b8542540-5d3b-11ee-99b8-1052801f2975 not found" in no_wait handler
ignored
In order to solve this, the failure detector verb is extended to report
information about whether group0 is alive. The raft rpc layer will drop
messages to nodes whose group0 is not seen as alive.
Tested by adding a delay before group0 is started on the joining node,
running all topology tests and grepping for the aforementioned log
messages.
Fixes: scylladb/scylladb#15853
Fixes: scylladb/scylladb#15167
Closes scylladb/scylladb#16071
* github.com:scylladb/scylladb:
raft: rpc: introduce destination_not_alive_error
raft: rpc: drop RPCs if the destination is not alive
raft: pass raft::failure_detector to raft_rpc
raft: transfer information about group0 liveness in direct_fd_ping
raft: add server::is_alive