Commit 0665d9c346 changed the gossiper
failure detector in the following way: when live endpoints change
and per-node failure detectors finish their loops, the main failure
detector calls gossiper::convict for those nodes which were alive when
the current iteration of the main FD started but now are not. This was
changed in order to make sure that nodes are marked as down, because
some other code in gossiper could concurrently remove nodes from
the live node lists without marking them properly.
This was committed around 3 years ago and the situation changed:
- After 75d1dd3a76
the `endpoint_state::_is_alive` field was removed and liveness
of a node is solely determined by its presence
in the `gossiper::_live_endpoints` field.
- Currently, all gossiper code which modifies `_live_endpoints`
takes care to trigger relevant callback. The only function which
modifies the field but does not trigger notifications
is `gossiper::evict_from_membership`, but it is either called
after `gossiper::remove_endpoint` which triggers callbacks
by itself, or when a node is already dead and there is no need
to trigger callbacks.
So, it looks like the reasons it was introduced for are not relevant
anymore. What's more important though is that it is involved in a bug
described in scylladb/scylladb#17515. In short, the following sequence
of events may happen:
1. Failure detector for some remote node X decides that it was dead
long enough and `convict`s it, causing live endpoints to be updated.
2. The gossiper main loop sends a successful echo to X and *decides*
to mark it as alive.
3. At the same time, failure detector for all nodes other than X finish
and main failure detector continues; it notices that node X is
not alive (because it was convicted in point 1.) and *decides*
to convict it.
4. Actions planned in 2 and 3 run one after another, i.e. node is first
marked as alive and then immediately as dead.
This causes `on_alive` callbacks to run first and then `on_dead`. The
second one is problematic as it closes RPC connections to node X - in
particular, if X is in the process of replacing another node with the
same IP then it may cause the replace operation to fail.
In order to simplify the code and fix the bug - remove the piece
of logic in question.
Fixes: scylladb/scylladb#17515Closesscylladb/scylladb#17754