In `storage_service::removenode`, in "Step 5", services which implement
`endpoint_lifecycle_subscriber` are first notified about the node
leaving the cluster, and only after that the gossiper state is updated
(comments added by me):
// This function indirectly notifies subscribers
ss.excise(std::move(tmp), endpoint);
// This function updates the gossiper state
ss._gossiper.advertise_token_removed(endpoint, host_id).get();
This order is confusing for those subscribers which expect the fact
that the node is leaving to be reflected in the gossiper state - more
specifically, for hints manager.
The hints manager has a function `can_send()` which determines if it is
OK for it to try send hints. More specifically, it looks at the
gossiper state to see if the destination node is ALIVE or if it has
left the ring. The first case is obvious as the destination node will
be able to receive the hints as writes, while the other means that the
hints will be sent with CL=ALL to its new replicas.
When a node leaves the cluster, all hint queues either to or from that
node enter the "drain" mode - the queue will attempt to send out all
hints and will drop those hints which failed to be sent. This mode is
triggered by a notification from the storage_service (hints manager is
a lifecycle subscriber).
The core drain logic for a queue looks as follows:
manager_logger.trace("Draining for {}: start", end_point_key());
set_draining();
send_hints_maybe();
_ep_manager.flush_current_hints().handle_exception([] (auto e) {
manager_logger.error("Failed to flush pending hints: {}. Ignoring...", e);
}).get();
send_hints_maybe();
manager_logger.trace("Draining for {}: end", end_point_key());
And `send_hints_maybe` contains the following loop:
while (replay_allowed() && have_segments() && can_send()) {
if (!send_one_file(*_segments_to_replay.begin())) {
break;
}
_segments_to_replay.pop_front();
++replayed_segments_count;
}
Coming back to the `storage_service::removenode` - because of the order
of `excise` and `advertise_token_removed`, draining starts before the
node which is being removed is removed from gossiper state. In turn,
it might happen that the drain logic calls `send_hints_maybe` twice and
does not send any hints - the loop in that function will immediately
stop because `can_send()` is false because the gossiper state still
reports that the target node is not alive. The logic expects `can_send`
to be true here because the node has left the ring.
This patch changes the order of `excise` and `advertise_token_removed`
in `storage_service::removenode` - now, the first one is called after
the other. This ensures that the gossiper state is updated before
listeners are called, and the race descrbed in the commit message does
not happen anymore - `can_send` is true when the node is being drained.
The race described here was exposed by the following commit:
77a0f1a153Fixes: #5087
Tests:
- unit(dev)
- dtest(hintedhandoff_additional_test.py)
- dtest(topology_test.py)
Closes#8284