In the new Raft-based recovery procedure, live nodes join the new
group 0 one by one during a rolling restart. There is a time window when
some of them are in the old group 0, while others are in the new group
0. This causes a group 0 mismatch in `gossiper::handle_syn_msg`. The
current solution for this problem is to ignore group 0 mismatches if
`recovery_leader` is set on the local node and to ask the administrator
to perform the rolling restart in the following way:
- set `recovery_leader` in `scylla.yaml` on all live nodes,
- send the `SIGHUP` signal to all Scylla processes to reload the config,
- proceed with the rolling restart.
This commit makes `gossiper::handle_syn_msg` ignore group 0 mismatches
when exactly one of the two gossiping nodes has `recovery_leader` set.
We achieve this by adding `recovery_leader` to `gossip_digest_syn`.
This change makes setting `recovery_leader` earlier on all nodes and
reloading the config unnecessary. From now on, the administrator can
simply restart each node with `recovery_leader` set.
However, note that nodes that join group 0 must have `recovery_leader`
set until all nodes join the new group 0. For example, assume that we
are in the middle of the rolling restart and one of the nodes in the new
group 0 crashes. It must be restarted with `recovery_leader` set, or
else it would reject `gossip_digest_syn` messages from nodes in the old
group 0. To avoid problems in such cases, we will continue to recommend
setting `recovery_leader` in `scylla.yaml` instead of passing it as
a command line argument.
We change the type of the `recovery_leader` config parameter and
`gossip_config::recovery_leader` from sstring to UUID. `recovery_leader`
is supposed to store host ID, so UUID is a natural choice.
After changing the type to UUID, if the user provides an incorrect UUID,
parsing `recovery_leader` will fail early, but the start-up will
continue. Outside the recovery procedure, `recovery_leader` will then be
ignored. In the recovery procedure, the start-up will fail on:
```
throw std::runtime_error(
"Cannot start - Raft-based topology has been enabled but persistent group 0 ID is not present. "
"If you are trying to run the Raft-based recovery procedure, you must set recovery_leader.");
```
failure_detector_loop_for_node may be started on a shard before id->ip
mapping is available there. Currently the code treats missing mapping
as an internal error, but it uses its result for debug output only, so
lets relax the code to not assume the mapping is available.
Fixes#23407Closesscylladb/scylladb#24614
Currently send_gossip_echo has a 22 seconds timeout
during which _abort_source is ignored.
Mark the verb as cancellable so it can be canceled
on shutdown / abort.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Aborting the failure detector happens normally
when the node shuts down.
There's no need to log anything about it,
as long as we abort the function cleanly.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
In cases where two entries have the same ip address send information
only for the newest one. Now we send both which make the receiver use
one of them at random and it may be outdated one (though it should only
cause more data than needed to be requested).
Since the gossiper works on host ids now it is incorrect to leave this
function to work on ip. It makes it impossible to delete outdated entry
since the "gossiper.get_host_id(endpoint) != id" check will always be
false for such entries (get_host_id() always returns most up -to-date
mapping.
Now that the gossiper map is id based there can be a situation where two
entries have the same ip, Shadow round should send the newest one in
this cased. The patch makes it so.
Fixes: #23553
This patch changes gossiper to index nodes by host ids instead of ips.
The main data structure that changes is _endpoint_state_map, but this
results in a lot of changes since everything that uses the map directly
or indirectly has to be changed. The big victim of this outside of the
gossiper itself is topology over gossiper code. It works on IPs and
assumes the gossiper does the same and both need to be changed together.
Changes to other subsystems are much smaller since they already mostly
work on host ids anyway.
Store endpoint's IP in the endpoint state. Currently it is stored as a key
in gossiper's endpoint map, but we are going to change that. The new filed
is not serialized when endpoint state is sent over rpc, so it is set by
the rpc handler from the value in the map that is in the rpc message. This
map will not be changed to be host id based to not break interoperability.
We assume that all endpoint states have HOST_ID set or the host id is
available locally, but the assassinate code injects a state without
HOST_ID for not existing endpoint violating this assumption.
This patch ensures that members of the new group 0 can gossip with
members of the old group 0 during rolling restart in the Raft-based
recovery procedure. Without this change, restarted nodes (members of
the new group 0) wouldn't be marked as UP by other nodes (members of
the old group 0), which would decrease availability.
Send digest ack and ack2 by host ids as well now since the id->ip
mapping is available after receiving digest syn. It allows to convert
more code to host id here.
Before calling force_remove_endpoint (which works on ip) the code checks
that the ip maps to the correct id (not not remove a new node that
inherited this ip by mistake). Move the check to the function itself.
A node may change its IP but some other node in the cluster may still
try to ping it using an old IP because it may receive an outdated gossiper
entry with the old IP. Do not send echo message to the old IP. It will
cause a misusing UP message with old address to be printed.
The following metrics will be marked with basic_level label:
scylla_gossip_heart_beat
scylla_gossip_live
scylla_gossip_unreachable
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
It is used by truncate code only and even there it only check if the
returned set is not empty. Check for dead token owners in the truncation
code directly.