mirror of
https://github.com/scylladb/scylladb.git
synced 2026-04-28 20:27:03 +00:00
The test would create a scenario where one node was down while the others started the Raft upgrade procedure. The procedure would get stuck, but it was possible to `removenode` the downed node using one of the alive nodes, which would unblock the Raft upgrade procedure. This worked because: 1. the upgrade procedure starts by ensuring that all peers can be contacted, 2. `removenode` starts by removing the node from the token ring. After removing the node from the token ring, the upgrade procedure becomes able to contact all peers (the peers set no longer contains the down node). At the end, after removing the node from the token ring, `removenode` would actually get stuck for a while, waiting for the upgrade procedure to finish before removing the peer from group 0. After the upgrade procedure finished, `removenode` would also finish. (so: first the upgrade procedure waited for removenode, then removenode waited for the upgrade procedure). We want to modify the `removenode` procedure and include a new step before removing the node from the token ring: making the node a non-voter. The purpose is to improve the possible failure scenarios. Previously, if the `removenode` procedure failed after removing the node from the token ring but before removing it from group 0, the cluster would contain a 'garbage' group 0 member which is a voter - reducing group 0's availability. If the node is made a non-voter first, then this failure will not be as big of a problem, because the leftover group 0 member will be a non-voter. However, to correctly perform group 0 operations including making someone a nonvoter, we must first wait for the Raft upgrade procedure to finish (or at least wait until everyone joins group 0). Therefore by including this 'make the node a non-voter' step at the beginning of `removenode`, we make it impossible to remove a token ring member in the middle of the upgrade procedure, on which the test case relied. The test case would get stuck waiting for the `removenode` operation to finish, which would never finish because it would wait for the upgrade procedure to finish, which would not finish because of the dead peer. We remove the test case; it was "lucky" to pass in the first place. We have a dedicated mechanism for handling dead peers during Raft upgrade procedure: the manual Raft group 0 RECOVERY procedure. There are other test cases in this file which are using that procedure.