scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-28 20:27:03 +00:00

Files

Kamil Braun 1eee349a17 test: test_raft_upgrade: remove test_raft_upgrade_with_node_remove

The test would create a scenario where one node was down while the others
started the Raft upgrade procedure. The procedure would get stuck, but
it was possible to `removenode` the downed node using one of the alive
nodes, which would unblock the Raft upgrade procedure.

This worked because:
1. the upgrade procedure starts by ensuring that all peers can be
   contacted,
2. `removenode` starts by removing the node from the token ring.

After removing the node from the token ring, the upgrade procedure
becomes able to contact all peers (the peers set no longer contains the
down node). At the end, after removing the node from the token ring,
`removenode` would actually get stuck for a while, waiting for the
upgrade procedure to finish before removing the peer from group 0.
After the upgrade procedure finished, `removenode` would also finish.
(so: first the upgrade procedure waited for removenode, then removenode
waited for the upgrade procedure).

We want to modify the `removenode` procedure and include a new step
before removing the node from the token ring: making the node a
non-voter. The purpose is to improve the possible failure scenarios.
Previously, if the `removenode` procedure failed after removing the node
from the token ring but before removing it from group 0, the cluster
would contain a 'garbage' group 0 member which is a voter - reducing
group 0's availability. If the node is made a non-voter first, then this
failure will not be as big of a problem, because the leftover group 0
member will be a non-voter.

However, to correctly perform group 0 operations including making
someone a nonvoter, we must first wait for the Raft upgrade procedure to
finish (or at least wait until everyone joins group 0). Therefore by
including this 'make the node a non-voter' step at the beginning of
`removenode`, we make it impossible to remove a token ring member in the
middle of the upgrade procedure, on which the test case relied. The test
case would get stuck waiting for the `removenode` operation to finish,
which would never finish because it would wait for the upgrade procedure
to finish, which would not finish because of the dead peer.

We remove the test case; it was "lucky" to pass in the first place. We
have a dedicated mechanism for handling dead peers during Raft upgrade
procedure: the manual Raft group 0 RECOVERY procedure. There are other
test cases in this file which are using that procedure.

2023-01-17 12:28:00 +01:00