This PR introduces the new Raft-based recovery procedure for group 0
majority loss.
The Raft-based recovery procedure works with tablets. The old
gossip-based recovery procedure does not because we have no code
for tablet migrations after the gossip-based topology changes.
The Raft-based procedure requires the Raft-based topology to be
enabled in the cluster. If the Raft-based topology is not enabled, the
gossip-based procedure must be used.
We will be able to get rid of the gossip-based procedure when we make
the Raft-based topology mandatory (we can do both in the same version,
2025.2 is the plan). Before we do it, we will have to keep both procedures
and explain when each of them should be used.
The idea behind the new procedure is to recreate group 0 without
touching the topology structures. Once we create a new group 0, we
can remove all dead nodes using the standard `removenode` and
`replace` operations.
For the procedure to be safe, we must ensure that each member of the
new group 0 moves to the same initial group 0 state. Also, the only safe
choice for the state is the latest persistent state available among the
live nodes.
The solution to the problem above is to ensure that the leader of the new
group 0 (called the recovery leader) is one of the nodes with the latest
state available. Other members will receive the snapshot from the
recovery leader when they join the new group 0 and move to its state.
Below is the shortened description of the new recovery procedure from
the perspective of the administrator. For the full description, refer to the
design document.
1. Find the set of live nodes.
2. Kill any live node that shouldn't be a member of the new group 0.
3. Ensure the full network connectivity between live nodes.
4. Rolling restart live nodes to ensure they are healthy and ready for
recovery.
5. Check if some data could have been lost. If yes, restore it from
backup after the recovery procedure.
6. Find the recovery leader (the node with the largest `group0_state_id`).
7. Remove `raft_group_id` from `system.scylla_local` and truncate
`system.discovery` on each live node.
8. Set the new scylla.yaml parameter, `recovery_leader`, to Host ID of the
recovery leader on each live node.
9. Rolling restart all live nodes, but the recovery leader must be
restarted first.
10. Remove all dead nodes using `removenode` or `replace`.
11. Unset `recovery_leader` on all nodes.
12. Delete data of the old group 0 from `system.raft`,
`system.raft_snaphots`, and `system.raft_snapshot_config`.
In the future, we could automate some of these steps or even introduce
a tool that will do all (or most) of them by itself. For now, we are fine with
a procedure that is reliable and simple enough.
This PR makes using 2025.1 with tablets much safer. We want to
backport it to 2025.1. We will also want to backport a few follow-ups.
Fixesscylladb/scylladb#20657Closesscylladb/scylladb#22286
* github.com:scylladb/scylladb:
test: mark tests with the gossip-based recovery procedure
test: add tests for the Raft-based recovery procedure
test: topology: util: fix the tokens consistency check for left nodes
test: topology: util: extend start_writes
gossip: allow group 0 ID mismatch in the Raft-based recovery procedure
raft_group0: modify_raft_voter_status: do not add new members
treewide: allow recreating group 0 in the Raft-based recovery procedure