Start with a cluster with Raft disabled, end up with a cluster that performs
schema operations using group 0.
Design doc:
https://docs.google.com/document/d/1PvZ4NzK3S0ohMhyVNZZ-kCxjkK5URmz1VP65rrkTOCQ/
(TODO: replace this with .md file - we can do it as a follow-up)
The procedure, on a high level, works as follows:
- join group 0
- wait until every peer joined group 0 (peers are taken from `system.peers`
table)
- enter `synchronize` upgrade state, in which group 0 operations are disabled
- wait until all members of group 0 entered `synchronize` state or some member
entered the final state
- synchronize schema by comparing versions and pulling if necessary
- enter the final state (`use_new_procedures`), in which group 0 is used for
schema operations.
With the procedure comes a recovery mode in case the upgrade procedure gets
stuck (and it may if we lose a node during recovery - the procedure, to
correctly establish a single group 0 cluster, requires contacting every node).
This recovery mode can also be used to recover clusters with group 0 already
established if they permanently lose a majority of nodes - killing two birds with
one stone. Details in the last commit message.
Read the design doc, then read the commits in topological order
for best reviewing experience.
---
I did some manual tests: upgrading a cluster, using the cluster to add nodes,
remove nodes (both with `decommission` and `removenode`), replacing nodes.
Performing recovery.
As a follow-up, we'll need to implement tests using the new framework (after
it's ready). It will be easy to test upgrades and recovery even with a single
Scylla version - we start with a cluster with the RAFT flag disabled, then
rolling-restart while enabling the flag (and recovery is done through simple
CQL statements).
Closes#10835
* github.com:scylladb/scylladb:
service/raft: raft_group0: implement upgrade procedure
service/raft: raft_group0: extract `tracker` from `persistent_discovery::run`
service/raft: raft_group0: introduce local loggers for group 0 and upgrade
service/raft: raft_group0: introduce GET_GROUP0_UPGRADE_STATE verb
service/raft: raft_group0_client: prepare for upgrade procedure
service/raft: introduce `group0_upgrade_state`
db: system_keyspace: introduce `load_peers`
idl-compiler: introduce cancellable verbs
message: messaging_service: cancellable version of `send_schema_check`