diff --git a/docs/dev/topology-over-raft.md b/docs/dev/topology-over-raft.md index a3c833c8af..ad2c8364a9 100644 --- a/docs/dev/topology-over-raft.md +++ b/docs/dev/topology-over-raft.md @@ -878,20 +878,87 @@ topology coordinator fiber and coordinates the remaining steps: If a disaster happens and a majority of nodes are lost, changes to the group 0 state are no longer possible and a manual recovery procedure needs to be performed. -Our current procedure starts by switching all nodes to a special "recovery" mode -in which nodes do not use raft at all. In this mode, dead nodes are supposed -to be removed from the cluster via `nodetool removenode`. After all dead nodes -are removed, state related to group 0 is deleted and nodes are restarted in -regular mode, allowing the cluster to re-form group 0. -Topology on raft fits into this procedure in the following way: +## The procedure -- When nodes are restarted in recovery mode, they revert to gossiper-based - operations. This allows to perform `nodetool removenode` without having - a majority of nodes. In this mode, `system.topology` is *not* updated, so - it becomes outdated at the end. -- Before disabling recovery mode on the nodes, the `system.topology` table - needs to be truncated on all nodes. This will cause nodes to revert to - legacy topology operations after exiting recovery mode. -- After re-forming group 0, the cluster needs to be upgraded again to raft - topology by the administrator. +Our current procedure starts by removing the persistent group 0 ID and the group 0 +discovery state on all live nodes. This process ensures that live nodes will try to +join a new group 0 during the next restart. + +The issue is that one of the live nodes has to create the new group 0, and not every +node is a safe choice. It turns out that we can choose only nodes with the latest +`commit_index` (see this [section](#choosing-the-recovery-leader) for a detailed +explanation). We call the chosen node the *recovery leader*. + +Once the recovery leader is chosen, all live nodes can join the new group 0 during +a rolling restart. Nodes learn about the recovery leader through the +`recovery_leader` config option. Also, the recovery leader must be restarted first +to create the new group 0 before other nodes try to join it. + +After successfully restarting all live nodes, dead nodes can be removed via +`nodetool removenode` or by replacing them. + +Finally, the persisted internal state of the old group 0 can be cleaned up. + +## Topology coordinator during recovery + +After joining the new group 0 during the procedure, live nodes don't execute any +"special recovery code" related to topology. + +In particular, the recovery leader normally starts the topology coordinator fiber. +This fiber is designed to ensure that a started topology operation never hangs +(it succeeds or is rolled back) regardless of the conditions. So, if the majority +has been lost in the middle of some work done by the topology coordinator, the new +topology coordinator (run on the recovery leader) will finish this work. It will +usually fail and be rolled back, e.g., due to `global_token_metadata_barrier` +failing after a global topology command sent to a dead normal node fails. + +Note that this behavior is necessary to ensure that the new topology coordinator +will eventually be able to start handling the topology requests to remove/replace +dead nodes. Those requests will succeed thanks to the `ignore_dead_nodes` and +`ignore_dead_nodes_for_replace` options. + +## Gossip during recovery + +A node always includes its group 0 ID in `gossip_digest_syn`, and the receiving node +rejects the message if the ID is different from its local ID. However, nodes can +temporarily belong to two different group 0's during the recovery procedure. To keep +the gossip working, we've decided to additionally include the local `recovery_leader` +value in `gossip_digest_syn`. Nodes ignore group 0 ID mismatch if the sender or the +receiver has a non-empty `recovery_leader` (usually both have it). + +## Choosing the recovery leader + +The group 0 state persisted on the recovery leader becomes the initial state of +other nodes that join the new group 0 (which happens through the Raft snapshot +transfer). After all, the Raft state machine must be consistent at the beginning. + +When a disaster happens, live nodes can have different commit indexes, and the nodes +that are behind have no way of catching up without majority. Imagine there are two +live nodes - node A and node B, node A has `commit_index`=10, and node B has +`commit_index`=11. Also, assume that the log entry with index 11 is a schema change +that adds a new column to a table. Node B could have already handled some replica +writes to the new column. If node A became the recovery leader and node B joined the +new group 0, node B would receive a snapshot that regresses its schema version. Node +B could end up in an inconsistent state with data in a column that doesn't exist +according to group 0. Hence, node B must be the recovery leader. + +## Loss of committed entries + +It can happen that a group 0 entry has been committed by a majority consisting of +only dead nodes. Then, no matter what recovery leader we choose, it won't have this +entry. This is fine, assuming that the following group 0 causality property holds on +all live nodes: any persisted effect on the node’s state is written only after +the group 0 state it depends on has already been persisted. + +For example, the above property holds for schema changes and writes because +a replica persists a write only after applying the group 0 entry with the latest +schema, which is ensured by a read barrier. + +It's critical for recovery safety to ensure that no subsystem breaks group 0 causality. +Fortunately, this property is natural and not very limiting. + +Losing a committed entry can be observed by external systems. For example, the latest +schema version in the cluster can go back in time from the driver's perspective. This +is outside the scope of the recovery procedure, though, and it shouldn't cause +problems in practice.