diff --git a/docs/troubleshooting/handling-node-failures.rst b/docs/troubleshooting/handling-node-failures.rst index 35a886ef47..7af6f7b661 100644 --- a/docs/troubleshooting/handling-node-failures.rst +++ b/docs/troubleshooting/handling-node-failures.rst @@ -78,16 +78,10 @@ You can follow the manual recovery procedure when: **irrecoverable** nodes. If possible, restart your nodes, and use the manual recovery procedure as a last resort. -.. note:: +.. warning:: - Before proceeding, make sure that the irrecoverable nodes are truly dead, and not, - for example, temporarily partitioned away due to a network failure. If it is - possible for the 'dead' nodes to come back to life, they might communicate and - interfere with the recovery procedure and cause unpredictable problems. - - If you have no means of ensuring that these irrecoverable nodes won't come back - to life and communicate with the rest of the cluster, setup firewall rules or otherwise - isolate your alive nodes to reject any communication attempts from these dead nodes. + The manual recovery procedure is not supported :doc:`if tablets are enabled on any of your keyspaces `. + In such a case, you need to :doc:`restore from backup `. During the manual recovery procedure you'll enter a special ``RECOVERY`` mode, remove all faulty nodes (using the standard :doc:`node removal procedure `), @@ -97,15 +91,26 @@ perform the Raft upgrade procedure again, initializing the Raft algorithm from s The manual recovery procedure is applicable both to clusters that were not running Raft in the past and then had Raft enabled, and to clusters that were bootstrapped using Raft. -.. note:: +**Prerequisites** - Entering ``RECOVERY`` mode requires a node restart. Restarting an additional node while - some nodes are already dead may lead to unavailability of data queries (assuming that - you haven't lost it already). For example, if you're using the standard RF=3, - CL=QUORUM setup, and you're recovering from a stuck of upgrade procedure because one - of your nodes is dead, restarting another node will cause temporary data query - unavailability (until the node finishes restarting). Prepare your service for - downtime before proceeding. +* Before proceeding, make sure that the irrecoverable nodes are truly dead, and not, + for example, temporarily partitioned away due to a network failure. If it is + possible for the 'dead' nodes to come back to life, they might communicate and + interfere with the recovery procedure and cause unpredictable problems. + + If you have no means of ensuring that these irrecoverable nodes won't come back + to life and communicate with the rest of the cluster, setup firewall rules or otherwise + isolate your alive nodes to reject any communication attempts from these dead nodes. + +* Prepare your service for downtime before proceeding. + Entering ``RECOVERY`` mode requires a node restart. Restarting an additional node while + some nodes are already dead may lead to unavailability of data queries (assuming that + you haven't lost it already). For example, if you're using the standard RF=3, + CL=QUORUM setup, and you're recovering from a stuck upgrade procedure because one + of your nodes is dead, restarting another node will cause temporary data query + unavailability (until the node finishes restarting). + +**Procedure** #. Perform the following query on **every alive node** in the cluster, using e.g. ``cqlsh``: