From cfa3cd4c949cb9058b88edef5667d51935a6746e Mon Sep 17 00:00:00 2001 From: Anna Stuchlik Date: Tue, 28 May 2024 14:13:58 +0200 Subject: [PATCH] doc: add the tablet limitation to the manual recovery procedure This commit adds the information that the manual recovery procedure is not supported if tablets are enabled. In addition, the content in the Manual Recovery Procedure is reorganized by adding the Prerequisites and Procedure subsections - in this way, we can limit the number of Note and Warning boxes that made the page hard to follow. Fixes https://github.com/scylladb/scylladb/issues/18895 Closes scylladb/scylladb#18935 --- .../handling-node-failures.rst | 39 +++++++++++-------- 1 file changed, 22 insertions(+), 17 deletions(-) diff --git a/docs/troubleshooting/handling-node-failures.rst b/docs/troubleshooting/handling-node-failures.rst index 35a886ef47..7af6f7b661 100644 --- a/docs/troubleshooting/handling-node-failures.rst +++ b/docs/troubleshooting/handling-node-failures.rst @@ -78,16 +78,10 @@ You can follow the manual recovery procedure when: **irrecoverable** nodes. If possible, restart your nodes, and use the manual recovery procedure as a last resort. -.. note:: +.. warning:: - Before proceeding, make sure that the irrecoverable nodes are truly dead, and not, - for example, temporarily partitioned away due to a network failure. If it is - possible for the 'dead' nodes to come back to life, they might communicate and - interfere with the recovery procedure and cause unpredictable problems. - - If you have no means of ensuring that these irrecoverable nodes won't come back - to life and communicate with the rest of the cluster, setup firewall rules or otherwise - isolate your alive nodes to reject any communication attempts from these dead nodes. + The manual recovery procedure is not supported :doc:`if tablets are enabled on any of your keyspaces `. + In such a case, you need to :doc:`restore from backup `. During the manual recovery procedure you'll enter a special ``RECOVERY`` mode, remove all faulty nodes (using the standard :doc:`node removal procedure `), @@ -97,15 +91,26 @@ perform the Raft upgrade procedure again, initializing the Raft algorithm from s The manual recovery procedure is applicable both to clusters that were not running Raft in the past and then had Raft enabled, and to clusters that were bootstrapped using Raft. -.. note:: +**Prerequisites** - Entering ``RECOVERY`` mode requires a node restart. Restarting an additional node while - some nodes are already dead may lead to unavailability of data queries (assuming that - you haven't lost it already). For example, if you're using the standard RF=3, - CL=QUORUM setup, and you're recovering from a stuck of upgrade procedure because one - of your nodes is dead, restarting another node will cause temporary data query - unavailability (until the node finishes restarting). Prepare your service for - downtime before proceeding. +* Before proceeding, make sure that the irrecoverable nodes are truly dead, and not, + for example, temporarily partitioned away due to a network failure. If it is + possible for the 'dead' nodes to come back to life, they might communicate and + interfere with the recovery procedure and cause unpredictable problems. + + If you have no means of ensuring that these irrecoverable nodes won't come back + to life and communicate with the rest of the cluster, setup firewall rules or otherwise + isolate your alive nodes to reject any communication attempts from these dead nodes. + +* Prepare your service for downtime before proceeding. + Entering ``RECOVERY`` mode requires a node restart. Restarting an additional node while + some nodes are already dead may lead to unavailability of data queries (assuming that + you haven't lost it already). For example, if you're using the standard RF=3, + CL=QUORUM setup, and you're recovering from a stuck upgrade procedure because one + of your nodes is dead, restarting another node will cause temporary data query + unavailability (until the node finishes restarting). + +**Procedure** #. Perform the following query on **every alive node** in the cluster, using e.g. ``cqlsh``: