In commit c82250e0cf (gossip: Allow deferring
advertise of local node to be up), the replacing node is changed to postpone
the responding of gossip echo message to avoid other nodes sending read
requests to the replacing node. It works as following:
1) replacing node does not respond echo message to avoid other nodes to
mark replacing node as alive
2) replacing node advertises hibernate state so other nodes knows
replacing node is replacing
3) replacing node responds echo message so other nodes can mark
replacing node as alive
This is problematic because after step 2, the existing nodes in the
cluster will start to send writes to the replacing node, but at this
time it is possible that existing nodes haven't marked the replacing
node as alive, thus failing the write request unnecessarily.
For instance, we saw the following errors in issue #8013 (Cassandra
stress fails to achieve consistency when only one of the nodes is down)
```
scylla:
[shard 1] consistency - Live nodes 2 do not satisfy ConsistencyLevel (2
required, 1 pending, live_endpoints={127.0.0.2, 127.0.0.1},
pending_endpoints={127.0.0.3}) [shard 0] gossip - Fail to send
EchoMessage to 127.0.0.3: std::runtime_error (Not ready to respond
gossip echo message)
c-s:
java.io.IOException: Operation x10 on key(s) [4c4f4d37324c35304c30]:
Error executing: (UnavailableException): Not enough replicas available
for query at consistency QUORUM (2 required but only 1 alive
```
To solve this problem, we can do the replacing operation in multiple stages.
One solution is to introduce a new gossip status state as proposed
here: gossip: Introduce STATUS_PREPARE_REPLACE #7416
1) replacing node does not respond echo message
2) replacing node advertises prepare_replace state (Remove replacing
node from natural endpoint, but do not put in pending list yet)
3) replacing node responds echo message
4) replacing node advertises hibernate state (Put replacing node in
pending list)
Since we now have the node ops verb introduced in
829b4c1438 (repair: Make removenode safe
by default), we can do the multiple stage without introducing a new
gossip status state.
This patch uses the NODE_OPS_CMD infrastructure to implement replace
operation.
Improvements:
1) It solves the race between marking replacing node alive and sending
writes to replacing node
2) The cluster reverts to a state before the replace operation
automatically in case of error. As a result, it solves when the
replacing node fails in the middle of the operation, the repacing
node will be in HIBERNATE status forever issue.
3) The gossip status of the node to be replaced is not changed until the
replace operation is successful. HIBERNATE gossip status is not used
anymore.
4) Users can now pass a list of dead nodes to ignore explicitly.
Fixes#8013Closes#8330
* github.com:scylladb/scylla:
repair: Switch to use NODE_OPS_CMD for replace operation
gossip: Add advertise_to_nodes
gossip: Add helper to wait for a node to be up
gossip: Add is_normal_ring_member helper