Files
scylladb/service
Kamil Braun 2b44631ded Merge 'storage_service: Make node operations safer by detecting asymmetric abort' from Tomasz Grabiec
This patch fixes a problem which affects decommission and removenode
which may lead to data consistency problems under conditions which
lead one of the nodes to unliaterally decide to abort the node
operation without the coordinator noticing.

If this happens during streaming, the node operation coordinator would
proceed to make a change in the gossiper, and only later dectect that
one of the nodes aborted during sending of decommission_done or
removenode_done command. That's too late, because the operation will
be finalized by all the nodes once gossip propagates.

It's unsafe to finalize the operation while another node aborted. The
other node reverted to the old topolgy, with which they were running
for some time, without considering the pending replica when handling
requests. As a result, we may end up with consistency issues. Writes
made by those coordinators may not be replicated to CL replicas in the
new topology. Streaming may have missed to replicate those writes
depending on timing.

It's possible that some node aborts but streaming succeeds if the
abort is not due to network problems, or if the network problems are
transient and/or localized and affect only heartbeats.

There is no way to revert after we commit the node operation to the
gossiper, so it's ok to close node_ops sessions before making the
change to the gossiper, and thus detect aborts and prevent later aborts
after the change in the gossiper is made. This is already done during
bootstrap (RBNO enabled) and replacenode. This patch canges removenode
to also take this approach by moving sending of remove_done earlier.

We cannot take this approach with decommission easily, because
decommission_done command includes a wait for the node to leave the
ring, which won't happen before the change to the gossiper is
made. Separating this from decommission_done would require protocol
changes. This patch adds a second-best solution, which is to check if
sessions are still there right before making a change to the gossiper,
leaving decommission_done where it was.

The race can still happen, but the time window is now much smaller.

The PR also lays down infrastructure which enables testing the scenarios. It makes node ops
watchdog periods configurable, and adds error injections.

Fixes #12989
Refs #12969

Closes #13028

* github.com:scylladb/scylladb:
  storage_service: node ops: Extract node_ops_insert() to reduce code duplication
  storage_service: Make node operations safer by detecting asymmetric abort
  storage_service: node ops: Add error injections
  service: node_ops: Make watchdog and heartbeat intervals configurable
2023-03-07 17:36:51 +01:00
..
2023-02-15 11:01:50 +02:00