Files
scylladb/message
Tomasz Grabiec 50e8ec77c6 Merge 'Wait for other nodes to be UP and NORMAL on bootstrap right after enabling gossiping' from Kamil Braun
`handle_state_normal` may drop connections to the handled node. This
causes spurious failures if there's an ongoing concurrent operation.
This problem was already solved twice in the past in different contexts:
first in 53636167ca, then in
79ee38181c.

Time to fix it for the third time. Now we do this right after enabling
gossiping, so hopefully it's the last time.

This time it's causing snapshot transfer failures in group 0. Although
the transfer is retried and eventually succeeds, the failed transfer is
wasted work and causes an annoying ERROR message in the log which
dtests, SCT, and I don't like.

The fix is done by moving the `wait_for_normal_state_handled_on_boot()`
call before `setup_group0()`. But for the wait to work correctly we must
first ensure that gossiper sees an alive node, so we precede it with
`wait_for_live_node_to_show_up()` (before this commit, the call site of
`wait_for_normal_state_handled_on_boot` was already after this wait).

There is another problem: the bootstrap procedure is racing with gossiper
marking nodes as UP, and waiting for other nodes to be NORMAL doesn't guarantee
that they are also UP. If gossiper is quick enough, everything will be fine.
If not, problems may arise such as streaming or repair failing due to nodes
still being marked as DOWN, or the CDC generation write failing.

In general, we need all NORMAL nodes to be up for bootstrap to proceed.
One exception is replace where we ignore the replaced node. The
`sync_nodes` set constructed for `wait_for_normal_state_handled_on_boot`
takes this into account, so we also use it to wait for nodes to be UP.

As explained in commit messages and comments, we only do these
waits outside raft-based-topology mode.

This should improve CI stability.
Fixes: #12972
Refs: #14042

Closes #14354

* github.com:scylladb/scylladb:
  messaging_service: print which connections are dropped due to missing topology info
  storage_service: wait for nodes to be UP on bootstrap
  storage_service: wait for NORMAL state handler before `setup_group0()`
  storage_service: extract `gossiper::wait_for_live_nodes_to_show_up()`
2023-06-28 20:40:03 +02:00
..