Files
scylladb/service
Tomasz Grabiec 79ee38181c Merge 'storage_service: wait for normal state handlers earlier in the boot procedure' from Kamil Braun
The `wait_for_normal_state_handled_on_boot` function waits until
`handle_state_normal` finishes for the given set of nodes. It was used
in `run_bootstrap_ops` and `run_replace_ops` to wait until NORMAL states
of existing nodes in the cluster are processed by the joining node
before continuing the joining process. One reason to do it is because at
the end of `handle_state_normal` the joining node might drop connections
to the NORMAL nodes in order to reestablish new connections using
correct encryption settings. In tests we observed that the connection
drop was happening in the middle of repair/streaming, causing
repair/streaming to abort.

Unfortunately, calling `wait_for_normal_state_handled_on_boot` in
`run_bootstrap_ops`/`run_replace_ops` is too late to fix all problems.
Before either of these two functions, we create a new CDC generation and
write the data to `system_distributed_everywhere.cdc_generation_descriptions_v2`.
In tests, the connections were sometimes dropped while this write was
in-flight. This would cause the write to never arrive to other nodes,
and the joining node would timeout waiting for confirmations.

To fix this, call `wait_for_normal_state_handled_on_boot` earlier in the
boot procedure, before `make_new_generation` call which does the write.

Fixes: #13302

Closes #13317

* github.com:scylladb/scylladb:
  storage_service: wait for normal state handlers earlier in the boot procedure
  storage_service: bootstrap: wait for normal tokens to arrive in all cases
  storage_service: extract get_nodes_to_sync_with helper
  storage_service: return unordered_set from get_ignore_dead_nodes_for_replace
2023-03-27 13:56:47 +02:00
..
2023-02-15 11:01:50 +02:00