In this commit we enhance test_replace_reuse_ip
to reproduce #17421. We create a test table and run
insert queries on it while the first node is
being replaced. In this form the test fails
without the fix from the previous commit. Some
insert requests fail with [Unavailable exception]
"Cannot achieve consistency level for cl QUORUM...".
The replaced node transitions to LEFT state, and
we used to remove the IPs of such nodes from gossiper.
If we replace with same IP, this caused the IP of the
new node to be removed from gossiper.
This problem was fixed by #16820, this commit
adds a regression test for it.
closes#15967
In this commit we modify the existing
test_replace_different_ip. We add the check that the old
IP is not contained in alive or down lists, which
means it's completely wiped from gossiper. This test is failing
without the force_remove_endpoint fix from
a previous commit. We also check that the state of
local system.peers table is correct.
ManagerClient.servers_add can be used in every test that uses
consistent topology changes. We replace all multiple server_add
calls in such tests with a single servers_add call to make these
tests faster and simplify their code. Additionally, these
servers_add calls will test concurrent bootstraps for free.
In one of the previous commits, we have made
ManagerClient.server_add wait until all running nodes see the node
being replaced as dead. Unfortunately, the waiting time can be
around 20 s if we stop the node being replaced ungracefully. 20 s
is the default value of the failure detector timeout.
We don't want to slow down the replace operations this much for no
good reason. We could use server_stop_gracefully instead of
server_stop everywhere, but we should have at least a few replace
tests with server_stop. For now, test_replace and
test_raft_ignore_nodes will be these tests. To keep them reasonably
fast, we decrease the failure_detector_timeout_in_ms value on all
initial servers.
We also skip test_replace in debug mode to avoid flakiness due to
low failure_detector_timeout_in_ms (test_raft_ignore_nodes is
already skipped).
In the following commit, we make all servers in test_replace use
failure-detector-timeout-in-ms = 2000. Therefore, we need
test_replace to be in a suite with initial_size equal to 0.