Some of the tests in the test suite have proven to be more problematic
in adjusting to RF-rack-validity. Since we'd like to run as many tests
as possible with the `rf_rack_valid_keyspaces` configuration option
enabled, let's disable it in those. In the following commit, we'll enable
it by default.
We adjust all of the simple cases of cluster tests so they work
with `rf_rack_valid_keyspaces: true`. It boils down to assigning
nodes to multiple racks. For most of the changes, we do that by:
* Using `pytest.mark.prepare_3_racks_cluster` instead of
`pytest.mark.prepare_3_nodes_cluster`.
* Using an additional argument -- `auto_rack_dc` -- when calling
`ManagerClient::servers_add()`.
In some cases, we need to assign the racks manually, which may be
less obvious, but in every such situation, the tests didn't rely
on that assignment, so that doesn't affect them or what they verify.
The test fails sporadically with:
cassandra.ReadFailure: Error from server: code=1300 [Replica(s) failed to execute read] message="Operation failed for test3.test2 - received 1 responses and 1 failures from 2 CL=QUORUM." info={'consistency': 'QUORUM', 'required_responses': 2, 'received_responses': 1, 'failures': 1}
That's becase a server is stopped in the middle of the workload.
The server is stopped ungracefully which will cause some requests to
time out. We should stop it gracefully to allow in-flight requests to
finish.
Fixes#20492Closesscylladb/scylladb#23451
With capacity-aware balancing, if we're missing capacity for a normal
node, we won't be able to proceed with tablet drain. Consider the
following scenario:
1. Nodes: A, B
2. refresh stats with A and B
3. Add node C
4. Node B goes down
5. removenode B starts
6. stats refreshing fails because B is down
If we don't have capacity stats for node C, load balancer cannot make
decisions and removenode is blocked indefinitely. A reproducer is
added in this patch.
To alleviate that, we allow capacity stats to be collected for nodes
which are reachable, we just don't update the table size part.
To keep table stats monotonic, we cache previous results per node, so
even if it's unreachable now, we use its last reported sizes. It's
still more accurate than not refreshing stats at all. A node can be
down for a long period, and other replicas can grow in size. It's not
perfect, because the stale node can skew the stats in its direction,
but ignoring it completely has its pitfalls too. Better solution is
left for later.