Commit Graph

9 Commits

Author SHA1 Message Date
Dawid Mędrek
b4898e50bf test/cluster/random_failures: Enable rf_rack_valid_keyspaces
Now that the test has been adjusted to work with the configuration
option, we enable it.
2025-10-28 14:17:09 +01:00
Dawid Mędrek
59b2a41c49 test/cluster/random_failures: Adjust to RF-rack-validity
We adjust the test to work with the configuration option
`rf_rack_valid_keyspaces` enabled. For that, we ensure that there is
always at least one node in each of the three racks. This way, all
keyspaces we create and manipulate will remain RF-rack-valid since they
all use RF=3.

------------------------------------------------------------------------

To achieve that, we only need to adjust the following events:

1. `init_tablet_transfer`
   The event creates a new keyspace and table and manually migrates
   a tablet belonging to it. As long as we make sure the migration occurs
   within the same rack, there will be no problem.

   Since RF == #racks, each rack will have exactly one tablet replica,
   so we can migrate the tablet to an arbitrary node in the same rack.

   Note that there must exist a node that's not a replica. If there weren't
   such a node, the test wouldn't have worked before this commit because
   it's not possible to migrate a tablet from one node being its replica to
   another. In other words, we have a guarantee that there are at least 4 nodes
   in the cluster when we try to migrate a tablet replica.

   That said, we check it anyway. If there's no viable node to migrate the
   tablet replica to, we log that information and do nothing. That should be
   an acceptable solution.

2. `add_new_node`
   As long as we add a node to an existing rack, there's no way to
   violate the invariant imposed by the configuration option, so we pick
   a random rack out of the existing three and create a node in it.

3. `decommission_node`
   We need to ensure that the node we'll be trying to decommission is
   not the only one in its rack.

   Following pretty much the same reasoning as in `init_tablet_transfer`,
   we conclude there must be a rack with at least two nodes in it. Otherwise
   we'd end up having to migrate a tablet from one replica node to another,
   which is not possible.

   What's more, decommissioning a node is not possible if any node in
   the cluster is dead, so we can assume that `manager.running_servers`
   returns the whole cluster.

4. `remove_node`
   The same as `decommission_node`. Just note although the node we choose to
   remove must be first stopped, none other node can be dead, so the whole
   cluster must be returned by `manager.running_servers`.

------------------------------------------------------------------------

There's one more important thing to note. The test may sometimes trigger
a sequence of events where a new node is started, but, due to an error
injection, its initialization is not completed. Among other things, the
node may NOT have a host ID recognized by the rest of the nodes in the
cluster, and operations like tablet migration will fail if they target
it.

Thankfully, there seems to be a way to avoid problems stemming from
that. When a new node is added to the cluster, it should appear at the
end of the list returned by `manager.running_servers`. This most likely
stems from how dictionaries work in Python:

"Keys and values are iterated over in insertion order."
-- https://docs.python.org/3/library/stdtypes.html#dict-views

and the fact that we keep track of running servers using a dictionary.

Furthermore, we rely on the assumption that the test currently works
correctly.

Assume, to the contrary, that among the nodes taking part in the operations
listed above, there is at most one node per rack that has its host ID recognized
by the rest of the cluster. Note that only those nodes can store any tablets.
Let's refer to the set of those nodes as X.

Assume that we're dealing with tablet migration, decommissioning, or removing
a node. Since those operations involve tablet migration, at least one tablet
will need to be migrated from the node in question to another node in X.
However, since X consists of at most three nodes, and one of them is losing
its tablet, there is no viable target for the tablet, so the operation fails.

Using those assumptions, an auxiliary function, `select_viable_rack`,
was designed to carefully choose a correct rack, which we'll then pick nodes
from to perform the topological operations. It's simple: we just find the first
rack in the list that has at least two nodes in it. That should ensure that we
perform an operation that doesn't lead to any unforeseen disaster.

------------------------------------------------------------------------

Since the test effectively becomes more complex due to more care for keeping
the topology of the cluster valid, we extend the log messages to make them
more helpful when debugging a failure.
2025-10-28 14:15:57 +01:00
Emil Maskovsky
87bd328873 group0: remove obsolete "stop_before_becoming_raft_voter" error injection
The Raft topology workflow was changed by the limited voters feature:
nodes no longer request votership themselves. As a result, the
"stop_before_becoming_raft_voter" error injection is now obsolete and
has been removed.

Fixes: scylladb/scylladb#23418
2025-09-16 18:24:27 +02:00
Emil Maskovsky
0453052d66 test/random_failures: preserve test repeatability when removing error injections
The order of entries in the ERROR_INJECTIONS list determines test
repeatability for a given random seed.

To allow removing error injections without affecting the order of the
remaining ones, removed injections are now renamed with a "REMOVED_"
prefix instead of being deleted.

This ensures they are ignored by the tests, while the sequence of active
injections—and thus test reproducibility—remains unchanged.
2025-09-16 18:22:45 +02:00
Evgeniy Naydanov
f6e3fdd778 test.py: rework log_browsing for dtest migration
Rework `ScyllaLogFile.wait_for()` method to make it easier
to add required methods to ScyllaNode class of ccm-like shim.

Also, added `ScyllaLogFile.grep_for_errors()` method and
reworked `ScyllaLogFile.grep()`
2025-05-19 11:50:55 +00:00
Dawid Mędrek
c4b32c38a3 test/cluster: Disable rf_rack_valid_keyspaces in problematic tests
Some of the tests in the test suite have proven to be more problematic
in adjusting to RF-rack-validity. Since we'd like to run as many tests
as possible with the `rf_rack_valid_keyspaces` configuration option
enabled, let's disable it in those. In the following commit, we'll enable
it by default.
2025-05-10 16:30:49 +02:00
Emil Maskovsky
00794af94d raft/test: disable the stop_before_becoming_raft_voter test
The workflow of becoming a voter changes with the "limited voters"
feature, as the node will no longer become a voter on its own, but the
votership is being managed by the topology coordinator. This therefore
breaks the `stop_before_becoming_raft_voter` test, as that injection
relies on the old behavior.

We will disable the test for this particular case for now and address
either fixing of complete removal of the test in a follow-up task.

Refs: scylladb/scylladb#23418
2025-04-07 12:23:25 +02:00
Evgeniy Naydanov
cac0257914 test.py: random_failures: make it play well with xdist
Pass random seed across xdist workers using env variable.
2025-03-30 03:19:30 +00:00
Artsiom Mishuta
d1198f8318 test.py: rename topology_custom folder to cluster
rename topology_custom folder to cluster
as it contains not only topology test cases
2025-03-04 10:32:44 +01:00