scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-28 20:27:03 +00:00

Author	SHA1	Message	Date
Dawid Mędrek	b4898e50bf	test/cluster/random_failures: Enable rf_rack_valid_keyspaces Now that the test has been adjusted to work with the configuration option, we enable it.	2025-10-28 14:17:09 +01:00
Dawid Mędrek	59b2a41c49	test/cluster/random_failures: Adjust to RF-rack-validity We adjust the test to work with the configuration option `rf_rack_valid_keyspaces` enabled. For that, we ensure that there is always at least one node in each of the three racks. This way, all keyspaces we create and manipulate will remain RF-rack-valid since they all use RF=3. ------------------------------------------------------------------------ To achieve that, we only need to adjust the following events: 1. `init_tablet_transfer` The event creates a new keyspace and table and manually migrates a tablet belonging to it. As long as we make sure the migration occurs within the same rack, there will be no problem. Since RF == #racks, each rack will have exactly one tablet replica, so we can migrate the tablet to an arbitrary node in the same rack. Note that there must exist a node that's not a replica. If there weren't such a node, the test wouldn't have worked before this commit because it's not possible to migrate a tablet from one node being its replica to another. In other words, we have a guarantee that there are at least 4 nodes in the cluster when we try to migrate a tablet replica. That said, we check it anyway. If there's no viable node to migrate the tablet replica to, we log that information and do nothing. That should be an acceptable solution. 2. `add_new_node` As long as we add a node to an existing rack, there's no way to violate the invariant imposed by the configuration option, so we pick a random rack out of the existing three and create a node in it. 3. `decommission_node` We need to ensure that the node we'll be trying to decommission is not the only one in its rack. Following pretty much the same reasoning as in `init_tablet_transfer`, we conclude there must be a rack with at least two nodes in it. Otherwise we'd end up having to migrate a tablet from one replica node to another, which is not possible. What's more, decommissioning a node is not possible if any node in the cluster is dead, so we can assume that `manager.running_servers` returns the whole cluster. 4. `remove_node` The same as `decommission_node`. Just note although the node we choose to remove must be first stopped, none other node can be dead, so the whole cluster must be returned by `manager.running_servers`. ------------------------------------------------------------------------ There's one more important thing to note. The test may sometimes trigger a sequence of events where a new node is started, but, due to an error injection, its initialization is not completed. Among other things, the node may NOT have a host ID recognized by the rest of the nodes in the cluster, and operations like tablet migration will fail if they target it. Thankfully, there seems to be a way to avoid problems stemming from that. When a new node is added to the cluster, it should appear at the end of the list returned by `manager.running_servers`. This most likely stems from how dictionaries work in Python: "Keys and values are iterated over in insertion order." -- https://docs.python.org/3/library/stdtypes.html#dict-views and the fact that we keep track of running servers using a dictionary. Furthermore, we rely on the assumption that the test currently works correctly. Assume, to the contrary, that among the nodes taking part in the operations listed above, there is at most one node per rack that has its host ID recognized by the rest of the cluster. Note that only those nodes can store any tablets. Let's refer to the set of those nodes as X. Assume that we're dealing with tablet migration, decommissioning, or removing a node. Since those operations involve tablet migration, at least one tablet will need to be migrated from the node in question to another node in X. However, since X consists of at most three nodes, and one of them is losing its tablet, there is no viable target for the tablet, so the operation fails. Using those assumptions, an auxiliary function, `select_viable_rack`, was designed to carefully choose a correct rack, which we'll then pick nodes from to perform the topological operations. It's simple: we just find the first rack in the list that has at least two nodes in it. That should ensure that we perform an operation that doesn't lead to any unforeseen disaster. ------------------------------------------------------------------------ Since the test effectively becomes more complex due to more care for keeping the topology of the cluster valid, we extend the log messages to make them more helpful when debugging a failure.	2025-10-28 14:15:57 +01:00
Emil Maskovsky	87bd328873	group0: remove obsolete "stop_before_becoming_raft_voter" error injection The Raft topology workflow was changed by the limited voters feature: nodes no longer request votership themselves. As a result, the "stop_before_becoming_raft_voter" error injection is now obsolete and has been removed. Fixes: scylladb/scylladb#23418	2025-09-16 18:24:27 +02:00
Emil Maskovsky	0453052d66	test/random_failures: preserve test repeatability when removing error injections The order of entries in the ERROR_INJECTIONS list determines test repeatability for a given random seed. To allow removing error injections without affecting the order of the remaining ones, removed injections are now renamed with a "REMOVED_" prefix instead of being deleted. This ensures they are ignored by the tests, while the sequence of active injections—and thus test reproducibility—remains unchanged.	2025-09-16 18:22:45 +02:00
Evgeniy Naydanov	f6e3fdd778	test.py: rework log_browsing for dtest migration Rework `ScyllaLogFile.wait_for()` method to make it easier to add required methods to ScyllaNode class of ccm-like shim. Also, added `ScyllaLogFile.grep_for_errors()` method and reworked `ScyllaLogFile.grep()`	2025-05-19 11:50:55 +00:00
Dawid Mędrek	c4b32c38a3	test/cluster: Disable rf_rack_valid_keyspaces in problematic tests Some of the tests in the test suite have proven to be more problematic in adjusting to RF-rack-validity. Since we'd like to run as many tests as possible with the `rf_rack_valid_keyspaces` configuration option enabled, let's disable it in those. In the following commit, we'll enable it by default.	2025-05-10 16:30:49 +02:00
Emil Maskovsky	00794af94d	raft/test: disable the `stop_before_becoming_raft_voter` test The workflow of becoming a voter changes with the "limited voters" feature, as the node will no longer become a voter on its own, but the votership is being managed by the topology coordinator. This therefore breaks the `stop_before_becoming_raft_voter` test, as that injection relies on the old behavior. We will disable the test for this particular case for now and address either fixing of complete removal of the test in a follow-up task. Refs: scylladb/scylladb#23418	2025-04-07 12:23:25 +02:00
Evgeniy Naydanov	cac0257914	test.py: random_failures: make it play well with xdist Pass random seed across xdist workers using env variable.	2025-03-30 03:19:30 +00:00
Artsiom Mishuta	d1198f8318	test.py: rename topology_custom folder to cluster rename topology_custom folder to cluster as it contains not only topology test cases	2025-03-04 10:32:44 +01:00

9 Commits