scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-29 20:57:00 +00:00

Author	SHA1	Message	Date
Marcin Maliszkiewicz	e414b2b0b9	test/cluster: scale failure_detector_timeout_in_ms by build mode Six cluster test files override failure_detector_timeout_in_ms to 2000ms for faster failure detection. In debug and sanitize builds, this causes flaky node join failures. The following log analysis shows how. The coordinator (server 614, IP 127.2.115.3) accepts the joining node (server 615, host_id 53b01f0b, IP 127.2.115.2) into group0: 20:10:57,049 [shard 0] raft_group0 - server 614 entered 'join group0' transition state for 53b01f0b The joining node begins receiving the raft snapshot 100ms later: 20:10:57,150 [shard 0] raft_group0 - transfer snapshot from 9fa48539 It then spends ~280ms applying schema changes -- creating 6 keyspaces and 12+ tables from the snapshot: 20:10:57,511 [shard 0] migration_manager - Creating keyspace system_auth_v2 ... 20:10:57,788 [shard 0] migration_manager - Creating system_auth_v2.role_members Meanwhile, the coordinator's failure detector pings the joining node. Under debug+ASan load the RPC call times out after ~4.6 seconds: 20:11:01,643 [shard 0] direct_failure_detector - unexpected exception when pinging 53b01f0b: seastar::rpc::timeout_error (rpc call timed out) 25ms later, the coordinator marks the joining node DOWN and removes it: 20:11:01,668 [shard 0] raft_group0 - failure_detector_loop: Mark node 53b01f0b as DOWN 20:11:01,717 [shard 0] raft_group0 - bootstrap: failed to accept 53b01f0b The joining node was still retrying the snapshot transfer at that point: 20:11:01,745 [shard 0] raft_group0 - transfer snapshot from 9fa48539 It then receives the ban notification and aborts: 20:11:01,844 [shard 0] raft_group0 - received notification of being banned from the cluster Replace the hardcoded 2000ms with the failure_detector_timeout fixture from conftest.py, which scales by MODES_TIMEOUT_FACTOR: 3x for debug/sanitize (6000ms), 2x for dev (4000ms), 1x for release (2000ms). Test measurements (before -> after fix): debug mode: test_replace_with_same_ip_twice 24.02s -> 25.02s test_banned_node_notification 217.22s -> 221.72s test_kill_coordinator_during_op 116.11s -> 127.13s test_node_failure_during_tablet_migration [streaming-source] 183.25s -> 192.69s test_replace (4 tests) skipped in debug (skip_in_debug) test_raft_replace_ignore_nodes skipped in debug (run_in_dev only) dev mode: test_replace_different_ip 10.51s -> 11.50s test_replace_different_ip_using_host_id 10.01s -> 12.01s test_replace_reuse_ip 10.51s -> 12.03s test_replace_reuse_ip_using_host_id 13.01s -> 12.01s test_raft_replace_ignore_nodes 19.52s -> 19.52s	2026-04-20 15:28:34 +02:00
Avi Kivity	0ae22a09d4	LICENSE: Update to version 1.1 Updated terms of non-commercial use (must be a never-customer).	2026-04-12 19:46:33 +03:00
Evgeniy Naydanov	f6e3fdd778	test.py: rework log_browsing for dtest migration Rework `ScyllaLogFile.wait_for()` method to make it easier to add required methods to ScyllaNode class of ccm-like shim. Also, added `ScyllaLogFile.grep_for_errors()` method and reworked `ScyllaLogFile.grep()`	2025-05-19 11:50:55 +00:00
Dawid Mędrek	dbb8835fdf	test/cluster: Adjust simple tests to RF-rack-validity We adjust all of the simple cases of cluster tests so they work with `rf_rack_valid_keyspaces: true`. It boils down to assigning nodes to multiple racks. For most of the changes, we do that by: * Using `pytest.mark.prepare_3_racks_cluster` instead of `pytest.mark.prepare_3_nodes_cluster`. * Using an additional argument -- `auto_rack_dc` -- when calling `ManagerClient::servers_add()`. In some cases, we need to assign the racks manually, which may be less obvious, but in every such situation, the tests didn't rely on that assignment, so that doesn't affect them or what they verify.	2025-05-10 16:30:18 +02:00
Artsiom Mishuta	d1198f8318	test.py: rename topology_custom folder to cluster rename topology_custom folder to cluster as it contains not only topology test cases	2025-03-04 10:32:44 +01:00

5 Commits