scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-05-22 15:52:13 +00:00

Author	SHA1	Message	Date
Botond Dénes	70261dc674	Merge 'test/cluster: scale failure_detector_timeout_in_ms by build mode' from Marcin Maliszkiewicz The failure_detector_timeout_in_ms override of 2000ms in 6 cluster test files is too aggressive for debug/sanitize builds. During node joins, the coordinator's failure detector times out on RPC pings to the joining node while it is still applying schema snapshots, marks it DOWN, and bans it — causing flaky test failures. Scale the timeout by MODES_TIMEOUT_FACTOR (3x for debug/sanitize, 2x for dev, 1x for release) via a shared failure_detector_timeout fixture in conftest.py. Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1587 Backport: no, elasticsearch analyser shows only a single failure Closes scylladb/scylladb#29522 * github.com:scylladb/scylladb: test/cluster: scale failure_detector_timeout_in_ms by build mode test/cluster: add failure_detector_timeout fixture	2026-04-24 09:10:43 +03:00
Łukasz Paszkowski	d18eb9479f	cql/statement: Create keyspace_metadata with correct initial_tablets count In `ks_prop_defs::as_ks_metadata(...)` a default initial tablets count is set to 0, when tablets are enabled and the replication strategy is NetworkReplicationStrategy. This effectively sets _uses_tablets = false in abstract_replication_strategy for the remaining strategies when no `tablets = {...}` options are specified. As a consequence, it is possible to create vnode-based keyspaces even when tablets are enforced with `tablets_mode_for_new_keyspaces`. The patch sets a default initial tablets count to zero regardless of the chosen replication strategy. Then each of the replication strategy validates the options and raises a configuration exception when tablets are not supported. All tests are altered in the following way: + whenever it was correct, SimpleStrategy was replaced with NetworkTopologyStrategy + otherwise, tablets were explicitly disabled with ` AND tablets = {'enabled': false}` Fixes https://github.com/scylladb/scylladb/issues/25340 Closes scylladb/scylladb#25342	2026-04-20 17:57:38 +03:00
Marcin Maliszkiewicz	e414b2b0b9	test/cluster: scale failure_detector_timeout_in_ms by build mode Six cluster test files override failure_detector_timeout_in_ms to 2000ms for faster failure detection. In debug and sanitize builds, this causes flaky node join failures. The following log analysis shows how. The coordinator (server 614, IP 127.2.115.3) accepts the joining node (server 615, host_id 53b01f0b, IP 127.2.115.2) into group0: 20:10:57,049 [shard 0] raft_group0 - server 614 entered 'join group0' transition state for 53b01f0b The joining node begins receiving the raft snapshot 100ms later: 20:10:57,150 [shard 0] raft_group0 - transfer snapshot from 9fa48539 It then spends ~280ms applying schema changes -- creating 6 keyspaces and 12+ tables from the snapshot: 20:10:57,511 [shard 0] migration_manager - Creating keyspace system_auth_v2 ... 20:10:57,788 [shard 0] migration_manager - Creating system_auth_v2.role_members Meanwhile, the coordinator's failure detector pings the joining node. Under debug+ASan load the RPC call times out after ~4.6 seconds: 20:11:01,643 [shard 0] direct_failure_detector - unexpected exception when pinging 53b01f0b: seastar::rpc::timeout_error (rpc call timed out) 25ms later, the coordinator marks the joining node DOWN and removes it: 20:11:01,668 [shard 0] raft_group0 - failure_detector_loop: Mark node 53b01f0b as DOWN 20:11:01,717 [shard 0] raft_group0 - bootstrap: failed to accept 53b01f0b The joining node was still retrying the snapshot transfer at that point: 20:11:01,745 [shard 0] raft_group0 - transfer snapshot from 9fa48539 It then receives the ban notification and aborts: 20:11:01,844 [shard 0] raft_group0 - received notification of being banned from the cluster Replace the hardcoded 2000ms with the failure_detector_timeout fixture from conftest.py, which scales by MODES_TIMEOUT_FACTOR: 3x for debug/sanitize (6000ms), 2x for dev (4000ms), 1x for release (2000ms). Test measurements (before -> after fix): debug mode: test_replace_with_same_ip_twice 24.02s -> 25.02s test_banned_node_notification 217.22s -> 221.72s test_kill_coordinator_during_op 116.11s -> 127.13s test_node_failure_during_tablet_migration [streaming-source] 183.25s -> 192.69s test_replace (4 tests) skipped in debug (skip_in_debug) test_raft_replace_ignore_nodes skipped in debug (run_in_dev only) dev mode: test_replace_different_ip 10.51s -> 11.50s test_replace_different_ip_using_host_id 10.01s -> 12.01s test_replace_reuse_ip 10.51s -> 12.03s test_replace_reuse_ip_using_host_id 13.01s -> 12.01s test_raft_replace_ignore_nodes 19.52s -> 19.52s	2026-04-20 15:28:34 +02:00
Avi Kivity	0ae22a09d4	LICENSE: Update to version 1.1 Updated terms of non-commercial use (must be a never-customer).	2026-04-12 19:46:33 +03:00
Gleb Natapov	39cec4ae45	topology: let banned node know that it is banned Currently if a banned node tries to connect to a cluster it fails to create connections, but has no idea why, so from inside the node it looks like it has communication problems. This patch adds new rpc NOTIFY_BANNED which is sent back to the node when its connection is dropped. On receiving the rpc the node isolates itself and print an informative message about why it did so. Closes scylladb/scylladb#26943	2025-11-24 17:12:13 +01:00
Artsiom Mishuta	4b975668f6	tiering (test.py): introduce tiering labels introduce tiering marks 1 “unstable” - For unstable tests that will be will continue runing every night and generate up-to-date statistics with failures without failing the “Main” verification path(scylla-ci, Next) 2 “nightly” - for tests that are quite old, stable, and test functionality that rather not be changed or affected by other features, are partially covered in other tests, verify non-critical functionality, have not found any issues or regressions, too long to run on every PR, and can be popped out from the CI run. set 7 long tests(according to statistic in elastic) as nightly(theses 8 tests took 20% of CI run, about 4 hours without paralelization) 1 test as unstable(as exaple ot marker usage) Closes scylladb/scylladb#24974	2025-08-04 15:38:16 +03:00
Artsiom Mishuta	d1198f8318	test.py: rename topology_custom folder to cluster rename topology_custom folder to cluster as it contains not only topology test cases	2025-03-04 10:32:44 +01:00

7 Commits