scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-24 18:40:38 +00:00

Files

Marcin Maliszkiewicz e414b2b0b9 test/cluster: scale failure_detector_timeout_in_ms by build mode

Six cluster test files override failure_detector_timeout_in_ms to 2000ms
for faster failure detection. In debug and sanitize builds, this causes
flaky node join failures. The following log analysis shows how.

The coordinator (server 614, IP 127.2.115.3) accepts the joining node
(server 615, host_id 53b01f0b, IP 127.2.115.2) into group0:

  20:10:57,049 [shard 0] raft_group0 - server 614 entered
    'join group0' transition state for 53b01f0b

The joining node begins receiving the raft snapshot 100ms later:

  20:10:57,150 [shard 0] raft_group0 - transfer snapshot from 9fa48539

It then spends ~280ms applying schema changes -- creating 6 keyspaces
and 12+ tables from the snapshot:

  20:10:57,511 [shard 0] migration_manager - Creating keyspace
    system_auth_v2
  ...
  20:10:57,788 [shard 0] migration_manager - Creating
    system_auth_v2.role_members

Meanwhile, the coordinator's failure detector pings the joining node.
Under debug+ASan load the RPC call times out after ~4.6 seconds:

  20:11:01,643 [shard 0] direct_failure_detector - unexpected exception
    when pinging 53b01f0b: seastar::rpc::timeout_error
    (rpc call timed out)

25ms later, the coordinator marks the joining node DOWN and removes it:

  20:11:01,668 [shard 0] raft_group0 - failure_detector_loop:
    Mark node 53b01f0b as DOWN
  20:11:01,717 [shard 0] raft_group0 - bootstrap: failed to accept
    53b01f0b

The joining node was still retrying the snapshot transfer at that point:

  20:11:01,745 [shard 0] raft_group0 - transfer snapshot from 9fa48539

It then receives the ban notification and aborts:

  20:11:01,844 [shard 0] raft_group0 - received notification of being
    banned from the cluster

Replace the hardcoded 2000ms with the failure_detector_timeout fixture
from conftest.py, which scales by MODES_TIMEOUT_FACTOR: 3x for
debug/sanitize (6000ms), 2x for dev (4000ms), 1x for release (2000ms).

Test measurements (before -> after fix):

  debug mode:
  test_replace_with_same_ip_twice           24.02s ->  25.02s
  test_banned_node_notification            217.22s -> 221.72s
  test_kill_coordinator_during_op          116.11s -> 127.13s
  test_node_failure_during_tablet_migration
    [streaming-source]                     183.25s -> 192.69s
  test_replace (4 tests)        skipped in debug (skip_in_debug)
  test_raft_replace_ignore_nodes  skipped in debug (run_in_dev only)

  dev mode:
  test_replace_different_ip                 10.51s ->  11.50s
  test_replace_different_ip_using_host_id   10.01s ->  12.01s
  test_replace_reuse_ip                     10.51s ->  12.03s
  test_replace_reuse_ip_using_host_id       13.01s ->  12.01s
  test_raft_replace_ignore_nodes            19.52s ->  19.52s

2026-04-20 15:28:34 +02:00

alternator

Revert "alternator: optional stripping of http response headers"

2026-04-19 15:14:48 +03:00

boost

Merge 'Add virtual task for vnodes-to-tablets migrations' from Nikos Dragazis

2026-04-19 00:56:33 +03:00

broadcast_tables

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

cluster

test/cluster: scale failure_detector_timeout_in_ms by build mode

2026-04-20 15:28:34 +02:00

cql

test.py: remove testpy_test_fixture_scope

2026-04-16 22:08:33 +02:00

cqlpy

Merge 'test.py: refactor test.py' from Andrei Chekun

2026-04-17 12:51:14 +03:00

ldap

Merge 'auth: sanitize {USER} substitution in LDAP URL template' from Piotr Smaron

2026-04-15 14:40:15 +03:00

lib

compaction: release GC'ed sstables incrementally during compaction

2026-04-17 18:20:47 +03:00

manual

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

nodetool

test.py: remove testpy_test_fixture_scope

2026-04-16 22:08:33 +02:00

perf

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

pylib

test: manager_client: use safe_driver_shutdown for exclusive_clusters

2026-04-19 21:31:18 +03:00

pylib_test

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

raft

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

resource

test/ldap: add LDAP filter-injection reproducers

2026-04-08 13:53:49 +02:00

rest_api

test.py: remove testpy_test_fixture_scope

2026-04-16 22:08:33 +02:00

scylla_gdb

test.py: remove testpy_test_fixture_scope

2026-04-16 22:08:33 +02:00

unit

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

vector_search

vector_search: decrease default connection timeout to 3s

2026-04-17 12:26:39 +03:00

__init__.py

test.py: delete dead code in test.py

2026-04-16 22:08:31 +02:00

CMakeLists.txt

test/cmake: add missing tests to boost test suite

2026-03-29 16:17:45 +03:00

conftest.py

test.py: remove testpy_test_fixture_scope

2026-04-16 22:08:33 +02:00

pytest.ini

Merge 'test: Lower default log level from DEBUG to INFO' from Artsiom Mishuta

2026-04-16 12:46:11 +03:00

README.md

…

README.md

Scylla in-source tests.

For details on how to run the tests, see docs/dev/testing.md

Shared C++ utils, libraries are in lib/, for Python - pylib/

alternator - Python tests which connect to a single server and use the DynamoDB API unit, boost, raft - unit tests in C++ cqlpy - Python tests which connect to a single server and use CQL topology* - tests that set up clusters and add/remove nodes cql - approval tests that use CQL and pre-recorded output rest_api - tests for Scylla REST API Port 9000 scylla-gdb - tests for scylla-gdb.py helper script nodetool - tests for C++ implementation of nodetool

If you can use an existing folder, consider adding your test to it. New folders should be used for new large categories/subsystems, or when the test environment is significantly different from some existing suite, e.g. you plan to start scylladb with different configuration, and you intend to add many tests and would like them to reuse an existing Scylla cluster (clusters can be reused for tests within the same folder).

To add a new folder, create a new directory, and then copy & edit its suite.ini.