scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-28 12:17:02 +00:00

Files

Kamil Braun 03818c4aa9 direct_failure_detector: increase ping timeout and make it tunable

The direct failure detector design is simplistic. It sends pings
sequentially and times out listeners that reached the threshold (i.e.
didn't hear from a given endpoint for too long) in-between pings.

Given the sequential nature, the previous ping must finish so the next
ping can start. We timeout pings that take too long. The timeout was
hardcoded and set to 300ms. This is too low for wide-area setups --
latencies across the Earth can indeed go up to 300ms. 3 subsequent timed
out pings to a given node were sufficient for the Raft listener to "mark
server as down" (the listener used a threshold of 1s).

Increase the ping timeout to 600ms which should be enough even for
pinging the opposite side of Earth, and make it tunable.

Increase the Raft listener threshold from 1s to 2s. Without the
increased threshold, one timed out ping would be enough to mark the
server as down. Increasing it to 2s requires 3 timed out pings which
makes it more robust in presence of transient network hiccups.

In the future we'll most likely want to decrease the Raft listener
threshold again, if we use Raft for data path -- so leader elections
start quickly after leader failures. (Faster than 2s). To do that we'll
have to improve the design of the direct failure detector.

Ref: scylladb/scylladb#16410
Fixes: scylladb/scylladb#16607

---

I tested the change manually using `tc qdisc ... netem delay`, setting
network delay on local setup to ~300ms with jitter. Without the change,
the result is as observed in scylladb/scylladb#16410: interleaving
```
raft_group_registry - marking Raft server ... as dead for Raft groups
raft_group_registry - marking Raft server ... as alive for Raft groups
```
happening once every few seconds. The "marking as dead" happens whenever
we get 3 subsequent failed pings, which is happens with certain (high)
probability depending on the latency jitter. Then as soon as we get a
successful ping, we mark server back as alive.

With the change, the phenomenon no longer appears.

Closes scylladb/scylladb#18443

2024-05-07 23:40:23 +02:00

alternator

alternator: do not use tablets on new Alternator tables

2024-04-04 12:11:29 +03:00

auth_cluster

Merge 'auth: don't run legacy migrations in auth-v2 mode' from Marcin Maliszkiewicz

2024-05-06 19:53:35 +02:00

boost

test/lib: do not include unused headers

2024-05-05 23:31:48 +03:00

broadcast_tables

db: config: make consistent_cluster_management mandatory

2023-12-14 16:54:04 +01:00

cql

cql3:statement_restrictions.cc add more conditions to prevent "allow filtering" error to pop up in delete/update statements

2023-12-07 21:25:18 +02:00

cql-pytest

cql3: Fix invalid JSON parsing for JSON object with different key types

2024-05-05 15:42:43 +03:00

lib

direct_failure_detector: increase ping timeout and make it tunable

2024-05-07 23:40:23 +02:00

manual

treewide: include fmt/ranges.h and/or fmt/std.h

2024-04-19 22:56:16 +08:00

nodetool

tools/scylla-nodetool: implement the resetlocalschema command

2024-05-01 08:49:11 +03:00

object_store

test: Add test for how quarantined sstables registry entries are loaded

2024-04-26 16:54:43 +03:00

perf

test/perf: report also log_allocations/op

2024-05-02 18:42:41 +03:00

pylib

test: return file mark from wait_for that points after the found string

2024-04-30 15:06:32 +03:00

pylib_test

test.py: support code coverage

2024-01-18 11:11:34 +02:00

raft

direct_failure_detector: increase ping timeout and make it tunable

2024-05-07 23:40:23 +02:00

redis

Typos: fix typos in comments

2023-12-02 22:37:22 +02:00

resource

rust: update dependencies

2023-12-17 13:20:25 +02:00

rest_api

db: config: make consistent-topology-changes unused

2024-04-25 14:33:21 +02:00

scylla-gdb

Typos: fix typos in comments

2023-12-02 22:37:22 +02:00

topology

storage_service: do not take API lock for removenode operation if topology coordinator is enabled

2024-04-30 15:13:50 +03:00

topology_custom

direct_failure_detector: increase ping timeout and make it tunable

2024-05-07 23:40:23 +02:00

topology_experimental_raft

Fix flakiness in test_tablet_load_and_stream due to premature gossiper abort on shutdown

2024-05-07 02:31:02 +02:00

unit

treewide: do not define FMT_DEPRECATED_OSTREAM

2024-04-19 22:57:36 +08:00

__init__.py

…

CMakeLists.txt

build: cmake: add "unit_test_list" target

2024-01-10 08:43:04 +02:00

README.md

test: provide overview of the contents of test/ directory

2023-11-26 15:51:07 +02:00

README.md

Scylla in-source tests.

For details on how to run the tests, see docs/dev/testing.md

Shared C++ utils, libraries are in lib/, for Python - pylib/

alternator - Python tests which connect to a single server and use the DynamoDB API unit, boost, raft - unit tests in C++ cql-pytest - Python tests which connect to a single server and use CQL topology* - tests that set up clusters and add/remove nodes cql - approval tests that use CQL and pre-recorded output rest_api - tests for Scylla REST API Port 9000 scylla-gdb - tests for scylla-gdb.py helper script nodetool - tests for C++ implementation of nodetool

If you can use an existing folder, consider adding your test to it. New folders should be used for new large categories/subsystems, or when the test environment is significantly different from some existing suite, e.g. you plan to start scylladb with different configuration, and you intend to add many tests and would like them to reuse an existing Scylla cluster (clusters can be reused for tests within the same folder).

To add a new folder, create a new directory, and then copy & edit its suite.ini.