scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-05-28 18:50:53 +00:00

Files

Kamil Braun 666e5a414d direct_failure_detector: introduce new failure detector service

The new service performs failure detection by periodically pinging
endpoints. The set of pinged endpoints can be dynamically extended and
shrinked. To learn about liveness of endpoints, user of the service
registers a listener and chooses a threshold - a duration of time which
has to pass since the last successful ping in order to mark an endpoint
as dead. When an endpoint responds it's immediately marked as alive.

Endpoints are identified using abstract integer identifiers.
The method of performing a ping is a dependency of the service provided
by the user through the `pinger` interface. The implementation of `pinger`
is responsible for translating the abstract endpoint IDs to 'real'
addresses. For example, production implementation may map endpoint IDs
to IP addresses and use TCP/IP to perform the ping, while a test/simulation
implementation may use a simulated network that also operates on
abstract identifiers.

Similarly, the method of measuring time is a dependency provided by the
user using the `clock` interface. The service operates on abstract time
intervals and timepoints. So, for example, in a production
implementation time can be measured using a stopwatch, while in
test/simulation we can use a logical clock.

The service distributes work across different shards. When an endpoint
is added to the set of detected endpoints, the service will choose a
shard with the smallest amount of workers and create a worker that is
responsible for periodically pinging this endpoint on that shard and
sending notifications to listeners.

Endpoints can be added or removed only through the shard 0 instance of
the service and shard 0 is responsible for coordinating the endpoint
workers. Listeners can be registered on any shard.

2022-05-09 13:14:40 +02:00

failure_detector.cc

direct_failure_detector: introduce new failure detector service

2022-05-09 13:14:40 +02:00

failure_detector.hh

direct_failure_detector: introduce new failure detector service

2022-05-09 13:14:40 +02:00