Commit Graph

3 Commits

Author SHA1 Message Date
Tomasz Grabiec
067f94d3f8 direct_failure_detector: Avoid throwing exceptions in the success path
sleep_abortable() is aborted on success, which causes sleep_aborted
exception to be thrown. This causes scylla to throw every 100ms for
each pinged node. Throwing may reduce performance if happens often.

Also, it spams the logs if --logger-log-level exception=trace is enabled.

Avoid by swallowing the exception on cancellation.

Fixes #13278.

Closes #13279

(cherry picked from commit 99cb948eac)
2023-04-27 20:49:35 +03:00
Avi Kivity
4b53af0bd5 treewide: replace parallel_for_each with coroutine::parallel_for_each in coroutines
coroutine::parallel_for_each avoids an allocation and is therefore preferred. The lifetime
of the function object is less ambiguous, and so it is safer. Replace all eligible
occurences (i.e. caller is a coroutine).

One case (storage_service::node_ops_cmd_heartbeat_updater()) needed a little extra
attention since there was a handle_exception() continuation attached. It is converted
to a try/catch.

Closes #10699
2022-05-31 09:06:24 +03:00
Kamil Braun
666e5a414d direct_failure_detector: introduce new failure detector service
The new service performs failure detection by periodically pinging
endpoints. The set of pinged endpoints can be dynamically extended and
shrinked. To learn about liveness of endpoints, user of the service
registers a listener and chooses a threshold - a duration of time which
has to pass since the last successful ping in order to mark an endpoint
as dead. When an endpoint responds it's immediately marked as alive.

Endpoints are identified using abstract integer identifiers.
The method of performing a ping is a dependency of the service provided
by the user through the `pinger` interface. The implementation of `pinger`
is responsible for translating the abstract endpoint IDs to 'real'
addresses. For example, production implementation may map endpoint IDs
to IP addresses and use TCP/IP to perform the ping, while a test/simulation
implementation may use a simulated network that also operates on
abstract identifiers.

Similarly, the method of measuring time is a dependency provided by the
user using the `clock` interface. The service operates on abstract time
intervals and timepoints. So, for example, in a production
implementation time can be measured using a stopwatch, while in
test/simulation we can use a logical clock.

The service distributes work across different shards. When an endpoint
is added to the set of detected endpoints, the service will choose a
shard with the smallest amount of workers and create a worker that is
responsible for periodically pinging this endpoint on that shard and
sending notifications to listeners.

Endpoints can be added or removed only through the shard 0 instance of
the service and shard 0 is responsible for coordinating the endpoint
workers. Listeners can be registered on any shard.
2022-05-09 13:14:40 +02:00