scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-24 02:20:37 +00:00

Author	SHA1	Message	Date
Kamil Braun	e086521c1a	direct_failure_detector: get rid of complex `endpoint_id` translations The direct failure detector operates on abstract `endpoint_id`s for pinging. The `pigner` interface is responsible for translating these IDs to 'real' addresses. Earlier we used two types of addresses: IP addresses in 'production' code (`gms::gossiper::direct_fd_pinger`) and `raft::server_id`s in test code (in `randomized_nemesis_test`). For each of these use cases we would maintain mappings between `endpoint_id`s and the address type. In recent commits we switched the 'production' code to also operate on Raft server IDs, which are UUIDs underneath. In this commit we switch `endpoint_id`s from `unsigned` type to `utils::UUID`. Because each use case operates in Raft server IDs, we can perform a simple translation: `raft_id.uuid()` to get an `endpoint_id` from a Raft ID, `raft::server_id{ep_id}` to obtain a Raft ID from an `endpoint_id`. We no longer have to maintain complex sharded data structures to store the mappings.	2022-11-04 09:38:08 +01:00
Gleb Natapov' via ScyllaDB development	0d20830863	direct_failure_detector: reduce severity of ping error logging Having an error while pinging a peer is not a critical error. The code retires and move on. Lets log the message with less severity since sometimes those error may happen (for instance during node replace operation some nodes refuse to answer to pings) and dtest complains that there are unexpected errors in the logs. Message-Id: <Ywy5e+8XVwt492Nc@scylladb.com>	2022-08-31 07:11:59 +03:00
Avi Kivity	4b53af0bd5	treewide: replace parallel_for_each with coroutine::parallel_for_each in coroutines coroutine::parallel_for_each avoids an allocation and is therefore preferred. The lifetime of the function object is less ambiguous, and so it is safer. Replace all eligible occurences (i.e. caller is a coroutine). One case (storage_service::node_ops_cmd_heartbeat_updater()) needed a little extra attention since there was a handle_exception() continuation attached. It is converted to a try/catch. Closes #10699	2022-05-31 09:06:24 +03:00
Kamil Braun	666e5a414d	direct_failure_detector: introduce new failure detector service The new service performs failure detection by periodically pinging endpoints. The set of pinged endpoints can be dynamically extended and shrinked. To learn about liveness of endpoints, user of the service registers a listener and chooses a threshold - a duration of time which has to pass since the last successful ping in order to mark an endpoint as dead. When an endpoint responds it's immediately marked as alive. Endpoints are identified using abstract integer identifiers. The method of performing a ping is a dependency of the service provided by the user through the `pinger` interface. The implementation of `pinger` is responsible for translating the abstract endpoint IDs to 'real' addresses. For example, production implementation may map endpoint IDs to IP addresses and use TCP/IP to perform the ping, while a test/simulation implementation may use a simulated network that also operates on abstract identifiers. Similarly, the method of measuring time is a dependency provided by the user using the `clock` interface. The service operates on abstract time intervals and timepoints. So, for example, in a production implementation time can be measured using a stopwatch, while in test/simulation we can use a logical clock. The service distributes work across different shards. When an endpoint is added to the set of detected endpoints, the service will choose a shard with the smallest amount of workers and create a worker that is responsible for periodically pinging this endpoint on that shard and sending notifications to listeners. Endpoints can be added or removed only through the shard 0 instance of the service and shard 0 is responsible for coordinating the endpoint workers. Listeners can be registered on any shard.	2022-05-09 13:14:40 +02:00

4 Commits