The Raft PhD presents the following scenario.
When we remove a server from the cluster configuration, it does not
receive the configuration entry which removes it (because the leader
appending this entry uses that entry's configuration to decide to which
servers to send the entry to, and the entry does not contain the removed
server). Therefore the server keeps believing it is a member but does
not receive heartbeats from leaders in the new configuration. Therefore
it will keep becoming a candidate, causing existing leaders to step
down, harming availability. With many such candidates the cluster may
even stop being able to proceed at all. We call such servers
"disruptive".
More concretely, consider the following example, adapted from the PhD for
joint configuration changes (the original PhD considered a different
algorithm which can only add/remove one server at once):
Let C_old = {A, B, C, D}, C_new = {B, C, D}, and C_joint be the joint
configuration (C_old, C_new). D is the leader. D managed to append
C_joint to every server and commit it. D appends C_new. At this point, D
stops sending heartbeats to A because C_new does not contain A, but A's
last entry is still C_joint, so it still has the ability to become a
candidate. A can now become a candidate and cause D, or any other leader
in C_new, to step down. Even if D manages to commit C_new, A can keep
disrupting the cluster until it is shut down.
Prevoting changes the situation, which the authors admit. The "even if"
above no longer applies: if D manages to commit C_new, or just append it
to a majority of C_new, then A won't be able to succeed in the prevote
phase because a majority of servers in C_new has a longer log than A
(and A must obtain a prevote from a majority of servers in C_new because
A is in C_joint which contains C_new). But the authors continue to argue
that disruptions can still occur during the small period where C_new is
only appended on D but not yet on a majority of C_new. As they say:
"we also did not want to assume that a leader will reliably replicate
entries fast enough to move past the scenario (...) quickly; that might
have worked in practice, but it depends on stronger assumptions that we
prefer to avoid about the performance (...) of replicating log entries".
One could probably try debunking this by saying that if entries take
longer to replicate than the election timeout we're in much bigger
trouble, but nevermind.
In any case, the authors propose a solution which we call "sticky
leadership". A server will not grant a vote to a candidate if it has
recently received a heartbeat from the currently known leader, even if
the candidate's term is higher. In the above example, servers in C_new
would not grant votes to A as long as D keeps sending them heartbeats,
thus A is no longer disruptive.
In our case the situation is a bit
different: in original Raft, "heartbeats" have a very specific meaning
- they are append_entries requests (possibly empty) sent by leaders.
Thus if a node stops being a leader it stops sending heartbeats;
similarly, if a node leaves the configuration, it stops receiving
heartbeats from others still in the configuration. We instead use a
"shared failure detector" interface, where nodes may still consider
other nodes alive regardless of their configuration/leadership
situation, as part of the general "MultiRaft" framework.
This pretty much invalidates the original argument, as seen on
the above example: A will still consider D alive, thus it won't become
a candidate.
Shared failure detector combined with sticky leadership actually makes
the situation worse - it may cause cluster unavailability in certain
scenarios (fortunately not a permanent one, it can be solved with server
restarts, for example). Randomized nemesis testing with reconfigurations
found the following scenario:
Let C1 = {A, B, C}, C2 = {A}, C3 = {B, C}. We start from configuration
C1, B is the leader. B commits joint (C1, C2), then new C2
configuration. Note that C does not learn about the last entry
(since it's not part of C2) but it keeps believing that B is alive,
so it keeps believing that B is the leader.
We then partition {A} from {B, C}. A appends (C2, C3) joint
configuration to its log. It's not able to append it to B or C due to
the partition. The partition holds long enough for A to revert to
candidate state (or we may restart A at this point). Eventually the
partition resolves. The only node which can become a candidate now is A:
C does not become a candidate because it keeps believeing that B is the
leader, and B does not become a candidate because it saw the C2
non-joint entry being committed. However, A won't become a leader
because C won't grant it a vote due to the sticky leadership rule.
The cluster will remain unavailable until e.g. C is restarted.
Note that this scenario requires allowing configuration changes which
remove and then readd the same servers to the configuration. One may
wonder if such reconfigurations should be allowed, but there doesn't
seem to be any example of them breaking safety of Raft (and the PhD
doesn't seem to mention them at all; perhaps it implicitly accepts
them). It is unknown whether a similar scenario may be produced without
such reconfigurations.
In any case, disabling sticky leadership resolves the problem, and it is
the last currently known availability problem found in randomized
nemesis testing. There is no reason to keep this extension, both because
the original Raft authors' argument does not apply for shared failure
detector, and because one may even argue with the authors in vanilla
Raft given that prevoting is enabled (see end of third paragraph of this
commit message).
Message-Id: <20210921153741.65084-1-kbraun@scylladb.com>
Raft consensus algorithm implementation for Seastar
Seastar is a high performance server-side application framework written in C++. Please read more about Seastar at http://seastar.io/
This library provides an efficient, extensible, implementation of Raft consensus algorithm for Seastar. For more details about Raft see https://raft.github.io/
Implementation status
- log replication, including throttling for unresponsive servers
- leader election
- configuration changes using joint consensus
Usage
In order to use the library the application has to provide implementations for RPC, persistence and state machine APIs, defined in raft/raft.hh. The purpose of these interfaces is:
- provide a way to communicate between Raft protocol instances
- persist the required protocol state on disk, a pre-requisite of the protocol correctness,
- apply committed entries to the state machine.
While comments for these classes provide an insight into expected guarantees they should provide, in order to provide a complying implementation it's necessary to understand the expectations of the Raft consistency protocol on its environment:
- RPC should implement a model of asynchronous, unreliable network, in which messages can be lost, reordered, retransmitted more than once, but not corrupted. Specifically, it's an error to deliver a message to a Raft server which was not sent to it.
- persistence should provide a durable persistent storage, which survives between state machine restarts and does not corrupt its state.
- Raft library calls
state_machine::apply_entry()for entries reliably committed to the replication log on the majority of servers. Whileapply_entry()is called in the order entries are serialized in the distributed log, there is no guarantee thatapply_entry()is called exactly once. E.g. when a protocol instance restarts from the persistent state, it may re-apply some already applied log entries.
Seastar's execution model is that every object is safe to use within a given shard (physical OS thread). Raft library follows the same pattern. Calls to Raft API are safe when they are local to a single shard. Moving instances of the library between shards is not supported.
First usage.
For an example of first usage see replication_test.cc in test/raft/.
In a nutshell:
- create instances of RPC, persistence, and state machine
- pass them to an instance of Raft server - the facade to the Raft cluster on this node
- repeat the above for every node in the cluster
- use
server::add_entry()to submit new entries on a leader,state_machine::apply_entries()is called after the added entry is committed by the cluster.
Subsequent usages
Similar to the first usage, but persistence::load_term_and_vote()
persistence::load_log(), persistence::load_snapshot() are expected to
return valid protocol state as persisted by the previous incarnation
of an instance of class server.
Architecture bits
Joint consensus based configuration changes
Seastar Raft implementation provides arbitrary configuration changes: it is possible to add and remove one or multiple nodes in a single transition, or even move Raft group to an entirely different set of servers. The implementation adopts the two-step algorithm described in the original Raft paper:
- first, an entry in the Raft log with joint configuration is committed. The joint configuration contains both old and new sets of servers. Once a server learns about a new configuration, it immediately adopts it.
- once a majority of servers persists the joint entry, a final entry with new configuration is appended to the log.
If a leader is deposed during a configuration change, the new leader carries out the transition from joint to final configuration for it. it carries out the transition for the prevoius leader.
No two configuration changes could happen concurrently. The leader refuses a new change if the previous one is still in progress.
Multi-Raft
One of the design goals of Seastar Raft was to support multiple Raft protocol instances. The library takes the following steps to address this:
class server_address, used to identify a Raft server instance (one participant of a Raft cluster) uses globally unique identifiers, while provides an extraserver_infofield which can then store a network address or connection credentials. This makes it possible to share the same transport (RPC) layer among multiple instances of Raft. But it is then the responsibility of this shared RPC layer to correctly route messages received from a shared network channel to a correct Raft server using server UUID.- Raft group failure detection, instead of sending Raft RPC every 0.1 second
to each follower, relies on external input. It is assumed
that a single physical server may be a container of multiple Raft
groups, hence failure detection RPC could run once on network peer level,
not sepately for each Raft instance. The library expects an accurate
failure_detectorinstance from a complying implementation.
RPC module address mappings
Raft instance needs to update RPC subsystem on changes in configuration, so that RPC can deliver messages to the new nodes in configuration, as well as dispose of the old nodes. I.e. the nodes which are not the part of the most recent configuration anymore.
New nodes are added to the RPC configuration after the configuration change is committed but before the instance sends messages to the peers.
Until the messages are successfully delivered to at least the majority of "old" nodes and we have heard back from them, the mappings should be kept intact. After that point the RPC mappings for the removed nodes are no longer of interest and thus can be immediately disposed.
There is also another problem to be solved: in Raft an instance may need to communicate with a peer outside its current configuration. This may happen, e.g., when a follower falls out of sync with the majority and then a configuration is changed and a leader not present in the old configuration is elected.
The solution is to introduce the concept of "expirable" updates to the RPC subsystem.
When RPC receives a message from an unknown peer, it also adds the return address of the peer to the address map with a TTL. Should we need to respond to the peer, its address will be known.
An outgoing communication to an unconfigured peer is impossible.