scylladb/raft/README.md

# Raft consensus algorithm implementation for Seastar

Seastar is a high performance server-side application framework
written in C++. Please read more about Seastar at http://seastar.io/

This library provides an efficient, extensible, implementation of
Raft consensus algorithm for Seastar.
For more details about Raft see https://raft.github.io/

## Implementation status
---------------------
- log replication, including throttling for unresponsive
  servers
- leader election
- configuration changes using joint consensus

## Usage
-----

In order to use the library the application has to provide implementations
for RPC, persistence and state machine APIs, defined in raft/raft.hh. The
purpose of these interfaces is:
- provide a way to communicate between Raft protocol instances
- persist the required protocol state on disk,
a pre-requisite of the protocol correctness,
- apply committed entries to the state machine.

While comments for these classes provide an insight into
expected guarantees they should provide, in order to provide a complying
implementation it's necessary to understand the expectations
of the Raft consistency protocol on its environment:
- RPC should implement a model of asynchronous, unreliable network,
  in which messages can be lost, reordered, retransmitted more than
  once, but not corrupted. Specifically, it's an error to
  deliver a message to a Raft server which was not sent to it.
- persistence should provide a durable persistent storage, which
  survives between state machine restarts and does not corrupt
  its state.
- Raft library calls `state_machine::apply_entry()` for entries
  reliably committed to the replication log on the majority of
  servers. While `apply_entry()` is called in the order
  entries are serialized in the distributed log, there is
  no guarantee that `apply_entry()` is called exactly once.
  E.g. when a protocol instance restarts from the persistent state,
  it may re-apply some already applied log entries.

Seastar's execution model is that every object is safe to use
within a given shard (physical OS thread). Raft library follows
the same pattern. Calls to Raft API are safe when they are local
to a single shard. Moving instances of the library between shards
is not supported.

### First usage.

For an example of first usage see `replication_test.cc` in test/raft/.

In a nutshell:
- create instances of RPC, persistence, and state machine
- pass them to an instance of Raft server - the facade to the Raft cluster
  on this node
- repeat the above for every node in the cluster
- use `server::add_entry()` to submit new entries
  on a leader, `state_machine::apply_entries()` is called after the added
  entry is committed by the cluster.

### Subsequent usages

Similar to the first usage, but `persistence::load_term_and_vote()`
`persistence::load_log()`, `persistence::load_snapshot()` are expected to
return valid protocol state as persisted by the previous incarnation
of an instance of class server.

## Architecture bits

### Joint consensus based configuration changes

Seastar Raft implementation provides arbitrary configuration
changes: it is possible to add and remove one or multiple
nodes in a single transition, or even move Raft group to an
entirely different set of servers. The implementation adopts
the two-step algorithm described in the original Raft paper:
- first, an entry in the Raft log with joint configuration is
committed. The joint configuration contains both old and
new sets of servers. Once a server learns about a new
configuration, it immediately adopts it.
- once a majority of servers persists the joint
entry, a final entry with new configuration is appended
to the log.

If a leader is deposed during a configuration change,
the new leader carries out the transition from joint
to final configuration for it.
it carries out the transition for the prevoius leader.

No two configuration changes could happen concurrently. The leader
refuses a new change if the previous one is still in progress.

### Multi-Raft

One of the design goals of Seastar Raft was to support multiple Raft
protocol instances. The library takes the following steps to address
this:
- `class server_address`, used to identify a Raft server instance (one
  participant of a Raft cluster) uses globally unique identifiers, while
  provides an extra `server_info` field which can then store a network
  address or connection credentials.
  This makes it possible to share the same transport (RPC) layer
  among multiple instances of Raft. But it is then the responsibility
  of this shared RPC layer to correctly route messages received from
  a shared network channel to a correct Raft server using server
  UUID.
- Raft group failure detection, instead of sending Raft RPC every 0.1 second
  to each follower, relies on external input. It is assumed
  that a single physical server may be a container of multiple Raft
  groups, hence failure detection RPC could run once on network peer level,
  not sepately for each Raft instance. The library expects an accurate
  `failure_detector` instance from a complying implementation.

### RPC module address mappings

Raft instance needs to update RPC subsystem on changes in
configuration, so that RPC can deliver messages to the new nodes
in configuration, as well as dispose of the old nodes.
I.e. the nodes which are not the part of the most recent
configuration anymore.

New nodes are added to the RPC configuration after the
configuration change is committed but before the instance
sends messages to the peers.

Until the messages are successfully delivered to at least
the majority of "old" nodes and we have heard back from them,
the mappings should be kept intact. After that point the RPC
mappings for the removed nodes are no longer of interest
and thus can be immediately disposed.

There is also another problem to be solved: in Raft an instance may
need to communicate with a peer outside its current configuration.
This may happen, e.g., when a follower falls out of sync with the
majority and then a configuration is changed and a leader not present
in the old configuration is elected.

The solution is to introduce the concept of "expirable" updates to
the RPC subsystem.

When RPC receives a message from an unknown peer, it also adds the
return address of the peer to the address map with a TTL. Should
we need to respond to the peer, its address will be known.

An outgoing communication to an unconfigured peer is impossible.