This patch implements RAFT extension that allows to perform linearisable
reads by accessing local state machine. The extension is described
in section 6.4 of the PhD. To sum it up to perform a read barrier on
a follower it needs to asks a leader the last committed index that it
knows about. The leader must make sure that it is still a leader before
answering by communicating with a quorum. When follower gets the index
back it waits for it to be applied and by that completes read_barrier
invocation.
The patch adds three new RPC: read_barrier, read_barrier_reply and
execute_read_barrier_on_leader. The last one is the one a follower uses
to ask a leader about safe index it can read. First two are used by a
leader to communicate with a quorum.
Add a helper to be able to wait until a Raft cluster
leader is elected. It can be used to avoid sleeps
when it's necessary to forward a request to the leader,
but the leader is yet unknown.
We add a function `log_last_conf_before(index_t)` to `fsm` which, given
an index greater than the last snapshot index, returns the configuration
at this index, i.e. the configuration of the last configuration entry
before this index.
This function is then used in `applier_fiber` to obtain the correct
configuration to be stored in a snapshot.
In order to ensure that the configuration can be obtained, i.e. the
index we're looking at is not smaller than the last snapshot index, we
strengthen the conditions required for taking a snapshot: we check that
`_fsm` has not yet applied a snapshot at a larger index (which it may
have due to a remote snapshot install request). This also causes fewer
unnecessary snapshots to be taken in general.
We must not apply remote snapshots with commit indexes smaller than our
local commit index; this could result in out-of-order command
application to the local state machine replica, leading to
serializability violations.
Message-Id: <20210805112736.35059-1-kbraun@scylladb.com>
We want to serialize snapshot application with command application
otherwise a command may be applied after a snapshot that already contains
the result of its application (it is not necessary a problem since the
raft by itself does not guaranty apply-once semantics, but better to
prevent it when possible). This also moves all interactions with user's
state machine into one place.
Message-Id: <YPltCmBAGUQnpW7r@scylladb.com>
Sometimes an ability to force a leader change is needed. For instance
if a node that is currently serving as a leader needs to be brought
down for maintenance. If it will be shutdown without leadership
transfer the cluster will be unavailable for leader election timeout at
least.
We already have a mechanism to transfer the leadership in case an active
leader is removed. The patch exposes it as an external interface with a
timeout period. If a node is still a leader after the timeout the
operation will fail.
By default, wait for the server to leave the joint configuration
when making a configuration change.
When assembling a fresh cluster Scylla may run a series of
configuration changes. These changes would all go through the same
leader and serialize in the critical section around server::cas().
Unless this critical section protects the complete transition from
C_old configuration to C_new, after the first configuration
is committed, the second may fail with exception that a configuration
change is in progress. The topology changes layer should handle
this exception, however, this may introduce either unpleasant
delays into cluster assembly (i.e. if we sleep before retry), or
a busy-wait/thundering herd situation, when all nodes are
retrying their configuration changes.
So let's be nice and wait for a full transition in
server::set_configuration().
Isolate the checks for configuration transitions in a static function,
to be able to unit test outside class server.
Split the condition of transitioning to an empty configuration
from the condition of transitioning into a configuration with
no voters, to produce more user-friendly error messages.
*Allow* to transfer leadership in a configuration when
the only voter is the leader itself. This would be equivalent
to syncing the leader log with the learner and converting
the leader to the follower itself. This is safe, since
the leader will re-elect itself quickly after an election timeout,
and may be used to do a rolling restart of a cluster with
only one voter.
A test case follows.
Currently if append_message cannot be sent to one of the followers the
entire io_fiber will block which eventually stop the replication. The
patch changes message sending part of io_fiber to be non blocking. The
code adds a hash table that is used to keep track of append_request
sending status per destination. All the remaining futures are waited for
during abort.
Message-Id: <20210606140305.2930189-2-gleb@scylladb.com>
Feature requests, fixes, and OOP refactor of replication_test.
Note: all known bugs and hangs are now fixed.
A new helper class "raft_cluster" is created.
Each move of a helper function to the class has its own commit.
New helpers are provided
To simplify code, for now only a single apply function can be set per
raft_cluster. No tests were using in any other way. In the future,
there could be custom apply functions per server dynamically assigned,
if this becomes needed.
* alejo/raft-tests-replication-02-v3-30: (66 commits)
raft: replication test: wait for log for both index and term
raft: replication test: reset network at construction
raft: replication test: use lambda visitor for updates
raft: replication test: move structs into class
raft: replication test: move data structures to cluster class
raft: replication test: remove shared pointers
raft: replication test: move get_states() to raft_cluster
raft: replication test: test_server inside raft_cluster
raft: replication test: rpc declarative tests
raft: replication test: add wait_log
raft: replication test: add stop and reset server
raft: replication test: disconnect 2 support
raft: replication test: explicit node_id naming
raft: replication test: move definitions up
raft: replication test: no append entries support
raft: replication test: fix helper parameter
raft: replication test: stop servers out of config
raft: replication test: wait log when removing leader from configuration
raft: replication test: only manipulate servers in configuration
raft: replication test: only cancel rearm ticker for removed server
...
Most RAFT packets are sent very rarely during special phases of the
protocol (like election or leader stepdown). The protocol itself does
not care if a packet is sent or dropped, so returning futures from their
send function does not serve any purpose. Change the raft's rpc interface
to return void for all packet types but append_request. We still want to
get a future from sending append_request for backpressure purposes since
replication protocol is more efficient if there is no packet loss, so
it is better to pause a sender than dropping packets inside the rpc. Rpc
is still allowed to drop append_requests if overloaded.
IO and applier fibers may update waiters and start new snapshot
transfers, so abort() needs to wait for them to stop before proceeding
to abort waiters and snapshot transfers,
Waiting on index alone does not guarantee leader correct leader log
propagation. This patch add checking also the term of the leader's last
log entry.
This was exposed with occasional problems with packet drops.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Currently an entry is declared to be dropped only when an entry with
different term is committed with the same index, but that may create a
situation where, if no new entries are submitted for a long time, an
already dropped entry will not be noticed for a long time as well.
Consider the case where a client submits 10 entries on a leader A, but
before they get replicated the leadership moves to a node B. B will
commit a dummy entry which will be committed eventually and will release
one of the waiters on A, but if anything else is submitted to B 9 other
waiters will wait forever.
The way to solve that is to drop all waiters that wait for a term
smaller that one been committed. There is no chance they will be
committed any longer since terms in the log may only grow.
A snapshot transfer may take a lot of time and meanwhile a leader doing
it may lose the leadership. If that happens the ongoing snapshot transfer
becomes obsolete since the snapshot will be rejected by the receiving
node as coming from an old leader. Make snapshot transfer abortable and
abort them when leader changes.
A leader may change while one of its followers is in snapshot transfer
mode and that node may get additional request for snapshot transfer from
a new leader while previous transfer is still not aborted. Currently
such situation will trigger an assert. This patch allows to have active
snapshot transfers from multiple nodes, but only one of them will succeed
in the end, all other will be replied to with 'fail'.
Use the new specific connectivity to manage old leader disconnection
more specifically.
This fixes having elections where the vote of the old leader is required
for quorum. For example {A,B} and we want to switch leader. For B to
become candidate it has to see A as down. Then A has to see B's request
for vote, and vote for A.
So to make the general case old leader needs to be first disconnected
from all nodes, make the desired node candidate, then have the old
leader connected only to the desired candidate (else, other nodes would
see the new candidate as disrupting a live leader).
Also, there might be stray messages from the former leader. These could
revert the candidate to follower. To handle this this patch retries
the process until the desired node becomes leader.
The helper function elect_me_leader() is split and renamed to
wait_until_candidate() and wait_election_done(). The former ticks until
the node is a candidate and the later waits until a candidate either
becomes a leader or reverts to follower
The existing etcd test workaround of incrementing from n=2 to n=3 nodes
is corrected back to original n=2.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
RPC configuration was updated only when an instance was
started with an initial snapshot.
In case we don't have an initial snapshot, but do have
a non-empty log with a configuration entry, the RPC
instance isn't set up correctly.
Fix that by moving RPC setup code outside the check for
snapshot id and look at `_log.get_configuration()` instead.
Also, set up RPC mappings both for `current` and `previous`
components, since in case the last configuration index
points to an entry from the log, it can happen to be
a joint configuration entry.
For example, this can happen if a leader made an attempt
to change configuration, but failed shortly afterwards
without being able to commit the new configuration.
Tests: unit(dev)
Signed-off-by: Pavel Solodovnikov <pa.solodonikov@scylladb.com>
Message-Id: <20210423220718.642470-1-pa.solodovnikov@scylladb.com>
Stop use seastar::pipe and use seastar::queue directly to pass log
entries to apply_fiber. The pipe is a layer above queue anyway and it
adds functionality that we do not need (EOS) and hinds functionality that
we do (been able to abort()). This fixes a crash during abort where the
pipe was uses after been destroyed.
Message-Id: <YHLkPZ9+sdLhwcjZ@scylladb.com>
If a leader is removed from a cluster it will never know when entries
that it did not committed yet will be committed, so abort the wait in
this case with uncertainty error.
Raft instance needs to update RPC subsystem on changes in
configuration, so that RPC can deliver messages to the new nodes
in configuration, as well as dispose of the old nodes.
I.e. the nodes which are not the part of the most recent
configuration anymore.
The effective scope of RPC mappings is limited by the piece of
code which sends messages to both the "new" nodes (which
are added to the cluster with the most recent configuration
change) and the "old" nodes which are removed from the cluster.
Until the messages are successfully delivered to at least
the majority of "old" nodes and we have heard back from them,
the mappings should be kept intact. After that point the RPC
mappings for the removed nodes are no longer of interest
and thus can be immediately disposed.
Tests: unit(dev)
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Introduce rpc server_address that represents the
last observed state of address mappings
for RPC module.
It does not correspond to any kind of configuration
in the raft sense, just an artificial construct
corresponding to the largest set of server
addresses coming from both previous and current
raft configurations (to be able to contact both
joining and leaving servers).
This will be used later to update rpc module mappings
when cluster configuration changes.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Section 3.10 of the PhD describes two cases for which the extension can
be helpful:
1. Sometimes the leader must step down. For example, it may need to reboot
for maintenance, or it may be removed from the cluster. When it steps
down, the cluster will be idle for an election timeout until another
server times out and wins an election. This brief unavailability can be
avoided by having the leader transfer its leadership to another server
before it steps down.
2. In some cases, one or more servers may be more suitable to lead the
cluster than others. For example, a server with high load would not make
a good leader, or in a WAN deployment, servers in a primary datacenter
may be preferred in order to minimize the latency between clients and
the leader. Other consensus algorithms may be able to accommodate these
preferences during leader election, but Raft needs a server with a
sufficiently up-to-date log to become leader, which might not be the
most preferred one. Instead, a leader in Raft can periodically check
to see whether one of its available followers would be more suitable,
and if so, transfer its leadership to that server. (If only human leaders
were so graceful.)
The patch here implements the extension and employs it automatically
when a leader removes itself from a cluster.
By the time we receive snapshot_reply from a follower
we may no longer be the leader. Follower term may be
different from snapshot term, e.g. the follower may
be aware of a new leader already and have a higher term.
We should pass this information into (possibly ex-) leader FSM via
fsm::step() so that it can correctly change its state, and
not call FSM directly.
Raft send_snapshot RPC is actually two-way, the follower
responds with snapshot_reply message. This message until now
was, however, muted by RPC.
Do not mute snapshot_reply any more:
- to make it obvious the RPC is two way
- to feed the follower response directly into leader's FSM and
thus ensure that FSM testing results produced when using a test
transport are representative of the real world uses of
raft::rpc.
Set follower's next_idx to snapshot index + 1 when switching
it to snapshot mode. If snapshot transfer succeeds, that's the
best match for the follower's next replication index. If it fails,
the leader will send a new probe to find out the follower position
again and re-try sending a possibly newer snapshot.
The change helps reduce protocol state managed outside FSM.
Prior to the fix there was an assert to check in
`raft::server_impl::start` that the initial term is not 0.
This restriction is completely artificial and can be lifted
without any problems, which will be described below.
The only place that is dependent on this corner case is in
`server_impl::io_fiber`. Whenever term or vote has changed,
they will be both set in `fsm::get_output`. `io_fiber` checks
whether it needs to persist term and vote by validating that
the term field is set (by actually executing a `term != 0`
condition).
This particular check is based on an unobvious fact that the
term will never be 0 in case `fsm::get_output` saves
term and vote values, indicating that they need to be
persisted.
Vote and term can change independently of each other, so that
checking only for term obscures what is happening and why
even more.
In either case term will never be 0, because:
1. If the term has changed, then it's naturally greater than 0,
since it's a monotonically increasing value.
2. If the vote has changed, it means that we received
a vote request message. In such case we have already updated
our term to the requester's term.
Switch to using an explicit optional in `fsm_output` so that
a reader don't have to think about the motivation behind this `if`
and just checks that `term_and_vote` optional is engaged.
Given the motivation described above, the corresponding
assert(_fsm->get_current_term() != term_t(0));
in `server_impl::start` is removed.
Tests: unit(dev)
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
This is how PhD explain the need for prevoting stage:
One downside of Raft's leader election algorithm is that a server that
has been partitioned from the cluster is likely to cause a disruption
when it regains connectivity. When a server is partitioned, it will
not receive heartbeats. It will soon increment its term to start
an election, although it won't be able to collect enough votes to
become leader. When the server regains connectivity sometime later, its
larger term number will propagate to the rest of the cluster (either
through the server's RequestVote requests or through its AppendEntries
response). This will force the cluster leader to step down, and a new
election will have to take place to select a new leader.
Prevoting stage is addressing that. In the Prevote algorithm, a
candidate only increments its term if it first learns from a majority of
the cluster that they would be willing to grant the candidate their votes
(if the candidate's log is sufficiently up-to-date, and the voters have
not received heartbeats from a valid leader for at least a baseline
election timeout).
The Prevote algorithm solves the issue of a partitioned server disrupting
the cluster when it rejoins. While a server is partitioned, it won't
be able to increment its term, since it can't receive permission
from a majority of the cluster. Then, when it rejoins the cluster, it
still won't be able to increment its term, since the other servers
will have been receiving regular heartbeats from the leader. Once the
server receives a heartbeat from the leader itself, it will return to
the follower state(in the same term).
In our implementation we have "stable leader" extension that prevents
spurious RequestVote to dispose an active leader, but AppendEntries with
higher term will still do that, so prevoting extension is also required.
The old name was incorrect, in case apply_snapshot() was called with
non-zero trailing entries, the total log length is greater than the
length of the part that is not stored in a snapshot.
Fix spelling in related comments.
Rename fsm::wait() to fsm::wait_max_log_size(), it's a more
specific name. Rename max_log_length to max_log_size to use
'size' rather than 'length' consistently for log size.
For certain situations where barely enough nodes to elect a new leader
are connected a disruptive candidate can occassionally block the
election.
For example having servers A B C D E and only A B C are active in a
partition. If the test wants to elect A, it has to first make all 3
servers reach election timeout threshold (to make B and C receptive).
Then A is ticked till it becomes a candidate and has to send vote
requests to the other servers.
But all servers have a timer (_ticker) calling their periodic tick()
functions. If one of the other servers, say B, gets its timer tick
before A sends vote requests, B becomes a (disruptive) candidate and
will refuse to vote for A. In our case of only having 3 out of 5 servers
connected a single missing vote can hang the election.
This patch disables timer ticks for all servers when running custom
elections and partitioning.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
The patch adds set of counters for various events inside raft
implementation to facilitate monitoring and debugging.
Message-Id: <20210204125313.GA1513786@scylladb.com>
This patch set adds etcd unit tests for raft.
It also includes a fix for replication test in debug mode and a
simplification for append_request.
Tests: unit ({dev}), unit ({debug}), unit ({release})
* https://github.com/alecco/scylla/tree/raft-ale-tests-09b:
raft: etcd unit tests: test log replication
raft: boost test etcd: test fsm can vote from any state
raft: boost test etcd: port TestLeaderElectionOverwriteNewerLogs
raft: replication test: add etcd test for cycling leaders
raft: testing: provide primitives to wait for log propagation
raft: etcd unit tests: initial boost tests
raft: combine append_request _receive and _send
The new naming scheme more clearly communicates to the client of
the raft library that the `persistence` interface implements
persistency layer of the fsm that is powering the raft
protocol itself rather than the client-side workflow and
user-provided `state_machine`.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20201126135114.7933-1-pa.solodovnikov@scylladb.com>
For tests to be able to transition in a consistent state, in some cases
it's needed to allow the followers to catch up with the leader.
This prevents occasional hangs in debug mode for incoming tests.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Combine structs for append request send and receive into a single
struct.
Author: Gleb Natapov <gleb@scylladb.com>
Date: Mon Nov 23 14:33:14 2020 +0200