This is how PhD explain the need for prevoting stage:
One downside of Raft's leader election algorithm is that a server that
has been partitioned from the cluster is likely to cause a disruption
when it regains connectivity. When a server is partitioned, it will
not receive heartbeats. It will soon increment its term to start
an election, although it won't be able to collect enough votes to
become leader. When the server regains connectivity sometime later, its
larger term number will propagate to the rest of the cluster (either
through the server's RequestVote requests or through its AppendEntries
response). This will force the cluster leader to step down, and a new
election will have to take place to select a new leader.
Prevoting stage is addressing that. In the Prevote algorithm, a
candidate only increments its term if it first learns from a majority of
the cluster that they would be willing to grant the candidate their votes
(if the candidate's log is sufficiently up-to-date, and the voters have
not received heartbeats from a valid leader for at least a baseline
election timeout).
The Prevote algorithm solves the issue of a partitioned server disrupting
the cluster when it rejoins. While a server is partitioned, it won't
be able to increment its term, since it can't receive permission
from a majority of the cluster. Then, when it rejoins the cluster, it
still won't be able to increment its term, since the other servers
will have been receiving regular heartbeats from the leader. Once the
server receives a heartbeat from the leader itself, it will return to
the follower state(in the same term).
In our implementation we have "stable leader" extension that prevents
spurious RequestVote to dispose an active leader, but AppendEntries with
higher term will still do that, so prevoting extension is also required.
The old name was incorrect, in case apply_snapshot() was called with
non-zero trailing entries, the total log length is greater than the
length of the part that is not stored in a snapshot.
Fix spelling in related comments.
Rename fsm::wait() to fsm::wait_max_log_size(), it's a more
specific name. Rename max_log_length to max_log_size to use
'size' rather than 'length' consistently for log size.
For certain situations where barely enough nodes to elect a new leader
are connected a disruptive candidate can occassionally block the
election.
For example having servers A B C D E and only A B C are active in a
partition. If the test wants to elect A, it has to first make all 3
servers reach election timeout threshold (to make B and C receptive).
Then A is ticked till it becomes a candidate and has to send vote
requests to the other servers.
But all servers have a timer (_ticker) calling their periodic tick()
functions. If one of the other servers, say B, gets its timer tick
before A sends vote requests, B becomes a (disruptive) candidate and
will refuse to vote for A. In our case of only having 3 out of 5 servers
connected a single missing vote can hang the election.
This patch disables timer ticks for all servers when running custom
elections and partitioning.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
This patch set adds etcd unit tests for raft.
It also includes a fix for replication test in debug mode and a
simplification for append_request.
Tests: unit ({dev}), unit ({debug}), unit ({release})
* https://github.com/alecco/scylla/tree/raft-ale-tests-09b:
raft: etcd unit tests: test log replication
raft: boost test etcd: test fsm can vote from any state
raft: boost test etcd: port TestLeaderElectionOverwriteNewerLogs
raft: replication test: add etcd test for cycling leaders
raft: testing: provide primitives to wait for log propagation
raft: etcd unit tests: initial boost tests
raft: combine append_request _receive and _send
The new naming scheme more clearly communicates to the client of
the raft library that the `persistence` interface implements
persistency layer of the fsm that is powering the raft
protocol itself rather than the client-side workflow and
user-provided `state_machine`.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20201126135114.7933-1-pa.solodovnikov@scylladb.com>
For tests to be able to transition in a consistent state, in some cases
it's needed to allow the followers to catch up with the leader.
This prevents occasional hangs in debug mode for incoming tests.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
For tests to advance servers they need to invoke tick().
This is needed to advance free elections.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
To prevent the log to take too much memory introduce a mechanism that
limits the log to a certain size. If the size is reached no new log
entries can be submitted until previous entries are committed and
snapshotted.
This patch allows to leave snapshot_trailing amount of entries
when a state machine is snapshotted and raft log entries are dropped.
Those entries can be used to catch up nodes that are slow without
requiring snapshot transfer. The value is part of the configuration
and can be changed.
The patch implements periodic taking of a snapshot and trimming of
the raft log.
In raft the only way the log of already committed entries can be shorten
is by taking a snapshot of the state machine and dropping log entries
included in the snapshot from the raft log. To not let log to grow too
large the patch takes the snapshot periodically after applying N number
of entries where N can be configured by setting snapshot_threshold
value in raft's configuration.
For convenience making Raft tests, use declarative structures.
Servers are set up and initialized and then updates are processed.
For now, updates are just adding entries to leader and change of leader.
Updates and leader changes can be specified to run after initial test setup.
An example test for 3 nodes, node 0 starting as leader having two entries
0 and 1 for term 1, and with current term 2, then adding 12 entries,
changing leader to node 1, and adding 12 more entries. The test will
automatically add more entries to the last leader until the test limit
of total_values (default 100).
{.name = "test_name", .nodes = 3, .initial_term = 2,
.initial_states = {{.le = {{1,0},{1,1}}},
.updates = {entries{12},new_leader{1},entries{12}},},
Leader is isolated before change via is_leader returning false.
Initial leader (default server 0) will be set with this method, too.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Send more that one entry in single append_entry message but
limit one packets size according to append_request_threshold parameter.
Message-Id: <20201007142602.GA2496906@scylladb.com>
This commit introduce public raft interfaces. raft::server represents
single raft server instance. raft::state_machine represents a user
defined state machine. raft::rpc, raft::rpc_client and raft::storage are
used to allow implementing custom networking and storage layers.
A shared failure detector interface defines keep-alive semantics,
required for efficient implementation of thousands of raft groups.