Commit Graph

18 Commits

Author SHA1 Message Date
Gleb Natapov
1f868d516e raft: implement prevoting stage in leader election
This is how PhD explain the need for prevoting stage:

  One downside of Raft's leader election algorithm is that a server that
  has been partitioned from the cluster is likely to cause a disruption
  when it regains connectivity. When a server is partitioned, it will
  not receive heartbeats. It will soon increment its term to start
  an election, although it won't be able to collect enough votes to
  become leader. When the server regains connectivity sometime later, its
  larger term number will propagate to the rest of the cluster (either
  through the server's RequestVote requests or through its AppendEntries
  response). This will force the cluster leader to step down, and a new
  election will have to take place to select a new leader.

  Prevoting stage is addressing that. In the Prevote algorithm, a
  candidate only increments its term if it first learns from a majority of
  the cluster that they would be willing to grant the candidate their votes
  (if the candidate's log is sufficiently up-to-date, and the voters have
  not received heartbeats from a valid leader for at least a baseline
  election timeout).

  The Prevote algorithm solves the issue of a partitioned server disrupting
  the cluster when it rejoins. While a server is partitioned, it won't
  be able to increment its term, since it can't receive permission
  from a majority of the cluster. Then, when it rejoins the cluster, it
  still won't be able to increment its term, since the other servers
  will have been receiving regular heartbeats from the leader. Once the
  server receives a heartbeat from the leader itself, it will return to
  the follower state(in the same term).

In our implementation we have "stable leader" extension that prevents
spurious RequestVote to dispose an active leader, but AppendEntries with
higher term will still do that, so prevoting extension is also required.
2021-03-12 11:09:21 +02:00
Konstantin Osipov
95ee8e1b90 raft: fix spelling
Fix spelling of a few comments.
2021-02-19 22:56:26 +03:00
Konstantin Osipov
51c968bcb4 raft: rename log::non_snapshoted_length() to log::in_memory_size()
The old name was incorrect, in case apply_snapshot() was called with
non-zero trailing entries, the total log length is greater than the
length of the part that is not stored in a snapshot.

Fix spelling in related comments.

Rename fsm::wait() to fsm::wait_max_log_size(), it's a more
specific name. Rename max_log_length to max_log_size to use
'size' rather than 'length' consistently for log size.
2021-02-18 16:04:44 +03:00
Alejo Sanchez
b41a6822e8 raft: drop ticker from raft
Remove ticker callbacks from raft::server.
External code should periodically call raft::server::tick().

Update replication_test accordingly.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-02-14 09:41:42 -04:00
Alejo Sanchez
97338ab53f raft: replication test: fix debug mode hangs
For certain situations where barely enough nodes to elect a new leader
are connected a disruptive candidate can occassionally block the
election.

For example having servers A B C D E and only A B C are active in a
partition. If the test wants to elect A, it has to first make all 3
servers reach election timeout threshold (to make B and C receptive).
Then A is ticked till it becomes a candidate and has to send vote
requests to the other servers.

But all servers have a timer (_ticker) calling their periodic tick()
functions. If one of the other servers, say B, gets its timer tick
before A sends vote requests, B becomes a (disruptive) candidate and
will refuse to vote for A. In our case of only having 3 out of 5 servers
connected a single missing vote can hang the election.

This patch disables timer ticks for all servers when running custom
elections and partitioning.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-02-11 11:42:31 -04:00
Konstantin Osipov
c7b5a60320 raft: joint consensus, wire up configuration changes in the API
Now that we've implemented joint consensus based configuration changes,
replace add_server()/remove_server() with a more general set_configuration().
2021-01-29 22:07:08 +03:00
Tomasz Grabiec
f08a3e3fd8 Merge "raft: test fixes, etcd tests, simplification" from Alejo
This patch set adds etcd unit tests for raft.

It also includes a fix for replication test in debug mode and a
simplification for append_request.

Tests: unit ({dev}), unit ({debug}), unit ({release})

*  https://github.com/alecco/scylla/tree/raft-ale-tests-09b:
  raft: etcd unit tests: test log replication
  raft: boost test etcd: test fsm can vote from any state
  raft: boost test etcd: port TestLeaderElectionOverwriteNewerLogs
  raft: replication test: add etcd test for cycling leaders
  raft: testing: provide primitives to wait for log propagation
  raft: etcd unit tests: initial boost tests
  raft: combine append_request _receive and _send
2021-01-21 10:41:33 +02:00
Pavel Solodovnikov
041072b59f raft: rename storage to persistence
The new naming scheme more clearly communicates to the client of
the raft library that the `persistence` interface implements
persistency layer of the fsm that is powering the raft
protocol itself rather than the client-side workflow and
user-provided `state_machine`.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20201126135114.7933-1-pa.solodovnikov@scylladb.com>
2021-01-20 10:23:43 +02:00
Alejo Sanchez
f627972186 raft: testing: provide primitives to wait for log propagation
For tests to be able to transition in a consistent state, in some cases
it's needed to allow the followers to catch up with the leader.

This prevents occasional hangs in debug mode for incoming tests.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-01-18 12:33:37 -04:00
Alejo Sanchez
d610d5a7b8 raft: expose fsm tick() to server for testing
For tests to advance servers they need to invoke tick().

This is needed to advance free elections.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2020-11-22 10:32:34 -04:00
Alejo Sanchez
9e7e14fc50 raft: expose is_leader() for testing
Expose fsm leader check to allow tests to find out the leader after an
election.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2020-11-22 10:32:34 -04:00
Gleb Natapov
8d9b6f588e raft: stop accepting requests on a leader after the log reaches the limit
To prevent the log to take too much memory introduce a mechanism that
limits the log to a certain size. If the size is reached no new log
entries can be submitted until previous entries are committed and
snapshotted.
2020-11-18 19:14:37 +01:00
Gleb Natapov
8bab38c6fa raft: use correct type for node info in add_server() 2020-11-06 17:06:07 +03:00
Gleb Natapov
7fdfa32dbd raft: preserve trailing raft log entries during snapshotting
This patch allows to leave snapshot_trailing amount of entries
when a state machine is snapshotted and raft log entries are dropped.
Those entries can be used to catch up nodes that are slow without
requiring snapshot transfer. The value is part of the configuration
and can be changed.
2020-10-15 11:50:27 +03:00
Gleb Natapov
7c1187b7f5 raft: implement periodic snapshotting of a state machine
The patch implements periodic taking of a snapshot and trimming of
the raft log.

In raft the only way the log of already committed entries can be shorten
is by taking a snapshot of the state machine and dropping log entries
included in the snapshot from the raft log. To not let log to grow too
large the patch takes the snapshot periodically after applying N number
of entries where N can be configured by setting snapshot_threshold
value in raft's configuration.
2020-10-15 11:48:44 +03:00
Alejo Sanchez
670824c6fa raft: declarative tests
For convenience making Raft tests, use declarative structures.

Servers are set up and initialized and then updates are processed.
For now, updates are just adding entries to leader and change of leader.

Updates and leader changes can be specified to run after initial test setup.

An example test for 3 nodes, node 0 starting as leader having two entries
0 and 1 for term 1, and with current term 2, then adding 12 entries,
changing leader to node 1, and adding 12 more entries. The test will
automatically add more entries to the last leader until the test limit
of total_values (default 100).

    {.name = "test_name", .nodes = 3, .initial_term = 2,
    .initial_states = {{.le = {{1,0},{1,1}}},
    .updates = {entries{12},new_leader{1},entries{12}},},

Leader is isolated before change via is_leader returning false.
Initial leader (default server 0) will be set with this method, too.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2020-10-09 15:50:31 +02:00
Gleb Natapov
0bff15a976 raft: Send multiple entries in one append_entry rpc
Send more that one entry in single append_entry message but
limit one packets size according to append_request_threshold parameter.

Message-Id: <20201007142602.GA2496906@scylladb.com>
2020-10-07 16:43:33 +02:00
Gleb Natapov
c073997431 raft: Introduce raft interface header
This commit introduce public raft interfaces. raft::server represents
single raft server instance. raft::state_machine represents a user
defined state machine. raft::rpc, raft::rpc_client and raft::storage are
used to allow implementing custom networking and storage layers.

A shared failure detector interface defines keep-alive semantics,
required for efficient implementation of thousands of raft groups.
2020-10-01 14:30:59 +03:00