scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-25 11:00:35 +00:00

Author	SHA1	Message	Date
Gleb Natapov	1f868d516e	raft: implement prevoting stage in leader election This is how PhD explain the need for prevoting stage: One downside of Raft's leader election algorithm is that a server that has been partitioned from the cluster is likely to cause a disruption when it regains connectivity. When a server is partitioned, it will not receive heartbeats. It will soon increment its term to start an election, although it won't be able to collect enough votes to become leader. When the server regains connectivity sometime later, its larger term number will propagate to the rest of the cluster (either through the server's RequestVote requests or through its AppendEntries response). This will force the cluster leader to step down, and a new election will have to take place to select a new leader. Prevoting stage is addressing that. In the Prevote algorithm, a candidate only increments its term if it first learns from a majority of the cluster that they would be willing to grant the candidate their votes (if the candidate's log is sufficiently up-to-date, and the voters have not received heartbeats from a valid leader for at least a baseline election timeout). The Prevote algorithm solves the issue of a partitioned server disrupting the cluster when it rejoins. While a server is partitioned, it won't be able to increment its term, since it can't receive permission from a majority of the cluster. Then, when it rejoins the cluster, it still won't be able to increment its term, since the other servers will have been receiving regular heartbeats from the leader. Once the server receives a heartbeat from the leader itself, it will return to the follower state(in the same term). In our implementation we have "stable leader" extension that prevents spurious RequestVote to dispose an active leader, but AppendEntries with higher term will still do that, so prevoting extension is also required.	2021-03-12 11:09:21 +02:00
Konstantin Osipov	95ee8e1b90	raft: fix spelling Fix spelling of a few comments.	2021-02-19 22:56:26 +03:00
Konstantin Osipov	51c968bcb4	raft: rename log::non_snapshoted_length() to log::in_memory_size() The old name was incorrect, in case apply_snapshot() was called with non-zero trailing entries, the total log length is greater than the length of the part that is not stored in a snapshot. Fix spelling in related comments. Rename fsm::wait() to fsm::wait_max_log_size(), it's a more specific name. Rename max_log_length to max_log_size to use 'size' rather than 'length' consistently for log size.	2021-02-18 16:04:44 +03:00
Alejo Sanchez	b41a6822e8	raft: drop ticker from raft Remove ticker callbacks from raft::server. External code should periodically call raft::server::tick(). Update replication_test accordingly. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-02-14 09:41:42 -04:00
Alejo Sanchez	97338ab53f	raft: replication test: fix debug mode hangs For certain situations where barely enough nodes to elect a new leader are connected a disruptive candidate can occassionally block the election. For example having servers A B C D E and only A B C are active in a partition. If the test wants to elect A, it has to first make all 3 servers reach election timeout threshold (to make B and C receptive). Then A is ticked till it becomes a candidate and has to send vote requests to the other servers. But all servers have a timer (_ticker) calling their periodic tick() functions. If one of the other servers, say B, gets its timer tick before A sends vote requests, B becomes a (disruptive) candidate and will refuse to vote for A. In our case of only having 3 out of 5 servers connected a single missing vote can hang the election. This patch disables timer ticks for all servers when running custom elections and partitioning. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-02-11 11:42:31 -04:00
Konstantin Osipov	c7b5a60320	raft: joint consensus, wire up configuration changes in the API Now that we've implemented joint consensus based configuration changes, replace add_server()/remove_server() with a more general set_configuration().	2021-01-29 22:07:08 +03:00
Tomasz Grabiec	f08a3e3fd8	Merge "raft: test fixes, etcd tests, simplification" from Alejo This patch set adds etcd unit tests for raft. It also includes a fix for replication test in debug mode and a simplification for append_request. Tests: unit ({dev}), unit ({debug}), unit ({release}) * https://github.com/alecco/scylla/tree/raft-ale-tests-09b: raft: etcd unit tests: test log replication raft: boost test etcd: test fsm can vote from any state raft: boost test etcd: port TestLeaderElectionOverwriteNewerLogs raft: replication test: add etcd test for cycling leaders raft: testing: provide primitives to wait for log propagation raft: etcd unit tests: initial boost tests raft: combine append_request _receive and _send	2021-01-21 10:41:33 +02:00
Pavel Solodovnikov	041072b59f	raft: rename `storage` to `persistence` The new naming scheme more clearly communicates to the client of the raft library that the `persistence` interface implements persistency layer of the fsm that is powering the raft protocol itself rather than the client-side workflow and user-provided `state_machine`. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com> Message-Id: <20201126135114.7933-1-pa.solodovnikov@scylladb.com>	2021-01-20 10:23:43 +02:00
Alejo Sanchez	f627972186	raft: testing: provide primitives to wait for log propagation For tests to be able to transition in a consistent state, in some cases it's needed to allow the followers to catch up with the leader. This prevents occasional hangs in debug mode for incoming tests. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-01-18 12:33:37 -04:00
Alejo Sanchez	d610d5a7b8	raft: expose fsm tick() to server for testing For tests to advance servers they need to invoke tick(). This is needed to advance free elections. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2020-11-22 10:32:34 -04:00
Alejo Sanchez	9e7e14fc50	raft: expose is_leader() for testing Expose fsm leader check to allow tests to find out the leader after an election. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2020-11-22 10:32:34 -04:00
Gleb Natapov	8d9b6f588e	raft: stop accepting requests on a leader after the log reaches the limit To prevent the log to take too much memory introduce a mechanism that limits the log to a certain size. If the size is reached no new log entries can be submitted until previous entries are committed and snapshotted.	2020-11-18 19:14:37 +01:00
Gleb Natapov	8bab38c6fa	raft: use correct type for node info in add_server()	2020-11-06 17:06:07 +03:00
Gleb Natapov	7fdfa32dbd	raft: preserve trailing raft log entries during snapshotting This patch allows to leave snapshot_trailing amount of entries when a state machine is snapshotted and raft log entries are dropped. Those entries can be used to catch up nodes that are slow without requiring snapshot transfer. The value is part of the configuration and can be changed.	2020-10-15 11:50:27 +03:00
Gleb Natapov	7c1187b7f5	raft: implement periodic snapshotting of a state machine The patch implements periodic taking of a snapshot and trimming of the raft log. In raft the only way the log of already committed entries can be shorten is by taking a snapshot of the state machine and dropping log entries included in the snapshot from the raft log. To not let log to grow too large the patch takes the snapshot periodically after applying N number of entries where N can be configured by setting snapshot_threshold value in raft's configuration.	2020-10-15 11:48:44 +03:00
Alejo Sanchez	670824c6fa	raft: declarative tests For convenience making Raft tests, use declarative structures. Servers are set up and initialized and then updates are processed. For now, updates are just adding entries to leader and change of leader. Updates and leader changes can be specified to run after initial test setup. An example test for 3 nodes, node 0 starting as leader having two entries 0 and 1 for term 1, and with current term 2, then adding 12 entries, changing leader to node 1, and adding 12 more entries. The test will automatically add more entries to the last leader until the test limit of total_values (default 100). {.name = "test_name", .nodes = 3, .initial_term = 2, .initial_states = {{.le = {{1,0},{1,1}}}, .updates = {entries{12},new_leader{1},entries{12}},}, Leader is isolated before change via is_leader returning false. Initial leader (default server 0) will be set with this method, too. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2020-10-09 15:50:31 +02:00
Gleb Natapov	0bff15a976	raft: Send multiple entries in one append_entry rpc Send more that one entry in single append_entry message but limit one packets size according to append_request_threshold parameter. Message-Id: <20201007142602.GA2496906@scylladb.com>	2020-10-07 16:43:33 +02:00
Gleb Natapov	c073997431	raft: Introduce raft interface header This commit introduce public raft interfaces. raft::server represents single raft server instance. raft::state_machine represents a user defined state machine. raft::rpc, raft::rpc_client and raft::storage are used to allow implementing custom networking and storage layers. A shared failure detector interface defines keep-alive semantics, required for efficient implementation of thousands of raft groups.	2020-10-01 14:30:59 +03:00

18 Commits