scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-06-04 05:53:13 +00:00

Author	SHA1	Message	Date
Gleb Natapov	9d6bf7f351	raft: introduce leader stepdown procedure Section 3.10 of the PhD describes two cases for which the extension can be helpful: 1. Sometimes the leader must step down. For example, it may need to reboot for maintenance, or it may be removed from the cluster. When it steps down, the cluster will be idle for an election timeout until another server times out and wins an election. This brief unavailability can be avoided by having the leader transfer its leadership to another server before it steps down. 2. In some cases, one or more servers may be more suitable to lead the cluster than others. For example, a server with high load would not make a good leader, or in a WAN deployment, servers in a primary datacenter may be preferred in order to minimize the latency between clients and the leader. Other consensus algorithms may be able to accommodate these preferences during leader election, but Raft needs a server with a sufficiently up-to-date log to become leader, which might not be the most preferred one. Instead, a leader in Raft can periodically check to see whether one of its available followers would be more suitable, and if so, transfer its leadership to that server. (If only human leaders were so graceful.) The patch here implements the extension and employs it automatically when a leader removes itself from a cluster.	2021-03-22 10:28:43 +02:00
Gleb Natapov	888b52dea1	raft: fix replication when leader is not part of current config When a leader orchestrates its own removal from a cluster there is a situation where the leader is still responsible for replication, but it is no longer part of active configuration. Current code skips replication in this case though. Fix it by always replicating in the leader state.	2021-03-22 09:52:17 +02:00
Gleb Natapov	1acc8996bc	raft: do not update last election time if current leader is not a part of current configuration Since we use external failure detector instead of relying on empty AppendRequests from a leader there can be a situation where a node is no longer part of a certain raft group but is still alive (and also may be part of other raft groups). In such case last election time should not be updated even if the node is alive. It is the same as if it would have stopped to send empty AppendRequests in original raft.	2021-03-22 09:52:17 +02:00
Gleb Natapov	ccf4435759	raft: move log limiting semaphore into the leader state Log limiting semaphore is used on a leader only, so it should be stored inside the leader state.	2021-03-22 09:52:17 +02:00
Konstantin Osipov	fcc6e621f8	raft: pass snapshot_reply into fsm::step() By the time we receive snapshot_reply from a follower we may no longer be the leader. Follower term may be different from snapshot term, e.g. the follower may be aware of a new leader already and have a higher term. We should pass this information into (possibly ex-) leader FSM via fsm::step() so that it can correctly change its state, and not call FSM directly.	2021-03-18 16:56:46 +03:00
Konstantin Osipov	4afa662d62	raft: respond with snapshot_reply to send_snapshot RPC Raft send_snapshot RPC is actually two-way, the follower responds with snapshot_reply message. This message until now was, however, muted by RPC. Do not mute snapshot_reply any more: - to make it obvious the RPC is two way - to feed the follower response directly into leader's FSM and thus ensure that FSM testing results produced when using a test transport are representative of the real world uses of raft::rpc.	2021-03-18 16:56:42 +03:00
Konstantin Osipov	cb3314d756	raft: set follower's next_idx when switching to SNAPSHOT mode Set follower's next_idx to snapshot index + 1 when switching it to snapshot mode. If snapshot transfer succeeds, that's the best match for the follower's next replication index. If it fails, the leader will send a new probe to find out the follower position again and re-try sending a possibly newer snapshot. The change helps reduce protocol state managed outside FSM.	2021-03-18 16:35:11 +03:00
Konstantin Osipov	66c729da66	raft: set the current leader upon getting InstallSnapshot If the current leader is set, the follower will not vote for another candidate. This is also known as "sticky leadership" rule. Before this change, the rule was enacted only upon receiving AppendEntries RPC from the leader. Turn it on also upon receiving InstallSnapshot RPC.	2021-03-18 08:36:57 +03:00
Gleb Natapov	32d386d0d8	raft: fix use after free during logging in append_entries_reply() As the existing comment explains a progress can be deleted at the point of logging. The logging should only be done if the progress still exists. Message-Id: <YFDFVRQU1iVYhFdM@scylladb.com>	2021-03-17 09:59:22 +02:00
Pavel Solodovnikov	93c565a1bf	raft: allow raft server to start with initial term 0 Prior to the fix there was an assert to check in `raft::server_impl::start` that the initial term is not 0. This restriction is completely artificial and can be lifted without any problems, which will be described below. The only place that is dependent on this corner case is in `server_impl::io_fiber`. Whenever term or vote has changed, they will be both set in `fsm::get_output`. `io_fiber` checks whether it needs to persist term and vote by validating that the term field is set (by actually executing a `term != 0` condition). This particular check is based on an unobvious fact that the term will never be 0 in case `fsm::get_output` saves term and vote values, indicating that they need to be persisted. Vote and term can change independently of each other, so that checking only for term obscures what is happening and why even more. In either case term will never be 0, because: 1. If the term has changed, then it's naturally greater than 0, since it's a monotonically increasing value. 2. If the vote has changed, it means that we received a vote request message. In such case we have already updated our term to the requester's term. Switch to using an explicit optional in `fsm_output` so that a reader don't have to think about the motivation behind this `if` and just checks that `term_and_vote` optional is engaged. Given the motivation described above, the corresponding assert(_fsm->get_current_term() != term_t(0)); in `server_impl::start` is removed. Tests: unit(dev) Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-03-17 09:59:21 +02:00
Gleb Natapov	e231186a7b	raft: store leader and candidate state in state variant We already have server state dependant state in fsm, so there is no need to maintain "voters" and "tracker" optionals as well. The upside is that optional and variant sates cannot drift apart now.	2021-03-12 11:12:57 +02:00
Gleb Natapov	e17e7d57bd	raft: add boost tests for prevoting	2021-03-12 11:12:57 +02:00
Gleb Natapov	1f868d516e	raft: implement prevoting stage in leader election This is how PhD explain the need for prevoting stage: One downside of Raft's leader election algorithm is that a server that has been partitioned from the cluster is likely to cause a disruption when it regains connectivity. When a server is partitioned, it will not receive heartbeats. It will soon increment its term to start an election, although it won't be able to collect enough votes to become leader. When the server regains connectivity sometime later, its larger term number will propagate to the rest of the cluster (either through the server's RequestVote requests or through its AppendEntries response). This will force the cluster leader to step down, and a new election will have to take place to select a new leader. Prevoting stage is addressing that. In the Prevote algorithm, a candidate only increments its term if it first learns from a majority of the cluster that they would be willing to grant the candidate their votes (if the candidate's log is sufficiently up-to-date, and the voters have not received heartbeats from a valid leader for at least a baseline election timeout). The Prevote algorithm solves the issue of a partitioned server disrupting the cluster when it rejoins. While a server is partitioned, it won't be able to increment its term, since it can't receive permission from a majority of the cluster. Then, when it rejoins the cluster, it still won't be able to increment its term, since the other servers will have been receiving regular heartbeats from the leader. Once the server receives a heartbeat from the leader itself, it will return to the follower state(in the same term). In our implementation we have "stable leader" extension that prevents spurious RequestVote to dispose an active leader, but AppendEntries with higher term will still do that, so prevoting extension is also required.	2021-03-12 11:09:21 +02:00
Gleb Natapov	a849246cfc	raft: reset the leader on entering candidate state Not resetting a leader causes vote requests to be ignored instead of rejected which will make voting round to take more time to fail and may slow down new leader election.	2021-03-11 10:36:43 +02:00
Gleb Natapov	20d6bb36cd	raft: use modern unordered_set::contains instead of find in become_candidate	2021-03-11 10:36:43 +02:00
Gleb Natapov	dd6ba3d507	raft: add non-voting member support This patch adds a support for non-voting members. Non voting member is a member which vote is not counted for leader election purposes and commit index calculation purposes and it cannot become a leader. But otherwise it is a normal raft node. The state is needed to let new nodes to catch up their log without disturbing a cluster. All kind of transitions are allowed. A node may be added as a voting member directly or it may be added as non-voting and then changed to be voting one through additional configuration change. A node can be demoted from voting to non-voting member through a configuration change as well. Message-Id: <20210304101158.1237480-2-gleb@scylladb.com>	2021-03-09 13:47:48 +01:00
Konstantin Osipov	95ee8e1b90	raft: fix spelling Fix spelling of a few comments.	2021-02-19 22:56:26 +03:00
Konstantin Osipov	e49d5f89a5	raft: do not account for the same vote twice While a duplicate vote from the same server is not possible by a conforming Raft implementation, Raft assumptions on network permit duplicates. So, in theory, it is possible that a vote message is delivered multiple times. The current voting implementation does reject votes from non-members, but doesn't check for duplicate votes. Keep track of who has voted yet, and reject duplicate votes. A unit test follows.	2021-02-18 16:04:44 +03:00
Konstantin Osipov	7ea064ac04	raft: remove fsm::set_configuration() Set either tracker or votes configuration explicitly. This saves a few lines and simplifies unit tests.	2021-02-18 16:04:44 +03:00
Konstantin Osipov	4083026b65	raft: consistently use configuration from the log	2021-02-18 16:04:44 +03:00
Konstantin Osipov	c4552ffb9a	raft: add ostream serialization for enum vote_result	2021-02-18 16:04:44 +03:00
Konstantin Osipov	2ae04d8a47	raft: advance commit index right after leaving joint configuration Imagine the cluster is in joint configuration {{A, B}, {A, B, C, D, E}}. The leader's view of stable indexes is: Server Match Index A 5 B 5 C 6 D 7 E 8 The commit index would be 5 if we use joint configuration, and 6 if we assume we left it. Let it happen without an extra FSM step.	2021-02-18 16:04:44 +03:00
Konstantin Osipov	6e3932bbc7	raft: tidy up follower_progress API Make the API More explicit so it's available for testing.	2021-02-18 16:04:44 +03:00
Konstantin Osipov	ed65a8635e	raft: update raft::log::apply_snapshot() assert apply_snapshot() doesn't support applying the same snapshot twice. The caller must check the current snapshot before applying.	2021-02-18 16:04:44 +03:00
Konstantin Osipov	51c968bcb4	raft: rename log::non_snapshoted_length() to log::in_memory_size() The old name was incorrect, in case apply_snapshot() was called with non-zero trailing entries, the total log length is greater than the length of the part that is not stored in a snapshot. Fix spelling in related comments. Rename fsm::wait() to fsm::wait_max_log_size(), it's a more specific name. Rename max_log_length to max_log_size to use 'size' rather than 'length' consistently for log size.	2021-02-18 16:04:44 +03:00
Konstantin Osipov	cfe407b402	raft: inline raft::log::truncate_tail() It's the core of apply_snapshot() work and is only used in it. Now that truncate_tail is inline, rename truncate_head() to truncate_uncommitted().	2021-02-18 16:04:44 +03:00
Konstantin Osipov	e0011c6e4d	raft: ignore AppendEntries RPC with a very old term Do not assert on an outdated message.	2021-02-18 16:04:44 +03:00
Konstantin Osipov	805d52eb16	raft: remove log::start_idx() Replace it with a private _first_idx, which is maintained along with the rest of class log state. _first_idx is a name consistent with counterpart last_idx(). Do not use a function since going forward we may want to remove Raft index from struct log_entry, so should rely less on it. This fixes a bug when _last_conf_idx was not reset after apply_snapshot() because start_idx() was pointing to a non-existent entry.	2021-02-18 16:04:44 +03:00
Konstantin Osipov	af8770da63	raft: return a correct last term on an empty log If the log is empty, we must use snapshot's term, since the log could be right after taking a snapshot when no trailing entries were kept. This fixes a rare possible bug when a log matching rule could be violated during elections by a follower with a log which was just truncated after a snapshot. A separate unit test for the issue will follow.	2021-02-18 16:04:43 +03:00
Konstantin Osipov	cb035a7c8d	raft: do not use raft::log::start_idx() outside raft::log() raft::log::start_idx() is currently not meaningful in case the log is empty. Avoid using it in fsm::replicate_to() and avoid manual search for previous log term, instead encapsulate the search in log::term_for(). As a side effect we currently return a correct term (0) when log matching rule is exercised for an empty log and the very first snapshot with term 0. Update raft_etcd_test.cc accordingly. This change happens to reduce the overall line count. While at it, improve the comments in raft::replicate_to().	2021-02-18 16:04:43 +03:00
Konstantin Osipov	04b4d97d6a	raft: rename progress.hh to tracker.hh class tracker is the main class of this module.	2021-02-18 16:04:43 +03:00
Alejo Sanchez	b41a6822e8	raft: drop ticker from raft Remove ticker callbacks from raft::server. External code should periodically call raft::server::tick(). Update replication_test accordingly. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-02-14 09:41:42 -04:00
Alejo Sanchez	97338ab53f	raft: replication test: fix debug mode hangs For certain situations where barely enough nodes to elect a new leader are connected a disruptive candidate can occassionally block the election. For example having servers A B C D E and only A B C are active in a partition. If the test wants to elect A, it has to first make all 3 servers reach election timeout threshold (to make B and C receptive). Then A is ticked till it becomes a candidate and has to send vote requests to the other servers. But all servers have a timer (_ticker) calling their periodic tick() functions. If one of the other servers, say B, gets its timer tick before A sends vote requests, B becomes a (disruptive) candidate and will refuse to vote for A. In our case of only having 3 out of 5 servers connected a single missing vote can hang the election. This patch disables timer ticks for all servers when running custom elections and partitioning. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-02-11 11:42:31 -04:00
Konstantin Osipov	adc87aa278	raft: re-lookup progress object after a configuration change Fix raft_fsm_test failure in debug mode. ASAN complained that follower_progress is used in append_entries_reply() after it was destroyed. This could happen if in maybe_commit() we switched to a new configuration and destroyed old progress objects. The fix is to lookup the object one more time after maybe_commit().	2021-02-05 12:40:19 +01:00
Gleb Natapov	e9043565b3	raft: add counters to raft server The patch adds set of counters for various events inside raft implementation to facilitate monitoring and debugging. Message-Id: <20210204125313.GA1513786@scylladb.com>	2021-02-04 14:19:54 +01:00
Konstantin Osipov	a8f2fa7fa0	raft: update README.md	2021-01-29 22:07:08 +03:00
Konstantin Osipov	c7b5a60320	raft: joint consensus, wire up configuration changes in the API Now that we've implemented joint consensus based configuration changes, replace add_server()/remove_server() with a more general set_configuration().	2021-01-29 22:07:08 +03:00
Konstantin Osipov	afadc7c0a1	raft: joint consensus, count votes using joint config Send RequestVote to a joint config. We need to exclude self from the list of peers if we're not part of the current configuration. Avoid disrupting the cluster in this case. Maintain separate status for previous and current config when counting votes.	2021-01-29 22:07:08 +03:00
Konstantin Osipov	8b86d91754	raft: joint consensus, wire up configuration changes in FSM When add_entry() with new configuraiton is submitted, create a joint configuration and switch to it immediately. Refuse to enter joint configuration if a configuration change is already in progress. When the leader it committed an entry with joint configuration, append a new entry with final configuration and switch to it. Resign leadership if the current leader is not part of a new configuration. When we change from A, B, C to B, C, D and the leader is A, then, when C_new starts to be used, the leader is not part of the current configuration, so it doesn't have to be in the tracker. Do not try to find & advance leader progress unconditionally then.	2021-01-29 22:07:08 +03:00
Konstantin Osipov	18a684ba11	raft: joint consensus, update progress tracker with joint configuration The leader doesn't have to be part of the current configuration, so add a way to access follower_progress for the leader only if it is present. Upon configuration changes, preserve progress information for intact nodes, remove for removed, and create a new progress object for added nodes. When tracking commit progress in joint configuration mode, calculate two commit indexes for two configurations, and choose the smallest one.	2021-01-29 22:07:08 +03:00
Konstantin Osipov	20df1955b2	raft: joint consensus, don't store configuration in FSM In follower state, FSM doesn't know the current cluster configuration. Instead of trying to watch the follower log for configuration changes to keep FSM copy up to date, remove it from FSM altogether since the follower doesn't need it anyway. When entering candidate or leader state, fetch the most recent configuration from the log and initialize the state specific state with it.	2021-01-29 22:07:07 +03:00
Konstantin Osipov	b29181875c	raft: joint consensus, keep track of the last confchange index in the log When initializing the log, find the most recent configuration change index, if present. Maintain the most recent configuration change index when the log is truncated or entries are appended to it. The last configuration change index will be used by FSM when it enters candidate or leader state to fetch the current configuration. We never truncate beyond a single in-progress configuration change, so storing the previous value of last_conf_idx helps avoid log backward scan on truncation in 100% of cases. Remove all unused log constructors.	2021-01-29 22:07:07 +03:00
Konstantin Osipov	6e128aa357	raft: joint consensus, implement helpers in class configuration	2021-01-29 22:07:07 +03:00
Konstantin Osipov	1ca738d9a2	raft: joint consensus, use unordered_set for server_address list	2021-01-29 22:07:07 +03:00
Konstantin Osipov	df944f953c	raft: joint consensus, switch configuration to joint In order to work correctly in transitional configuration, participants must enter it after crashes, restarts and state changes. This means it must be stored in Raft log and snapshot on the leader and followers. This is most easily done if transitional configuration is just a flavour of standard configuration. In FSM, rename _current_config to _configuration, it now contains both current and future configuration at all times.	2021-01-29 22:07:07 +03:00
Konstantin Osipov	076e46af9e	raft: rename check_committed() to maybe_commit() This is what the function does, and it's the name used in other implementations.	2021-01-29 22:07:07 +03:00
Gleb Natapov	aad0209b1c	raft: fix spelling and add comments Fix spelling errors in a few comments, improve comments. With fix-ups by Gleb Natapov <gleb@scylladb.com>	2021-01-29 22:07:07 +03:00
Pavel Solodovnikov	e1504bbf0e	raft: add IDL definitions for raft types Changes to the `configuration` and `tagged_uint64` classes are needed to overcome limitations of the IDL compiler tool, i.e. we need to supply a constructor to the struct initializing all the members (raft::configuration) and also need to make an accessor function for private members (in case of raft::tagged_uint64). All other structs mirror raft definitions in exactly the same way they are declared in `raft.hh`. `tagged_id` and `tagged_uint64` are used directly instead of their typedef-ed companions defined in `raft.hh` since we don't want to introduce indirect dependencies. In such case it can be guaranteed that no accidental changes made outside of the idl file will affect idl definitions. This patch also fixes a minor typo in `snapshot_id_tag` struct used in `snapshot_id` typedef.	2021-01-29 01:59:10 +03:00
Alejo Sanchez	f875ff72c9	raft: testing: remove election wait time and just yield Replace sleep time for elect_me_leader with yield to speed things up. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-01-24 20:25:48 -04:00
Tomasz Grabiec	f08a3e3fd8	Merge "raft: test fixes, etcd tests, simplification" from Alejo This patch set adds etcd unit tests for raft. It also includes a fix for replication test in debug mode and a simplification for append_request. Tests: unit ({dev}), unit ({debug}), unit ({release}) * https://github.com/alecco/scylla/tree/raft-ale-tests-09b: raft: etcd unit tests: test log replication raft: boost test etcd: test fsm can vote from any state raft: boost test etcd: port TestLeaderElectionOverwriteNewerLogs raft: replication test: add etcd test for cycling leaders raft: testing: provide primitives to wait for log propagation raft: etcd unit tests: initial boost tests raft: combine append_request _receive and _send	2021-01-21 10:41:33 +02:00

1 2

88 Commits