scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-05-30 19:46:48 +00:00

Author	SHA1	Message	Date
Gleb Natapov	aa7ea333da	raft: document that add entry my throw commit_status_unknown	2021-05-06 11:59:36 +03:00
Gleb Natapov	d2f58d8656	raft: drop waiters with outdated terms Currently an entry is declared to be dropped only when an entry with different term is committed with the same index, but that may create a situation where, if no new entries are submitted for a long time, an already dropped entry will not be noticed for a long time as well. Consider the case where a client submits 10 entries on a leader A, but before they get replicated the leadership moves to a node B. B will commit a dummy entry which will be committed eventually and will release one of the waiters on A, but if anything else is submitted to B 9 other waiters will wait forever. The way to solve that is to drop all waiters that wait for a term smaller that one been committed. There is no chance they will be committed any longer since terms in the log may only grow.	2021-05-06 11:34:31 +03:00
Gleb Natapov	6abe2772dc	raft: make snapshot transfer abortable A snapshot transfer may take a lot of time and meanwhile a leader doing it may lose the leadership. If that happens the ongoing snapshot transfer becomes obsolete since the snapshot will be rejected by the receiving node as coming from an old leader. Make snapshot transfer abortable and abort them when leader changes.	2021-05-06 11:34:31 +03:00
Gleb Natapov	50d545a138	raft: accept snapshots transfer from multiple nodes simultaneously A leader may change while one of its followers is in snapshot transfer mode and that node may get additional request for snapshot transfer from a new leader while previous transfer is still not aborted. Currently such situation will trigger an assert. This patch allows to have active snapshot transfers from multiple nodes, but only one of them will succeed in the end, all other will be replied to with 'fail'.	2021-05-06 11:34:31 +03:00
Gleb Natapov	073a9be4c7	raft: do not send probes while transferring snapshot If a follower is in snapshot transfer mode there is no need to send probe append messages to it.	2021-05-06 11:34:31 +03:00
Gleb Natapov	08077a21b7	raft: handle messages sending errors Fail to send a message should not abort raft server.	2021-05-06 11:34:31 +03:00
Gleb Natapov	c4d87d7a23	raft: fix a typo in a variable name	2021-05-06 11:33:47 +03:00
Alejo Sanchez	0a5c605713	raft: replication test: fix custom election Use the new specific connectivity to manage old leader disconnection more specifically. This fixes having elections where the vote of the old leader is required for quorum. For example {A,B} and we want to switch leader. For B to become candidate it has to see A as down. Then A has to see B's request for vote, and vote for A. So to make the general case old leader needs to be first disconnected from all nodes, make the desired node candidate, then have the old leader connected only to the desired candidate (else, other nodes would see the new candidate as disrupting a live leader). Also, there might be stray messages from the former leader. These could revert the candidate to follower. To handle this this patch retries the process until the desired node becomes leader. The helper function elect_me_leader() is split and renamed to wait_until_candidate() and wait_election_done(). The former ticks until the node is a candidate and the later waits until a candidate either becomes a leader or reverts to follower The existing etcd test workaround of incrementing from n=2 to n=3 nodes is corrected back to original n=2. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-05-03 07:53:35 -04:00
Pavel Solodovnikov	4c351ff260	raft: switch `group_id` type from `uint64_t` to `utils::UUID` Introduce a tagged id struct for `group_id`. Raft code would want to generate quite a lot of unique raft groups in the future (e.g. tablets). UUID is designed exactly for that (e.g. larger capacity than `uint64_t`, obviously, and also has built-in procedures to generate random ids). Also, this is a preparation to make "raft group 0" use a random ID instead of a literal fixed `0` as a group id. The purpose is that every scylla cluster must have a unique ID for "raft group 0" since we don't want the nodes from some other cluster to disrupt the current cluster. This can happen if, for some reason, a foreign node happens to contact a node in our cluster. Tests: unit(dev) Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com> Message-Id: <20210429170630.533596-3-pa.solodovnikov@scylladb.com>	2021-05-02 16:39:54 +03:00
Kamil Braun	4c95277619	raft: fsm: fix assertion failure on stray rejects When probes are sent over a slow network, the leader would send multiple probes to a lagging follower before it would get a reject response to the first probe back. After getting a reject, the leader will be able to correctly position `next_idx` for that follower and switch to pipeline mode. Then, an out of order reject to a now irrelevant probe could crash the leader, since it would effectively request it to "rewind" its `match_idx` for that follower, and the code asserts this never happens. We fix the problem by strengthening `is_stray_reject`. The check that was previously only made in `PIPELINE` case (`rejected.non_matching_idx <= match_idx`) is now always performed and we add a new check: `rejected.last_idx < match_idx`. We also strengthen the assert. The commit improves the documentation by explaining that `is_stray_reject` may return false negatives. We also precisely state the preconditions and postconditions of `is_stray_reject`, give a more precise definition of `progress.match_idx`, argue how the postconditions of `is_stray_reject` follow from its preconditions and Raft invariants, and argue why the (strengthened) assert must always pass. Message-Id: <20210423173117.32939-1-kbraun@scylladb.com>	2021-04-27 01:07:22 +02:00
Pavel Solodovnikov	fba1910770	raft: fix incorrect rpc setup in `server_impl::start()` RPC configuration was updated only when an instance was started with an initial snapshot. In case we don't have an initial snapshot, but do have a non-empty log with a configuration entry, the RPC instance isn't set up correctly. Fix that by moving RPC setup code outside the check for snapshot id and look at `_log.get_configuration()` instead. Also, set up RPC mappings both for `current` and `previous` components, since in case the last configuration index points to an entry from the log, it can happen to be a joint configuration entry. For example, this can happen if a leader made an attempt to change configuration, but failed shortly afterwards without being able to commit the new configuration. Tests: unit(dev) Signed-off-by: Pavel Solodovnikov <pa.solodonikov@scylladb.com> Message-Id: <20210423220718.642470-1-pa.solodovnikov@scylladb.com>	2021-04-26 20:46:50 +02:00
Kamil Braun	8e9a9f8bd3	raft: fsm: include config entries in output.committed Otherwise waiters on committed configuration changes (e.g. `server::set_configuration`) would never get notified. Also if we tried to send another entry concurrently we would get replication_test: raft/server.cc:318: void raft::server_impl::notify_waiters(std::map<index_t, op_status> &, const std::vector<log_entry_ptr> &): Assertion `entry_idx >= first_idx' failed. (not sure if this commit also fixes whatever caused that). Message-Id: <20210419181319.68628-2-kbraun@scylladb.com>	2021-04-22 15:38:10 +02:00
Avi Kivity	14a4173f50	treewide: make headers self-sufficient In preparation for some large header changes, fix up any headers that aren't self-sufficient by adding needed includes or forward declarations.	2021-04-20 21:23:00 +03:00
Gleb Natapov	9fdb3d3d98	raft: stop using seastar::pipe to pass log entries to apply_fiber Stop use seastar::pipe and use seastar::queue directly to pass log entries to apply_fiber. The pipe is a layer above queue anyway and it adds functionality that we do not need (EOS) and hinds functionality that we do (been able to abort()). This fixes a crash during abort where the pipe was uses after been destroyed. Message-Id: <YHLkPZ9+sdLhwcjZ@scylladb.com>	2021-04-12 13:18:03 +02:00
Gleb Natapov	fb938a36d4	raft: disallow adding and creating servers with id zero Id zero has special meaning in the code and cannot be valid server id. Message-Id: <20210407134853.1964226-1-gleb@scylladb.com>	2021-04-08 17:07:18 +02:00
Gleb Natapov	b3cb4f3966	raft: fix quorum check code for joint config and non-voting members Current leader code check for most nodes to be alive, but this is incorrect since some nodes may be non-voting and hence should not cause a leader to stepdown if dead. It also incorrect with joint config since quorum is calculated differently there. Fix it by introducing activity_tracker class that knows how to handle all the above details.	2021-04-07 10:15:33 +03:00
Gleb Natapov	a48a2c454b	raft: do not hang on waiting for entries on a leader that was removed from a cluster If a leader is removed from a cluster it will never know when entries that it did not committed yet will be committed, so abort the wait in this case with uncertainty error.	2021-04-07 10:15:33 +03:00
Gleb Natapov	db03c94692	raft: add more tracing to stepdown code	2021-04-07 10:15:33 +03:00
Gleb Natapov	7dec56721c	raft: use existing election_elapsed() function instead of redo the calculation	2021-04-07 10:15:33 +03:00
Gleb Natapov	3bcd3212e2	raft: check that a node is still the leader after initiating stepdown process Usually initiation of stepdown process does not immediately depose the current leader, but if the current leader is no longer part of the cluster it will happen. We were missing the check after initiating stepdown process in append reply handling.	2021-04-07 10:15:33 +03:00
Gleb Natapov	28add88a1f	raft: do not assert when receiving unexpected messages in a leader state Current code assert when it gets InstallSnapshot/AppendRequest in a leader state and the term in the message is equal current term. It is true that such messages cannot be received if the protocol works correctly, but we should not crash on a network input nonetheless.	2021-04-04 11:33:35 +03:00
Gleb Natapov	995cd1c8a7	raft: use existing function to check if election timeout elapsed is_past_election_timeout() repeats the calculation that election_elapsed() is doing. Use existing function instead.	2021-04-04 11:33:35 +03:00
Gleb Natapov	13a3cf62bb	raft: move incoming message processing into per state functions Clean up step() function by moving state specific processing into per state functions. This way it is easier to see how each state handles individual messages. No functional changes here. Message-Id: <YGHCiTWjq+L/jVCB@scylladb.com>	2021-03-29 15:48:43 +02:00
Pavel Solodovnikov	2d9e94f050	raft: update README.md with info on RPC server address mappings Describe the high-level scheme of managing RPC mappings and also expand on the introduction of "expirable" RPC mappings concept and why these are needed. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-03-26 22:47:13 +03:00
Pavel Solodovnikov	f61206e483	raft: wire up `rpc::add_server` and `rpc::remove_server` for configuration changes Raft instance needs to update RPC subsystem on changes in configuration, so that RPC can deliver messages to the new nodes in configuration, as well as dispose of the old nodes. I.e. the nodes which are not the part of the most recent configuration anymore. The effective scope of RPC mappings is limited by the piece of code which sends messages to both the "new" nodes (which are added to the cluster with the most recent configuration change) and the "old" nodes which are removed from the cluster. Until the messages are successfully delivered to at least the majority of "old" nodes and we have heard back from them, the mappings should be kept intact. After that point the RPC mappings for the removed nodes are no longer of interest and thus can be immediately disposed. Tests: unit(dev) Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-03-26 22:47:09 +03:00
Pavel Solodovnikov	16d9e8e9af	raft/fsm: add optional `rpc_configuration` field to fsm_output The field is set in `fsm.get_output` whenever `_log.last_conf_idx()` or the term changes. Also, add `_last_conf_idx` and `_last_term` to `fsm::last_observed_state`, they are utilized in the condition to evaluate current rpc configuration in `fsm.get_output()`. This will be used later to update rpc config state stored in `server_impl` and maintain rpc address map. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-03-26 22:47:05 +03:00
Pavel Solodovnikov	19cc85b3b6	raft: maintain current rpc context in `server_impl` Introduce rpc server_address that represents the last observed state of address mappings for RPC module. It does not correspond to any kind of configuration in the raft sense, just an artificial construct corresponding to the largest set of server addresses coming from both previous and current raft configurations (to be able to contact both joining and leaving servers). This will be used later to update rpc module mappings when cluster configuration changes. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-03-26 20:22:44 +03:00
Pavel Solodovnikov	8799ccbab0	raft: use `.contains` instead of `.count` for std::set in `raft::configuration::diff` `std::unordered_set::contains` is introduced in C++20 and provides clearer semantics to check existence of a given element in a set. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-03-26 20:22:44 +03:00
Alejo Sanchez	7a6616f1cb	raft: testing: expose log for test verification Let derived classes access the log to verify its contents. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-03-25 15:03:46 -04:00
Alejo Sanchez	7e6807e8fc	raft: testing: make become_follower() available for tests Some etcd tests need to force a follower with a specific leader. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-03-24 19:11:09 -04:00
Konstantin Osipov	1a1d7ab662	raft: (testing) stray replies from removed followers	2021-03-24 14:05:55 +03:00
Konstantin Osipov	0295163f6f	raft: always return a non-zero configuration index from the log Return snapshot index for last configuration index if there is no configuration in the log.	2021-03-24 14:05:55 +03:00
Konstantin Osipov	00d7379bc9	raft: minor style changes & comments Add comments explaining the rationale from transfer_leadership() (more PhD quotes), encapsulate stable leader check in tick() into a lambda and add more detailed comments to it.	2021-03-24 14:04:18 +03:00
Konstantin Osipov	ce29fb44c3	raft: do not assert when transitioning to empty config Throw instead, to make this case testable.	2021-03-22 18:55:40 +03:00
Konstantin Osipov	2ee15ad6c7	raft: assert we never apply a snapshot over uncommitted entries (leader)	2021-03-22 18:55:40 +03:00
Konstantin Osipov	c7f7ad2c4e	raft: improve tracing Add tracing to apply_snapshot, request_vote.	2021-03-22 18:55:40 +03:00
Konstantin Osipov	4dd66edae5	raft: add fsm_output::empty() helper to aid testing Used in testing to implement trivial transport.	2021-03-22 18:55:40 +03:00
Konstantin Osipov	89349f550c	raft: aid testing by providing fsm::id()	2021-03-22 18:55:40 +03:00
Gleb Natapov	9d6bf7f351	raft: introduce leader stepdown procedure Section 3.10 of the PhD describes two cases for which the extension can be helpful: 1. Sometimes the leader must step down. For example, it may need to reboot for maintenance, or it may be removed from the cluster. When it steps down, the cluster will be idle for an election timeout until another server times out and wins an election. This brief unavailability can be avoided by having the leader transfer its leadership to another server before it steps down. 2. In some cases, one or more servers may be more suitable to lead the cluster than others. For example, a server with high load would not make a good leader, or in a WAN deployment, servers in a primary datacenter may be preferred in order to minimize the latency between clients and the leader. Other consensus algorithms may be able to accommodate these preferences during leader election, but Raft needs a server with a sufficiently up-to-date log to become leader, which might not be the most preferred one. Instead, a leader in Raft can periodically check to see whether one of its available followers would be more suitable, and if so, transfer its leadership to that server. (If only human leaders were so graceful.) The patch here implements the extension and employs it automatically when a leader removes itself from a cluster.	2021-03-22 10:28:43 +02:00
Gleb Natapov	888b52dea1	raft: fix replication when leader is not part of current config When a leader orchestrates its own removal from a cluster there is a situation where the leader is still responsible for replication, but it is no longer part of active configuration. Current code skips replication in this case though. Fix it by always replicating in the leader state.	2021-03-22 09:52:17 +02:00
Gleb Natapov	1acc8996bc	raft: do not update last election time if current leader is not a part of current configuration Since we use external failure detector instead of relying on empty AppendRequests from a leader there can be a situation where a node is no longer part of a certain raft group but is still alive (and also may be part of other raft groups). In such case last election time should not be updated even if the node is alive. It is the same as if it would have stopped to send empty AppendRequests in original raft.	2021-03-22 09:52:17 +02:00
Gleb Natapov	ccf4435759	raft: move log limiting semaphore into the leader state Log limiting semaphore is used on a leader only, so it should be stored inside the leader state.	2021-03-22 09:52:17 +02:00
Konstantin Osipov	fcc6e621f8	raft: pass snapshot_reply into fsm::step() By the time we receive snapshot_reply from a follower we may no longer be the leader. Follower term may be different from snapshot term, e.g. the follower may be aware of a new leader already and have a higher term. We should pass this information into (possibly ex-) leader FSM via fsm::step() so that it can correctly change its state, and not call FSM directly.	2021-03-18 16:56:46 +03:00
Konstantin Osipov	4afa662d62	raft: respond with snapshot_reply to send_snapshot RPC Raft send_snapshot RPC is actually two-way, the follower responds with snapshot_reply message. This message until now was, however, muted by RPC. Do not mute snapshot_reply any more: - to make it obvious the RPC is two way - to feed the follower response directly into leader's FSM and thus ensure that FSM testing results produced when using a test transport are representative of the real world uses of raft::rpc.	2021-03-18 16:56:42 +03:00
Konstantin Osipov	cb3314d756	raft: set follower's next_idx when switching to SNAPSHOT mode Set follower's next_idx to snapshot index + 1 when switching it to snapshot mode. If snapshot transfer succeeds, that's the best match for the follower's next replication index. If it fails, the leader will send a new probe to find out the follower position again and re-try sending a possibly newer snapshot. The change helps reduce protocol state managed outside FSM.	2021-03-18 16:35:11 +03:00
Konstantin Osipov	66c729da66	raft: set the current leader upon getting InstallSnapshot If the current leader is set, the follower will not vote for another candidate. This is also known as "sticky leadership" rule. Before this change, the rule was enacted only upon receiving AppendEntries RPC from the leader. Turn it on also upon receiving InstallSnapshot RPC.	2021-03-18 08:36:57 +03:00
Gleb Natapov	32d386d0d8	raft: fix use after free during logging in append_entries_reply() As the existing comment explains a progress can be deleted at the point of logging. The logging should only be done if the progress still exists. Message-Id: <YFDFVRQU1iVYhFdM@scylladb.com>	2021-03-17 09:59:22 +02:00
Pavel Solodovnikov	93c565a1bf	raft: allow raft server to start with initial term 0 Prior to the fix there was an assert to check in `raft::server_impl::start` that the initial term is not 0. This restriction is completely artificial and can be lifted without any problems, which will be described below. The only place that is dependent on this corner case is in `server_impl::io_fiber`. Whenever term or vote has changed, they will be both set in `fsm::get_output`. `io_fiber` checks whether it needs to persist term and vote by validating that the term field is set (by actually executing a `term != 0` condition). This particular check is based on an unobvious fact that the term will never be 0 in case `fsm::get_output` saves term and vote values, indicating that they need to be persisted. Vote and term can change independently of each other, so that checking only for term obscures what is happening and why even more. In either case term will never be 0, because: 1. If the term has changed, then it's naturally greater than 0, since it's a monotonically increasing value. 2. If the vote has changed, it means that we received a vote request message. In such case we have already updated our term to the requester's term. Switch to using an explicit optional in `fsm_output` so that a reader don't have to think about the motivation behind this `if` and just checks that `term_and_vote` optional is engaged. Given the motivation described above, the corresponding assert(_fsm->get_current_term() != term_t(0)); in `server_impl::start` is removed. Tests: unit(dev) Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-03-17 09:59:21 +02:00
Gleb Natapov	e231186a7b	raft: store leader and candidate state in state variant We already have server state dependant state in fsm, so there is no need to maintain "voters" and "tracker" optionals as well. The upside is that optional and variant sates cannot drift apart now.	2021-03-12 11:12:57 +02:00
Gleb Natapov	e17e7d57bd	raft: add boost tests for prevoting	2021-03-12 11:12:57 +02:00

1 2 3

126 Commits