scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-05-31 12:06:44 +00:00

Author	SHA1	Message	Date
Konstantin Osipov	eaf32f2c3c	raft: (testing) test receiving a confchange in a snapshot	2021-06-11 17:16:56 +03:00
Konstantin Osipov	ba046ed1ab	raft: style fix	2021-06-11 17:16:54 +03:00
Konstantin Osipov	b0a1ebc635	raft: step down as a leader if converted to a non-voter If the leader becomes a non-voter after a configuration change, step down and become a follower. Non-voting members are an extension to Raft, so the protocol spec does not define whether they can be leaders. I can not think of a reason why they can't, yet I also can not think of a reason why it's useful, so let's forbid this. We already do not allow non-voters to become candidates, and they ignore timeout_now RPC (leadership transfer), so they already can not be elected.	2021-06-11 17:16:50 +03:00
Konstantin Osipov	684e0d2a8c	raft: improve configuration consistency checks Isolate the checks for configuration transitions in a static function, to be able to unit test outside class server. Split the condition of transitioning to an empty configuration from the condition of transitioning into a configuration with no voters, to produce more user-friendly error messages. Allow to transfer leadership in a configuration when the only voter is the leader itself. This would be equivalent to syncing the leader log with the learner and converting the leader to the follower itself. This is safe, since the leader will re-elect itself quickly after an election timeout, and may be used to do a rolling restart of a cluster with only one voter. A test case follows.	2021-06-11 17:16:47 +03:00
Alejo Sanchez	add12d801d	raft: log ignored prevote Add a log line for ignored prevote. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com> Message-Id: <20210609193945.910592-2-alejo.sanchez@scylladb.com>	2021-06-10 12:33:34 +02:00
Tomasz Grabiec	ce7a404f17	Merge "Cleanups/refactoring for Raft Group 0" from Kostja * scylla-dev/raft-group-0-part-1-rebase: raft: (service) pass Raft service into storage_service raft: (service) add comments for boot steps raft: add ordering for raft::server_address based on id raft: (internal) simplify construction of tagged_id raft: (internal) tagged_id minor improvements	2021-06-09 10:48:05 +02:00
Konstantin Osipov	b81580f3c6	raft: add ordering for raft::server_address based on id	2021-06-08 14:52:32 +03:00
Konstantin Osipov	d42d5aee8c	raft: (internal) simplify construction of tagged_id Make it easy to construct tagged_id from UUID.	2021-06-08 14:52:32 +03:00
Konstantin Osipov	c9a23e9b8a	raft: (internal) tagged_id minor improvements Introduce a syntax helper tagged_id::create_random_id(), used to create a new Raft server or group id. Provide a default ordering for tagged ids, for use in Raft leader discovery, which selects the smallest id for leader.	2021-06-08 14:52:32 +03:00
Gleb Natapov	5d15ecb7e5	raft: do not block io_fiber just because of a slow follower Currently if append_message cannot be sent to one of the followers the entire io_fiber will block which eventually stop the replication. The patch changes message sending part of io_fiber to be non blocking. The code adds a hash table that is used to keep track of append_request sending status per destination. All the remaining futures are waited for during abort. Message-Id: <20210606140305.2930189-2-gleb@scylladb.com>	2021-06-07 16:55:14 +02:00
Alejo Sanchez	bd168d57ff	raft: fix vote reply handling in prevote Do not register a reply to prevote as a real vote Found and authored by @kostja. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com> Message-Id: <20210604122530.1975388-1-alejo.sanchez@scylladb.com>	2021-06-06 19:18:49 +03:00
Tomasz Grabiec	50d64646cd	Merge "raft: replication test fixes and OOP refactor" from Alejo Feature requests, fixes, and OOP refactor of replication_test. Note: all known bugs and hangs are now fixed. A new helper class "raft_cluster" is created. Each move of a helper function to the class has its own commit. New helpers are provided To simplify code, for now only a single apply function can be set per raft_cluster. No tests were using in any other way. In the future, there could be custom apply functions per server dynamically assigned, if this becomes needed. * alejo/raft-tests-replication-02-v3-30: (66 commits) raft: replication test: wait for log for both index and term raft: replication test: reset network at construction raft: replication test: use lambda visitor for updates raft: replication test: move structs into class raft: replication test: move data structures to cluster class raft: replication test: remove shared pointers raft: replication test: move get_states() to raft_cluster raft: replication test: test_server inside raft_cluster raft: replication test: rpc declarative tests raft: replication test: add wait_log raft: replication test: add stop and reset server raft: replication test: disconnect 2 support raft: replication test: explicit node_id naming raft: replication test: move definitions up raft: replication test: no append entries support raft: replication test: fix helper parameter raft: replication test: stop servers out of config raft: replication test: wait log when removing leader from configuration raft: replication test: only manipulate servers in configuration raft: replication test: only cancel rearm ticker for removed server ...	2021-06-06 19:18:49 +03:00
Avi Kivity	a55b434a2b	treewide: extent copyright statements to present day	2021-06-06 19:18:49 +03:00
Gleb Natapov	bb822c92ab	raft: change raft::rpc api to return void for most sending functions Most RAFT packets are sent very rarely during special phases of the protocol (like election or leader stepdown). The protocol itself does not care if a packet is sent or dropped, so returning futures from their send function does not serve any purpose. Change the raft's rpc interface to return void for all packet types but append_request. We still want to get a future from sending append_request for backpressure purposes since replication protocol is more efficient if there is no packet loss, so it is better to pause a sender than dropping packets inside the rpc. Rpc is still allowed to drop append_requests if overloaded.	2021-06-06 19:18:49 +03:00
Gleb Natapov	f5a54d6c05	raft: move ELECTION_TIMEOUT definition to a public header Move ELECTION_TIMEOUT definition to be visible to outside modules.	2021-06-06 19:18:49 +03:00
Gleb Natapov	87844c0ce1	raft: remove unused clock type definition RAFT uses logical clock now and this define is from older times.	2021-06-06 19:18:49 +03:00
Gleb Natapov	90ea71da54	raft: wait for io and applier fiber to stop before before aborting snapshots and waiters IO and applier fibers may update waiters and start new snapshot transfers, so abort() needs to wait for them to stop before proceeding to abort waiters and snapshot transfers,	2021-06-06 19:18:49 +03:00
Alejo Sanchez	3e91a8ca0d	raft: replication test: wait for log for both index and term Waiting on index alone does not guarantee leader correct leader log propagation. This patch add checking also the term of the leader's last log entry. This was exposed with occasional problems with packet drops. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-06-04 08:38:19 -04:00
Tomasz Grabiec	0fdd2f8217	Merge "raft: fsm cleanups" from Gleb * scylla-dev/raft-cleanup-v1: raft: drop _leader_progress tracking from the tracker raft: move current_leader into the follower state raft: add some precondition checks	2021-05-14 17:24:59 +02:00
Alejo Sanchez	68f69671b5	raft: style: test optionals directly Avoid using has_value() and test optional directly Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com> Message-Id: <20210512142018.297203-2-alejo.sanchez@scylladb.com>	2021-05-12 20:39:52 +02:00
Gleb Natapov	78c5a72b32	raft: drop _leader_progress tracking from the tracker The tracker maintains a separate pointer to current leader progress, but all this complexity is not needed because the tracker already have find() function that can either find a leader's progress by id or return null. Removing the tracking simplifies code and make going out of sync (which is always a possibility if a state is maintained in two different places) impossible.	2021-05-09 13:55:55 +03:00
Gleb Natapov	1245736776	raft: move current_leader into the follower state Only when fsm is in the follower state current_leader has any meaning. In the leader state a node is always its own follower and in a candidate state there is no leader. To make sure that the current_leader value cannot be out of sync with fsm state move it into the follower state.	2021-05-09 13:55:55 +03:00
Gleb Natapov	0634674aef	raft: add some precondition checks Check that fsm does not process messages from itself and that it does not tries to become its own follower.	2021-05-07 08:04:16 +03:00
Gleb Natapov	aa7ea333da	raft: document that add entry my throw commit_status_unknown	2021-05-06 11:59:36 +03:00
Gleb Natapov	d2f58d8656	raft: drop waiters with outdated terms Currently an entry is declared to be dropped only when an entry with different term is committed with the same index, but that may create a situation where, if no new entries are submitted for a long time, an already dropped entry will not be noticed for a long time as well. Consider the case where a client submits 10 entries on a leader A, but before they get replicated the leadership moves to a node B. B will commit a dummy entry which will be committed eventually and will release one of the waiters on A, but if anything else is submitted to B 9 other waiters will wait forever. The way to solve that is to drop all waiters that wait for a term smaller that one been committed. There is no chance they will be committed any longer since terms in the log may only grow.	2021-05-06 11:34:31 +03:00
Gleb Natapov	6abe2772dc	raft: make snapshot transfer abortable A snapshot transfer may take a lot of time and meanwhile a leader doing it may lose the leadership. If that happens the ongoing snapshot transfer becomes obsolete since the snapshot will be rejected by the receiving node as coming from an old leader. Make snapshot transfer abortable and abort them when leader changes.	2021-05-06 11:34:31 +03:00
Gleb Natapov	50d545a138	raft: accept snapshots transfer from multiple nodes simultaneously A leader may change while one of its followers is in snapshot transfer mode and that node may get additional request for snapshot transfer from a new leader while previous transfer is still not aborted. Currently such situation will trigger an assert. This patch allows to have active snapshot transfers from multiple nodes, but only one of them will succeed in the end, all other will be replied to with 'fail'.	2021-05-06 11:34:31 +03:00
Gleb Natapov	073a9be4c7	raft: do not send probes while transferring snapshot If a follower is in snapshot transfer mode there is no need to send probe append messages to it.	2021-05-06 11:34:31 +03:00
Gleb Natapov	08077a21b7	raft: handle messages sending errors Fail to send a message should not abort raft server.	2021-05-06 11:34:31 +03:00
Gleb Natapov	c4d87d7a23	raft: fix a typo in a variable name	2021-05-06 11:33:47 +03:00
Alejo Sanchez	0a5c605713	raft: replication test: fix custom election Use the new specific connectivity to manage old leader disconnection more specifically. This fixes having elections where the vote of the old leader is required for quorum. For example {A,B} and we want to switch leader. For B to become candidate it has to see A as down. Then A has to see B's request for vote, and vote for A. So to make the general case old leader needs to be first disconnected from all nodes, make the desired node candidate, then have the old leader connected only to the desired candidate (else, other nodes would see the new candidate as disrupting a live leader). Also, there might be stray messages from the former leader. These could revert the candidate to follower. To handle this this patch retries the process until the desired node becomes leader. The helper function elect_me_leader() is split and renamed to wait_until_candidate() and wait_election_done(). The former ticks until the node is a candidate and the later waits until a candidate either becomes a leader or reverts to follower The existing etcd test workaround of incrementing from n=2 to n=3 nodes is corrected back to original n=2. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-05-03 07:53:35 -04:00
Pavel Solodovnikov	4c351ff260	raft: switch `group_id` type from `uint64_t` to `utils::UUID` Introduce a tagged id struct for `group_id`. Raft code would want to generate quite a lot of unique raft groups in the future (e.g. tablets). UUID is designed exactly for that (e.g. larger capacity than `uint64_t`, obviously, and also has built-in procedures to generate random ids). Also, this is a preparation to make "raft group 0" use a random ID instead of a literal fixed `0` as a group id. The purpose is that every scylla cluster must have a unique ID for "raft group 0" since we don't want the nodes from some other cluster to disrupt the current cluster. This can happen if, for some reason, a foreign node happens to contact a node in our cluster. Tests: unit(dev) Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com> Message-Id: <20210429170630.533596-3-pa.solodovnikov@scylladb.com>	2021-05-02 16:39:54 +03:00
Kamil Braun	4c95277619	raft: fsm: fix assertion failure on stray rejects When probes are sent over a slow network, the leader would send multiple probes to a lagging follower before it would get a reject response to the first probe back. After getting a reject, the leader will be able to correctly position `next_idx` for that follower and switch to pipeline mode. Then, an out of order reject to a now irrelevant probe could crash the leader, since it would effectively request it to "rewind" its `match_idx` for that follower, and the code asserts this never happens. We fix the problem by strengthening `is_stray_reject`. The check that was previously only made in `PIPELINE` case (`rejected.non_matching_idx <= match_idx`) is now always performed and we add a new check: `rejected.last_idx < match_idx`. We also strengthen the assert. The commit improves the documentation by explaining that `is_stray_reject` may return false negatives. We also precisely state the preconditions and postconditions of `is_stray_reject`, give a more precise definition of `progress.match_idx`, argue how the postconditions of `is_stray_reject` follow from its preconditions and Raft invariants, and argue why the (strengthened) assert must always pass. Message-Id: <20210423173117.32939-1-kbraun@scylladb.com>	2021-04-27 01:07:22 +02:00
Pavel Solodovnikov	fba1910770	raft: fix incorrect rpc setup in `server_impl::start()` RPC configuration was updated only when an instance was started with an initial snapshot. In case we don't have an initial snapshot, but do have a non-empty log with a configuration entry, the RPC instance isn't set up correctly. Fix that by moving RPC setup code outside the check for snapshot id and look at `_log.get_configuration()` instead. Also, set up RPC mappings both for `current` and `previous` components, since in case the last configuration index points to an entry from the log, it can happen to be a joint configuration entry. For example, this can happen if a leader made an attempt to change configuration, but failed shortly afterwards without being able to commit the new configuration. Tests: unit(dev) Signed-off-by: Pavel Solodovnikov <pa.solodonikov@scylladb.com> Message-Id: <20210423220718.642470-1-pa.solodovnikov@scylladb.com>	2021-04-26 20:46:50 +02:00
Kamil Braun	8e9a9f8bd3	raft: fsm: include config entries in output.committed Otherwise waiters on committed configuration changes (e.g. `server::set_configuration`) would never get notified. Also if we tried to send another entry concurrently we would get replication_test: raft/server.cc:318: void raft::server_impl::notify_waiters(std::map<index_t, op_status> &, const std::vector<log_entry_ptr> &): Assertion `entry_idx >= first_idx' failed. (not sure if this commit also fixes whatever caused that). Message-Id: <20210419181319.68628-2-kbraun@scylladb.com>	2021-04-22 15:38:10 +02:00
Avi Kivity	14a4173f50	treewide: make headers self-sufficient In preparation for some large header changes, fix up any headers that aren't self-sufficient by adding needed includes or forward declarations.	2021-04-20 21:23:00 +03:00
Gleb Natapov	9fdb3d3d98	raft: stop using seastar::pipe to pass log entries to apply_fiber Stop use seastar::pipe and use seastar::queue directly to pass log entries to apply_fiber. The pipe is a layer above queue anyway and it adds functionality that we do not need (EOS) and hinds functionality that we do (been able to abort()). This fixes a crash during abort where the pipe was uses after been destroyed. Message-Id: <YHLkPZ9+sdLhwcjZ@scylladb.com>	2021-04-12 13:18:03 +02:00
Gleb Natapov	fb938a36d4	raft: disallow adding and creating servers with id zero Id zero has special meaning in the code and cannot be valid server id. Message-Id: <20210407134853.1964226-1-gleb@scylladb.com>	2021-04-08 17:07:18 +02:00
Gleb Natapov	b3cb4f3966	raft: fix quorum check code for joint config and non-voting members Current leader code check for most nodes to be alive, but this is incorrect since some nodes may be non-voting and hence should not cause a leader to stepdown if dead. It also incorrect with joint config since quorum is calculated differently there. Fix it by introducing activity_tracker class that knows how to handle all the above details.	2021-04-07 10:15:33 +03:00
Gleb Natapov	a48a2c454b	raft: do not hang on waiting for entries on a leader that was removed from a cluster If a leader is removed from a cluster it will never know when entries that it did not committed yet will be committed, so abort the wait in this case with uncertainty error.	2021-04-07 10:15:33 +03:00
Gleb Natapov	db03c94692	raft: add more tracing to stepdown code	2021-04-07 10:15:33 +03:00
Gleb Natapov	7dec56721c	raft: use existing election_elapsed() function instead of redo the calculation	2021-04-07 10:15:33 +03:00
Gleb Natapov	3bcd3212e2	raft: check that a node is still the leader after initiating stepdown process Usually initiation of stepdown process does not immediately depose the current leader, but if the current leader is no longer part of the cluster it will happen. We were missing the check after initiating stepdown process in append reply handling.	2021-04-07 10:15:33 +03:00
Gleb Natapov	28add88a1f	raft: do not assert when receiving unexpected messages in a leader state Current code assert when it gets InstallSnapshot/AppendRequest in a leader state and the term in the message is equal current term. It is true that such messages cannot be received if the protocol works correctly, but we should not crash on a network input nonetheless.	2021-04-04 11:33:35 +03:00
Gleb Natapov	995cd1c8a7	raft: use existing function to check if election timeout elapsed is_past_election_timeout() repeats the calculation that election_elapsed() is doing. Use existing function instead.	2021-04-04 11:33:35 +03:00
Gleb Natapov	13a3cf62bb	raft: move incoming message processing into per state functions Clean up step() function by moving state specific processing into per state functions. This way it is easier to see how each state handles individual messages. No functional changes here. Message-Id: <YGHCiTWjq+L/jVCB@scylladb.com>	2021-03-29 15:48:43 +02:00
Pavel Solodovnikov	2d9e94f050	raft: update README.md with info on RPC server address mappings Describe the high-level scheme of managing RPC mappings and also expand on the introduction of "expirable" RPC mappings concept and why these are needed. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-03-26 22:47:13 +03:00
Pavel Solodovnikov	f61206e483	raft: wire up `rpc::add_server` and `rpc::remove_server` for configuration changes Raft instance needs to update RPC subsystem on changes in configuration, so that RPC can deliver messages to the new nodes in configuration, as well as dispose of the old nodes. I.e. the nodes which are not the part of the most recent configuration anymore. The effective scope of RPC mappings is limited by the piece of code which sends messages to both the "new" nodes (which are added to the cluster with the most recent configuration change) and the "old" nodes which are removed from the cluster. Until the messages are successfully delivered to at least the majority of "old" nodes and we have heard back from them, the mappings should be kept intact. After that point the RPC mappings for the removed nodes are no longer of interest and thus can be immediately disposed. Tests: unit(dev) Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-03-26 22:47:09 +03:00
Pavel Solodovnikov	16d9e8e9af	raft/fsm: add optional `rpc_configuration` field to fsm_output The field is set in `fsm.get_output` whenever `_log.last_conf_idx()` or the term changes. Also, add `_last_conf_idx` and `_last_term` to `fsm::last_observed_state`, they are utilized in the condition to evaluate current rpc configuration in `fsm.get_output()`. This will be used later to update rpc config state stored in `server_impl` and maintain rpc address map. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-03-26 22:47:05 +03:00
Pavel Solodovnikov	19cc85b3b6	raft: maintain current rpc context in `server_impl` Introduce rpc server_address that represents the last observed state of address mappings for RPC module. It does not correspond to any kind of configuration in the raft sense, just an artificial construct corresponding to the largest set of server addresses coming from both previous and current raft configurations (to be able to contact both joining and leaving servers). This will be used later to update rpc module mappings when cluster configuration changes. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-03-26 20:22:44 +03:00

1 2 3

149 Commits