scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-23 18:10:39 +00:00

Author	SHA1	Message	Date
Gleb Natapov	03a266d73b	raft: make read_barrier work on a follower as well as on a leader This patch implements RAFT extension that allows to perform linearisable reads by accessing local state machine. The extension is described in section 6.4 of the PhD. To sum it up to perform a read barrier on a follower it needs to asks a leader the last committed index that it knows about. The leader must make sure that it is still a leader before answering by communicating with a quorum. When follower gets the index back it waits for it to be applied and by that completes read_barrier invocation. The patch adds three new RPC: read_barrier, read_barrier_reply and execute_read_barrier_on_leader. The last one is the one a follower uses to ask a leader about safe index it can read. First two are used by a leader to communicate with a quorum.	2021-08-25 08:57:13 +03:00
Gleb Natapov	73af7edc78	raft: add a function to wait for an index to be applied	2021-08-25 08:19:25 +03:00
Konstantin Osipov	0429196e06	raft: (server) add a helper to wait through uncertainty period Add a helper to be able to wait until a Raft cluster leader is elected. It can be used to avoid sleeps when it's necessary to forward a request to the leader, but the leader is yet unknown.	2021-08-25 08:19:25 +03:00
Gleb Natapov	bd0fd579cf	raft: fix indentation in applier_fiber	2021-08-25 08:19:25 +03:00
Kamil Braun	1ca4d30cc3	raft: sanity checking of apply index Check that entries are applied in the correct order.	2021-08-06 12:21:19 +02:00
Kamil Braun	c6563220b0	raft: store cluster configuration when taking snapshots We add a function `log_last_conf_before(index_t)` to `fsm` which, given an index greater than the last snapshot index, returns the configuration at this index, i.e. the configuration of the last configuration entry before this index. This function is then used in `applier_fiber` to obtain the correct configuration to be stored in a snapshot. In order to ensure that the configuration can be obtained, i.e. the index we're looking at is not smaller than the last snapshot index, we strengthen the conditions required for taking a snapshot: we check that `_fsm` has not yet applied a snapshot at a larger index (which it may have due to a remote snapshot install request). This also causes fewer unnecessary snapshots to be taken in general.	2021-08-06 12:00:32 +02:00
Kamil Braun	f050d3682c	raft: fsm: stronger check for outdated remote snapshots We must not apply remote snapshots with commit indexes smaller than our local commit index; this could result in out-of-order command application to the local state machine replica, leading to serializability violations. Message-Id: <20210805112736.35059-1-kbraun@scylladb.com>	2021-08-05 14:29:50 +02:00
Kamil Braun	e9632ee986	raft: use the correct term when storing a snapshot We should not use the current term; we should use the term of the snapshot's index, which may be lower.	2021-08-02 11:46:04 +02:00
Gleb Natapov	f0047bd749	raft: apply snapshots in applier_fiber We want to serialize snapshot application with command application otherwise a command may be applied after a snapshot that already contains the result of its application (it is not necessary a problem since the raft by itself does not guaranty apply-once semantics, but better to prevent it when possible). This also moves all interactions with user's state machine into one place. Message-Id: <YPltCmBAGUQnpW7r@scylladb.com>	2021-07-23 18:05:38 +02:00
Gleb Natapov	aa8c6b85fb	raft: do not apply empty command list Do not call user's state machine apply() if there is nothing to apply. Message-Id: <YO1dMitXnZhZlmra@scylladb.com>	2021-07-19 18:26:18 +02:00
Gleb Natapov	ed49d29473	raft: allow to initiate leader stepdown process Sometimes an ability to force a leader change is needed. For instance if a node that is currently serving as a leader needs to be brought down for maintenance. If it will be shutdown without leadership transfer the cluster will be unavailable for leader election timeout at least. We already have a mechanism to transfer the leadership in case an active leader is removed. The patch exposes it as an external interface with a timeout period. If a node is still a leader after the timeout the operation will fail.	2021-06-22 14:36:42 +03:00
Konstantin Osipov	c67c77ed03	raft: (server) wait for configuration transition to complete By default, wait for the server to leave the joint configuration when making a configuration change. When assembling a fresh cluster Scylla may run a series of configuration changes. These changes would all go through the same leader and serialize in the critical section around server::cas(). Unless this critical section protects the complete transition from C_old configuration to C_new, after the first configuration is committed, the second may fail with exception that a configuration change is in progress. The topology changes layer should handle this exception, however, this may introduce either unpleasant delays into cluster assembly (i.e. if we sleep before retry), or a busy-wait/thundering herd situation, when all nodes are retrying their configuration changes. So let's be nice and wait for a full transition in server::set_configuration().	2021-06-16 16:52:43 +03:00
Konstantin Osipov	631c89e1a6	raft: (server) implement raft::server::get_configuration() raft::server::set_configuration() is useless on application level if we can't query the previous configuration.	2021-06-16 16:52:43 +03:00
Gleb Natapov	580edcef27	raft: register metrics only after fsm is created Metrics access _fsm pointer, so we should register them only after the pointer is populated. Fixes: #8824 Message-Id: <YMilsCslLAeEnbaw@scylladb.com>	2021-06-16 09:34:49 +02:00
Konstantin Osipov	684e0d2a8c	raft: improve configuration consistency checks Isolate the checks for configuration transitions in a static function, to be able to unit test outside class server. Split the condition of transitioning to an empty configuration from the condition of transitioning into a configuration with no voters, to produce more user-friendly error messages. Allow to transfer leadership in a configuration when the only voter is the leader itself. This would be equivalent to syncing the leader log with the learner and converting the leader to the follower itself. This is safe, since the leader will re-elect itself quickly after an election timeout, and may be used to do a rolling restart of a cluster with only one voter. A test case follows.	2021-06-11 17:16:47 +03:00
Gleb Natapov	5d15ecb7e5	raft: do not block io_fiber just because of a slow follower Currently if append_message cannot be sent to one of the followers the entire io_fiber will block which eventually stop the replication. The patch changes message sending part of io_fiber to be non blocking. The code adds a hash table that is used to keep track of append_request sending status per destination. All the remaining futures are waited for during abort. Message-Id: <20210606140305.2930189-2-gleb@scylladb.com>	2021-06-07 16:55:14 +02:00
Tomasz Grabiec	50d64646cd	Merge "raft: replication test fixes and OOP refactor" from Alejo Feature requests, fixes, and OOP refactor of replication_test. Note: all known bugs and hangs are now fixed. A new helper class "raft_cluster" is created. Each move of a helper function to the class has its own commit. New helpers are provided To simplify code, for now only a single apply function can be set per raft_cluster. No tests were using in any other way. In the future, there could be custom apply functions per server dynamically assigned, if this becomes needed. * alejo/raft-tests-replication-02-v3-30: (66 commits) raft: replication test: wait for log for both index and term raft: replication test: reset network at construction raft: replication test: use lambda visitor for updates raft: replication test: move structs into class raft: replication test: move data structures to cluster class raft: replication test: remove shared pointers raft: replication test: move get_states() to raft_cluster raft: replication test: test_server inside raft_cluster raft: replication test: rpc declarative tests raft: replication test: add wait_log raft: replication test: add stop and reset server raft: replication test: disconnect 2 support raft: replication test: explicit node_id naming raft: replication test: move definitions up raft: replication test: no append entries support raft: replication test: fix helper parameter raft: replication test: stop servers out of config raft: replication test: wait log when removing leader from configuration raft: replication test: only manipulate servers in configuration raft: replication test: only cancel rearm ticker for removed server ...	2021-06-06 19:18:49 +03:00
Avi Kivity	a55b434a2b	treewide: extent copyright statements to present day	2021-06-06 19:18:49 +03:00
Gleb Natapov	bb822c92ab	raft: change raft::rpc api to return void for most sending functions Most RAFT packets are sent very rarely during special phases of the protocol (like election or leader stepdown). The protocol itself does not care if a packet is sent or dropped, so returning futures from their send function does not serve any purpose. Change the raft's rpc interface to return void for all packet types but append_request. We still want to get a future from sending append_request for backpressure purposes since replication protocol is more efficient if there is no packet loss, so it is better to pause a sender than dropping packets inside the rpc. Rpc is still allowed to drop append_requests if overloaded.	2021-06-06 19:18:49 +03:00
Gleb Natapov	90ea71da54	raft: wait for io and applier fiber to stop before before aborting snapshots and waiters IO and applier fibers may update waiters and start new snapshot transfers, so abort() needs to wait for them to stop before proceeding to abort waiters and snapshot transfers,	2021-06-06 19:18:49 +03:00
Alejo Sanchez	3e91a8ca0d	raft: replication test: wait for log for both index and term Waiting on index alone does not guarantee leader correct leader log propagation. This patch add checking also the term of the leader's last log entry. This was exposed with occasional problems with packet drops. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-06-04 08:38:19 -04:00
Gleb Natapov	d2f58d8656	raft: drop waiters with outdated terms Currently an entry is declared to be dropped only when an entry with different term is committed with the same index, but that may create a situation where, if no new entries are submitted for a long time, an already dropped entry will not be noticed for a long time as well. Consider the case where a client submits 10 entries on a leader A, but before they get replicated the leadership moves to a node B. B will commit a dummy entry which will be committed eventually and will release one of the waiters on A, but if anything else is submitted to B 9 other waiters will wait forever. The way to solve that is to drop all waiters that wait for a term smaller that one been committed. There is no chance they will be committed any longer since terms in the log may only grow.	2021-05-06 11:34:31 +03:00
Gleb Natapov	6abe2772dc	raft: make snapshot transfer abortable A snapshot transfer may take a lot of time and meanwhile a leader doing it may lose the leadership. If that happens the ongoing snapshot transfer becomes obsolete since the snapshot will be rejected by the receiving node as coming from an old leader. Make snapshot transfer abortable and abort them when leader changes.	2021-05-06 11:34:31 +03:00
Gleb Natapov	50d545a138	raft: accept snapshots transfer from multiple nodes simultaneously A leader may change while one of its followers is in snapshot transfer mode and that node may get additional request for snapshot transfer from a new leader while previous transfer is still not aborted. Currently such situation will trigger an assert. This patch allows to have active snapshot transfers from multiple nodes, but only one of them will succeed in the end, all other will be replied to with 'fail'.	2021-05-06 11:34:31 +03:00
Gleb Natapov	08077a21b7	raft: handle messages sending errors Fail to send a message should not abort raft server.	2021-05-06 11:34:31 +03:00
Gleb Natapov	c4d87d7a23	raft: fix a typo in a variable name	2021-05-06 11:33:47 +03:00
Alejo Sanchez	0a5c605713	raft: replication test: fix custom election Use the new specific connectivity to manage old leader disconnection more specifically. This fixes having elections where the vote of the old leader is required for quorum. For example {A,B} and we want to switch leader. For B to become candidate it has to see A as down. Then A has to see B's request for vote, and vote for A. So to make the general case old leader needs to be first disconnected from all nodes, make the desired node candidate, then have the old leader connected only to the desired candidate (else, other nodes would see the new candidate as disrupting a live leader). Also, there might be stray messages from the former leader. These could revert the candidate to follower. To handle this this patch retries the process until the desired node becomes leader. The helper function elect_me_leader() is split and renamed to wait_until_candidate() and wait_election_done(). The former ticks until the node is a candidate and the later waits until a candidate either becomes a leader or reverts to follower The existing etcd test workaround of incrementing from n=2 to n=3 nodes is corrected back to original n=2. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-05-03 07:53:35 -04:00
Pavel Solodovnikov	fba1910770	raft: fix incorrect rpc setup in `server_impl::start()` RPC configuration was updated only when an instance was started with an initial snapshot. In case we don't have an initial snapshot, but do have a non-empty log with a configuration entry, the RPC instance isn't set up correctly. Fix that by moving RPC setup code outside the check for snapshot id and look at `_log.get_configuration()` instead. Also, set up RPC mappings both for `current` and `previous` components, since in case the last configuration index points to an entry from the log, it can happen to be a joint configuration entry. For example, this can happen if a leader made an attempt to change configuration, but failed shortly afterwards without being able to commit the new configuration. Tests: unit(dev) Signed-off-by: Pavel Solodovnikov <pa.solodonikov@scylladb.com> Message-Id: <20210423220718.642470-1-pa.solodovnikov@scylladb.com>	2021-04-26 20:46:50 +02:00
Gleb Natapov	9fdb3d3d98	raft: stop using seastar::pipe to pass log entries to apply_fiber Stop use seastar::pipe and use seastar::queue directly to pass log entries to apply_fiber. The pipe is a layer above queue anyway and it adds functionality that we do not need (EOS) and hinds functionality that we do (been able to abort()). This fixes a crash during abort where the pipe was uses after been destroyed. Message-Id: <YHLkPZ9+sdLhwcjZ@scylladb.com>	2021-04-12 13:18:03 +02:00
Gleb Natapov	a48a2c454b	raft: do not hang on waiting for entries on a leader that was removed from a cluster If a leader is removed from a cluster it will never know when entries that it did not committed yet will be committed, so abort the wait in this case with uncertainty error.	2021-04-07 10:15:33 +03:00
Pavel Solodovnikov	f61206e483	raft: wire up `rpc::add_server` and `rpc::remove_server` for configuration changes Raft instance needs to update RPC subsystem on changes in configuration, so that RPC can deliver messages to the new nodes in configuration, as well as dispose of the old nodes. I.e. the nodes which are not the part of the most recent configuration anymore. The effective scope of RPC mappings is limited by the piece of code which sends messages to both the "new" nodes (which are added to the cluster with the most recent configuration change) and the "old" nodes which are removed from the cluster. Until the messages are successfully delivered to at least the majority of "old" nodes and we have heard back from them, the mappings should be kept intact. After that point the RPC mappings for the removed nodes are no longer of interest and thus can be immediately disposed. Tests: unit(dev) Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-03-26 22:47:09 +03:00
Pavel Solodovnikov	19cc85b3b6	raft: maintain current rpc context in `server_impl` Introduce rpc server_address that represents the last observed state of address mappings for RPC module. It does not correspond to any kind of configuration in the raft sense, just an artificial construct corresponding to the largest set of server addresses coming from both previous and current raft configurations (to be able to contact both joining and leaving servers). This will be used later to update rpc module mappings when cluster configuration changes. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-03-26 20:22:44 +03:00
Gleb Natapov	9d6bf7f351	raft: introduce leader stepdown procedure Section 3.10 of the PhD describes two cases for which the extension can be helpful: 1. Sometimes the leader must step down. For example, it may need to reboot for maintenance, or it may be removed from the cluster. When it steps down, the cluster will be idle for an election timeout until another server times out and wins an election. This brief unavailability can be avoided by having the leader transfer its leadership to another server before it steps down. 2. In some cases, one or more servers may be more suitable to lead the cluster than others. For example, a server with high load would not make a good leader, or in a WAN deployment, servers in a primary datacenter may be preferred in order to minimize the latency between clients and the leader. Other consensus algorithms may be able to accommodate these preferences during leader election, but Raft needs a server with a sufficiently up-to-date log to become leader, which might not be the most preferred one. Instead, a leader in Raft can periodically check to see whether one of its available followers would be more suitable, and if so, transfer its leadership to that server. (If only human leaders were so graceful.) The patch here implements the extension and employs it automatically when a leader removes itself from a cluster.	2021-03-22 10:28:43 +02:00
Konstantin Osipov	fcc6e621f8	raft: pass snapshot_reply into fsm::step() By the time we receive snapshot_reply from a follower we may no longer be the leader. Follower term may be different from snapshot term, e.g. the follower may be aware of a new leader already and have a higher term. We should pass this information into (possibly ex-) leader FSM via fsm::step() so that it can correctly change its state, and not call FSM directly.	2021-03-18 16:56:46 +03:00
Konstantin Osipov	4afa662d62	raft: respond with snapshot_reply to send_snapshot RPC Raft send_snapshot RPC is actually two-way, the follower responds with snapshot_reply message. This message until now was, however, muted by RPC. Do not mute snapshot_reply any more: - to make it obvious the RPC is two way - to feed the follower response directly into leader's FSM and thus ensure that FSM testing results produced when using a test transport are representative of the real world uses of raft::rpc.	2021-03-18 16:56:42 +03:00
Konstantin Osipov	cb3314d756	raft: set follower's next_idx when switching to SNAPSHOT mode Set follower's next_idx to snapshot index + 1 when switching it to snapshot mode. If snapshot transfer succeeds, that's the best match for the follower's next replication index. If it fails, the leader will send a new probe to find out the follower position again and re-try sending a possibly newer snapshot. The change helps reduce protocol state managed outside FSM.	2021-03-18 16:35:11 +03:00
Pavel Solodovnikov	93c565a1bf	raft: allow raft server to start with initial term 0 Prior to the fix there was an assert to check in `raft::server_impl::start` that the initial term is not 0. This restriction is completely artificial and can be lifted without any problems, which will be described below. The only place that is dependent on this corner case is in `server_impl::io_fiber`. Whenever term or vote has changed, they will be both set in `fsm::get_output`. `io_fiber` checks whether it needs to persist term and vote by validating that the term field is set (by actually executing a `term != 0` condition). This particular check is based on an unobvious fact that the term will never be 0 in case `fsm::get_output` saves term and vote values, indicating that they need to be persisted. Vote and term can change independently of each other, so that checking only for term obscures what is happening and why even more. In either case term will never be 0, because: 1. If the term has changed, then it's naturally greater than 0, since it's a monotonically increasing value. 2. If the vote has changed, it means that we received a vote request message. In such case we have already updated our term to the requester's term. Switch to using an explicit optional in `fsm_output` so that a reader don't have to think about the motivation behind this `if` and just checks that `term_and_vote` optional is engaged. Given the motivation described above, the corresponding assert(_fsm->get_current_term() != term_t(0)); in `server_impl::start` is removed. Tests: unit(dev) Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-03-17 09:59:21 +02:00
Gleb Natapov	1f868d516e	raft: implement prevoting stage in leader election This is how PhD explain the need for prevoting stage: One downside of Raft's leader election algorithm is that a server that has been partitioned from the cluster is likely to cause a disruption when it regains connectivity. When a server is partitioned, it will not receive heartbeats. It will soon increment its term to start an election, although it won't be able to collect enough votes to become leader. When the server regains connectivity sometime later, its larger term number will propagate to the rest of the cluster (either through the server's RequestVote requests or through its AppendEntries response). This will force the cluster leader to step down, and a new election will have to take place to select a new leader. Prevoting stage is addressing that. In the Prevote algorithm, a candidate only increments its term if it first learns from a majority of the cluster that they would be willing to grant the candidate their votes (if the candidate's log is sufficiently up-to-date, and the voters have not received heartbeats from a valid leader for at least a baseline election timeout). The Prevote algorithm solves the issue of a partitioned server disrupting the cluster when it rejoins. While a server is partitioned, it won't be able to increment its term, since it can't receive permission from a majority of the cluster. Then, when it rejoins the cluster, it still won't be able to increment its term, since the other servers will have been receiving regular heartbeats from the leader. Once the server receives a heartbeat from the leader itself, it will return to the follower state(in the same term). In our implementation we have "stable leader" extension that prevents spurious RequestVote to dispose an active leader, but AppendEntries with higher term will still do that, so prevoting extension is also required.	2021-03-12 11:09:21 +02:00
Konstantin Osipov	95ee8e1b90	raft: fix spelling Fix spelling of a few comments.	2021-02-19 22:56:26 +03:00
Konstantin Osipov	51c968bcb4	raft: rename log::non_snapshoted_length() to log::in_memory_size() The old name was incorrect, in case apply_snapshot() was called with non-zero trailing entries, the total log length is greater than the length of the part that is not stored in a snapshot. Fix spelling in related comments. Rename fsm::wait() to fsm::wait_max_log_size(), it's a more specific name. Rename max_log_length to max_log_size to use 'size' rather than 'length' consistently for log size.	2021-02-18 16:04:44 +03:00
Alejo Sanchez	b41a6822e8	raft: drop ticker from raft Remove ticker callbacks from raft::server. External code should periodically call raft::server::tick(). Update replication_test accordingly. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-02-14 09:41:42 -04:00
Alejo Sanchez	97338ab53f	raft: replication test: fix debug mode hangs For certain situations where barely enough nodes to elect a new leader are connected a disruptive candidate can occassionally block the election. For example having servers A B C D E and only A B C are active in a partition. If the test wants to elect A, it has to first make all 3 servers reach election timeout threshold (to make B and C receptive). Then A is ticked till it becomes a candidate and has to send vote requests to the other servers. But all servers have a timer (_ticker) calling their periodic tick() functions. If one of the other servers, say B, gets its timer tick before A sends vote requests, B becomes a (disruptive) candidate and will refuse to vote for A. In our case of only having 3 out of 5 servers connected a single missing vote can hang the election. This patch disables timer ticks for all servers when running custom elections and partitioning. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-02-11 11:42:31 -04:00
Gleb Natapov	e9043565b3	raft: add counters to raft server The patch adds set of counters for various events inside raft implementation to facilitate monitoring and debugging. Message-Id: <20210204125313.GA1513786@scylladb.com>	2021-02-04 14:19:54 +01:00
Konstantin Osipov	c7b5a60320	raft: joint consensus, wire up configuration changes in the API Now that we've implemented joint consensus based configuration changes, replace add_server()/remove_server() with a more general set_configuration().	2021-01-29 22:07:08 +03:00
Gleb Natapov	aad0209b1c	raft: fix spelling and add comments Fix spelling errors in a few comments, improve comments. With fix-ups by Gleb Natapov <gleb@scylladb.com>	2021-01-29 22:07:07 +03:00
Alejo Sanchez	f875ff72c9	raft: testing: remove election wait time and just yield Replace sleep time for elect_me_leader with yield to speed things up. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-01-24 20:25:48 -04:00
Tomasz Grabiec	f08a3e3fd8	Merge "raft: test fixes, etcd tests, simplification" from Alejo This patch set adds etcd unit tests for raft. It also includes a fix for replication test in debug mode and a simplification for append_request. Tests: unit ({dev}), unit ({debug}), unit ({release}) * https://github.com/alecco/scylla/tree/raft-ale-tests-09b: raft: etcd unit tests: test log replication raft: boost test etcd: test fsm can vote from any state raft: boost test etcd: port TestLeaderElectionOverwriteNewerLogs raft: replication test: add etcd test for cycling leaders raft: testing: provide primitives to wait for log propagation raft: etcd unit tests: initial boost tests raft: combine append_request _receive and _send	2021-01-21 10:41:33 +02:00
Pavel Solodovnikov	041072b59f	raft: rename `storage` to `persistence` The new naming scheme more clearly communicates to the client of the raft library that the `persistence` interface implements persistency layer of the fsm that is powering the raft protocol itself rather than the client-side workflow and user-provided `state_machine`. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com> Message-Id: <20201126135114.7933-1-pa.solodovnikov@scylladb.com>	2021-01-20 10:23:43 +02:00
Alejo Sanchez	f627972186	raft: testing: provide primitives to wait for log propagation For tests to be able to transition in a consistent state, in some cases it's needed to allow the followers to catch up with the leader. This prevents occasional hangs in debug mode for incoming tests. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-01-18 12:33:37 -04:00
Gleb Natapov	6d47a535b9	raft: combine append_request _receive and _send Combine structs for append request send and receive into a single struct. Author: Gleb Natapov <gleb@scylladb.com> Date: Mon Nov 23 14:33:14 2020 +0200	2021-01-18 12:24:13 -04:00

1 2

66 Commits