scylladb

Author	SHA1	Message	Date
Kamil Braun	ad3141d3e0	raft: server: translate abort_requested_exception to raft::request_aborted The `wait_for_leader` function would throw a low-level `abort_requested_aborted` exception from seastar::shared_promise. Translate it to the high-level raft::request_aborted so we can reduce the number of different exception types which cross the Raft API boundary. Also, add comments on Raft API functions about the exception thrown when requests are aborted.	2022-04-05 19:18:53 +02:00
Gleb Natapov	a1604aa388	raft: make raft requests abortable This patch adds an ability to pass abort_source to raft request APIs ( add_entry, modify_config) to make them abortable. A request issuer not always want to wait for a request to complete. For instance because a client disconnected or because it no longer interested in waiting because of a timeout. After this patch it can now abort waiting for such requests through an abort source. Note that aborting a request only aborts the wait for it to complete, it does not mean that the request will not be eventually executed. Message-Id: <YjHivLfIB9Xj5F4g@scylladb.com>	2022-03-16 18:38:01 +01:00
Kamil Braun	28b5792481	raft: server: don't create local waiter in `modify_config` When forwarding a reconfiguration request from follower to a leader in `modify_config`, there is no reason to wait for the follower's commit index to be updated. The only useful information is that the leader committed the configuration change - so `modify_config` should return as soon as we know that. There is a reason not to wait for the follower's commit index to be updated: if the configuration change removes the follower, the follower will never learn about it, so a local waiter will never be resolved. `execute_modify_config` - the part of `modify_config` executed on the leader - is thus modified to finish when the configuration change is fully complete (including the dummy entry appended at the end), and `modify_config` - which does the forwarding - no longer creates a local waiter, but returns as soon as the RPC call to the leader confirms that the entry was committed on the leader. We still return an `entry_id` from `execute_modify_config` but that's just an artifact of the implementation. Fixes #9981.	2022-01-27 17:49:40 +01:00
Avi Kivity	fcb8d040e8	treewide: use Software Package Data Exchange (SPDX) license identifiers Instead of lengthy blurbs, switch to single-line, machine-readable standardized (https://spdx.dev) license identifiers. The Linux kernel switched long ago, so there is strong precedent. Three cases are handled: AGPL-only, Apache-only, and dual licensed. For the latter case, I chose (AGPL-3.0-or-later and Apache-2.0), reasoning that our changes are extensive enough to apply our license. The changes we applied mechanically with a script, except to licenses/README.md. Closes #9937	2022-01-18 12:15:18 +01:00
Kamil Braun	485c0b1819	raft: server: don't register metrics in `start()` Instead, expose `register_metrics()` at the `server` interface (previously it was a private method of `server_impl`). Metrics are global so `register_metrics()` cannot be called on two servers that have the same ID, which is useful e.g. in tests when we want to simulate server stops and restarts.	2021-12-07 11:23:33 +01:00
Konstantin Osipov	6d28927550	raft: make forwarding optional In absence of abort_source or timeouts in Raft API, automatic bouncing can create too much noise during testing, especially during network failures. Add an option to disable follower bouncing feature, since randomized_nemesis_test has its own bouncing which handles timeouts correctly. Optionally disable forwarding in basic_generator_test.	2021-11-25 12:35:43 +03:00
Konstantin Osipov	e3751068fe	raft: (server) allow adding entries/modify config on a follower Implement an RPC to forward add_entry calls from the follower to leader. Bounce & retry in case of not_a_leader. Do not retry in case of uncertainty - this can lead to adding duplicate entries. The feature is added to core Raft since it's needed by all current clients - both topology and schema changes. When forwarding an entry to a remote leader we may get back a term/index pair that conflicts (has the same index, but is with a higher term) with a local entry we're still waiting on. This can happen, e.g. because there was a leader change and the log was truncated, but we still haven't got the append_entries RPC from the new leader, still haven't truncated the log locally, still haven't aborted all the local waits for truncated entries. Only remove the offending entry from the wait list and abort it. There may be entries labeled with an older term to the right (with higher commit index) of the conflicting entry. However, finding them, would require a linear scan. If we allow it, we may end up doing this linear scan for every conflicting entry during the transition period, which brings us to N^2 complexity of this step. At the same time, as soon as append_entries that commits a higher-term entry with the same index reaches the follower, the waits for the respective truncated entry will be aborted anyway (see notify_waiters() which sets dropped_entry exception), so the scan is unnecessary. Similarly to being able to add entries, allow to modify Raft group configuration on a follower. The implementation works the same way as adding entries - forwards the command to the leader. Now that add_entry() or modify_config never throws not_a_leader, it's more likely to throw timed_out_error, e.g. in case the network is partitioned. Previously it was only possible due to a semaphore wait timeout, and this scenario was not tested. Handle timed_out_error on RPC level to let the existing tests (specifically the randomized nemesis test) pass.	2021-11-25 11:50:38 +03:00
Konstantin Osipov	9cde1cdf71	raft: (server) implement id() helper There is no easy way to get server id otherwise.	2021-11-25 11:50:38 +03:00
Gleb Natapov	03a266d73b	raft: make read_barrier work on a follower as well as on a leader This patch implements RAFT extension that allows to perform linearisable reads by accessing local state machine. The extension is described in section 6.4 of the PhD. To sum it up to perform a read barrier on a follower it needs to asks a leader the last committed index that it knows about. The leader must make sure that it is still a leader before answering by communicating with a quorum. When follower gets the index back it waits for it to be applied and by that completes read_barrier invocation. The patch adds three new RPC: read_barrier, read_barrier_reply and execute_read_barrier_on_leader. The last one is the one a follower uses to ask a leader about safe index it can read. First two are used by a leader to communicate with a quorum.	2021-08-25 08:57:13 +03:00
Gleb Natapov	ed49d29473	raft: allow to initiate leader stepdown process Sometimes an ability to force a leader change is needed. For instance if a node that is currently serving as a leader needs to be brought down for maintenance. If it will be shutdown without leadership transfer the cluster will be unavailable for leader election timeout at least. We already have a mechanism to transfer the leadership in case an active leader is removed. The patch exposes it as an external interface with a timeout period. If a node is still a leader after the timeout the operation will fail.	2021-06-22 14:36:42 +03:00
Konstantin Osipov	c67c77ed03	raft: (server) wait for configuration transition to complete By default, wait for the server to leave the joint configuration when making a configuration change. When assembling a fresh cluster Scylla may run a series of configuration changes. These changes would all go through the same leader and serialize in the critical section around server::cas(). Unless this critical section protects the complete transition from C_old configuration to C_new, after the first configuration is committed, the second may fail with exception that a configuration change is in progress. The topology changes layer should handle this exception, however, this may introduce either unpleasant delays into cluster assembly (i.e. if we sleep before retry), or a busy-wait/thundering herd situation, when all nodes are retrying their configuration changes. So let's be nice and wait for a full transition in server::set_configuration().	2021-06-16 16:52:43 +03:00
Konstantin Osipov	631c89e1a6	raft: (server) implement raft::server::get_configuration() raft::server::set_configuration() is useless on application level if we can't query the previous configuration.	2021-06-16 16:52:43 +03:00
Tomasz Grabiec	50d64646cd	Merge "raft: replication test fixes and OOP refactor" from Alejo Feature requests, fixes, and OOP refactor of replication_test. Note: all known bugs and hangs are now fixed. A new helper class "raft_cluster" is created. Each move of a helper function to the class has its own commit. New helpers are provided To simplify code, for now only a single apply function can be set per raft_cluster. No tests were using in any other way. In the future, there could be custom apply functions per server dynamically assigned, if this becomes needed. * alejo/raft-tests-replication-02-v3-30: (66 commits) raft: replication test: wait for log for both index and term raft: replication test: reset network at construction raft: replication test: use lambda visitor for updates raft: replication test: move structs into class raft: replication test: move data structures to cluster class raft: replication test: remove shared pointers raft: replication test: move get_states() to raft_cluster raft: replication test: test_server inside raft_cluster raft: replication test: rpc declarative tests raft: replication test: add wait_log raft: replication test: add stop and reset server raft: replication test: disconnect 2 support raft: replication test: explicit node_id naming raft: replication test: move definitions up raft: replication test: no append entries support raft: replication test: fix helper parameter raft: replication test: stop servers out of config raft: replication test: wait log when removing leader from configuration raft: replication test: only manipulate servers in configuration raft: replication test: only cancel rearm ticker for removed server ...	2021-06-06 19:18:49 +03:00
Avi Kivity	a55b434a2b	treewide: extent copyright statements to present day	2021-06-06 19:18:49 +03:00
Alejo Sanchez	3e91a8ca0d	raft: replication test: wait for log for both index and term Waiting on index alone does not guarantee leader correct leader log propagation. This patch add checking also the term of the leader's last log entry. This was exposed with occasional problems with packet drops. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-06-04 08:38:19 -04:00
Gleb Natapov	aa7ea333da	raft: document that add entry my throw commit_status_unknown	2021-05-06 11:59:36 +03:00
Alejo Sanchez	0a5c605713	raft: replication test: fix custom election Use the new specific connectivity to manage old leader disconnection more specifically. This fixes having elections where the vote of the old leader is required for quorum. For example {A,B} and we want to switch leader. For B to become candidate it has to see A as down. Then A has to see B's request for vote, and vote for A. So to make the general case old leader needs to be first disconnected from all nodes, make the desired node candidate, then have the old leader connected only to the desired candidate (else, other nodes would see the new candidate as disrupting a live leader). Also, there might be stray messages from the former leader. These could revert the candidate to follower. To handle this this patch retries the process until the desired node becomes leader. The helper function elect_me_leader() is split and renamed to wait_until_candidate() and wait_election_done(). The former ticks until the node is a candidate and the later waits until a candidate either becomes a leader or reverts to follower The existing etcd test workaround of incrementing from n=2 to n=3 nodes is corrected back to original n=2. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-05-03 07:53:35 -04:00
Gleb Natapov	1f868d516e	raft: implement prevoting stage in leader election This is how PhD explain the need for prevoting stage: One downside of Raft's leader election algorithm is that a server that has been partitioned from the cluster is likely to cause a disruption when it regains connectivity. When a server is partitioned, it will not receive heartbeats. It will soon increment its term to start an election, although it won't be able to collect enough votes to become leader. When the server regains connectivity sometime later, its larger term number will propagate to the rest of the cluster (either through the server's RequestVote requests or through its AppendEntries response). This will force the cluster leader to step down, and a new election will have to take place to select a new leader. Prevoting stage is addressing that. In the Prevote algorithm, a candidate only increments its term if it first learns from a majority of the cluster that they would be willing to grant the candidate their votes (if the candidate's log is sufficiently up-to-date, and the voters have not received heartbeats from a valid leader for at least a baseline election timeout). The Prevote algorithm solves the issue of a partitioned server disrupting the cluster when it rejoins. While a server is partitioned, it won't be able to increment its term, since it can't receive permission from a majority of the cluster. Then, when it rejoins the cluster, it still won't be able to increment its term, since the other servers will have been receiving regular heartbeats from the leader. Once the server receives a heartbeat from the leader itself, it will return to the follower state(in the same term). In our implementation we have "stable leader" extension that prevents spurious RequestVote to dispose an active leader, but AppendEntries with higher term will still do that, so prevoting extension is also required.	2021-03-12 11:09:21 +02:00
Konstantin Osipov	95ee8e1b90	raft: fix spelling Fix spelling of a few comments.	2021-02-19 22:56:26 +03:00
Konstantin Osipov	51c968bcb4	raft: rename log::non_snapshoted_length() to log::in_memory_size() The old name was incorrect, in case apply_snapshot() was called with non-zero trailing entries, the total log length is greater than the length of the part that is not stored in a snapshot. Fix spelling in related comments. Rename fsm::wait() to fsm::wait_max_log_size(), it's a more specific name. Rename max_log_length to max_log_size to use 'size' rather than 'length' consistently for log size.	2021-02-18 16:04:44 +03:00
Alejo Sanchez	b41a6822e8	raft: drop ticker from raft Remove ticker callbacks from raft::server. External code should periodically call raft::server::tick(). Update replication_test accordingly. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-02-14 09:41:42 -04:00
Alejo Sanchez	97338ab53f	raft: replication test: fix debug mode hangs For certain situations where barely enough nodes to elect a new leader are connected a disruptive candidate can occassionally block the election. For example having servers A B C D E and only A B C are active in a partition. If the test wants to elect A, it has to first make all 3 servers reach election timeout threshold (to make B and C receptive). Then A is ticked till it becomes a candidate and has to send vote requests to the other servers. But all servers have a timer (_ticker) calling their periodic tick() functions. If one of the other servers, say B, gets its timer tick before A sends vote requests, B becomes a (disruptive) candidate and will refuse to vote for A. In our case of only having 3 out of 5 servers connected a single missing vote can hang the election. This patch disables timer ticks for all servers when running custom elections and partitioning. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-02-11 11:42:31 -04:00
Konstantin Osipov	c7b5a60320	raft: joint consensus, wire up configuration changes in the API Now that we've implemented joint consensus based configuration changes, replace add_server()/remove_server() with a more general set_configuration().	2021-01-29 22:07:08 +03:00
Tomasz Grabiec	f08a3e3fd8	Merge "raft: test fixes, etcd tests, simplification" from Alejo This patch set adds etcd unit tests for raft. It also includes a fix for replication test in debug mode and a simplification for append_request. Tests: unit ({dev}), unit ({debug}), unit ({release}) * https://github.com/alecco/scylla/tree/raft-ale-tests-09b: raft: etcd unit tests: test log replication raft: boost test etcd: test fsm can vote from any state raft: boost test etcd: port TestLeaderElectionOverwriteNewerLogs raft: replication test: add etcd test for cycling leaders raft: testing: provide primitives to wait for log propagation raft: etcd unit tests: initial boost tests raft: combine append_request _receive and _send	2021-01-21 10:41:33 +02:00
Pavel Solodovnikov	041072b59f	raft: rename `storage` to `persistence` The new naming scheme more clearly communicates to the client of the raft library that the `persistence` interface implements persistency layer of the fsm that is powering the raft protocol itself rather than the client-side workflow and user-provided `state_machine`. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com> Message-Id: <20201126135114.7933-1-pa.solodovnikov@scylladb.com>	2021-01-20 10:23:43 +02:00
Alejo Sanchez	f627972186	raft: testing: provide primitives to wait for log propagation For tests to be able to transition in a consistent state, in some cases it's needed to allow the followers to catch up with the leader. This prevents occasional hangs in debug mode for incoming tests. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-01-18 12:33:37 -04:00
Alejo Sanchez	d610d5a7b8	raft: expose fsm tick() to server for testing For tests to advance servers they need to invoke tick(). This is needed to advance free elections. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2020-11-22 10:32:34 -04:00
Alejo Sanchez	9e7e14fc50	raft: expose is_leader() for testing Expose fsm leader check to allow tests to find out the leader after an election. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2020-11-22 10:32:34 -04:00
Gleb Natapov	8d9b6f588e	raft: stop accepting requests on a leader after the log reaches the limit To prevent the log to take too much memory introduce a mechanism that limits the log to a certain size. If the size is reached no new log entries can be submitted until previous entries are committed and snapshotted.	2020-11-18 19:14:37 +01:00
Gleb Natapov	8bab38c6fa	raft: use correct type for node info in add_server()	2020-11-06 17:06:07 +03:00
Gleb Natapov	7fdfa32dbd	raft: preserve trailing raft log entries during snapshotting This patch allows to leave snapshot_trailing amount of entries when a state machine is snapshotted and raft log entries are dropped. Those entries can be used to catch up nodes that are slow without requiring snapshot transfer. The value is part of the configuration and can be changed.	2020-10-15 11:50:27 +03:00
Gleb Natapov	7c1187b7f5	raft: implement periodic snapshotting of a state machine The patch implements periodic taking of a snapshot and trimming of the raft log. In raft the only way the log of already committed entries can be shorten is by taking a snapshot of the state machine and dropping log entries included in the snapshot from the raft log. To not let log to grow too large the patch takes the snapshot periodically after applying N number of entries where N can be configured by setting snapshot_threshold value in raft's configuration.	2020-10-15 11:48:44 +03:00
Alejo Sanchez	670824c6fa	raft: declarative tests For convenience making Raft tests, use declarative structures. Servers are set up and initialized and then updates are processed. For now, updates are just adding entries to leader and change of leader. Updates and leader changes can be specified to run after initial test setup. An example test for 3 nodes, node 0 starting as leader having two entries 0 and 1 for term 1, and with current term 2, then adding 12 entries, changing leader to node 1, and adding 12 more entries. The test will automatically add more entries to the last leader until the test limit of total_values (default 100). {.name = "test_name", .nodes = 3, .initial_term = 2, .initial_states = {{.le = {{1,0},{1,1}}}, .updates = {entries{12},new_leader{1},entries{12}},}, Leader is isolated before change via is_leader returning false. Initial leader (default server 0) will be set with this method, too. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2020-10-09 15:50:31 +02:00
Gleb Natapov	0bff15a976	raft: Send multiple entries in one append_entry rpc Send more that one entry in single append_entry message but limit one packets size according to append_request_threshold parameter. Message-Id: <20201007142602.GA2496906@scylladb.com>	2020-10-07 16:43:33 +02:00
Gleb Natapov	c073997431	raft: Introduce raft interface header This commit introduce public raft interfaces. raft::server represents single raft server instance. raft::state_machine represents a user defined state machine. raft::rpc, raft::rpc_client and raft::storage are used to allow implementing custom networking and storage layers. A shared failure detector interface defines keep-alive semantics, required for efficient implementation of thousands of raft groups.	2020-10-01 14:30:59 +03:00

35 Commits