scylladb

Author	SHA1	Message	Date
Gleb Natapov	7f26a8eef5	raft: actively search for a leader if it is not known for a tick duration For a follower to forward requests to a leader the leader must be known. But there may be a situation where a follower does not learn about a leader for a while. This may happen when a node becomes a follower while its log is up-to-date and there are no new entries submitted to raft. In such case the leader will send nothing to the follower and the only way to learn about the current leader is to get a message from it. Until a new entry is added to the raft's log a follower that does not know who the leader is will not be able to add entries. Kind of a deadlock. Note that the problem is specific to our implementation where failure detection is done by an outside module. In vanilla raft a leader sends messages to all followers periodically, so essentially it is never idle. The patch solves this by broadcasting specially crafted append reject to all nodes in the cluster on a tick in case a leader is not known. The leader responds to this message with an empty append request which will cause the node to learn about the leader. For optimisation purposes the patch sends the broadcast only in case there is actually an operation that waits for leader to be known. Fixes #10379	2022-04-25 14:51:22 +02:00
Kamil Braun	5308a7d7a3	raft: server: return immediately from `wait_for_leader` if leader is known `wait_for_leader` may be called when leader is known. There's nothing to wait for in this case.	2022-04-25 12:59:55 +02:00
Kamil Braun	ad3141d3e0	raft: server: translate abort_requested_exception to raft::request_aborted The `wait_for_leader` function would throw a low-level `abort_requested_aborted` exception from seastar::shared_promise. Translate it to the high-level raft::request_aborted so we can reduce the number of different exception types which cross the Raft API boundary. Also, add comments on Raft API functions about the exception thrown when requests are aborted.	2022-04-05 19:18:53 +02:00
Kamil Braun	0f0d75fd66	raft: server: translate semaphore_aborted to request_aborted	2022-03-29 15:10:29 +02:00
Gleb Natapov	a1604aa388	raft: make raft requests abortable This patch adds an ability to pass abort_source to raft request APIs ( add_entry, modify_config) to make them abortable. A request issuer not always want to wait for a request to complete. For instance because a client disconnected or because it no longer interested in waiting because of a timeout. After this patch it can now abort waiting for such requests through an abort source. Note that aborting a request only aborts the wait for it to complete, it does not mean that the request will not be eventually executed. Message-Id: <YjHivLfIB9Xj5F4g@scylladb.com>	2022-03-16 18:38:01 +01:00
Alejo Sanchez	627275945f	raft: modify_config: support voting state change Handle requests to change voting for servers already present in the current configuration. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2022-02-08 08:00:07 -04:00
Alejo Sanchez	a40417df08	raft: minor: fix log format string Fix format string for log line. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2022-02-08 08:00:07 -04:00
Tomasz Grabiec	00a9326ae7	Merge "raft: let `modify_config` finish on a follower that removes itself" from Kamil When forwarding a reconfiguration request from follower to a leader in `modify_config`, there is no reason to wait for the follower's commit index to be updated. The only useful information is that the leader committed the configuration change - so `modify_config` should return as soon as we know that. There is a reason not to wait for the follower's commit index to be updated: if the configuration change removes the follower, the follower will never learn about it, so a local waiter will never be resolved. `execute_modify_config` - the part of `modify_config` executed on the leader - is thus modified to finish when the configuration change is fully complete (including the dummy entry appended at the end), and `modify_config` - which does the forwarding - no longer creates a local waiter, but returns as soon as the RPC call to the leader confirms that the entry was committed on the leader. We still return an `entry_id` from `execute_modify_config` but that's just an artifact of the implementation. Fixes #9981. A regression test was also added in randomized_nemesis_test. * kbr/modify-config-finishes-v1: test: raft: randomized_nemesis_test: regression test for #9981 raft: server: don't create local waiter in `modify_config`	2022-01-31 20:14:50 +01:00
Tomasz Grabiec	b78bab7286	Merge "raft: fixes and improvements to the library and nemesis test" from Kamil Raft randomized nemesis test was improved by adding some more chaos: randomizing the network delay, server configuration, ticking speed of servers. This allowed to catch a serious bug, which is fixed in the first patch. The patchset also fixes bugs in the test itself and adds quality of life improvements such as better diagnostics when inconsistency is detected. * kbr/nemesis-random-v1: test: raft: randomized_nemesis_test: print state of each state machine when detecting inconsistency test: raft: randomized_nemesis_test: print details when detecting inconsistency test: raft: randomized_nemesis_test: print snapshot details when taking/loading snapshots in `impure_state_machine` test: raft: randomized_nemesis_test: keep server id in impure_state_machine test: raft: randomized_nemesis_test: frequent snapshotting configuration test: raft: randomized_nemesis_test: tick servers at different speeds in generator test test: raft: randomized_nemesis_test: simplify ticker test: raft: randomized_nemesis_test: randomize network delay test: raft: randomized_nemesis_test: fix use-after-free in `environment::crash()` test: raft: randomized_nemesis_test: fix use-after-free in two-way rpc functions test: raft: randomized_nemesis_test: rpc: don't propagate `gate_closed_exception` outside test: raft: randomized_nemesis_test: fix obsolete comment raft: fsm: print configuration entries appearing in the log raft: `operator<<(ostream&, ...)` implementation for `server_address` and `configuration` raft: server: abort snapshot applications before waiting for rpc abort raft: server: logging fix raft: fsm: don't advance commit index beyond matched entries	2022-01-31 13:25:27 +01:00
Kamil Braun	28b5792481	raft: server: don't create local waiter in `modify_config` When forwarding a reconfiguration request from follower to a leader in `modify_config`, there is no reason to wait for the follower's commit index to be updated. The only useful information is that the leader committed the configuration change - so `modify_config` should return as soon as we know that. There is a reason not to wait for the follower's commit index to be updated: if the configuration change removes the follower, the follower will never learn about it, so a local waiter will never be resolved. `execute_modify_config` - the part of `modify_config` executed on the leader - is thus modified to finish when the configuration change is fully complete (including the dummy entry appended at the end), and `modify_config` - which does the forwarding - no longer creates a local waiter, but returns as soon as the RPC call to the leader confirms that the entry was committed on the leader. We still return an `entry_id` from `execute_modify_config` but that's just an artifact of the implementation. Fixes #9981.	2022-01-27 17:49:40 +01:00
Kamil Braun	46f6a0cca5	raft: server: abort snapshot applications before waiting for rpc abort The implementation of `rpc` may wait for all snapshot applications to finish before it can finish aborting. This is what the randomized_nemesis_test implementation did. This caused rpc abort to hang in some scenarios. In this commit, the order of abort calls is modified a bit. Instead of waiting for rpc abort to finish and then aborting existing snapshot applications, we call `rpc::abort()` and keep the future, then abort snapshot applications, then wait on the future. Calling `rpc::abort()` first is supposed to prevent new snapshot applications from starting; a comment was added at the interface definition. The nemesis test implementation had this property, and `raft_rpc` in group registry was adjusted appropriately. Aborting the snapshot applications then allows `rpc::abort()` to finish.	2022-01-26 16:06:45 +01:00
Kamil Braun	5577ad6c34	raft: server: logging fix	2022-01-26 15:54:14 +01:00
Gleb Natapov	579dcf187a	raft: allow an option to persist commit index Raft does not need to persist the commit index since a restarted node will either learn it from an append message from a leader or (if entire cluster is restarted and hence there is no leader) new leader will figure it out after contacting a quorum. But some users may want to be able to bring their local state machine to a state as up-to-date as it was before restart as soon as possible without any external communication. For them this patch introduces new persistence API that allows saving and restoring last seen committed index. Message-Id: <YfFD53oS2j1My0p/@scylladb.com>	2022-01-26 14:06:39 +01:00
Gleb Natapov	e56e96ac5a	raft: do not add new wait entries after abort Abort signals stopped_error on all awaited entries, but if an entry is added after this it will be destroyed without signaling and will cause a waiter to get broken_promise. Fixes #9688 Message-Id: <Ye6xJjTDooKSuZ87@scylladb.com>	2022-01-25 09:52:30 +02:00
Avi Kivity	fcb8d040e8	treewide: use Software Package Data Exchange (SPDX) license identifiers Instead of lengthy blurbs, switch to single-line, machine-readable standardized (https://spdx.dev) license identifiers. The Linux kernel switched long ago, so there is strong precedent. Three cases are handled: AGPL-only, Apache-only, and dual licensed. For the latter case, I chose (AGPL-3.0-or-later and Apache-2.0), reasoning that our changes are extensive enough to apply our license. The changes we applied mechanically with a script, except to licenses/README.md. Closes #9937	2022-01-18 12:15:18 +01:00
Avi Kivity	4118f2d8be	treewide: replace deprecated seastar::later() with seastar::yield() seastar::later() was recently deprecated and replaced with two alternatives: a cheap seastar::yield() and an expensive (but more powerful) seastar::check_for_io_immediately(), that corresponds to the original later(). This patch replaces all later() calls with the weaker yield(). In all cases except one, it's unambiguously correct. In one case (test/perf scheduling_latency_measurer::stop()) it's not so ambiguous, since check_for_io_immediately() will additionally force a poll and so will cause more work to be done (but no additional tasks to be executed). However, I think that any measurement that relies on the measuring the work on the last tick to be inaccurate (you need thousands of ticks to get any amount of confidence in the measurement) that in the end it doesn't matter what we pick. Tests: unit (dev) Closes #9904	2022-01-12 12:19:19 +01:00
Kamil Braun	75bab2beec	raft: server: print the ID of aborted server	2021-12-07 11:23:34 +01:00
Kamil Braun	485c0b1819	raft: server: don't register metrics in `start()` Instead, expose `register_metrics()` at the `server` interface (previously it was a private method of `server_impl`). Metrics are global so `register_metrics()` cannot be called on two servers that have the same ID, which is useful e.g. in tests when we want to simulate server stops and restarts.	2021-12-07 11:23:33 +01:00
Konstantin Osipov	eea82f1262	raft: (server) improve tracing	2021-11-25 12:35:43 +03:00
Konstantin Osipov	0d830d4c11	raft: (metrics) fix spelling of waiters_awaken The usage of awake and awaken is quite messy, but awoken is more common for passive voice, so use waiters_awoken.	2021-11-25 12:35:43 +03:00
Konstantin Osipov	6d28927550	raft: make forwarding optional In absence of abort_source or timeouts in Raft API, automatic bouncing can create too much noise during testing, especially during network failures. Add an option to disable follower bouncing feature, since randomized_nemesis_test has its own bouncing which handles timeouts correctly. Optionally disable forwarding in basic_generator_test.	2021-11-25 12:35:43 +03:00
Konstantin Osipov	e3751068fe	raft: (server) allow adding entries/modify config on a follower Implement an RPC to forward add_entry calls from the follower to leader. Bounce & retry in case of not_a_leader. Do not retry in case of uncertainty - this can lead to adding duplicate entries. The feature is added to core Raft since it's needed by all current clients - both topology and schema changes. When forwarding an entry to a remote leader we may get back a term/index pair that conflicts (has the same index, but is with a higher term) with a local entry we're still waiting on. This can happen, e.g. because there was a leader change and the log was truncated, but we still haven't got the append_entries RPC from the new leader, still haven't truncated the log locally, still haven't aborted all the local waits for truncated entries. Only remove the offending entry from the wait list and abort it. There may be entries labeled with an older term to the right (with higher commit index) of the conflicting entry. However, finding them, would require a linear scan. If we allow it, we may end up doing this linear scan for every conflicting entry during the transition period, which brings us to N^2 complexity of this step. At the same time, as soon as append_entries that commits a higher-term entry with the same index reaches the follower, the waits for the respective truncated entry will be aborted anyway (see notify_waiters() which sets dropped_entry exception), so the scan is unnecessary. Similarly to being able to add entries, allow to modify Raft group configuration on a follower. The implementation works the same way as adding entries - forwards the command to the leader. Now that add_entry() or modify_config never throws not_a_leader, it's more likely to throw timed_out_error, e.g. in case the network is partitioned. Previously it was only possible due to a semaphore wait timeout, and this scenario was not tested. Handle timed_out_error on RPC level to let the existing tests (specifically the randomized nemesis test) pass.	2021-11-25 11:50:38 +03:00
Konstantin Osipov	9cde1cdf71	raft: (server) implement id() helper There is no easy way to get server id otherwise.	2021-11-25 11:50:38 +03:00
Konstantin Osipov	b9faf41513	raft: (server) remove apply_dummy_entry() It's currently unused, and going forward we'd like to make it work on the follower, which requires a new implementation.	2021-11-25 11:50:38 +03:00
Gleb Natapov	7aac6c2086	raft: rename rpc_configuration to configuration in fsm output The filed is generic and used not only for rpc configuration now.	2021-11-09 15:16:57 +02:00
Gleb Natapov	9d505c48de	raft: abort snapshot transfer to a server that was removed from the configuration If a node is removed from a config we should stop transferring snapshot to it. Do exactly that. Fixes #9547	2021-11-09 14:51:40 +02:00
Gleb Natapov	88a6e2446d	raft: fix race between snapshot application and committing of new entries Completion notification code assumes that previous snapshot is applied before new entries are committed, otherwise it asserts that some notifications were missing. But currently commit notifications and snapshot application run in different fibers, so the can be race between those. Fix that by moving commit notification into applier fiber as well. Fixes #9550	2021-11-09 14:51:40 +02:00
Gleb Natapov	bdf7d1a411	raft: correctly truncate the log in a persistence module during snapshot application When remote snapshot is applied the log is completely cleared because snapshot transfer happens only when common log prefix cannot be found, so we cannot be sure that existing entries in the log are correct. But currently it only happens for in memory log by calling apply_snapshot with trailing set to zero, but when persistence module is called to store the snapshot _config.snapshot_trailing is used which can be non zero. This may cause the log to contain incorrect entries after restart. The patch fixes this by using zero trailing for non local snapshots. Fixes #9551	2021-11-04 15:11:19 +02:00
Avi Kivity	cd4af0c722	raft: disambiguate promise name in raft::active_read gcc complains tha the name 'promise' changes meaning (from type to variable) within active_read. Help it by disambiguating the use as type.	2021-10-10 18:16:50 +03:00
Kamil Braun	36f3e26374	raft: server: handle `rpc::send_snapshot` returning instantly If `rpc::send_snapshot` returned immediately with a ready future, or if it threw, the code in `server_impl::send_snapshot` would not update `_snapshot_transfers` correctly. The code assumed that the continuation attached to `rpc::send_snapshot` (with `then_wrapped`) was executed after `_snapshot_transfer` below the `rpc::send_snapshot` call was updated. That would not necessarily be true (the continuation may even not have been executed at all if `rpc::send_snapshot` threw). Fix that by wrapping the `rpc::send_snapshot` call into a continuation attached to `later()`. Originally authored by Gleb <gleb@scylladb.com>, I added a comment.	2021-10-05 11:04:11 +02:00
Gleb Natapov	78774a485a	raft: drop local snapshot if it cannot be installed If a locally taken snapshot cannot be installed because newer one was received meanwhile it should be dropped, otherwise it will take space needlessly. Message-Id: <YUrWXxVfBjEio1Ol@scylladb.com>	2021-09-27 13:03:23 +02:00
Avi Kivity	daf028210b	build: enable -Winconsistent-missing-override warning This warning can catch a virtual function that thinks it overrides another, but doesn't, because the two functions have different signatures. This isn't very likely since most of our virtual functions override pure virtuals, but it's still worth having. Enable the warning and fix numerous violations. Closes #9347	2021-09-15 12:55:54 +03:00
Gleb Natapov	ce40b01b07	raft: rename snapshot into snapshot_descriptor The snapshot structure does not contain the snapshot itself but only refers to it trough its id. Rename it to snapshot_descriptor for clarity.	2021-08-29 12:53:03 +03:00
Gleb Natapov	0aa2e95475	raft: drop snapshot if is application failed No need to keep a snapshot that was not applied.	2021-08-29 12:53:03 +03:00
Gleb Natapov	f9f859ac40	raft: fix local snapshot detection The code assumes that the snapshot that was taken locally is never applied. Currently logic to detect that is flawed. It relies on an id of a most recently applied snapshot (where a locally taken snapshot is considered to be applied as well). But if between snapshot creation and the check another local snapshot is taken ids will not match. The patch fixes this by propagating locality information together with the snapshot itself.	2021-08-29 12:53:03 +03:00
Gleb Natapov	03a266d73b	raft: make read_barrier work on a follower as well as on a leader This patch implements RAFT extension that allows to perform linearisable reads by accessing local state machine. The extension is described in section 6.4 of the PhD. To sum it up to perform a read barrier on a follower it needs to asks a leader the last committed index that it knows about. The leader must make sure that it is still a leader before answering by communicating with a quorum. When follower gets the index back it waits for it to be applied and by that completes read_barrier invocation. The patch adds three new RPC: read_barrier, read_barrier_reply and execute_read_barrier_on_leader. The last one is the one a follower uses to ask a leader about safe index it can read. First two are used by a leader to communicate with a quorum.	2021-08-25 08:57:13 +03:00
Gleb Natapov	73af7edc78	raft: add a function to wait for an index to be applied	2021-08-25 08:19:25 +03:00
Konstantin Osipov	0429196e06	raft: (server) add a helper to wait through uncertainty period Add a helper to be able to wait until a Raft cluster leader is elected. It can be used to avoid sleeps when it's necessary to forward a request to the leader, but the leader is yet unknown.	2021-08-25 08:19:25 +03:00
Gleb Natapov	bd0fd579cf	raft: fix indentation in applier_fiber	2021-08-25 08:19:25 +03:00
Kamil Braun	1ca4d30cc3	raft: sanity checking of apply index Check that entries are applied in the correct order.	2021-08-06 12:21:19 +02:00
Kamil Braun	c6563220b0	raft: store cluster configuration when taking snapshots We add a function `log_last_conf_before(index_t)` to `fsm` which, given an index greater than the last snapshot index, returns the configuration at this index, i.e. the configuration of the last configuration entry before this index. This function is then used in `applier_fiber` to obtain the correct configuration to be stored in a snapshot. In order to ensure that the configuration can be obtained, i.e. the index we're looking at is not smaller than the last snapshot index, we strengthen the conditions required for taking a snapshot: we check that `_fsm` has not yet applied a snapshot at a larger index (which it may have due to a remote snapshot install request). This also causes fewer unnecessary snapshots to be taken in general.	2021-08-06 12:00:32 +02:00
Kamil Braun	f050d3682c	raft: fsm: stronger check for outdated remote snapshots We must not apply remote snapshots with commit indexes smaller than our local commit index; this could result in out-of-order command application to the local state machine replica, leading to serializability violations. Message-Id: <20210805112736.35059-1-kbraun@scylladb.com>	2021-08-05 14:29:50 +02:00
Kamil Braun	e9632ee986	raft: use the correct term when storing a snapshot We should not use the current term; we should use the term of the snapshot's index, which may be lower.	2021-08-02 11:46:04 +02:00
Gleb Natapov	f0047bd749	raft: apply snapshots in applier_fiber We want to serialize snapshot application with command application otherwise a command may be applied after a snapshot that already contains the result of its application (it is not necessary a problem since the raft by itself does not guaranty apply-once semantics, but better to prevent it when possible). This also moves all interactions with user's state machine into one place. Message-Id: <YPltCmBAGUQnpW7r@scylladb.com>	2021-07-23 18:05:38 +02:00
Gleb Natapov	aa8c6b85fb	raft: do not apply empty command list Do not call user's state machine apply() if there is nothing to apply. Message-Id: <YO1dMitXnZhZlmra@scylladb.com>	2021-07-19 18:26:18 +02:00
Gleb Natapov	ed49d29473	raft: allow to initiate leader stepdown process Sometimes an ability to force a leader change is needed. For instance if a node that is currently serving as a leader needs to be brought down for maintenance. If it will be shutdown without leadership transfer the cluster will be unavailable for leader election timeout at least. We already have a mechanism to transfer the leadership in case an active leader is removed. The patch exposes it as an external interface with a timeout period. If a node is still a leader after the timeout the operation will fail.	2021-06-22 14:36:42 +03:00
Konstantin Osipov	c67c77ed03	raft: (server) wait for configuration transition to complete By default, wait for the server to leave the joint configuration when making a configuration change. When assembling a fresh cluster Scylla may run a series of configuration changes. These changes would all go through the same leader and serialize in the critical section around server::cas(). Unless this critical section protects the complete transition from C_old configuration to C_new, after the first configuration is committed, the second may fail with exception that a configuration change is in progress. The topology changes layer should handle this exception, however, this may introduce either unpleasant delays into cluster assembly (i.e. if we sleep before retry), or a busy-wait/thundering herd situation, when all nodes are retrying their configuration changes. So let's be nice and wait for a full transition in server::set_configuration().	2021-06-16 16:52:43 +03:00
Konstantin Osipov	631c89e1a6	raft: (server) implement raft::server::get_configuration() raft::server::set_configuration() is useless on application level if we can't query the previous configuration.	2021-06-16 16:52:43 +03:00
Gleb Natapov	580edcef27	raft: register metrics only after fsm is created Metrics access _fsm pointer, so we should register them only after the pointer is populated. Fixes: #8824 Message-Id: <YMilsCslLAeEnbaw@scylladb.com>	2021-06-16 09:34:49 +02:00
Konstantin Osipov	684e0d2a8c	raft: improve configuration consistency checks Isolate the checks for configuration transitions in a static function, to be able to unit test outside class server. Split the condition of transitioning to an empty configuration from the condition of transitioning into a configuration with no voters, to produce more user-friendly error messages. Allow to transfer leadership in a configuration when the only voter is the leader itself. This would be equivalent to syncing the leader log with the learner and converting the leader to the follower itself. This is safe, since the leader will re-elect itself quickly after an election timeout, and may be used to do a rolling restart of a cluster with only one voter. A test case follows.	2021-06-11 17:16:47 +03:00

1 2 3

101 Commits