scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-24 10:30:38 +00:00

Author	SHA1	Message	Date
Gleb Natapov	7f26a8eef5	raft: actively search for a leader if it is not known for a tick duration For a follower to forward requests to a leader the leader must be known. But there may be a situation where a follower does not learn about a leader for a while. This may happen when a node becomes a follower while its log is up-to-date and there are no new entries submitted to raft. In such case the leader will send nothing to the follower and the only way to learn about the current leader is to get a message from it. Until a new entry is added to the raft's log a follower that does not know who the leader is will not be able to add entries. Kind of a deadlock. Note that the problem is specific to our implementation where failure detection is done by an outside module. In vanilla raft a leader sends messages to all followers periodically, so essentially it is never idle. The patch solves this by broadcasting specially crafted append reject to all nodes in the cluster on a tick in case a leader is not known. The leader responds to this message with an empty append request which will cause the node to learn about the leader. For optimisation purposes the patch sends the broadcast only in case there is actually an operation that waits for leader to be known. Fixes #10379	2022-04-25 14:51:22 +02:00
Gleb Natapov	108e7fcc4e	raft: enter candidate state immediately when starting a singleton cluster When a node starts it does not immediately becomes a candidate since it waits to learn about already existing leader and randomize the time it becomes a candidate to prevent dueling candidates if several nodes are started simultaneously. If a cluster consist of only one node there is no point in waiting before becoming a candidate though because two cases above cannot happen. This patch checks that the node belongs to a singleton cluster where the node itself is the only voting member and becomes candidate immediately. This reduces the starting time of a single node cluster which are often used in testing. Message-Id: <YiCbQXx8LPlRQssC@scylladb.com>	2022-03-04 20:30:52 +01:00
Avi Kivity	fcb8d040e8	treewide: use Software Package Data Exchange (SPDX) license identifiers Instead of lengthy blurbs, switch to single-line, machine-readable standardized (https://spdx.dev) license identifiers. The Linux kernel switched long ago, so there is strong precedent. Three cases are handled: AGPL-only, Apache-only, and dual licensed. For the latter case, I chose (AGPL-3.0-or-later and Apache-2.0), reasoning that our changes are extensive enough to apply our license. The changes we applied mechanically with a script, except to licenses/README.md. Closes #9937	2022-01-18 12:15:18 +01:00
Konstantin Osipov	c22f945f11	raft: (service) manage Raft configuration during topology changes Operations of adding or removing a node to Raft configuration are made idempotent: they do nothing if already done, and they are safe to resume after a failure. However, since topology changes are not transactional, if a bootstrap or removal procedure fails midway, Raft group 0 configuration may go out of sync with topology state as seen by gossip. In future we must change gossip to avoid making any persistent changes to the cluster: all changes to persistent topology state will be done exclusively through Raft Group 0. Specifically, instead of persisting the tokens by advertising them through gossip, the bootstrap will commit a change to a system table using Raft group 0. nodetool will switch from looking at gossip-managed tables to consulting with Raft Group 0 configuration or Raft-managed tables. Once this transformation is done, naturally, adding a node to Raft configuration (perhaps as a non-voting member at first) will become the first persistent change to ring state applied when a node joins; removing a node from the Raft Group 0 configuration will become the last action when removing a node. Until this is done, do our best to avoid a cluster state when a removed node or a node which addition failed is stuck in Raft configuration, but the node is no longer present in gossip-managed system tables. In other words, keep the gossip the primary source of truth. For this purpose, carefully chose the timing when we join and leave Raft group 0: Join the Raft group 0 only after we've advertised our tokens, so the cluster is aware of this node, it's visible in nodetool status, but before node state jumps to "normal", i.e. before it accepts queries. Since the operation is idempotent, invoke it on each restart. Remove the node from Group 0 before its tokens are removed from gossip-managed system tables. This guarantees that if removal from Raft group 0 fails for whatever reason, the node stays in the ring, so nodetool removenode and friends are re-tried. Add tracing.	2021-11-25 12:35:42 +03:00
Gleb Natapov	db25f1dbb8	raft: test: test case for the issue #9552 test that if a leader tries to append an entry that falls inside a follower's snapshot the protocol stays alive.	2021-11-09 14:51:40 +02:00
Gleb Natapov	3a88fa5f70	raft: test: add test for correct last configuration index calculation during snapshot application	2021-11-09 14:51:40 +02:00
Avi Kivity	11cc772388	test: raft: avoid ignored variable errors Avoid instantiating unused variables, and in one case ignore it, to avoid a gcc warning.	2021-10-10 18:17:53 +03:00
Kamil Braun	bf823e34a4	raft: disable sticky leadership rule The Raft PhD presents the following scenario. When we remove a server from the cluster configuration, it does not receive the configuration entry which removes it (because the leader appending this entry uses that entry's configuration to decide to which servers to send the entry to, and the entry does not contain the removed server). Therefore the server keeps believing it is a member but does not receive heartbeats from leaders in the new configuration. Therefore it will keep becoming a candidate, causing existing leaders to step down, harming availability. With many such candidates the cluster may even stop being able to proceed at all. We call such servers "disruptive". More concretely, consider the following example, adapted from the PhD for joint configuration changes (the original PhD considered a different algorithm which can only add/remove one server at once): Let C_old = {A, B, C, D}, C_new = {B, C, D}, and C_joint be the joint configuration (C_old, C_new). D is the leader. D managed to append C_joint to every server and commit it. D appends C_new. At this point, D stops sending heartbeats to A because C_new does not contain A, but A's last entry is still C_joint, so it still has the ability to become a candidate. A can now become a candidate and cause D, or any other leader in C_new, to step down. Even if D manages to commit C_new, A can keep disrupting the cluster until it is shut down. Prevoting changes the situation, which the authors admit. The "even if" above no longer applies: if D manages to commit C_new, or just append it to a majority of C_new, then A won't be able to succeed in the prevote phase because a majority of servers in C_new has a longer log than A (and A must obtain a prevote from a majority of servers in C_new because A is in C_joint which contains C_new). But the authors continue to argue that disruptions can still occur during the small period where C_new is only appended on D but not yet on a majority of C_new. As they say: "we also did not want to assume that a leader will reliably replicate entries fast enough to move past the scenario (...) quickly; that might have worked in practice, but it depends on stronger assumptions that we prefer to avoid about the performance (...) of replicating log entries". One could probably try debunking this by saying that if entries take longer to replicate than the election timeout we're in much bigger trouble, but nevermind. In any case, the authors propose a solution which we call "sticky leadership". A server will not grant a vote to a candidate if it has recently received a heartbeat from the currently known leader, even if the candidate's term is higher. In the above example, servers in C_new would not grant votes to A as long as D keeps sending them heartbeats, thus A is no longer disruptive. In our case the situation is a bit different: in original Raft, "heartbeats" have a very specific meaning - they are append_entries requests (possibly empty) sent by leaders. Thus if a node stops being a leader it stops sending heartbeats; similarly, if a node leaves the configuration, it stops receiving heartbeats from others still in the configuration. We instead use a "shared failure detector" interface, where nodes may still consider other nodes alive regardless of their configuration/leadership situation, as part of the general "MultiRaft" framework. This pretty much invalidates the original argument, as seen on the above example: A will still consider D alive, thus it won't become a candidate. Shared failure detector combined with sticky leadership actually makes the situation worse - it may cause cluster unavailability in certain scenarios (fortunately not a permanent one, it can be solved with server restarts, for example). Randomized nemesis testing with reconfigurations found the following scenario: Let C1 = {A, B, C}, C2 = {A}, C3 = {B, C}. We start from configuration C1, B is the leader. B commits joint (C1, C2), then new C2 configuration. Note that C does not learn about the last entry (since it's not part of C2) but it keeps believing that B is alive, so it keeps believing that B is the leader. We then partition {A} from {B, C}. A appends (C2, C3) joint configuration to its log. It's not able to append it to B or C due to the partition. The partition holds long enough for A to revert to candidate state (or we may restart A at this point). Eventually the partition resolves. The only node which can become a candidate now is A: C does not become a candidate because it keeps believeing that B is the leader, and B does not become a candidate because it saw the C2 non-joint entry being committed. However, A won't become a leader because C won't grant it a vote due to the sticky leadership rule. The cluster will remain unavailable until e.g. C is restarted. Note that this scenario requires allowing configuration changes which remove and then readd the same servers to the configuration. One may wonder if such reconfigurations should be allowed, but there doesn't seem to be any example of them breaking safety of Raft (and the PhD doesn't seem to mention them at all; perhaps it implicitly accepts them). It is unknown whether a similar scenario may be produced without such reconfigurations. In any case, disabling sticky leadership resolves the problem, and it is the last currently known availability problem found in randomized nemesis testing. There is no reason to keep this extension, both because the original Raft authors' argument does not apply for shared failure detector, and because one may even argue with the authors in vanilla Raft given that prevoting is enabled (see end of third paragraph of this commit message). Message-Id: <20210921153741.65084-1-kbraun@scylladb.com>	2021-09-26 11:09:01 +03:00
Gleb Natapov	ce40b01b07	raft: rename snapshot into snapshot_descriptor The snapshot structure does not contain the snapshot itself but only refers to it trough its id. Rename it to snapshot_descriptor for clarity.	2021-08-29 12:53:03 +03:00
Gleb Natapov	5e1d589872	raft: do not wait for entry to become stable before replicate it Since io_fiber persist entries before sending out messages even non stable entries will become stable before observed by other nodes. This patch also moves generation of append messages into get_outptut() call because without the change we will lose batching since each advance of last_idx will generate new append message.	2021-08-29 12:48:15 +03:00
Gleb Natapov	ad2c2abcb8	raft: test: add read_barrier tests to fsm_test	2021-08-25 08:57:13 +03:00
Alejo Sanchez	a6cd35c512	raft: testing: refactor helper Move definitions to helper object file. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-08-23 17:50:16 +02:00
Kamil Braun	7533c84e62	raft: sometimes become a candidate even if outside the configuration There are situations where a node outside the current configuration is the only node that can become a leader. We become candidates in such cases. But there is an easy check for when we don't need to; a comment was added explaining that.	2021-08-06 13:18:32 +02:00
Kamil Braun	f050d3682c	raft: fsm: stronger check for outdated remote snapshots We must not apply remote snapshots with commit indexes smaller than our local commit index; this could result in out-of-order command application to the local state machine replica, leading to serializability violations. Message-Id: <20210805112736.35059-1-kbraun@scylladb.com>	2021-08-05 14:29:50 +02:00
Gleb Natapov	4764028cb3	raft: Remove leader_id from append_request The filed is not used anywhere. Message-Id: <YP0khmjK2JSp77AG@scylladb.com>	2021-07-28 20:30:07 +02:00
Pavel Solodovnikov	ab6b0e3d62	raft: fsm_test: test_leader_transfer_lost_force_vote_request 3-node cluster (A, B, C). A is initially elected a leader. The leader adds a new configuration entry, that removes it from the cluster (B, C). Wait up until the former leader commits the new configuration and starts leader transfer procedure, sending out the `timeout_now` message to one of the remaining nodes. But at that point it haven't received it yet. Deliver the `timeout_now` message to the target but lose all the `vote_request(force)` messages it attempts to send. This should halt the election process. Then wait for election timeout so that candidate node starts another normal election (without `force` flag for vote requests). Check that this candidate further makes progress and is elected a leader. Tests: unit(dev, debug) Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-06-15 19:44:21 +03:00
Pavel Solodovnikov	97fe6f9d49	raft: fsm_test: test_leader_transfer_lost_timeout_now 3-node cluster (A, B, C). A is initially elected a leader. The leader adds a new configuration entry, that removes it from the cluster (B, C). Wait up until the former leader commits the new configuration and starts leader transfer procedure, sending out the `timeout_now` message to one of the remaining nodes. But at that point it haven't received it yet. Lose this message and verify that the rest of the cluster (B, C) can make progress and elect a new leader. Tests: unit(dev, debug) Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-06-15 19:44:21 +03:00
Pavel Solodovnikov	c32497b798	raft: fsm_test: test_leader_transferee_dies_upon_receiving_timeout_now 4-node cluster (A, B, C, D). A is initially elected a leader. The leader adds a new configuration entry, that removes it from the cluster (B, C, D). Communicate the cluster up to the point where A starts to resign its leadership (calls `transfer_leadership()`). At this point, A should send a `timeout_now` message to one the remaining nodes (B, C or D) and the new configuration should be committed. But no nodes actually have received the `timeout_now` message yet. Determine on which node the message should arrive, accept the `timeout_now` message and disconnect the target from the rest of the group. Check that after that the cluster, which has only two live members, could progress and elect a new leader through a normal election process. tests: unit(dev, debug) Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-06-15 19:44:19 +03:00
Konstantin Osipov	2be8a73c34	raft: (testing) test non-voter can vote When a non-voter is requested a vote, it must vote to preserve liveness. In Raft, servers respond to messages without consulting with their current configuration, and the non-voter may not have the latest configuration when it is requested to vote.	2021-06-11 17:16:57 +03:00
Konstantin Osipov	eaf32f2c3c	raft: (testing) test receiving a confchange in a snapshot	2021-06-11 17:16:56 +03:00
Konstantin Osipov	d08ad76c24	raft: (testing) test voter-non-voter config change loop	2021-06-11 17:16:55 +03:00
Konstantin Osipov	6e4619fe87	raft: (testing) test non-voter doesn't start election on election timeout	2021-06-11 17:16:55 +03:00
Konstantin Osipov	c8ae13a392	raft: (testing) test what happens when a learner gets TimeoutNow Once learner receives TimeoutNow it becomes a candidate, discovers it can't vote, doesn't increase its term and converts back to a follower. Once entries arrive from a new leader it updates its term.	2021-06-11 17:16:55 +03:00
Konstantin Osipov	a972269630	raft: (testing) implement a test for a leader becoming non-voter	2021-06-11 17:16:55 +03:00
Konstantin Osipov	3e6fd5705b	raft: (testing) test that non-voter stays in PIPELINE mode Test that configuration changes preserve PIPELINE mode.	2021-06-11 17:07:39 +03:00
Konstantin Osipov	d42d5aee8c	raft: (internal) simplify construction of tagged_id Make it easy to construct tagged_id from UUID.	2021-06-08 14:52:32 +03:00
Konstantin Osipov	52f7ff4ee4	raft: (testing) update copyright An incorrect copyright information was copy-pasted from another test file. Message-Id: <20210525183919.1395607-1-kostja@scylladb.com>	2021-05-27 15:47:49 +03:00
Tomasz Grabiec	b1821c773f	Merge "raft: basic RPC module testing" from Pavel Solodovnikov Now RPC module has some basic testing coverage to make sure RPC configuration is updated appropriately on configuration changes (i.e. `add_server` and `remove_server` are called when appropriate). The test suite currenty consists of the following test-cases: * Loading server instance with configuration from a snapshot. * Loading server instance with configuration from a log. * Configuration changes (remove + add node). * Leader elections don't lead to RPC configuration changes. * Voter <-> learner node transitions also don't change RPC configuration. * Reverting uncommitted configuration changes updates RPC configuration accordingly (two cases: revert to snapshot config or committed state from the log). A few more refactorings are made along the way to be able to reuse some existing functions from `replication_test` in `rpc_test` implementation. Please note, though, that there are still some functions that are borrowed from `replication_test` but not yet extracted to common helpers. This is mostly because RPC tests doesn't need all the complexity that `replication_test` has, thus, some helpers are copied in a reduced form. It would take some effort to refactor these bits to fit both `replication_test` and `rpc_test` without sacrificing convenience. This will probably be addressed in another series later. * manmanson/raft-rpc-tests-v9-alt3: raft: add tests for RPC module test: add CHECK_EVENTUALLY_EQUAL utility macro raft: replication_test: reset test rpc network between test runs raft: replication_test: extract tickers initialization into a separate func raft: replication_test: support passing custom `apply_fn` to `change_configuration()` raft: replication_test: introduce `test_server` aggregate struct raft: replication_test: support voter<->learner configuration changes raft: remove duplicate `create_command` function from `replication_test` raft: avoid 'using' statements in raft testing helpers header	2021-05-24 14:44:37 +02:00
Gleb Natapov	b4d6bdb16e	raft: test: check that a leader does not send probes to a follower in the snapshot mode Message-Id: <YKTNN7vNGkQwTDX7@scylladb.com>	2021-05-23 01:06:12 +02:00
Pavel Solodovnikov	0389001496	raft: avoid 'using' statements in raft testing helpers header It is generally considered a bad practice to use the `using` directives at global scope in header files. Also, many parts of `test/raft/helpers.hh` were already using `raft::` prefixes explicitly, so definitely not much to lose there. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-05-17 13:36:09 +03:00
Tomasz Grabiec	0fdd2f8217	Merge "raft: fsm cleanups" from Gleb * scylla-dev/raft-cleanup-v1: raft: drop _leader_progress tracking from the tracker raft: move current_leader into the follower state raft: add some precondition checks	2021-05-14 17:24:59 +02:00
Alejo Sanchez	68f69671b5	raft: style: test optionals directly Avoid using has_value() and test optional directly Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com> Message-Id: <20210512142018.297203-2-alejo.sanchez@scylladb.com>	2021-05-12 20:39:52 +02:00
Gleb Natapov	78c5a72b32	raft: drop _leader_progress tracking from the tracker The tracker maintains a separate pointer to current leader progress, but all this complexity is not needed because the tracker already have find() function that can either find a leader's progress by id or return null. Removing the tracking simplifies code and make going out of sync (which is always a possibility if a state is maintained in two different places) impossible.	2021-05-09 13:55:55 +03:00
Kamil Braun	4c95277619	raft: fsm: fix assertion failure on stray rejects When probes are sent over a slow network, the leader would send multiple probes to a lagging follower before it would get a reject response to the first probe back. After getting a reject, the leader will be able to correctly position `next_idx` for that follower and switch to pipeline mode. Then, an out of order reject to a now irrelevant probe could crash the leader, since it would effectively request it to "rewind" its `match_idx` for that follower, and the code asserts this never happens. We fix the problem by strengthening `is_stray_reject`. The check that was previously only made in `PIPELINE` case (`rejected.non_matching_idx <= match_idx`) is now always performed and we add a new check: `rejected.last_idx < match_idx`. We also strengthen the assert. The commit improves the documentation by explaining that `is_stray_reject` may return false negatives. We also precisely state the preconditions and postconditions of `is_stray_reject`, give a more precise definition of `progress.match_idx`, argue how the postconditions of `is_stray_reject` follow from its preconditions and Raft invariants, and argue why the (strengthened) assert must always pass. Message-Id: <20210423173117.32939-1-kbraun@scylladb.com>	2021-04-27 01:07:22 +02:00
Gleb Natapov	b9175edea4	raft: test: check that a server with id zero cannot be neither created nor added to a config Message-Id: <20210407134853.1964226-2-gleb@scylladb.com>	2021-04-08 17:07:18 +02:00
Gleb Natapov	68d73bd4c8	raft: add test for check quorum on a leader	2021-04-07 10:15:33 +03:00
Gleb Natapov	bdb59307d3	raft: test: add test case for stepdown process Add the test for the case where C_new entry is not the last one in a leader that is been removed from a cluster. In this case a leader will continue replication even after committing C_new and will start stepdown process later, when at least one follower is fully synchronized.	2021-04-07 10:15:33 +03:00
Gleb Natapov	10781037f5	raft: test: add test that leader behaves as expected when it gets unexpended messages	2021-04-04 11:33:35 +03:00
Konstantin Osipov	1a1d7ab662	raft: (testing) stray replies from removed followers	2021-03-24 14:05:55 +03:00
Konstantin Osipov	0295163f6f	raft: always return a non-zero configuration index from the log Return snapshot index for last configuration index if there is no configuration in the log.	2021-03-24 14:05:55 +03:00
Konstantin Osipov	cec59e53ef	raft: (testing) leader change during configuration change	2021-03-24 14:05:36 +03:00
Konstantin Osipov	a203c8833f	raft: (testing) test confchange {ABCDE} -> {ABCDEFG}	2021-03-24 14:04:18 +03:00
Konstantin Osipov	40e117d36e	raft: (testing) test confchange {ABCDEF} -> {ABCGH}	2021-03-24 14:04:18 +03:00
Konstantin Osipov	14b2d5d308	raft: (testing) test confchange {ABC} -> {CDE} Test leader change during configuration change.	2021-03-24 14:04:18 +03:00
Konstantin Osipov	3c718a175e	raft: (testing) test confchange {AB} -> {CD}	2021-03-24 14:04:18 +03:00
Konstantin Osipov	2e30c8540e	raft: (testing) test confchange {A} -> {B} Test non-restart and leader restart scenario.	2021-03-24 14:04:18 +03:00
Konstantin Osipov	e23da06fef	raft: (testing) test a server with empty configuration Try becoming a candidate for such server, or adding it to an existing configuration.	2021-03-24 14:04:18 +03:00
Konstantin Osipov	b18599c630	raft: (testing) introduce testing utilities Add a discrete_failure_detector, to be able to mark a single server dead.	2021-03-24 14:04:18 +03:00
Konstantin Osipov	8d26d24370	raft: (testing) simplify id allocation in test	2021-03-24 14:04:18 +03:00
Konstantin Osipov	7182323ac0	raft: (testing) style cleanup in raft_fsm_test 1) Avoid memory violations on test failure 2) Print better diagnostics on failure (BOOST_CHECK_EQUAL vs BOOST_CHECK)	2021-03-24 14:04:18 +03:00

1 2

54 Commits