Commit Graph

54 Commits

Author SHA1 Message Date
Gleb Natapov
7f26a8eef5 raft: actively search for a leader if it is not known for a tick duration
For a follower to forward requests to a leader the leader must be known.
But there may be a situation where a follower does not learn about
a leader for a while. This may happen when a node becomes a follower while its
log is up-to-date and there are no new entries submitted to raft. In such
case the leader will send nothing to the follower and the only way to
learn about the current leader is to get a message from it. Until a new
entry is added to the raft's log a follower that does not know who the
leader is will not be able to add entries. Kind of a deadlock. Note that
the problem is specific to our implementation where failure detection is
done by an outside module. In vanilla raft a leader sends messages to
all followers periodically, so essentially it is never idle.

The patch solves this by broadcasting specially crafted append reject to all
nodes in the cluster on a tick in case a leader is not known. The leader
responds to this message with an empty append request which will cause the
node to learn about the leader. For optimisation purposes the patch
sends the broadcast only in case there is actually an operation that
waits for leader to be known.

Fixes #10379
2022-04-25 14:51:22 +02:00
Gleb Natapov
108e7fcc4e raft: enter candidate state immediately when starting a singleton cluster
When a node starts it does not immediately becomes a candidate since it
waits to learn about already existing leader and randomize the time it
becomes a candidate to prevent dueling candidates if several nodes are
started simultaneously.

If a cluster consist of only one node there is no point in waiting
before becoming a candidate though because two cases above cannot
happen. This patch checks that the node belongs to a singleton cluster
where the node itself is the only voting member and becomes candidate
immediately. This reduces the starting time of a single node cluster
which are often used in testing.

Message-Id: <YiCbQXx8LPlRQssC@scylladb.com>
2022-03-04 20:30:52 +01:00
Avi Kivity
fcb8d040e8 treewide: use Software Package Data Exchange (SPDX) license identifiers
Instead of lengthy blurbs, switch to single-line, machine-readable
standardized (https://spdx.dev) license identifiers. The Linux kernel
switched long ago, so there is strong precedent.

Three cases are handled: AGPL-only, Apache-only, and dual licensed.
For the latter case, I chose (AGPL-3.0-or-later and Apache-2.0),
reasoning that our changes are extensive enough to apply our license.

The changes we applied mechanically with a script, except to
licenses/README.md.

Closes #9937
2022-01-18 12:15:18 +01:00
Konstantin Osipov
c22f945f11 raft: (service) manage Raft configuration during topology changes
Operations of adding or removing a node to Raft configuration
are made idempotent: they do nothing if already done, and
they are safe to resume after a failure.

However, since topology changes are not transactional, if a
bootstrap or removal procedure fails midway, Raft group 0
configuration may go out of sync with topology state as seen by
gossip.

In future we must change gossip to avoid making any persistent
changes to the cluster: all changes to persistent topology state
will be done exclusively through Raft Group 0.

Specifically, instead of persisting the tokens by advertising
them through gossip, the bootstrap will commit a change to a system
table using Raft group 0. nodetool will switch from looking at
gossip-managed tables to consulting with Raft Group 0 configuration
or Raft-managed tables.
Once this transformation is done, naturally, adding a node to Raft
configuration (perhaps as a non-voting member at first) will become the
first persistent change to ring state applied when a node joins;
removing a node from the Raft Group 0 configuration will become the last
action when removing a node.

Until this is done, do our best to avoid a cluster state when
a removed node or a node which addition failed is stuck in Raft
configuration, but the node is no longer present in gossip-managed
system tables. In other words, keep the gossip the primary source of
truth. For this purpose, carefully chose the timing when we
join and leave Raft group 0:

Join the Raft group 0 only after we've advertised our tokens, so the
cluster is aware of this node, it's visible in nodetool status,
but before node state jumps to "normal", i.e. before it accepts
queries. Since the operation is idempotent, invoke it on each
restart.

Remove the node from Group 0 *before* its tokens are removed
from gossip-managed system tables. This guarantees
that if removal from Raft group 0 fails for whatever reason,
the node stays in the ring, so nodetool removenode and
friends are re-tried.

Add tracing.
2021-11-25 12:35:42 +03:00
Gleb Natapov
db25f1dbb8 raft: test: test case for the issue #9552
test that if a leader tries to append an entry that falls inside
a follower's snapshot the protocol stays alive.
2021-11-09 14:51:40 +02:00
Gleb Natapov
3a88fa5f70 raft: test: add test for correct last configuration index calculation during snapshot application 2021-11-09 14:51:40 +02:00
Avi Kivity
11cc772388 test: raft: avoid ignored variable errors
Avoid instantiating unused variables, and in one case ignore it,
to avoid a gcc warning.
2021-10-10 18:17:53 +03:00
Kamil Braun
bf823e34a4 raft: disable sticky leadership rule
The Raft PhD presents the following scenario.
When we remove a server from the cluster configuration, it does not
receive the configuration entry which removes it (because the leader
appending this entry uses that entry's configuration to decide to which
servers to send the entry to, and the entry does not contain the removed
server). Therefore the server keeps believing it is a member but does
not receive heartbeats from leaders in the new configuration. Therefore
it will keep becoming a candidate, causing existing leaders to step
down, harming availability. With many such candidates the cluster may
even stop being able to proceed at all. We call such servers
"disruptive".

More concretely, consider the following example, adapted from the PhD for
joint configuration changes (the original PhD considered a different
algorithm which can only add/remove one server at once):
Let C_old = {A, B, C, D}, C_new = {B, C, D}, and C_joint be the joint
configuration (C_old, C_new). D is the leader. D managed to append
C_joint to every server and commit it. D appends C_new. At this point, D
stops sending heartbeats to A because C_new does not contain A, but A's
last entry is still C_joint, so it still has the ability to become a
candidate. A can now become a candidate and cause D, or any other leader
in C_new, to step down. Even if D manages to commit C_new, A can keep
disrupting the cluster until it is shut down.

Prevoting changes the situation, which the authors admit. The "even if"
above no longer applies: if D manages to commit C_new, or just append it
to a majority of C_new, then A won't be able to succeed in the prevote
phase because a majority of servers in C_new has a longer log than A
(and A must obtain a prevote from a majority of servers in C_new because
A is in C_joint which contains C_new). But the authors continue to argue
that disruptions can still occur during the small period where C_new is
only appended on D but not yet on a majority of C_new. As they say:
"we also did not want to assume that a leader will reliably replicate
entries fast enough to move past the scenario (...) quickly; that might
have worked in practice, but it depends on stronger assumptions that we
prefer to avoid about the performance (...) of replicating log entries".
One could probably try debunking this by saying that if entries take
longer to replicate than the election timeout we're in much bigger
trouble, but nevermind.

In any case, the authors propose a solution which we call "sticky
leadership". A server will not grant a vote to a candidate if it has
recently received a heartbeat from the currently known leader, even if
the candidate's term is higher. In the above example, servers in C_new
would not grant votes to A as long as D keeps sending them heartbeats,
thus A is no longer disruptive.

In our case the situation is a bit
different: in original Raft, "heartbeats" have a very specific meaning
- they are append_entries requests (possibly empty) sent by leaders.
Thus if a node stops being a leader it stops sending heartbeats;
similarly, if a node leaves the configuration, it stops receiving
heartbeats from others still in the configuration. We instead use a
"shared failure detector" interface, where nodes may still consider
other nodes alive regardless of their configuration/leadership
situation, as part of the general "MultiRaft" framework.

This pretty much invalidates the original argument, as seen on
the above example: A will still consider D alive, thus it won't become
a candidate.

Shared failure detector combined with sticky leadership actually makes
the situation worse - it may cause cluster unavailability in certain
scenarios (fortunately not a permanent one, it can be solved with server
restarts, for example). Randomized nemesis testing with reconfigurations
found the following scenario:
Let C1 = {A, B, C}, C2 = {A}, C3 = {B, C}. We start from configuration
C1, B is the leader. B commits joint (C1, C2), then new C2
configuration. Note that C does not learn about the last entry
(since it's not part of C2) but it keeps believing that B is alive,
so it keeps believing that B is the leader.
We then partition {A} from {B, C}. A appends (C2, C3) joint
configuration to its log. It's not able to append it to B or C due to
the partition. The partition holds long enough for A to revert to
candidate state (or we may restart A at this point). Eventually the
partition resolves. The only node which can become a candidate now is A:
C does not become a candidate because it keeps believeing that B is the
leader, and B does not become a candidate because it saw the C2
non-joint entry being committed. However, A won't become a leader
because C won't grant it a vote due to the sticky leadership rule.
The cluster will remain unavailable until e.g. C is restarted.

Note that this scenario requires allowing configuration changes which
remove and then readd the same servers to the configuration. One may
wonder if such reconfigurations should be allowed, but there doesn't
seem to be any example of them breaking safety of Raft (and the PhD
doesn't seem to mention them at all; perhaps it implicitly accepts
them). It is unknown whether a similar scenario may be produced without
such reconfigurations.

In any case, disabling sticky leadership resolves the problem, and it is
the last currently known availability problem found in randomized
nemesis testing. There is no reason to keep this extension, both because
the original Raft authors' argument does not apply for shared failure
detector, and because one may even argue with the authors in vanilla
Raft given that prevoting is enabled (see end of third paragraph of this
commit message).
Message-Id: <20210921153741.65084-1-kbraun@scylladb.com>
2021-09-26 11:09:01 +03:00
Gleb Natapov
ce40b01b07 raft: rename snapshot into snapshot_descriptor
The snapshot structure does not contain the snapshot itself but only
refers to it trough its id. Rename it to snapshot_descriptor for clarity.
2021-08-29 12:53:03 +03:00
Gleb Natapov
5e1d589872 raft: do not wait for entry to become stable before replicate it
Since io_fiber persist entries before sending out messages even non
stable entries will become stable before observed by other nodes.

This patch also moves generation of append messages into get_outptut()
call because without the change we will lose batching since each
advance of last_idx will generate new append message.
2021-08-29 12:48:15 +03:00
Gleb Natapov
ad2c2abcb8 raft: test: add read_barrier tests to fsm_test 2021-08-25 08:57:13 +03:00
Alejo Sanchez
a6cd35c512 raft: testing: refactor helper
Move definitions to helper object file.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-08-23 17:50:16 +02:00
Kamil Braun
7533c84e62 raft: sometimes become a candidate even if outside the configuration
There are situations where a node outside the current configuration is
the only node that can become a leader. We become candidates in such
cases. But there is an easy check for when we don't need to; a comment was
added explaining that.
2021-08-06 13:18:32 +02:00
Kamil Braun
f050d3682c raft: fsm: stronger check for outdated remote snapshots
We must not apply remote snapshots with commit indexes smaller than our
local commit index; this could result in out-of-order command
application to the local state machine replica, leading to
serializability violations.
Message-Id: <20210805112736.35059-1-kbraun@scylladb.com>
2021-08-05 14:29:50 +02:00
Gleb Natapov
4764028cb3 raft: Remove leader_id from append_request
The filed is not used anywhere.

Message-Id: <YP0khmjK2JSp77AG@scylladb.com>
2021-07-28 20:30:07 +02:00
Pavel Solodovnikov
ab6b0e3d62 raft: fsm_test: test_leader_transfer_lost_force_vote_request
3-node cluster (A, B, C). A is initially elected a leader.
The leader adds a new configuration entry, that removes it from the
cluster (B, C).

Wait up until the former leader commits the new configuration and starts
leader transfer procedure, sending out the `timeout_now` message to
one of the remaining nodes. But at that point it haven't received it yet.

Deliver the `timeout_now` message to the target but lose all the
`vote_request(force)` messages it attempts to send.
This should halt the election process.
Then wait for election timeout so that candidate node starts another
normal election (without `force` flag for vote requests).

Check that this candidate further makes progress and is elected a
leader.

Tests: unit(dev, debug)

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-06-15 19:44:21 +03:00
Pavel Solodovnikov
97fe6f9d49 raft: fsm_test: test_leader_transfer_lost_timeout_now
3-node cluster (A, B, C). A is initially elected a leader.
The leader adds a new configuration entry, that removes it from the
cluster (B, C).

Wait up until the former leader commits the new configuration and starts
leader transfer procedure, sending out the `timeout_now` message to
one of the remaining nodes. But at that point it haven't received it yet.

Lose this message and verify that the rest of the cluster (B, C)
can make progress and elect a new leader.

Tests: unit(dev, debug)

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-06-15 19:44:21 +03:00
Pavel Solodovnikov
c32497b798 raft: fsm_test: test_leader_transferee_dies_upon_receiving_timeout_now
4-node cluster (A, B, C, D). A is initially elected a leader.
The leader adds a new configuration entry, that removes it from the
cluster (B, C, D).
Communicate the cluster up to the point where A starts to resign
its leadership (calls `transfer_leadership()`).
At this point, A should send a `timeout_now` message to one
the remaining nodes (B, C or D) and the new configuration should be
committed. But no nodes actually have received the `timeout_now` message
yet.

Determine on which node the message should arrive, accept the
`timeout_now` message and disconnect the target from the rest of the
group.

Check that after that the cluster, which has only two live members,
could progress and elect a new leader through a normal election process.

tests: unit(dev, debug)

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-06-15 19:44:19 +03:00
Konstantin Osipov
2be8a73c34 raft: (testing) test non-voter can vote
When a non-voter is requested a vote, it must vote
to preserve liveness. In Raft, servers respond
to messages without consulting with their current configuration,
and the non-voter may not have the latest configuration
when it is requested to vote.
2021-06-11 17:16:57 +03:00
Konstantin Osipov
eaf32f2c3c raft: (testing) test receiving a confchange in a snapshot 2021-06-11 17:16:56 +03:00
Konstantin Osipov
d08ad76c24 raft: (testing) test voter-non-voter config change loop 2021-06-11 17:16:55 +03:00
Konstantin Osipov
6e4619fe87 raft: (testing) test non-voter doesn't start election on election timeout 2021-06-11 17:16:55 +03:00
Konstantin Osipov
c8ae13a392 raft: (testing) test what happens when a learner gets TimeoutNow
Once learner receives TimeoutNow it becomes a candidate, discovers it
can't vote, doesn't increase its term and converts back to a
follower. Once entries arrive from a new leader it updates its
term.
2021-06-11 17:16:55 +03:00
Konstantin Osipov
a972269630 raft: (testing) implement a test for a leader becoming non-voter 2021-06-11 17:16:55 +03:00
Konstantin Osipov
3e6fd5705b raft: (testing) test that non-voter stays in PIPELINE mode
Test that configuration changes preserve PIPELINE mode.
2021-06-11 17:07:39 +03:00
Konstantin Osipov
d42d5aee8c raft: (internal) simplify construction of tagged_id
Make it easy to construct tagged_id from UUID.
2021-06-08 14:52:32 +03:00
Konstantin Osipov
52f7ff4ee4 raft: (testing) update copyright
An incorrect copyright information was copy-pasted
from another test file.

Message-Id: <20210525183919.1395607-1-kostja@scylladb.com>
2021-05-27 15:47:49 +03:00
Tomasz Grabiec
b1821c773f Merge "raft: basic RPC module testing" from Pavel Solodovnikov
Now RPC module has some basic testing coverage to
make sure RPC configuration is updated appropriately
on configuration changes (i.e. `add_server` and
`remove_server` are called when appropriate).

The test suite currenty consists of the following
test-cases:
 * Loading server instance with configuration from a snapshot.
 * Loading server instance with configuration from a log.
 * Configuration changes (remove + add node).
 * Leader elections don't lead to RPC configuration changes.
 * Voter <-> learner node transitions also don't change RPC
   configuration.
 * Reverting uncommitted configuration changes updates
   RPC configuration accordingly (two cases: revert to
   snapshot config or committed state from the log).

A few more refactorings are made along the way to be
able to reuse some existing functions from
`replication_test` in `rpc_test` implementation.

Please note, though, that there are still some functions
that are borrowed from `replication_test` but not yet
extracted to common helpers.

This is mostly because RPC tests doesn't need all
the complexity that `replication_test` has, thus,
some helpers are copied in a reduced form.

It would take some effort to refactor these bits to
fit both `replication_test` and `rpc_test` without
sacrificing convenience.
This will probably be addressed in another series later.

* manmanson/raft-rpc-tests-v9-alt3:
  raft: add tests for RPC module
  test: add CHECK_EVENTUALLY_EQUAL utility macro
  raft: replication_test: reset test rpc network between test runs
  raft: replication_test: extract tickers initialization into a separate func
  raft: replication_test: support passing custom `apply_fn` to `change_configuration()`
  raft: replication_test: introduce `test_server` aggregate struct
  raft: replication_test: support voter<->learner configuration changes
  raft: remove duplicate `create_command` function from `replication_test`
  raft: avoid 'using' statements in raft testing helpers header
2021-05-24 14:44:37 +02:00
Gleb Natapov
b4d6bdb16e raft: test: check that a leader does not send probes to a follower in the snapshot mode
Message-Id: <YKTNN7vNGkQwTDX7@scylladb.com>
2021-05-23 01:06:12 +02:00
Pavel Solodovnikov
0389001496 raft: avoid 'using' statements in raft testing helpers header
It is generally considered a bad practice to use the `using`
directives at global scope in header files.

Also, many parts of `test/raft/helpers.hh` were already
using `raft::` prefixes explicitly, so definitely not much
to lose there.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-05-17 13:36:09 +03:00
Tomasz Grabiec
0fdd2f8217 Merge "raft: fsm cleanups" from Gleb
* scylla-dev/raft-cleanup-v1:
  raft: drop _leader_progress tracking from the tracker
  raft: move current_leader into the follower state
  raft: add some precondition checks
2021-05-14 17:24:59 +02:00
Alejo Sanchez
68f69671b5 raft: style: test optionals directly
Avoid using has_value() and test optional directly

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Message-Id: <20210512142018.297203-2-alejo.sanchez@scylladb.com>
2021-05-12 20:39:52 +02:00
Gleb Natapov
78c5a72b32 raft: drop _leader_progress tracking from the tracker
The tracker maintains a separate pointer to current leader progress,
but all this complexity is not needed because the tracker already have
find() function that can either find a leader's progress by id or return
null. Removing the tracking simplifies code and make going out of sync
(which is always a possibility if a state is maintained in two different
places) impossible.
2021-05-09 13:55:55 +03:00
Kamil Braun
4c95277619 raft: fsm: fix assertion failure on stray rejects
When probes are sent over a slow network, the leader would send
multiple probes to a lagging follower before it would get a
reject response to the first probe back. After getting a reject, the
leader will be able to correctly position `next_idx` for that
follower and switch to pipeline mode. Then, an out of order reject
to a now irrelevant probe could crash the leader, since it would
effectively request it to "rewind" its `match_idx` for that
follower, and the code asserts this never happens.

We fix the problem by strengthening `is_stray_reject`. The check that
was previously only made in `PIPELINE` case
(`rejected.non_matching_idx <= match_idx`) is now always performed and
we add a new check: `rejected.last_idx < match_idx`. We also strengthen
the assert.

The commit improves the documentation by explaining that
`is_stray_reject` may return false negatives.  We also precisely state
the preconditions and postconditions of `is_stray_reject`, give a more
precise definition of `progress.match_idx`, argue how the
postconditions of `is_stray_reject` follow from its preconditions
and Raft invariants, and argue why the (strengthened) assert
must always pass.
Message-Id: <20210423173117.32939-1-kbraun@scylladb.com>
2021-04-27 01:07:22 +02:00
Gleb Natapov
b9175edea4 raft: test: check that a server with id zero cannot be neither created nor added to a config
Message-Id: <20210407134853.1964226-2-gleb@scylladb.com>
2021-04-08 17:07:18 +02:00
Gleb Natapov
68d73bd4c8 raft: add test for check quorum on a leader 2021-04-07 10:15:33 +03:00
Gleb Natapov
bdb59307d3 raft: test: add test case for stepdown process
Add the test for the case where C_new entry is not the last one in a
leader that is been removed from a cluster. In this case a leader will
continue replication even after committing C_new and will start stepdown
process later, when at least one follower is fully synchronized.
2021-04-07 10:15:33 +03:00
Gleb Natapov
10781037f5 raft: test: add test that leader behaves as expected when it gets unexpended messages 2021-04-04 11:33:35 +03:00
Konstantin Osipov
1a1d7ab662 raft: (testing) stray replies from removed followers 2021-03-24 14:05:55 +03:00
Konstantin Osipov
0295163f6f raft: always return a non-zero configuration index from the log
Return snapshot index for last configuration index if there
is no configuration in the log.
2021-03-24 14:05:55 +03:00
Konstantin Osipov
cec59e53ef raft: (testing) leader change during configuration change 2021-03-24 14:05:36 +03:00
Konstantin Osipov
a203c8833f raft: (testing) test confchange {ABCDE} -> {ABCDEFG} 2021-03-24 14:04:18 +03:00
Konstantin Osipov
40e117d36e raft: (testing) test confchange {ABCDEF} -> {ABCGH} 2021-03-24 14:04:18 +03:00
Konstantin Osipov
14b2d5d308 raft: (testing) test confchange {ABC} -> {CDE}
Test leader change during configuration change.
2021-03-24 14:04:18 +03:00
Konstantin Osipov
3c718a175e raft: (testing) test confchange {AB} -> {CD} 2021-03-24 14:04:18 +03:00
Konstantin Osipov
2e30c8540e raft: (testing) test confchange {A} -> {B}
Test non-restart and leader restart scenario.
2021-03-24 14:04:18 +03:00
Konstantin Osipov
e23da06fef raft: (testing) test a server with empty configuration
Try becoming a candidate for such server, or adding it
to an existing configuration.
2021-03-24 14:04:18 +03:00
Konstantin Osipov
b18599c630 raft: (testing) introduce testing utilities
Add a discrete_failure_detector, to be able
to mark a single server dead.
2021-03-24 14:04:18 +03:00
Konstantin Osipov
8d26d24370 raft: (testing) simplify id allocation in test 2021-03-24 14:04:18 +03:00
Konstantin Osipov
7182323ac0 raft: (testing) style cleanup in raft_fsm_test
1) Avoid memory violations on test failure
2) Print better diagnostics on failure (BOOST_CHECK_EQUAL vs
   BOOST_CHECK)
2021-03-24 14:04:18 +03:00