Commit Graph

31 Commits

Author SHA1 Message Date
Kamil Braun
a83e04279e raft: fsm: remove constructor used only in tests
This constructor does not provide persisted commit index. It was only
used in tests, so move it there, to the helper `fsm_debug` which
inherits from `fsm`.

Test cases which used `fsm` directly instead of `fsm_debug` were
modified to use `fsm_debug` so they can access the constructor.
`fsm_debug` doesn't change the behavior of `fsm`, only adds some helper
members. This will be useful in following commits too.
2024-01-18 18:07:17 +01:00
Kefu Chai
fa3129fa29 treewide: use unsigned variable to compare with unsigned
some times we initialize a loop variable like

auto i = 0;

or

int i = 0;

but since the type of `0` is `int`, what we get is a variable of
`int` type, but later we compare it with an unsigned number, if we
compile the source code with `-Werror=sign-compare` option, the
compiler would warn at seeing this. in general, this is a false
alarm, as we are not likely to have a wrong comparison result
here. but in order to prevent issues due to the integer promotion
for comparison in other places. and to prepare for enabling
`-Werror=sign-compare`. let's use unsigned to silence this warning.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-07-18 10:27:18 +08:00
Kamil Braun
daf9c53bb8 raft: split can_vote field from server_address to separate struct
Whether a server can vote in a Raft configuration is not part of the
address. `server_address` was used in many context where `can_vote` is
irrelevant.

Split the struct: `server_address` now contains only `id` and
`server_info` as it did before `can_vote` was introduced. Instead we
have a `config_member` struct that contains a `server_address` and the
`can_vote` field.

Also remove an "unsafe" constructor from `server_address` where `id` was
provided but `server_info` was not. The constructor was used for tests
where `server_info` is irrelevant, but it's important not to forget
about the info in production code. The constructor was used for two
purposes:
- Invoking set operations such as `contains`. To solve this we use C++20
  transparent hash and comparator functions, which allow invoking
  `contains` and similar functions by providing a different key type (in
  this case `raft::server_id` in set of addresses, for example).
- constructing addresses without `info`s in tests. For this we provide
  helper functions in the test helpers module and use them.
2022-07-18 18:22:10 +02:00
Gleb Natapov
108e7fcc4e raft: enter candidate state immediately when starting a singleton cluster
When a node starts it does not immediately becomes a candidate since it
waits to learn about already existing leader and randomize the time it
becomes a candidate to prevent dueling candidates if several nodes are
started simultaneously.

If a cluster consist of only one node there is no point in waiting
before becoming a candidate though because two cases above cannot
happen. This patch checks that the node belongs to a singleton cluster
where the node itself is the only voting member and becomes candidate
immediately. This reduces the starting time of a single node cluster
which are often used in testing.

Message-Id: <YiCbQXx8LPlRQssC@scylladb.com>
2022-03-04 20:30:52 +01:00
Avi Kivity
fcb8d040e8 treewide: use Software Package Data Exchange (SPDX) license identifiers
Instead of lengthy blurbs, switch to single-line, machine-readable
standardized (https://spdx.dev) license identifiers. The Linux kernel
switched long ago, so there is strong precedent.

Three cases are handled: AGPL-only, Apache-only, and dual licensed.
For the latter case, I chose (AGPL-3.0-or-later and Apache-2.0),
reasoning that our changes are extensive enough to apply our license.

The changes we applied mechanically with a script, except to
licenses/README.md.

Closes #9937
2022-01-18 12:15:18 +01:00
Avi Kivity
11cc772388 test: raft: avoid ignored variable errors
Avoid instantiating unused variables, and in one case ignore it,
to avoid a gcc warning.
2021-10-10 18:17:53 +03:00
Avi Kivity
9907303bf5 test: adjust signed/unsigned comparisons in loops and boost tests
gcc complains about comparing a signed loop induction variable
with an unsigned limit, or comparing an expected value and measured
value. Fix by using unsigned throughout, except in one case where
the signed value was needed for the data_value constructor.
2021-10-10 18:16:50 +03:00
Kamil Braun
bf823e34a4 raft: disable sticky leadership rule
The Raft PhD presents the following scenario.
When we remove a server from the cluster configuration, it does not
receive the configuration entry which removes it (because the leader
appending this entry uses that entry's configuration to decide to which
servers to send the entry to, and the entry does not contain the removed
server). Therefore the server keeps believing it is a member but does
not receive heartbeats from leaders in the new configuration. Therefore
it will keep becoming a candidate, causing existing leaders to step
down, harming availability. With many such candidates the cluster may
even stop being able to proceed at all. We call such servers
"disruptive".

More concretely, consider the following example, adapted from the PhD for
joint configuration changes (the original PhD considered a different
algorithm which can only add/remove one server at once):
Let C_old = {A, B, C, D}, C_new = {B, C, D}, and C_joint be the joint
configuration (C_old, C_new). D is the leader. D managed to append
C_joint to every server and commit it. D appends C_new. At this point, D
stops sending heartbeats to A because C_new does not contain A, but A's
last entry is still C_joint, so it still has the ability to become a
candidate. A can now become a candidate and cause D, or any other leader
in C_new, to step down. Even if D manages to commit C_new, A can keep
disrupting the cluster until it is shut down.

Prevoting changes the situation, which the authors admit. The "even if"
above no longer applies: if D manages to commit C_new, or just append it
to a majority of C_new, then A won't be able to succeed in the prevote
phase because a majority of servers in C_new has a longer log than A
(and A must obtain a prevote from a majority of servers in C_new because
A is in C_joint which contains C_new). But the authors continue to argue
that disruptions can still occur during the small period where C_new is
only appended on D but not yet on a majority of C_new. As they say:
"we also did not want to assume that a leader will reliably replicate
entries fast enough to move past the scenario (...) quickly; that might
have worked in practice, but it depends on stronger assumptions that we
prefer to avoid about the performance (...) of replicating log entries".
One could probably try debunking this by saying that if entries take
longer to replicate than the election timeout we're in much bigger
trouble, but nevermind.

In any case, the authors propose a solution which we call "sticky
leadership". A server will not grant a vote to a candidate if it has
recently received a heartbeat from the currently known leader, even if
the candidate's term is higher. In the above example, servers in C_new
would not grant votes to A as long as D keeps sending them heartbeats,
thus A is no longer disruptive.

In our case the situation is a bit
different: in original Raft, "heartbeats" have a very specific meaning
- they are append_entries requests (possibly empty) sent by leaders.
Thus if a node stops being a leader it stops sending heartbeats;
similarly, if a node leaves the configuration, it stops receiving
heartbeats from others still in the configuration. We instead use a
"shared failure detector" interface, where nodes may still consider
other nodes alive regardless of their configuration/leadership
situation, as part of the general "MultiRaft" framework.

This pretty much invalidates the original argument, as seen on
the above example: A will still consider D alive, thus it won't become
a candidate.

Shared failure detector combined with sticky leadership actually makes
the situation worse - it may cause cluster unavailability in certain
scenarios (fortunately not a permanent one, it can be solved with server
restarts, for example). Randomized nemesis testing with reconfigurations
found the following scenario:
Let C1 = {A, B, C}, C2 = {A}, C3 = {B, C}. We start from configuration
C1, B is the leader. B commits joint (C1, C2), then new C2
configuration. Note that C does not learn about the last entry
(since it's not part of C2) but it keeps believing that B is alive,
so it keeps believing that B is the leader.
We then partition {A} from {B, C}. A appends (C2, C3) joint
configuration to its log. It's not able to append it to B or C due to
the partition. The partition holds long enough for A to revert to
candidate state (or we may restart A at this point). Eventually the
partition resolves. The only node which can become a candidate now is A:
C does not become a candidate because it keeps believeing that B is the
leader, and B does not become a candidate because it saw the C2
non-joint entry being committed. However, A won't become a leader
because C won't grant it a vote due to the sticky leadership rule.
The cluster will remain unavailable until e.g. C is restarted.

Note that this scenario requires allowing configuration changes which
remove and then readd the same servers to the configuration. One may
wonder if such reconfigurations should be allowed, but there doesn't
seem to be any example of them breaking safety of Raft (and the PhD
doesn't seem to mention them at all; perhaps it implicitly accepts
them). It is unknown whether a similar scenario may be produced without
such reconfigurations.

In any case, disabling sticky leadership resolves the problem, and it is
the last currently known availability problem found in randomized
nemesis testing. There is no reason to keep this extension, both because
the original Raft authors' argument does not apply for shared failure
detector, and because one may even argue with the authors in vanilla
Raft given that prevoting is enabled (see end of third paragraph of this
commit message).
Message-Id: <20210921153741.65084-1-kbraun@scylladb.com>
2021-09-26 11:09:01 +03:00
Gleb Natapov
ce40b01b07 raft: rename snapshot into snapshot_descriptor
The snapshot structure does not contain the snapshot itself but only
refers to it trough its id. Rename it to snapshot_descriptor for clarity.
2021-08-29 12:53:03 +03:00
Gleb Natapov
5e1d589872 raft: do not wait for entry to become stable before replicate it
Since io_fiber persist entries before sending out messages even non
stable entries will become stable before observed by other nodes.

This patch also moves generation of append messages into get_outptut()
call because without the change we will lose batching since each
advance of last_idx will generate new append message.
2021-08-29 12:48:15 +03:00
Alejo Sanchez
a6cd35c512 raft: testing: refactor helper
Move definitions to helper object file.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-08-23 17:50:16 +02:00
Gleb Natapov
4764028cb3 raft: Remove leader_id from append_request
The filed is not used anywhere.

Message-Id: <YP0khmjK2JSp77AG@scylladb.com>
2021-07-28 20:30:07 +02:00
Gleb Natapov
09528b8671 raft: test: test leadership transfer timeout
Test that if leadership transfer cannot be done in configured time frame
fsm cancels the leadership transfer process. Also check that timeout_now
message is resent on each tick while leadership transfer is in progress.
2021-06-22 14:42:50 +03:00
Pavel Solodovnikov
e9258f43cd raft: etcd_test: test_transfer_non_member
Test that a node outside configuration, that receives `timeout_now`
message, doesn't disrupt operation of the rest of the cluster.

That is, `timeout_now` has no effect and the outsider stays in
the follower state without promoting to the candidate.

Tests: unit(dev, debug)

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-06-15 19:44:21 +03:00
Pavel Solodovnikov
2b6d73de98 raft: etcd_test: test_leader_transfer_ignore_proposal
Test that a leader which has entered leader stepdown mode rejects
new append requests.

Tests: unit(dev, debug)

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-06-15 19:44:21 +03:00
Konstantin Osipov
d42d5aee8c raft: (internal) simplify construction of tagged_id
Make it easy to construct tagged_id from UUID.
2021-06-08 14:52:32 +03:00
Pavel Solodovnikov
0389001496 raft: avoid 'using' statements in raft testing helpers header
It is generally considered a bad practice to use the `using`
directives at global scope in header files.

Also, many parts of `test/raft/helpers.hh` were already
using `raft::` prefixes explicitly, so definitely not much
to lose there.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-05-17 13:36:09 +03:00
Gleb Natapov
745f63991f raft: test: fix c&p error in a test
Message-Id: <YJKBOwBX8hqHLxsB@scylladb.com>
2021-05-05 17:18:49 +02:00
Alejo Sanchez
ace0ee514f raft: etcd unit tests: test proposal handling scenarios
TestProposal
For multiple scenarios, check proposal handling.

Note, instead of expecting an explicit result for each specified case,
the test automatically checks for expected behavior when quorum is
reached or not.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-03-25 15:04:29 -04:00
Alejo Sanchez
77163ea76a raft: etcd unit tests: test old messages ignored
TestOldMessages
Checks an append request from a leader from a previous term is ignored.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-03-25 15:04:29 -04:00
Alejo Sanchez
bf65b19803 raft: etcd unit tests: test single node precandidate
TestSingleNodePreCandidate
Checks a single node configuration with precandidate on works to
automatically elect the node.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-03-25 15:04:29 -04:00
Alejo Sanchez
de7051467b raft: etcd unit tests: test dueling precandidates
TestDuelingPreCandidates
In a configuration of 3 nodes, two nodes don't see each other and they
compete for leadership. Loser (3) should revert to follower when prevote
is rejected and revert to term 1.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-03-25 15:04:29 -04:00
Alejo Sanchez
aa7d23f86b raft: etcd unit tests: test dueling candidates
TestDuelingCandidates
In a configuration of 3 nodes, two nodes don't see each other and they
compete for leadership. Once reconnected, loser should not disrupt.

But note it will remain candidate with current algorithm without
prevoting and other fsms will not bump term.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-03-25 15:04:29 -04:00
Alejo Sanchez
1eac94e7d6 raft: etcd unit tests: test cannot commit without new term
TestCannotCommitWithoutNewTermEntry tests the entries cannot be
committed when leader changes, no new proposal comes in and ChangeTerm
proposal is filtered.

NOTE: this doesn't check committed but it's implicit for next round;
      this could also use communicate() providing committed output map

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-03-25 15:04:29 -04:00
Alejo Sanchez
b421fe3605 raft: etcd unit tests: test single node commit
Port etcd TestSingleNodeCommit

In a single node configuration elect the node, add 2 entries and check
number of committed entries.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-03-25 15:04:29 -04:00
Alejo Sanchez
9b4538476b raft: etcd unit tests: update test_leader_election_overwrite_newer_logs
Make test_leader_election_overwrite_newer_logs use newer communicate()
and other new helpers.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-03-25 15:04:29 -04:00
Alejo Sanchez
368eec1190 raft: etcd unit tests: fix test_progress_leader
Make implementation follow closer to original test.
Use newer boost test helpers.

NOTE: in etcd it seems a leader's self progress is in PIPELINE state.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-03-25 15:04:28 -04:00
Konstantin Osipov
b18599c630 raft: (testing) introduce testing utilities
Add a discrete_failure_detector, to be able
to mark a single server dead.
2021-03-24 14:04:18 +03:00
Pavel Solodovnikov
93c565a1bf raft: allow raft server to start with initial term 0
Prior to the fix there was an assert to check in
`raft::server_impl::start` that the initial term is not 0.

This restriction is completely artificial and can be lifted
without any problems, which will be described below.

The only place that is dependent on this corner case is in
`server_impl::io_fiber`. Whenever term or vote has changed,
they will be both set in `fsm::get_output`. `io_fiber` checks
whether it needs to persist term and vote by validating that
the term field is set (by actually executing a `term != 0`
condition).

This particular check is based on an unobvious fact that the
term will never be 0 in case `fsm::get_output` saves
term and vote values, indicating that they need to be
persisted.

Vote and term can change independently of each other, so that
checking only for term obscures what is happening and why
even more.

In either case term will never be 0, because:

1. If the term has changed, then it's naturally greater than 0,
   since it's a monotonically increasing value.
2. If the vote has changed, it means that we received
   a vote request message. In such case we have already updated
   our term to the requester's term.

Switch to using an explicit optional in `fsm_output` so that
a reader don't have to think about the motivation behind this `if`
and just checks that `term_and_vote` optional is engaged.

Given the motivation described above, the corresponding

    assert(_fsm->get_current_term() != term_t(0));

in `server_impl::start` is removed.

Tests: unit(dev)

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-03-17 09:59:21 +02:00
Alejo Sanchez
88063b6e3e raft: tests: move common helpers to header
Move common test helper functions and data structures to a common
helpers.hh header.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-03-15 06:16:58 -04:00
Alejo Sanchez
6139ad6337 raft: tests: move boost tests to tests/raft
Move raft boost tests to test/raft directory.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-03-15 06:16:58 -04:00