Commit Graph

149 Commits

Author SHA1 Message Date
Konstantin Osipov
eaf32f2c3c raft: (testing) test receiving a confchange in a snapshot 2021-06-11 17:16:56 +03:00
Konstantin Osipov
ba046ed1ab raft: style fix 2021-06-11 17:16:54 +03:00
Konstantin Osipov
b0a1ebc635 raft: step down as a leader if converted to a non-voter
If the leader becomes a non-voter after a configuration change,
step down and become a follower.

Non-voting members are an extension to Raft, so the protocol spec does
not define whether they can be leaders. I can not think of a reason
why they can't, yet I also can not think of a reason why it's useful,
so let's forbid this.

We already do not allow non-voters to become candidates, and
they ignore timeout_now RPC (leadership transfer), so they
already can not be elected.
2021-06-11 17:16:50 +03:00
Konstantin Osipov
684e0d2a8c raft: improve configuration consistency checks
Isolate the checks for configuration transitions in a static function,
to be able to unit test outside class server.

Split the condition of transitioning to an empty configuration
from the condition of transitioning into a configuration with
no voters, to produce more user-friendly error messages.

*Allow* to transfer leadership in a configuration when
the only voter is the leader itself. This would be equivalent
to syncing the leader log with the learner and converting
the leader to the follower itself. This is safe, since
the leader will re-elect itself quickly after an election timeout,
and may be used to do a rolling restart of a cluster with
only one voter.

A test case follows.
2021-06-11 17:16:47 +03:00
Alejo Sanchez
add12d801d raft: log ignored prevote
Add a log line for ignored prevote.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Message-Id: <20210609193945.910592-2-alejo.sanchez@scylladb.com>
2021-06-10 12:33:34 +02:00
Tomasz Grabiec
ce7a404f17 Merge "Cleanups/refactoring for Raft Group 0" from Kostja
* scylla-dev/raft-group-0-part-1-rebase:
  raft: (service) pass Raft service into storage_service
  raft: (service) add comments for boot steps
  raft: add ordering for raft::server_address based on id
  raft: (internal) simplify construction of tagged_id
  raft: (internal) tagged_id minor improvements
2021-06-09 10:48:05 +02:00
Konstantin Osipov
b81580f3c6 raft: add ordering for raft::server_address based on id 2021-06-08 14:52:32 +03:00
Konstantin Osipov
d42d5aee8c raft: (internal) simplify construction of tagged_id
Make it easy to construct tagged_id from UUID.
2021-06-08 14:52:32 +03:00
Konstantin Osipov
c9a23e9b8a raft: (internal) tagged_id minor improvements
Introduce a syntax helper tagged_id::create_random_id(),
used to create a new Raft server or group id.

Provide a default ordering for tagged ids, for use
in Raft leader discovery, which selects the smallest
id for leader.
2021-06-08 14:52:32 +03:00
Gleb Natapov
5d15ecb7e5 raft: do not block io_fiber just because of a slow follower
Currently if append_message cannot be sent to one of the followers the
entire io_fiber will block which eventually stop the replication. The
patch changes message sending part of io_fiber to be non blocking. The
code adds a hash table that is used to keep track of append_request
sending status per destination. All the remaining futures are waited for
during abort.
Message-Id: <20210606140305.2930189-2-gleb@scylladb.com>
2021-06-07 16:55:14 +02:00
Alejo Sanchez
bd168d57ff raft: fix vote reply handling in prevote
Do not register a reply to prevote as a real vote

Found and authored by @kostja.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Message-Id: <20210604122530.1975388-1-alejo.sanchez@scylladb.com>
2021-06-06 19:18:49 +03:00
Tomasz Grabiec
50d64646cd Merge "raft: replication test fixes and OOP refactor" from Alejo
Feature requests, fixes, and OOP refactor of replication_test.

Note: all known bugs and hangs are now fixed.

A new helper class "raft_cluster" is created.
Each move of a helper function to the class has its own commit.
New helpers are provided

To simplify code, for now only a single apply function can be set per
raft_cluster. No tests were using in any other way. In the future,
there could be custom apply functions per server dynamically assigned,
if this becomes needed.

* alejo/raft-tests-replication-02-v3-30: (66 commits)
  raft: replication test: wait for log for both index and term
  raft: replication test: reset network at construction
  raft: replication test: use lambda visitor for updates
  raft: replication test: move structs into class
  raft: replication test: move data structures to cluster class
  raft: replication test: remove shared pointers
  raft: replication test: move get_states() to raft_cluster
  raft: replication test: test_server inside raft_cluster
  raft: replication test: rpc declarative tests
  raft: replication test: add wait_log
  raft: replication test: add stop and reset server
  raft: replication test: disconnect 2 support
  raft: replication test: explicit node_id naming
  raft: replication test: move definitions up
  raft: replication test: no append entries support
  raft: replication test: fix helper parameter
  raft: replication test: stop servers out of config
  raft: replication test: wait log when removing leader from configuration
  raft: replication test: only manipulate servers in configuration
  raft: replication test: only cancel rearm ticker for removed server
  ...
2021-06-06 19:18:49 +03:00
Avi Kivity
a55b434a2b treewide: extent copyright statements to present day 2021-06-06 19:18:49 +03:00
Gleb Natapov
bb822c92ab raft: change raft::rpc api to return void for most sending functions
Most RAFT packets are sent very rarely during special phases of the
protocol (like election or leader stepdown). The protocol itself does
not care if a packet is sent or dropped, so returning futures from their
send function does not serve any purpose. Change the raft's rpc interface
to return void for all packet types but append_request. We still want to
get a future from sending append_request for backpressure purposes since
replication protocol is more efficient if there is no packet loss, so
it is better to pause a sender than dropping packets inside the rpc. Rpc
is still allowed to drop append_requests if overloaded.
2021-06-06 19:18:49 +03:00
Gleb Natapov
f5a54d6c05 raft: move ELECTION_TIMEOUT definition to a public header
Move ELECTION_TIMEOUT definition to be visible to outside modules.
2021-06-06 19:18:49 +03:00
Gleb Natapov
87844c0ce1 raft: remove unused clock type definition
RAFT uses logical clock now and this define is from older times.
2021-06-06 19:18:49 +03:00
Gleb Natapov
90ea71da54 raft: wait for io and applier fiber to stop before before aborting snapshots and waiters
IO and applier fibers may update waiters and start new snapshot
transfers, so abort() needs to wait for them to stop before proceeding
to abort waiters and snapshot transfers,
2021-06-06 19:18:49 +03:00
Alejo Sanchez
3e91a8ca0d raft: replication test: wait for log for both index and term
Waiting on index alone does not guarantee leader correct leader log
propagation. This patch add checking also the term of the leader's last
log entry.

This was exposed with occasional problems with packet drops.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-06-04 08:38:19 -04:00
Tomasz Grabiec
0fdd2f8217 Merge "raft: fsm cleanups" from Gleb
* scylla-dev/raft-cleanup-v1:
  raft: drop _leader_progress tracking from the tracker
  raft: move current_leader into the follower state
  raft: add some precondition checks
2021-05-14 17:24:59 +02:00
Alejo Sanchez
68f69671b5 raft: style: test optionals directly
Avoid using has_value() and test optional directly

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Message-Id: <20210512142018.297203-2-alejo.sanchez@scylladb.com>
2021-05-12 20:39:52 +02:00
Gleb Natapov
78c5a72b32 raft: drop _leader_progress tracking from the tracker
The tracker maintains a separate pointer to current leader progress,
but all this complexity is not needed because the tracker already have
find() function that can either find a leader's progress by id or return
null. Removing the tracking simplifies code and make going out of sync
(which is always a possibility if a state is maintained in two different
places) impossible.
2021-05-09 13:55:55 +03:00
Gleb Natapov
1245736776 raft: move current_leader into the follower state
Only when fsm is in the follower state current_leader has any meaning.
In the leader state a node is always its own follower and in a candidate
state there is no leader. To make sure that the current_leader value
cannot be out of sync with fsm state move it into the follower state.
2021-05-09 13:55:55 +03:00
Gleb Natapov
0634674aef raft: add some precondition checks
Check that fsm does not process messages from itself and that it does
not tries to become its own follower.
2021-05-07 08:04:16 +03:00
Gleb Natapov
aa7ea333da raft: document that add entry my throw commit_status_unknown 2021-05-06 11:59:36 +03:00
Gleb Natapov
d2f58d8656 raft: drop waiters with outdated terms
Currently an entry is declared to be dropped only when an entry with
different term is committed with the same index, but that may create a
situation where, if no new entries are submitted for a long time, an
already dropped entry will not be noticed for a long time as well.

Consider the case where a client submits 10 entries on a leader A, but
before they get replicated the leadership moves to a node B. B will
commit a dummy entry which will be committed eventually and will release
one of the waiters on A, but if anything else is submitted to B 9 other
waiters will wait forever.

The way to solve that is to drop all waiters that wait for a term
smaller that one been committed. There is no chance they will be
committed any longer since terms in the log may only grow.
2021-05-06 11:34:31 +03:00
Gleb Natapov
6abe2772dc raft: make snapshot transfer abortable
A snapshot transfer may take a lot of time and meanwhile a leader doing
it may lose the leadership. If that happens the ongoing snapshot transfer
becomes obsolete since the snapshot will be rejected by the receiving
node as coming from an old leader. Make snapshot transfer abortable and
abort them when leader changes.
2021-05-06 11:34:31 +03:00
Gleb Natapov
50d545a138 raft: accept snapshots transfer from multiple nodes simultaneously
A leader may change while one of its followers is in snapshot transfer
mode and that node may get additional request for snapshot transfer from
a new leader while previous transfer is still not aborted. Currently
such situation will trigger an assert. This patch allows to have active
snapshot transfers from multiple nodes, but only one of them will succeed
in the end, all other will be replied to with 'fail'.
2021-05-06 11:34:31 +03:00
Gleb Natapov
073a9be4c7 raft: do not send probes while transferring snapshot
If a follower is in snapshot transfer mode there is no need to send
probe append messages to it.
2021-05-06 11:34:31 +03:00
Gleb Natapov
08077a21b7 raft: handle messages sending errors
Fail to send a message should not abort raft server.
2021-05-06 11:34:31 +03:00
Gleb Natapov
c4d87d7a23 raft: fix a typo in a variable name 2021-05-06 11:33:47 +03:00
Alejo Sanchez
0a5c605713 raft: replication test: fix custom election
Use the new specific connectivity to manage old leader disconnection
more specifically.

This fixes having elections where the vote of the old leader is required
for quorum. For example {A,B} and we want to switch leader.  For B to
become candidate it has to see A as down. Then A has to see B's request
for vote, and vote for A.

So to make the general case old leader needs to be first disconnected
from all nodes, make the desired node candidate, then have the old
leader connected only to the desired candidate (else, other nodes would
see the new candidate as disrupting a live leader).

Also, there might be stray messages from the former leader. These could
revert the candidate to follower. To handle this this patch retries
the process until the desired node becomes leader.

The helper function elect_me_leader() is split and renamed to
wait_until_candidate() and wait_election_done(). The former ticks until
the node is a candidate and the later waits until a candidate either
becomes a leader or reverts to follower

The existing etcd test workaround of incrementing from n=2 to n=3 nodes
is corrected back to original n=2.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-05-03 07:53:35 -04:00
Pavel Solodovnikov
4c351ff260 raft: switch group_id type from uint64_t to utils::UUID
Introduce a tagged id struct for `group_id`.

Raft code would want to generate quite a lot of unique
raft groups in the future (e.g. tablets). UUID is designed
exactly for that (e.g. larger capacity than `uint64_t`, obviously,
and also has built-in procedures to generate random ids).

Also, this is a preparation to make "raft group 0" use a random
ID instead of a literal fixed `0` as a group id.

The purpose is that every scylla cluster must have a unique ID
for "raft group 0" since we don't want the nodes from some other
cluster to disrupt the current cluster. This can happen if,
for some reason, a foreign node happens to contact a node in
our cluster.

Tests: unit(dev)

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20210429170630.533596-3-pa.solodovnikov@scylladb.com>
2021-05-02 16:39:54 +03:00
Kamil Braun
4c95277619 raft: fsm: fix assertion failure on stray rejects
When probes are sent over a slow network, the leader would send
multiple probes to a lagging follower before it would get a
reject response to the first probe back. After getting a reject, the
leader will be able to correctly position `next_idx` for that
follower and switch to pipeline mode. Then, an out of order reject
to a now irrelevant probe could crash the leader, since it would
effectively request it to "rewind" its `match_idx` for that
follower, and the code asserts this never happens.

We fix the problem by strengthening `is_stray_reject`. The check that
was previously only made in `PIPELINE` case
(`rejected.non_matching_idx <= match_idx`) is now always performed and
we add a new check: `rejected.last_idx < match_idx`. We also strengthen
the assert.

The commit improves the documentation by explaining that
`is_stray_reject` may return false negatives.  We also precisely state
the preconditions and postconditions of `is_stray_reject`, give a more
precise definition of `progress.match_idx`, argue how the
postconditions of `is_stray_reject` follow from its preconditions
and Raft invariants, and argue why the (strengthened) assert
must always pass.
Message-Id: <20210423173117.32939-1-kbraun@scylladb.com>
2021-04-27 01:07:22 +02:00
Pavel Solodovnikov
fba1910770 raft: fix incorrect rpc setup in server_impl::start()
RPC configuration was updated only when an instance was
started with an initial snapshot.

In case we don't have an initial snapshot, but do have
a non-empty log with a configuration entry, the RPC
instance isn't set up correctly.

Fix that by moving RPC setup code outside the check for
snapshot id and look at `_log.get_configuration()` instead.

Also, set up RPC mappings both for `current` and `previous`
components, since in case the last configuration index
points to an entry from the log, it can happen to be
a joint configuration entry.

For example, this can happen if a leader made an attempt
to change configuration, but failed shortly afterwards
without being able to commit the new configuration.

Tests: unit(dev)

Signed-off-by: Pavel Solodovnikov <pa.solodonikov@scylladb.com>
Message-Id: <20210423220718.642470-1-pa.solodovnikov@scylladb.com>
2021-04-26 20:46:50 +02:00
Kamil Braun
8e9a9f8bd3 raft: fsm: include config entries in output.committed
Otherwise waiters on committed configuration changes (e.g.
`server::set_configuration`) would never get notified.

Also if we tried to send another entry concurrently we would get
replication_test: raft/server.cc:318: void raft::server_impl::notify_waiters(std::map<index_t, op_status> &, const std::vector<log_entry_ptr> &): Assertion `entry_idx >= first_idx' failed.
(not sure if this commit also fixes whatever caused that).
Message-Id: <20210419181319.68628-2-kbraun@scylladb.com>
2021-04-22 15:38:10 +02:00
Avi Kivity
14a4173f50 treewide: make headers self-sufficient
In preparation for some large header changes, fix up any headers
that aren't self-sufficient by adding needed includes or forward
declarations.
2021-04-20 21:23:00 +03:00
Gleb Natapov
9fdb3d3d98 raft: stop using seastar::pipe to pass log entries to apply_fiber
Stop use seastar::pipe and use seastar::queue directly to pass log
entries to apply_fiber. The pipe is a layer above queue anyway and it
adds functionality that we do not need (EOS) and hinds functionality that
we do (been able to abort()). This fixes a crash during abort where the
pipe was uses after been destroyed.

Message-Id: <YHLkPZ9+sdLhwcjZ@scylladb.com>
2021-04-12 13:18:03 +02:00
Gleb Natapov
fb938a36d4 raft: disallow adding and creating servers with id zero
Id zero has special meaning in the code and cannot be valid server id.
Message-Id: <20210407134853.1964226-1-gleb@scylladb.com>
2021-04-08 17:07:18 +02:00
Gleb Natapov
b3cb4f3966 raft: fix quorum check code for joint config and non-voting members
Current leader code check for most nodes to be alive, but this is
incorrect since some nodes may be non-voting and hence should not
cause a leader to stepdown if dead. It also incorrect with joint config
since quorum is calculated differently there. Fix it by introducing
activity_tracker class that knows how to handle all the above details.
2021-04-07 10:15:33 +03:00
Gleb Natapov
a48a2c454b raft: do not hang on waiting for entries on a leader that was removed from a cluster
If a leader is removed from a cluster it will never know when entries
that it did not committed yet will be committed, so abort the wait in
this case with uncertainty error.
2021-04-07 10:15:33 +03:00
Gleb Natapov
db03c94692 raft: add more tracing to stepdown code 2021-04-07 10:15:33 +03:00
Gleb Natapov
7dec56721c raft: use existing election_elapsed() function instead of redo the calculation 2021-04-07 10:15:33 +03:00
Gleb Natapov
3bcd3212e2 raft: check that a node is still the leader after initiating stepdown process
Usually initiation of stepdown process does not immediately depose the
current leader, but if the current leader is no longer part of the
cluster it will happen. We were missing the check after initiating
stepdown process in append reply handling.
2021-04-07 10:15:33 +03:00
Gleb Natapov
28add88a1f raft: do not assert when receiving unexpected messages in a leader state
Current code assert when it gets InstallSnapshot/AppendRequest in a leader
state and the term in the message is equal current term. It is true that
such messages cannot be received if the protocol works correctly, but we
should not crash on a network input nonetheless.
2021-04-04 11:33:35 +03:00
Gleb Natapov
995cd1c8a7 raft: use existing function to check if election timeout elapsed
is_past_election_timeout() repeats the calculation that
election_elapsed() is doing. Use existing function instead.
2021-04-04 11:33:35 +03:00
Gleb Natapov
13a3cf62bb raft: move incoming message processing into per state functions
Clean up step() function by moving state specific processing into per
state functions. This way it is easier to see how each state handles
individual messages. No functional changes here.

Message-Id: <YGHCiTWjq+L/jVCB@scylladb.com>
2021-03-29 15:48:43 +02:00
Pavel Solodovnikov
2d9e94f050 raft: update README.md with info on RPC server address mappings
Describe the high-level scheme of managing RPC mappings and
also expand on the introduction of "expirable" RPC mappings concept
and why these are needed.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-03-26 22:47:13 +03:00
Pavel Solodovnikov
f61206e483 raft: wire up rpc::add_server and rpc::remove_server for configuration changes
Raft instance needs to update RPC subsystem on changes in
configuration, so that RPC can deliver messages to the new nodes
in configuration, as well as dispose of the old nodes.
I.e. the nodes which are not the part of the most recent
configuration anymore.

The effective scope of RPC mappings is limited by the piece of
code which sends messages to both the "new" nodes (which
are added to the cluster with the most recent configuration
change) and the "old" nodes which are removed from the cluster.

Until the messages are successfully delivered to at least
the majority of "old" nodes and we have heard back from them,
the mappings should be kept intact. After that point the RPC
mappings for the removed nodes are no longer of interest
and thus can be immediately disposed.

Tests: unit(dev)

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-03-26 22:47:09 +03:00
Pavel Solodovnikov
16d9e8e9af raft/fsm: add optional rpc_configuration field to fsm_output
The field is set in `fsm.get_output` whenever
`_log.last_conf_idx()` or the term changes.

Also, add `_last_conf_idx` and `_last_term` to
`fsm::last_observed_state`, they are utilized
in the condition to evaluate current rpc
configuration in `fsm.get_output()`.

This will be used later to update rpc config state
stored in `server_impl` and maintain rpc address map.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-03-26 22:47:05 +03:00
Pavel Solodovnikov
19cc85b3b6 raft: maintain current rpc context in server_impl
Introduce rpc server_address that represents the
last observed state of address mappings
for RPC module.

It does not correspond to any kind of configuration
in the raft sense, just an artificial construct
corresponding to the largest set of server
addresses coming from both previous and current
raft configurations (to be able to contact both
joining and leaving servers).

This will be used later to update rpc module mappings
when cluster configuration changes.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-03-26 20:22:44 +03:00