Currently an entry is declared to be dropped only when an entry with
different term is committed with the same index, but that may create a
situation where, if no new entries are submitted for a long time, an
already dropped entry will not be noticed for a long time as well.
Consider the case where a client submits 10 entries on a leader A, but
before they get replicated the leadership moves to a node B. B will
commit a dummy entry which will be committed eventually and will release
one of the waiters on A, but if anything else is submitted to B 9 other
waiters will wait forever.
The way to solve that is to drop all waiters that wait for a term
smaller that one been committed. There is no chance they will be
committed any longer since terms in the log may only grow.
A snapshot transfer may take a lot of time and meanwhile a leader doing
it may lose the leadership. If that happens the ongoing snapshot transfer
becomes obsolete since the snapshot will be rejected by the receiving
node as coming from an old leader. Make snapshot transfer abortable and
abort them when leader changes.
A leader may change while one of its followers is in snapshot transfer
mode and that node may get additional request for snapshot transfer from
a new leader while previous transfer is still not aborted. Currently
such situation will trigger an assert. This patch allows to have active
snapshot transfers from multiple nodes, but only one of them will succeed
in the end, all other will be replied to with 'fail'.
Use the new specific connectivity to manage old leader disconnection
more specifically.
This fixes having elections where the vote of the old leader is required
for quorum. For example {A,B} and we want to switch leader. For B to
become candidate it has to see A as down. Then A has to see B's request
for vote, and vote for A.
So to make the general case old leader needs to be first disconnected
from all nodes, make the desired node candidate, then have the old
leader connected only to the desired candidate (else, other nodes would
see the new candidate as disrupting a live leader).
Also, there might be stray messages from the former leader. These could
revert the candidate to follower. To handle this this patch retries
the process until the desired node becomes leader.
The helper function elect_me_leader() is split and renamed to
wait_until_candidate() and wait_election_done(). The former ticks until
the node is a candidate and the later waits until a candidate either
becomes a leader or reverts to follower
The existing etcd test workaround of incrementing from n=2 to n=3 nodes
is corrected back to original n=2.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Introduce a tagged id struct for `group_id`.
Raft code would want to generate quite a lot of unique
raft groups in the future (e.g. tablets). UUID is designed
exactly for that (e.g. larger capacity than `uint64_t`, obviously,
and also has built-in procedures to generate random ids).
Also, this is a preparation to make "raft group 0" use a random
ID instead of a literal fixed `0` as a group id.
The purpose is that every scylla cluster must have a unique ID
for "raft group 0" since we don't want the nodes from some other
cluster to disrupt the current cluster. This can happen if,
for some reason, a foreign node happens to contact a node in
our cluster.
Tests: unit(dev)
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20210429170630.533596-3-pa.solodovnikov@scylladb.com>
When probes are sent over a slow network, the leader would send
multiple probes to a lagging follower before it would get a
reject response to the first probe back. After getting a reject, the
leader will be able to correctly position `next_idx` for that
follower and switch to pipeline mode. Then, an out of order reject
to a now irrelevant probe could crash the leader, since it would
effectively request it to "rewind" its `match_idx` for that
follower, and the code asserts this never happens.
We fix the problem by strengthening `is_stray_reject`. The check that
was previously only made in `PIPELINE` case
(`rejected.non_matching_idx <= match_idx`) is now always performed and
we add a new check: `rejected.last_idx < match_idx`. We also strengthen
the assert.
The commit improves the documentation by explaining that
`is_stray_reject` may return false negatives. We also precisely state
the preconditions and postconditions of `is_stray_reject`, give a more
precise definition of `progress.match_idx`, argue how the
postconditions of `is_stray_reject` follow from its preconditions
and Raft invariants, and argue why the (strengthened) assert
must always pass.
Message-Id: <20210423173117.32939-1-kbraun@scylladb.com>
RPC configuration was updated only when an instance was
started with an initial snapshot.
In case we don't have an initial snapshot, but do have
a non-empty log with a configuration entry, the RPC
instance isn't set up correctly.
Fix that by moving RPC setup code outside the check for
snapshot id and look at `_log.get_configuration()` instead.
Also, set up RPC mappings both for `current` and `previous`
components, since in case the last configuration index
points to an entry from the log, it can happen to be
a joint configuration entry.
For example, this can happen if a leader made an attempt
to change configuration, but failed shortly afterwards
without being able to commit the new configuration.
Tests: unit(dev)
Signed-off-by: Pavel Solodovnikov <pa.solodonikov@scylladb.com>
Message-Id: <20210423220718.642470-1-pa.solodovnikov@scylladb.com>
Otherwise waiters on committed configuration changes (e.g.
`server::set_configuration`) would never get notified.
Also if we tried to send another entry concurrently we would get
replication_test: raft/server.cc:318: void raft::server_impl::notify_waiters(std::map<index_t, op_status> &, const std::vector<log_entry_ptr> &): Assertion `entry_idx >= first_idx' failed.
(not sure if this commit also fixes whatever caused that).
Message-Id: <20210419181319.68628-2-kbraun@scylladb.com>
Stop use seastar::pipe and use seastar::queue directly to pass log
entries to apply_fiber. The pipe is a layer above queue anyway and it
adds functionality that we do not need (EOS) and hinds functionality that
we do (been able to abort()). This fixes a crash during abort where the
pipe was uses after been destroyed.
Message-Id: <YHLkPZ9+sdLhwcjZ@scylladb.com>
Current leader code check for most nodes to be alive, but this is
incorrect since some nodes may be non-voting and hence should not
cause a leader to stepdown if dead. It also incorrect with joint config
since quorum is calculated differently there. Fix it by introducing
activity_tracker class that knows how to handle all the above details.
If a leader is removed from a cluster it will never know when entries
that it did not committed yet will be committed, so abort the wait in
this case with uncertainty error.
Usually initiation of stepdown process does not immediately depose the
current leader, but if the current leader is no longer part of the
cluster it will happen. We were missing the check after initiating
stepdown process in append reply handling.
Current code assert when it gets InstallSnapshot/AppendRequest in a leader
state and the term in the message is equal current term. It is true that
such messages cannot be received if the protocol works correctly, but we
should not crash on a network input nonetheless.
Clean up step() function by moving state specific processing into per
state functions. This way it is easier to see how each state handles
individual messages. No functional changes here.
Message-Id: <YGHCiTWjq+L/jVCB@scylladb.com>
Describe the high-level scheme of managing RPC mappings and
also expand on the introduction of "expirable" RPC mappings concept
and why these are needed.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Raft instance needs to update RPC subsystem on changes in
configuration, so that RPC can deliver messages to the new nodes
in configuration, as well as dispose of the old nodes.
I.e. the nodes which are not the part of the most recent
configuration anymore.
The effective scope of RPC mappings is limited by the piece of
code which sends messages to both the "new" nodes (which
are added to the cluster with the most recent configuration
change) and the "old" nodes which are removed from the cluster.
Until the messages are successfully delivered to at least
the majority of "old" nodes and we have heard back from them,
the mappings should be kept intact. After that point the RPC
mappings for the removed nodes are no longer of interest
and thus can be immediately disposed.
Tests: unit(dev)
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
The field is set in `fsm.get_output` whenever
`_log.last_conf_idx()` or the term changes.
Also, add `_last_conf_idx` and `_last_term` to
`fsm::last_observed_state`, they are utilized
in the condition to evaluate current rpc
configuration in `fsm.get_output()`.
This will be used later to update rpc config state
stored in `server_impl` and maintain rpc address map.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Introduce rpc server_address that represents the
last observed state of address mappings
for RPC module.
It does not correspond to any kind of configuration
in the raft sense, just an artificial construct
corresponding to the largest set of server
addresses coming from both previous and current
raft configurations (to be able to contact both
joining and leaving servers).
This will be used later to update rpc module mappings
when cluster configuration changes.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
`std::unordered_set::contains` is introduced in C++20 and provides
clearer semantics to check existence of a given element in a set.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Add comments explaining the rationale from transfer_leadership()
(more PhD quotes), encapsulate stable leader check in tick()
into a lambda and add more detailed comments to it.
Section 3.10 of the PhD describes two cases for which the extension can
be helpful:
1. Sometimes the leader must step down. For example, it may need to reboot
for maintenance, or it may be removed from the cluster. When it steps
down, the cluster will be idle for an election timeout until another
server times out and wins an election. This brief unavailability can be
avoided by having the leader transfer its leadership to another server
before it steps down.
2. In some cases, one or more servers may be more suitable to lead the
cluster than others. For example, a server with high load would not make
a good leader, or in a WAN deployment, servers in a primary datacenter
may be preferred in order to minimize the latency between clients and
the leader. Other consensus algorithms may be able to accommodate these
preferences during leader election, but Raft needs a server with a
sufficiently up-to-date log to become leader, which might not be the
most preferred one. Instead, a leader in Raft can periodically check
to see whether one of its available followers would be more suitable,
and if so, transfer its leadership to that server. (If only human leaders
were so graceful.)
The patch here implements the extension and employs it automatically
when a leader removes itself from a cluster.
When a leader orchestrates its own removal from a cluster there is a
situation where the leader is still responsible for replication, but it
is no longer part of active configuration. Current code skips replication
in this case though. Fix it by always replicating in the leader state.
Since we use external failure detector instead of relying on empty
AppendRequests from a leader there can be a situation where a node
is no longer part of a certain raft group but is still alive (and also
may be part of other raft groups). In such case last election time
should not be updated even if the node is alive. It is the same as if
it would have stopped to send empty AppendRequests in original raft.
By the time we receive snapshot_reply from a follower
we may no longer be the leader. Follower term may be
different from snapshot term, e.g. the follower may
be aware of a new leader already and have a higher term.
We should pass this information into (possibly ex-) leader FSM via
fsm::step() so that it can correctly change its state, and
not call FSM directly.
Raft send_snapshot RPC is actually two-way, the follower
responds with snapshot_reply message. This message until now
was, however, muted by RPC.
Do not mute snapshot_reply any more:
- to make it obvious the RPC is two way
- to feed the follower response directly into leader's FSM and
thus ensure that FSM testing results produced when using a test
transport are representative of the real world uses of
raft::rpc.
Set follower's next_idx to snapshot index + 1 when switching
it to snapshot mode. If snapshot transfer succeeds, that's the
best match for the follower's next replication index. If it fails,
the leader will send a new probe to find out the follower position
again and re-try sending a possibly newer snapshot.
The change helps reduce protocol state managed outside FSM.
If the current leader is set, the follower will not vote
for another candidate. This is also known as "sticky leadership" rule.
Before this change, the rule was enacted only upon receiving
AppendEntries RPC from the leader. Turn it on also upon receiving
InstallSnapshot RPC.
As the existing comment explains a progress can be deleted at the point
of logging. The logging should only be done if the progress still
exists.
Message-Id: <YFDFVRQU1iVYhFdM@scylladb.com>
Prior to the fix there was an assert to check in
`raft::server_impl::start` that the initial term is not 0.
This restriction is completely artificial and can be lifted
without any problems, which will be described below.
The only place that is dependent on this corner case is in
`server_impl::io_fiber`. Whenever term or vote has changed,
they will be both set in `fsm::get_output`. `io_fiber` checks
whether it needs to persist term and vote by validating that
the term field is set (by actually executing a `term != 0`
condition).
This particular check is based on an unobvious fact that the
term will never be 0 in case `fsm::get_output` saves
term and vote values, indicating that they need to be
persisted.
Vote and term can change independently of each other, so that
checking only for term obscures what is happening and why
even more.
In either case term will never be 0, because:
1. If the term has changed, then it's naturally greater than 0,
since it's a monotonically increasing value.
2. If the vote has changed, it means that we received
a vote request message. In such case we have already updated
our term to the requester's term.
Switch to using an explicit optional in `fsm_output` so that
a reader don't have to think about the motivation behind this `if`
and just checks that `term_and_vote` optional is engaged.
Given the motivation described above, the corresponding
assert(_fsm->get_current_term() != term_t(0));
in `server_impl::start` is removed.
Tests: unit(dev)
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
We already have server state dependant state in fsm, so there is no need
to maintain "voters" and "tracker" optionals as well. The upside is that
optional and variant sates cannot drift apart now.