For a follower to forward requests to a leader the leader must be known.
But there may be a situation where a follower does not learn about
a leader for a while. This may happen when a node becomes a follower while its
log is up-to-date and there are no new entries submitted to raft. In such
case the leader will send nothing to the follower and the only way to
learn about the current leader is to get a message from it. Until a new
entry is added to the raft's log a follower that does not know who the
leader is will not be able to add entries. Kind of a deadlock. Note that
the problem is specific to our implementation where failure detection is
done by an outside module. In vanilla raft a leader sends messages to
all followers periodically, so essentially it is never idle.
The patch solves this by broadcasting specially crafted append reject to all
nodes in the cluster on a tick in case a leader is not known. The leader
responds to this message with an empty append request which will cause the
node to learn about the leader. For optimisation purposes the patch
sends the broadcast only in case there is actually an operation that
waits for leader to be known.
Fixes#10379
After enabling add_entry forwarding in randomized_nemesis_test, the test
would sometimes hang on _rpc->abort() call due to add_entry messages
from followers which waited on log_limiter_semaphore on the leader
preventing _rpc from finishing the abort; the log_limter_semaphore would
not get unblocked because the part of the server was already stopped.
Prevent log_limiter_semaphore from being waited on when stopping the
server by becoming a follower in fsm::stop.
This patch adds an ability to pass abort_source to raft request APIs (
add_entry, modify_config) to make them abortable. A request issuer not
always want to wait for a request to complete. For instance because a
client disconnected or because it no longer interested in waiting
because of a timeout. After this patch it can now abort waiting for such
requests through an abort source. Note that aborting a request only
aborts the wait for it to complete, it does not mean that the request
will not be eventually executed.
Message-Id: <YjHivLfIB9Xj5F4g@scylladb.com>
When a node starts it does not immediately becomes a candidate since it
waits to learn about already existing leader and randomize the time it
becomes a candidate to prevent dueling candidates if several nodes are
started simultaneously.
If a cluster consist of only one node there is no point in waiting
before becoming a candidate though because two cases above cannot
happen. This patch checks that the node belongs to a singleton cluster
where the node itself is the only voting member and becomes candidate
immediately. This reduces the starting time of a single node cluster
which are often used in testing.
Message-Id: <YiCbQXx8LPlRQssC@scylladb.com>
Raft randomized nemesis test was improved by adding some more
chaos: randomizing the network delay, server configuration,
ticking speed of servers.
This allowed to catch a serious bug, which is fixed in the first patch.
The patchset also fixes bugs in the test itself and adds quality of life
improvements such as better diagnostics when inconsistency is detected.
* kbr/nemesis-random-v1:
test: raft: randomized_nemesis_test: print state of each state machine when detecting inconsistency
test: raft: randomized_nemesis_test: print details when detecting inconsistency
test: raft: randomized_nemesis_test: print snapshot details when taking/loading snapshots in `impure_state_machine`
test: raft: randomized_nemesis_test: keep server id in impure_state_machine
test: raft: randomized_nemesis_test: frequent snapshotting configuration
test: raft: randomized_nemesis_test: tick servers at different speeds in generator test
test: raft: randomized_nemesis_test: simplify ticker
test: raft: randomized_nemesis_test: randomize network delay
test: raft: randomized_nemesis_test: fix use-after-free in `environment::crash()`
test: raft: randomized_nemesis_test: fix use-after-free in two-way rpc functions
test: raft: randomized_nemesis_test: rpc: don't propagate `gate_closed_exception` outside
test: raft: randomized_nemesis_test: fix obsolete comment
raft: fsm: print configuration entries appearing in the log
raft: `operator<<(ostream&, ...)` implementation for `server_address` and `configuration`
raft: server: abort snapshot applications before waiting for rpc abort
raft: server: logging fix
raft: fsm: don't advance commit index beyond matched entries
Otherwise it was possible to incorrectly mark obsolete entries from
earlier terms as committed, leading to inconsistencies between state
machine replicas.
Fixes#9965.
Raft does not need to persist the commit index since a restarted node will
either learn it from an append message from a leader or (if entire cluster
is restarted and hence there is no leader) new leader will figure it out
after contacting a quorum. But some users may want to be able to bring
their local state machine to a state as up-to-date as it was before restart
as soon as possible without any external communication.
For them this patch introduces new persistence API that allows saving
and restoring last seen committed index.
Message-Id: <YfFD53oS2j1My0p/@scylladb.com>
Instead of lengthy blurbs, switch to single-line, machine-readable
standardized (https://spdx.dev) license identifiers. The Linux kernel
switched long ago, so there is strong precedent.
Three cases are handled: AGPL-only, Apache-only, and dual licensed.
For the latter case, I chose (AGPL-3.0-or-later and Apache-2.0),
reasoning that our changes are extensive enough to apply our license.
The changes we applied mechanically with a script, except to
licenses/README.md.
Closes#9937
Operations of adding or removing a node to Raft configuration
are made idempotent: they do nothing if already done, and
they are safe to resume after a failure.
However, since topology changes are not transactional, if a
bootstrap or removal procedure fails midway, Raft group 0
configuration may go out of sync with topology state as seen by
gossip.
In future we must change gossip to avoid making any persistent
changes to the cluster: all changes to persistent topology state
will be done exclusively through Raft Group 0.
Specifically, instead of persisting the tokens by advertising
them through gossip, the bootstrap will commit a change to a system
table using Raft group 0. nodetool will switch from looking at
gossip-managed tables to consulting with Raft Group 0 configuration
or Raft-managed tables.
Once this transformation is done, naturally, adding a node to Raft
configuration (perhaps as a non-voting member at first) will become the
first persistent change to ring state applied when a node joins;
removing a node from the Raft Group 0 configuration will become the last
action when removing a node.
Until this is done, do our best to avoid a cluster state when
a removed node or a node which addition failed is stuck in Raft
configuration, but the node is no longer present in gossip-managed
system tables. In other words, keep the gossip the primary source of
truth. For this purpose, carefully chose the timing when we
join and leave Raft group 0:
Join the Raft group 0 only after we've advertised our tokens, so the
cluster is aware of this node, it's visible in nodetool status,
but before node state jumps to "normal", i.e. before it accepts
queries. Since the operation is idempotent, invoke it on each
restart.
Remove the node from Group 0 *before* its tokens are removed
from gossip-managed system tables. This guarantees
that if removal from Raft group 0 fails for whatever reason,
the node stays in the ring, so nodetool removenode and
friends are re-tried.
Add tracing.
For leader stepdown purposes a non voting member is not different
from a node outside of the config. The patch makes relevant code paths
to check for both conditions.
If a node is a non voting member it cannot be a leader, so the stable
leader rule should not be applied to it. This patch aligns non voting
node behaviour with a node that was removed from the cluster. Both of
them stepdown from leader position if they happen to be a leader when
the state change occurred.
The code assumes that the snapshot that was taken locally is never
applied. Currently logic to detect that is flawed. It relies on an
id of a most recently applied snapshot (where a locally taken snapshot
is considered to be applied as well). But if between snapshot creation
and the check another local snapshot is taken ids will not match.
The patch fixes this by propagating locality information together with
the snapshot itself.
Since io_fiber persist entries before sending out messages even non
stable entries will become stable before observed by other nodes.
This patch also moves generation of append messages into get_outptut()
call because without the change we will lose batching since each
advance of last_idx will generate new append message.
This patch implements RAFT extension that allows to perform linearisable
reads by accessing local state machine. The extension is described
in section 6.4 of the PhD. To sum it up to perform a read barrier on
a follower it needs to asks a leader the last committed index that it
knows about. The leader must make sure that it is still a leader before
answering by communicating with a quorum. When follower gets the index
back it waits for it to be applied and by that completes read_barrier
invocation.
The patch adds three new RPC: read_barrier, read_barrier_reply and
execute_read_barrier_on_leader. The last one is the one a follower uses
to ask a leader about safe index it can read. First two are used by a
leader to communicate with a quorum.
To avoid dueling candidates with large clusters, make the timeout
proportional to the cluster size.
Debug mode is too slow for a test of 1000 nodes so it's disabled, but
the test passes for release and dev modes.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
There are situations where a node outside the current configuration is
the only node that can become a leader. We become candidates in such
cases. But there is an easy check for when we don't need to; a comment was
added explaining that.
All entries up to snapshot.idx must obviously be committed, so why not
update _commit_idx to reflect that.
With this we get a useful invariant:
`_log.get_snapshot().idx <= _commit_idx`.
For example, when checking whether the latest active configuration is
committed, it should be enough to compare the configuration index to the
commit index. Without the invariant we would need a special case if the
latest configuration comes from a snapshot.
We must not apply remote snapshots with commit indexes smaller than our
local commit index; this could result in out-of-order command
application to the local state machine replica, leading to
serializability violations.
Message-Id: <20210805112736.35059-1-kbraun@scylladb.com>
append_reply packets can be reordered and thus reply.commit_idx may be
smaller than the one it the tracker. The tracker's commit index is used
to check if a follower needs to be updated with potentially empty
append message, so the bug may theoretically cause unneeded packets to
be sent.
Message-Id: <YQZZ/6nlNb5nQyXp@scylladb.com>
When a leader moves to a follower state it aborts all requests that are
waiting on an admission semaphore with not_a_leader exception. But
currently it specifies itself as a new leader since abortion happens
before the fsm state changes to a follower. The patch fixes this by
destroying leader state after fsm state already changed to be a
follower.
Message-Id: <YPbI++0z5ZPV9pKb@scylladb.com>
Sometimes an ability to force a leader change is needed. For instance
if a node that is currently serving as a leader needs to be brought
down for maintenance. If it will be shutdown without leadership
transfer the cluster will be unavailable for leader election timeout at
least.
We already have a mechanism to transfer the leadership in case an active
leader is removed. The patch exposes it as an external interface with a
timeout period. If a node is still a leader after the timeout the
operation will fail.
If the leader becomes a non-voter after a configuration change,
step down and become a follower.
Non-voting members are an extension to Raft, so the protocol spec does
not define whether they can be leaders. I can not think of a reason
why they can't, yet I also can not think of a reason why it's useful,
so let's forbid this.
We already do not allow non-voters to become candidates, and
they ignore timeout_now RPC (leadership transfer), so they
already can not be elected.
Isolate the checks for configuration transitions in a static function,
to be able to unit test outside class server.
Split the condition of transitioning to an empty configuration
from the condition of transitioning into a configuration with
no voters, to produce more user-friendly error messages.
*Allow* to transfer leadership in a configuration when
the only voter is the leader itself. This would be equivalent
to syncing the leader log with the learner and converting
the leader to the follower itself. This is safe, since
the leader will re-elect itself quickly after an election timeout,
and may be used to do a rolling restart of a cluster with
only one voter.
A test case follows.
The tracker maintains a separate pointer to current leader progress,
but all this complexity is not needed because the tracker already have
find() function that can either find a leader's progress by id or return
null. Removing the tracking simplifies code and make going out of sync
(which is always a possibility if a state is maintained in two different
places) impossible.
Only when fsm is in the follower state current_leader has any meaning.
In the leader state a node is always its own follower and in a candidate
state there is no leader. To make sure that the current_leader value
cannot be out of sync with fsm state move it into the follower state.
When probes are sent over a slow network, the leader would send
multiple probes to a lagging follower before it would get a
reject response to the first probe back. After getting a reject, the
leader will be able to correctly position `next_idx` for that
follower and switch to pipeline mode. Then, an out of order reject
to a now irrelevant probe could crash the leader, since it would
effectively request it to "rewind" its `match_idx` for that
follower, and the code asserts this never happens.
We fix the problem by strengthening `is_stray_reject`. The check that
was previously only made in `PIPELINE` case
(`rejected.non_matching_idx <= match_idx`) is now always performed and
we add a new check: `rejected.last_idx < match_idx`. We also strengthen
the assert.
The commit improves the documentation by explaining that
`is_stray_reject` may return false negatives. We also precisely state
the preconditions and postconditions of `is_stray_reject`, give a more
precise definition of `progress.match_idx`, argue how the
postconditions of `is_stray_reject` follow from its preconditions
and Raft invariants, and argue why the (strengthened) assert
must always pass.
Message-Id: <20210423173117.32939-1-kbraun@scylladb.com>
Otherwise waiters on committed configuration changes (e.g.
`server::set_configuration`) would never get notified.
Also if we tried to send another entry concurrently we would get
replication_test: raft/server.cc:318: void raft::server_impl::notify_waiters(std::map<index_t, op_status> &, const std::vector<log_entry_ptr> &): Assertion `entry_idx >= first_idx' failed.
(not sure if this commit also fixes whatever caused that).
Message-Id: <20210419181319.68628-2-kbraun@scylladb.com>
Current leader code check for most nodes to be alive, but this is
incorrect since some nodes may be non-voting and hence should not
cause a leader to stepdown if dead. It also incorrect with joint config
since quorum is calculated differently there. Fix it by introducing
activity_tracker class that knows how to handle all the above details.
Usually initiation of stepdown process does not immediately depose the
current leader, but if the current leader is no longer part of the
cluster it will happen. We were missing the check after initiating
stepdown process in append reply handling.
The field is set in `fsm.get_output` whenever
`_log.last_conf_idx()` or the term changes.
Also, add `_last_conf_idx` and `_last_term` to
`fsm::last_observed_state`, they are utilized
in the condition to evaluate current rpc
configuration in `fsm.get_output()`.
This will be used later to update rpc config state
stored in `server_impl` and maintain rpc address map.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Add comments explaining the rationale from transfer_leadership()
(more PhD quotes), encapsulate stable leader check in tick()
into a lambda and add more detailed comments to it.