For a follower to forward requests to a leader the leader must be known.
But there may be a situation where a follower does not learn about
a leader for a while. This may happen when a node becomes a follower while its
log is up-to-date and there are no new entries submitted to raft. In such
case the leader will send nothing to the follower and the only way to
learn about the current leader is to get a message from it. Until a new
entry is added to the raft's log a follower that does not know who the
leader is will not be able to add entries. Kind of a deadlock. Note that
the problem is specific to our implementation where failure detection is
done by an outside module. In vanilla raft a leader sends messages to
all followers periodically, so essentially it is never idle.
The patch solves this by broadcasting specially crafted append reject to all
nodes in the cluster on a tick in case a leader is not known. The leader
responds to this message with an empty append request which will cause the
node to learn about the leader. For optimisation purposes the patch
sends the broadcast only in case there is actually an operation that
waits for leader to be known.
Fixes#10379
The `wait_for_leader` function would throw a low-level
`abort_requested_aborted` exception from seastar::shared_promise.
Translate it to the high-level raft::request_aborted so we can reduce
the number of different exception types which cross the Raft API
boundary.
Also, add comments on Raft API functions about the exception thrown when
requests are aborted.
This patch adds an ability to pass abort_source to raft request APIs (
add_entry, modify_config) to make them abortable. A request issuer not
always want to wait for a request to complete. For instance because a
client disconnected or because it no longer interested in waiting
because of a timeout. After this patch it can now abort waiting for such
requests through an abort source. Note that aborting a request only
aborts the wait for it to complete, it does not mean that the request
will not be eventually executed.
Message-Id: <YjHivLfIB9Xj5F4g@scylladb.com>
When forwarding a reconfiguration request from follower to a leader in
`modify_config`, there is no reason to wait for the follower's commit
index to be updated. The only useful information is that the leader
committed the configuration change - so `modify_config` should return as
soon as we know that.
There is a reason *not* to wait for the follower's commit index to be
updated: if the configuration change removes the follower, the follower
will never learn about it, so a local waiter will never be resolved.
`execute_modify_config` - the part of `modify_config` executed on the
leader - is thus modified to finish when the configuration change is
fully complete (including the dummy entry appended at the end), and
`modify_config` - which does the forwarding - no longer creates a local
waiter, but returns as soon as the RPC call to the leader confirms that
the entry was committed on the leader.
We still return an `entry_id` from `execute_modify_config` but that's
just an artifact of the implementation.
Fixes#9981.
A regression test was also added in randomized_nemesis_test.
* kbr/modify-config-finishes-v1:
test: raft: randomized_nemesis_test: regression test for #9981
raft: server: don't create local waiter in `modify_config`
Raft randomized nemesis test was improved by adding some more
chaos: randomizing the network delay, server configuration,
ticking speed of servers.
This allowed to catch a serious bug, which is fixed in the first patch.
The patchset also fixes bugs in the test itself and adds quality of life
improvements such as better diagnostics when inconsistency is detected.
* kbr/nemesis-random-v1:
test: raft: randomized_nemesis_test: print state of each state machine when detecting inconsistency
test: raft: randomized_nemesis_test: print details when detecting inconsistency
test: raft: randomized_nemesis_test: print snapshot details when taking/loading snapshots in `impure_state_machine`
test: raft: randomized_nemesis_test: keep server id in impure_state_machine
test: raft: randomized_nemesis_test: frequent snapshotting configuration
test: raft: randomized_nemesis_test: tick servers at different speeds in generator test
test: raft: randomized_nemesis_test: simplify ticker
test: raft: randomized_nemesis_test: randomize network delay
test: raft: randomized_nemesis_test: fix use-after-free in `environment::crash()`
test: raft: randomized_nemesis_test: fix use-after-free in two-way rpc functions
test: raft: randomized_nemesis_test: rpc: don't propagate `gate_closed_exception` outside
test: raft: randomized_nemesis_test: fix obsolete comment
raft: fsm: print configuration entries appearing in the log
raft: `operator<<(ostream&, ...)` implementation for `server_address` and `configuration`
raft: server: abort snapshot applications before waiting for rpc abort
raft: server: logging fix
raft: fsm: don't advance commit index beyond matched entries
When forwarding a reconfiguration request from follower to a leader in
`modify_config`, there is no reason to wait for the follower's commit
index to be updated. The only useful information is that the leader
committed the configuration change - so `modify_config` should return as
soon as we know that.
There is a reason *not* to wait for the follower's commit index to be
updated: if the configuration change removes the follower, the follower
will never learn about it, so a local waiter will never be resolved.
`execute_modify_config` - the part of `modify_config` executed on the
leader - is thus modified to finish when the configuration change is
fully complete (including the dummy entry appended at the end), and
`modify_config` - which does the forwarding - no longer creates a local
waiter, but returns as soon as the RPC call to the leader confirms that
the entry was committed on the leader.
We still return an `entry_id` from `execute_modify_config` but that's
just an artifact of the implementation.
Fixes#9981.
The implementation of `rpc` may wait for all snapshot applications to
finish before it can finish aborting. This is what the
randomized_nemesis_test implementation did. This caused rpc abort to
hang in some scenarios.
In this commit, the order of abort calls is modified a bit. Instead of
waiting for rpc abort to finish and then aborting existing snapshot
applications, we call `rpc::abort()` and keep the future, then abort
snapshot applications, then wait on the future. Calling `rpc::abort()`
first is supposed to prevent new snapshot applications from starting;
a comment was added at the interface definition. The nemesis test
implementation had this property, and `raft_rpc` in group registry
was adjusted appropriately. Aborting the snapshot applications then
allows `rpc::abort()` to finish.
Raft does not need to persist the commit index since a restarted node will
either learn it from an append message from a leader or (if entire cluster
is restarted and hence there is no leader) new leader will figure it out
after contacting a quorum. But some users may want to be able to bring
their local state machine to a state as up-to-date as it was before restart
as soon as possible without any external communication.
For them this patch introduces new persistence API that allows saving
and restoring last seen committed index.
Message-Id: <YfFD53oS2j1My0p/@scylladb.com>
Abort signals stopped_error on all awaited entries, but if an entry is
added after this it will be destroyed without signaling and will cause
a waiter to get broken_promise.
Fixes#9688
Message-Id: <Ye6xJjTDooKSuZ87@scylladb.com>
Instead of lengthy blurbs, switch to single-line, machine-readable
standardized (https://spdx.dev) license identifiers. The Linux kernel
switched long ago, so there is strong precedent.
Three cases are handled: AGPL-only, Apache-only, and dual licensed.
For the latter case, I chose (AGPL-3.0-or-later and Apache-2.0),
reasoning that our changes are extensive enough to apply our license.
The changes we applied mechanically with a script, except to
licenses/README.md.
Closes#9937
seastar::later() was recently deprecated and replaced with two
alternatives: a cheap seastar::yield() and an expensive (but more
powerful) seastar::check_for_io_immediately(), that corresponds to
the original later().
This patch replaces all later() calls with the weaker yield(). In
all cases except one, it's unambiguously correct. In one case
(test/perf scheduling_latency_measurer::stop()) it's not so ambiguous,
since check_for_io_immediately() will additionally force a poll and
so will cause more work to be done (but no additional tasks to be
executed). However, I think that any measurement that relies on
the measuring the work on the last tick to be inaccurate (you need
thousands of ticks to get any amount of confidence in the
measurement) that in the end it doesn't matter what we pick.
Tests: unit (dev)
Closes#9904
Instead, expose `register_metrics()` at the `server` interface
(previously it was a private method of `server_impl`).
Metrics are global so `register_metrics()` cannot be called on
two servers that have the same ID, which is useful e.g. in tests when we
want to simulate server stops and restarts.
In absence of abort_source or timeouts in Raft API, automatic
bouncing can create too much noise during testing, especially
during network failures. Add an option to disable follower
bouncing feature, since randomized_nemesis_test has its own
bouncing which handles timeouts correctly.
Optionally disable forwarding in basic_generator_test.
Implement an RPC to forward add_entry calls from the follower
to leader. Bounce & retry in case of not_a_leader.
Do not retry in case of uncertainty - this can lead to adding
duplicate entries.
The feature is added to core Raft since it's needed by
all current clients - both topology and schema changes.
When forwarding an entry to a remote leader we may get back
a term/index pair that conflicts (has the same index, but is with
a higher term) with a local entry we're still waiting on.
This can happen, e.g. because there was a leader change and the
log was truncated, but we still haven't got the append_entries
RPC from the new leader, still haven't truncated the log locally,
still haven't aborted all the local waits for truncated entries.
Only remove the offending entry from the wait list and abort it.
There may be entries labeled with an older term to the right (with
higher commit index) of the conflicting entry. However, finding them,
would require a linear scan. If we allow it, we may end up doing this
linear scan for *every* conflicting entry during the transition
period, which brings us to N^2 complexity of this step. At the
same time, as soon as append_entries that commits a higher-term
entry with the same index reaches the follower, the waits
for the respective truncated entry will be aborted anyway (see
notify_waiters() which sets dropped_entry exception), so the scan
is unnecessary.
Similarly to being able to add entries, allow to modify
Raft group configuration on a follower. The implementation
works the same way as adding entries - forwards the command
to the leader.
Now that add_entry() or modify_config never throws not_a_leader,
it's more likely to throw timed_out_error, e.g. in case the
network is partitioned. Previously it was only possible due to a
semaphore wait timeout, and this scenario was not tested.
Handle timed_out_error on RPC level to let the existing tests
(specifically the randomized nemesis test) pass.
Completion notification code assumes that previous snapshot is applied
before new entries are committed, otherwise it asserts that some
notifications were missing. But currently commit notifications and
snapshot application run in different fibers, so the can be race between
those.
Fix that by moving commit notification into applier fiber as well.
Fixes#9550
When remote snapshot is applied the log is completely cleared because
snapshot transfer happens only when common log prefix cannot be found,
so we cannot be sure that existing entries in the log are correct. But
currently it only happens for in memory log by calling apply_snapshot
with trailing set to zero, but when persistence module is called to
store the snapshot _config.snapshot_trailing is used which can be non
zero. This may cause the log to contain incorrect entries after restart.
The patch fixes this by using zero trailing for non local snapshots.
Fixes#9551
If `rpc::send_snapshot` returned immediately with a ready future, or if
it threw, the code in `server_impl::send_snapshot` would not update
`_snapshot_transfers` correctly.
The code assumed that the continuation attached to `rpc::send_snapshot`
(with `then_wrapped`) was executed after `_snapshot_transfer` below the
`rpc::send_snapshot` call was updated. That would not necessarily be true
(the continuation may even not have been executed at all if
`rpc::send_snapshot` threw).
Fix that by wrapping the `rpc::send_snapshot` call into a continuation
attached to `later()`.
Originally authored by Gleb <gleb@scylladb.com>, I added a comment.
If a locally taken snapshot cannot be installed because newer one was
received meanwhile it should be dropped, otherwise it will take space
needlessly.
Message-Id: <YUrWXxVfBjEio1Ol@scylladb.com>
This warning can catch a virtual function that thinks it
overrides another, but doesn't, because the two functions
have different signatures. This isn't very likely since most
of our virtual functions override pure virtuals, but it's
still worth having.
Enable the warning and fix numerous violations.
Closes#9347
The code assumes that the snapshot that was taken locally is never
applied. Currently logic to detect that is flawed. It relies on an
id of a most recently applied snapshot (where a locally taken snapshot
is considered to be applied as well). But if between snapshot creation
and the check another local snapshot is taken ids will not match.
The patch fixes this by propagating locality information together with
the snapshot itself.
This patch implements RAFT extension that allows to perform linearisable
reads by accessing local state machine. The extension is described
in section 6.4 of the PhD. To sum it up to perform a read barrier on
a follower it needs to asks a leader the last committed index that it
knows about. The leader must make sure that it is still a leader before
answering by communicating with a quorum. When follower gets the index
back it waits for it to be applied and by that completes read_barrier
invocation.
The patch adds three new RPC: read_barrier, read_barrier_reply and
execute_read_barrier_on_leader. The last one is the one a follower uses
to ask a leader about safe index it can read. First two are used by a
leader to communicate with a quorum.
Add a helper to be able to wait until a Raft cluster
leader is elected. It can be used to avoid sleeps
when it's necessary to forward a request to the leader,
but the leader is yet unknown.
We add a function `log_last_conf_before(index_t)` to `fsm` which, given
an index greater than the last snapshot index, returns the configuration
at this index, i.e. the configuration of the last configuration entry
before this index.
This function is then used in `applier_fiber` to obtain the correct
configuration to be stored in a snapshot.
In order to ensure that the configuration can be obtained, i.e. the
index we're looking at is not smaller than the last snapshot index, we
strengthen the conditions required for taking a snapshot: we check that
`_fsm` has not yet applied a snapshot at a larger index (which it may
have due to a remote snapshot install request). This also causes fewer
unnecessary snapshots to be taken in general.
We must not apply remote snapshots with commit indexes smaller than our
local commit index; this could result in out-of-order command
application to the local state machine replica, leading to
serializability violations.
Message-Id: <20210805112736.35059-1-kbraun@scylladb.com>
We want to serialize snapshot application with command application
otherwise a command may be applied after a snapshot that already contains
the result of its application (it is not necessary a problem since the
raft by itself does not guaranty apply-once semantics, but better to
prevent it when possible). This also moves all interactions with user's
state machine into one place.
Message-Id: <YPltCmBAGUQnpW7r@scylladb.com>
Sometimes an ability to force a leader change is needed. For instance
if a node that is currently serving as a leader needs to be brought
down for maintenance. If it will be shutdown without leadership
transfer the cluster will be unavailable for leader election timeout at
least.
We already have a mechanism to transfer the leadership in case an active
leader is removed. The patch exposes it as an external interface with a
timeout period. If a node is still a leader after the timeout the
operation will fail.
By default, wait for the server to leave the joint configuration
when making a configuration change.
When assembling a fresh cluster Scylla may run a series of
configuration changes. These changes would all go through the same
leader and serialize in the critical section around server::cas().
Unless this critical section protects the complete transition from
C_old configuration to C_new, after the first configuration
is committed, the second may fail with exception that a configuration
change is in progress. The topology changes layer should handle
this exception, however, this may introduce either unpleasant
delays into cluster assembly (i.e. if we sleep before retry), or
a busy-wait/thundering herd situation, when all nodes are
retrying their configuration changes.
So let's be nice and wait for a full transition in
server::set_configuration().
Isolate the checks for configuration transitions in a static function,
to be able to unit test outside class server.
Split the condition of transitioning to an empty configuration
from the condition of transitioning into a configuration with
no voters, to produce more user-friendly error messages.
*Allow* to transfer leadership in a configuration when
the only voter is the leader itself. This would be equivalent
to syncing the leader log with the learner and converting
the leader to the follower itself. This is safe, since
the leader will re-elect itself quickly after an election timeout,
and may be used to do a rolling restart of a cluster with
only one voter.
A test case follows.