The add_entry and modify_config methods sometimes do an rpc to
execute the request on the current leader. If the tcp connection
was broken, a seastar::rpc::closed_error would be thrown to the client.
This exception was not documented in the method comments and the
client could have missed handling it. For example, this exception
was not handled when calling modify_config in raft_group0,
which sometimes broke the removenode command.
An intermittent_connection_error exception was added earlier to
solve a similar problem with the read_barrier method. In this patch it
is renamed to transport_error, as it seems to better describe the
situation, and an explicit specification for this exception
was added - the rpc implementation can throw it if it is not known
whether the call reached the target node and whether any
actions were performed on it.
In case of read_barrier it does not matter and we just retry. In case
of add_entry and modify_config we cannot retry
because the rpc calls are not idempotent, so we convert this
exception to commit_status_unknown, which the client has to handle.
Explicit comments have also been added to raft::server methods
describing all possible exceptions.
Before this patch we could get an OOM if we
received several big commands. The number of
commands was small, but their total size
in bytes was large.
snapshot_trailing_size is needed to guarantee
progress. Without this limit the fsm could
get stuck if the size of the next item is
greater than max_log_size - (size of trailing entries).
applier_fiber could create multiple snapshots between
io_fiber run. The fsm_output.snp variable was
overwritten by applier_fiber and io_fiber didn't drop
the previous snapshot.
In this patch we introduce the variable
fsm_output.snps_to_drop, store in it
the current snapshot id before applying
a new one, and then sequentially drop them in
io_fiber after storing the last snapshot_descriptor.
_sm_events.signal() is added to fsm::apply_snapshot,
since this method mutates the _output and thus gives a
reason to run io_fiber.
The new test test_frequent_snapshotting demonstrates
the problem by causing frequent snapshots and
setting the applier queue size to one.
Closes#11530
Commitlog imposes a limit on the size of mutations
and throws an exception if it's exceeded. In case of
schema changes before raft this exception was delivered
to the client. Now it happens while saving the raft
command in io_fiber in persistence->store_log_entries
and what the client gets is just a timeout exception,
which doesn't say much about the cause of the problem.
This patch introduces an explicit command size limit
and provides a clear error message in this case.
Closes#11318
* github.com:scylladb/scylladb:
raft, use max_command_size to satisfy commitlog limit
raft, limit for command size
Changing configuration involves two entries in the log: a 'joint
configuration entry' and a 'non-joint configuration entry'. We use
`wait_for_entry` to wait on the joint one. To wait on the non-joint one,
we use a separate promise field in `server`. This promise wasn't
connected to the `abort_source` passed into `set_configuration`.
The call could get stuck if the server got removed from the
configuration and lost leadership after committing the joint entry but
before committing the non-joint one, waiting on the promise. Aborting
wouldn't help. Fix this by subscribing to the `abort_source` in
resolving the promise exceptionally.
Furthermore, make sure that two `set_configuration` calls don't step on
each other's toes by one setting the other's promise. To do that, reset
the promise field at the end of `set_configuration` and check that it's
not engaged at the beginning.
Fixes#11288.
Closes#11325
* github.com:scylladb/scylladb:
test: raft: randomized_nemesis_test: additional logging
raft: server: handle aborts when waiting for config entry to commit
When `io_fiber` fetched a batch with a configuration that does not
contain this node, it would send the entries committed in this batch to
`applier_fiber` and proceed by any remaining entry dropping waiters (if
the node was no longer a leader).
If there were waiters for entries committed in this batch, it could
either happen that `applier_fiber` received and processed those entries
first, notifying the waiters that the entries were committed and/or
applied, or it could happen that `io_fiber` reaches the dropping waiters
code first, causing the waiters to be resolved with
`commit_status_unknown`.
The second scenario is undesirable. For example, when a follower tries
to remove the current leader from the configuration using
`modify_config`, if the second scenario happens, the follower will get
`commit_status_unknown` - this can happen even though there are no node
or network failures. In particular, this caused
`randomized_nemesis_test.remove_leader_with_forwarding_finishes` to fail
from time to time.
Fix it by serializing the notifying and dropping of waiters in a single
fiber - `applier_fiber`. We decided to move all management of waiters
into `applier_fiber`, because most of that management was already there
(there was already one `drop_waiters` call, and two `notify_waiters`
calls). Now, when `io_fiber` observes that we've been removed from the
config and no longer a leader, instead of dropping waiters, it sends a
message to `applier_fiber`. `applier_fiber` will drop waiters when
receiving that message.
Improve an existing test to reproduce this scenario more frequently.
Fixes#11235.
Closes#11308
* github.com:scylladb/scylladb:
test: raft: randomized_nemesis_test: more chaos in `remove_leader_with_forwarding_finishes`
raft: server: drop waiters in `applier_fiber` instead of `io_fiber`
raft: server: use `visit` instead of `holds_alternative`+`get`
Changing configuration involves two entries in the log: a 'joint
configuration entry' and a 'non-joint configuration entry'. We use
`wait_for_entry` to wait on the joint one. To wait on the non-joint one,
we use a separate promise field in `server`. This promise wasn't
connected to the `abort_source` passed into `set_configuration`.
The call could get stuck if the server got removed from the
configuration and lost leadership after committing the joint entry but
before committing the non-joint one, waiting on the promise. Aborting
wouldn't help. Fix this by subscribing to the `abort_source` in
resolving the promise exceptionally.
Furthermore, make sure that two `set_configuration` calls don't step on
each other's toes by one setting the other's promise. To do that, reset
the promise field at the end of `set_configuration` and check that it's
not engaged at the beginning.
Fixes#11288.
When `io_fiber` fetched a batch with a configuration that does not
contain this node, it would send the entries committed in this batch to
`applier_fiber` and proceed by any remaining entry dropping waiters (if
the node was no longer a leader).
If there were waiters for entries committed in this batch, it could
either happen that `applier_fiber` received and processed those entries
first, notifying the waiters that the entries were committed and/or
applied, or it could happen that `io_fiber` reaches the dropping waiters
code first, causing the waiters to be resolved with
`commit_status_unknown`.
The second scenario is undesirable. For example, when a follower tries
to remove the current leader from the configuration using
`modify_config`, if the second scenario happens, the follower will get
`commit_status_unknown` - this can happen even though there are no node
or network failures. In particular, this caused
`randomized_nemesis_test.remove_leader_with_forwarding_finishes` to fail
from time to time.
Fix it by serializing the notifying and dropping of waiters in a single
fiber - `applier_fiber`. We decided to move all management of waiters
into `applier_fiber`, because most of that management was already there
(there was already one `drop_waiters` call, and two `notify_waiters`
calls). Now, when `io_fiber` observes that we've been removed from the
config and no longer a leader, instead of dropping waiters, it sends a
message to `applier_fiber`. `applier_fiber` will drop waiters when
receiving that message.
Fixes#11235.
In `std::holds_alternative`+`std::get` version, the `get` performs a
redundant check. Also `std::visit` gives a compile-time exhaustiveness
check (whether we handled all possible cases of the `variant`).
Dtest fails if it sees an unknown errors in the logs. This series
reduces severity of some errors (since they are actually expected during
shutdown) and removes some others that duplicate already existing errors
that dtest knows how to deal with. Also fix one case of unhandled
exception in schema management code.
* 'dtest-fixes-v1' of github.com:gleb-cloudius/scylla:
raft: getting abort_requested_exception exception from a sm::apply is not a critical error
schema_registry: fix abandoned feature warning
service: raft: silence rpc::closed_errors in raft_rpc
During shutdown it is normal to get abort_requested_exception exception
from a state machine "apply" method. Do not rethrow it as
state_machine_error, just abort an applier loop with an info message.
If the leader was unavailable during read_barrier,
closed_error occurs, which was not handled in any way
and eventually reached the client. This patch adds retries in this case.
Fix: scylladb#11262
Refs: #11278Closes#11263
Whether a server can vote in a Raft configuration is not part of the
address. `server_address` was used in many context where `can_vote` is
irrelevant.
Split the struct: `server_address` now contains only `id` and
`server_info` as it did before `can_vote` was introduced. Instead we
have a `config_member` struct that contains a `server_address` and the
`can_vote` field.
Also remove an "unsafe" constructor from `server_address` where `id` was
provided but `server_info` was not. The constructor was used for tests
where `server_info` is irrelevant, but it's important not to forget
about the info in production code. Replace the constructor with helper
functions which specify in comments that they are supposed to be used in
tests or in contexts where `info` doesn't matter (e.g. when checking
presence in an `unordered_set`, where the equality operator and hash
operate only on the `id`).
Closes#11047
* github.com:scylladb/scylla:
raft: fsm: fix `entry_size` calculation for config entries
raft: split `can_vote` field from `server_address` to separate struct
serializer_impl: generalize (de)serialization of `unordered_set`
to_string: generalize `operator<<` for `unordered_set`
Whether a server can vote in a Raft configuration is not part of the
address. `server_address` was used in many context where `can_vote` is
irrelevant.
Split the struct: `server_address` now contains only `id` and
`server_info` as it did before `can_vote` was introduced. Instead we
have a `config_member` struct that contains a `server_address` and the
`can_vote` field.
Also remove an "unsafe" constructor from `server_address` where `id` was
provided but `server_info` was not. The constructor was used for tests
where `server_info` is irrelevant, but it's important not to forget
about the info in production code. The constructor was used for two
purposes:
- Invoking set operations such as `contains`. To solve this we use C++20
transparent hash and comparator functions, which allow invoking
`contains` and similar functions by providing a different key type (in
this case `raft::server_id` in set of addresses, for example).
- constructing addresses without `info`s in tests. For this we provide
helper functions in the test helpers module and use them.
Leader which ceases to be a leader as a result of a
execute_modify_config cannot wait for a dummy record to be
committed because io_fiber aborts current waiters as soon as it
detects a lost of leadership.
This commit excludes dummy entries from the configuration change
procedure. A special promise is set on io_fiber when it gets a
non-joint configuration, and set_configuration just waits for
the corresponding future instead of a dummy record.
Fixes: #10010Closes#10905
Previously it could happen that `add_entry` returned successfully but
`state_machine::apply` was never called by the server for this entry,
even though `wait_type::applied` was used, if the server loaded
a snapshot that contained this entry in just the right moment. Some
clients may find this behavior surprising, even though we may argue that
it's not technically incorrect.
For example, the nemesis test assumed that if `add_entry` returned
successfully (with `wait_type::applied`), the local state machine
applied the entry; the test uses `apply` to obtain an output - the
result of the command - from the state machine.
It's not a problem to give a stronger guarantee, so we do it in this
commit. In the scenario where a snapshot causes Raft to skip over the
entry, `add_entry` will finish exceptionally with
`commit_status_unknown`.
`rpc::abort` may need to wait until all read barriers finish, so abort
read barrier before waiting for `rpc::abort` to finish to avoid a
deadlock on shutdown.
`rpc::abort` is still called before the read barriers are aborted, only
waited for after. Calling it first prevents new read barriers from being
started by `rpc` (see `rpc::abort` comment).
Also prevent new read barriers from being started after abort starts
directly on a leader by checking the `_aborted` flag at the beginning
of `execute_read_barrier`.
Finally, use the opportunity to remove some compiler-dependent code.
The functions are called from RPC when a follower forwards a request to
a leader (`add_entry`, `modify_config`, `read_barrier`). The call may be
attempted during shutdown. The Raft shutdown code cleans up data structures
created by those requests. Make sure that they are not updated
concurrently with shutdown. This can lead to problems such as using the
server object after it was aborted, or even destroyed.
After this change, the RPC implementation may wait for a `execute_modify_config`
call to finish before finishing abort. That call in turn may be stuck on
`wait_for_entry`. Thus the waiter may prevent RPC from aborting. Fix
this be moving the wait on the future returned from `_rpc->abort()` in
`server::abort()` until after waiters were destroyed.
`modify_config` would call `execute_modify_config` or
`_rpc->send_modify_config`, which returned a reply of type
`add_entry_reply`. This is a variant of 3 options: `entry_id`,
`not_a_leader`, or `commit_status_unknown`. The code would check
for the `entry_id` option and otherwise assume that it was `not_a_leader`.
During nemesis testing however, the reply was sometimes
`commit_status_unknown`, which caused a `bad_variant_access` exception
during `std::get` call. Fix this.
There is a similar piece of code in `add_entry`, but there it should be
impossible to obtain `commit_status_unknown` even though the types don't
enforce it. Make it more explicit with a comment and an assertion.
For a follower to forward requests to a leader the leader must be known.
But there may be a situation where a follower does not learn about
a leader for a while. This may happen when a node becomes a follower while its
log is up-to-date and there are no new entries submitted to raft. In such
case the leader will send nothing to the follower and the only way to
learn about the current leader is to get a message from it. Until a new
entry is added to the raft's log a follower that does not know who the
leader is will not be able to add entries. Kind of a deadlock. Note that
the problem is specific to our implementation where failure detection is
done by an outside module. In vanilla raft a leader sends messages to
all followers periodically, so essentially it is never idle.
The patch solves this by broadcasting specially crafted append reject to all
nodes in the cluster on a tick in case a leader is not known. The leader
responds to this message with an empty append request which will cause the
node to learn about the leader. For optimisation purposes the patch
sends the broadcast only in case there is actually an operation that
waits for leader to be known.
Fixes#10379
The `wait_for_leader` function would throw a low-level
`abort_requested_aborted` exception from seastar::shared_promise.
Translate it to the high-level raft::request_aborted so we can reduce
the number of different exception types which cross the Raft API
boundary.
Also, add comments on Raft API functions about the exception thrown when
requests are aborted.
This patch adds an ability to pass abort_source to raft request APIs (
add_entry, modify_config) to make them abortable. A request issuer not
always want to wait for a request to complete. For instance because a
client disconnected or because it no longer interested in waiting
because of a timeout. After this patch it can now abort waiting for such
requests through an abort source. Note that aborting a request only
aborts the wait for it to complete, it does not mean that the request
will not be eventually executed.
Message-Id: <YjHivLfIB9Xj5F4g@scylladb.com>
When forwarding a reconfiguration request from follower to a leader in
`modify_config`, there is no reason to wait for the follower's commit
index to be updated. The only useful information is that the leader
committed the configuration change - so `modify_config` should return as
soon as we know that.
There is a reason *not* to wait for the follower's commit index to be
updated: if the configuration change removes the follower, the follower
will never learn about it, so a local waiter will never be resolved.
`execute_modify_config` - the part of `modify_config` executed on the
leader - is thus modified to finish when the configuration change is
fully complete (including the dummy entry appended at the end), and
`modify_config` - which does the forwarding - no longer creates a local
waiter, but returns as soon as the RPC call to the leader confirms that
the entry was committed on the leader.
We still return an `entry_id` from `execute_modify_config` but that's
just an artifact of the implementation.
Fixes#9981.
A regression test was also added in randomized_nemesis_test.
* kbr/modify-config-finishes-v1:
test: raft: randomized_nemesis_test: regression test for #9981
raft: server: don't create local waiter in `modify_config`
Raft randomized nemesis test was improved by adding some more
chaos: randomizing the network delay, server configuration,
ticking speed of servers.
This allowed to catch a serious bug, which is fixed in the first patch.
The patchset also fixes bugs in the test itself and adds quality of life
improvements such as better diagnostics when inconsistency is detected.
* kbr/nemesis-random-v1:
test: raft: randomized_nemesis_test: print state of each state machine when detecting inconsistency
test: raft: randomized_nemesis_test: print details when detecting inconsistency
test: raft: randomized_nemesis_test: print snapshot details when taking/loading snapshots in `impure_state_machine`
test: raft: randomized_nemesis_test: keep server id in impure_state_machine
test: raft: randomized_nemesis_test: frequent snapshotting configuration
test: raft: randomized_nemesis_test: tick servers at different speeds in generator test
test: raft: randomized_nemesis_test: simplify ticker
test: raft: randomized_nemesis_test: randomize network delay
test: raft: randomized_nemesis_test: fix use-after-free in `environment::crash()`
test: raft: randomized_nemesis_test: fix use-after-free in two-way rpc functions
test: raft: randomized_nemesis_test: rpc: don't propagate `gate_closed_exception` outside
test: raft: randomized_nemesis_test: fix obsolete comment
raft: fsm: print configuration entries appearing in the log
raft: `operator<<(ostream&, ...)` implementation for `server_address` and `configuration`
raft: server: abort snapshot applications before waiting for rpc abort
raft: server: logging fix
raft: fsm: don't advance commit index beyond matched entries
When forwarding a reconfiguration request from follower to a leader in
`modify_config`, there is no reason to wait for the follower's commit
index to be updated. The only useful information is that the leader
committed the configuration change - so `modify_config` should return as
soon as we know that.
There is a reason *not* to wait for the follower's commit index to be
updated: if the configuration change removes the follower, the follower
will never learn about it, so a local waiter will never be resolved.
`execute_modify_config` - the part of `modify_config` executed on the
leader - is thus modified to finish when the configuration change is
fully complete (including the dummy entry appended at the end), and
`modify_config` - which does the forwarding - no longer creates a local
waiter, but returns as soon as the RPC call to the leader confirms that
the entry was committed on the leader.
We still return an `entry_id` from `execute_modify_config` but that's
just an artifact of the implementation.
Fixes#9981.
The implementation of `rpc` may wait for all snapshot applications to
finish before it can finish aborting. This is what the
randomized_nemesis_test implementation did. This caused rpc abort to
hang in some scenarios.
In this commit, the order of abort calls is modified a bit. Instead of
waiting for rpc abort to finish and then aborting existing snapshot
applications, we call `rpc::abort()` and keep the future, then abort
snapshot applications, then wait on the future. Calling `rpc::abort()`
first is supposed to prevent new snapshot applications from starting;
a comment was added at the interface definition. The nemesis test
implementation had this property, and `raft_rpc` in group registry
was adjusted appropriately. Aborting the snapshot applications then
allows `rpc::abort()` to finish.
Raft does not need to persist the commit index since a restarted node will
either learn it from an append message from a leader or (if entire cluster
is restarted and hence there is no leader) new leader will figure it out
after contacting a quorum. But some users may want to be able to bring
their local state machine to a state as up-to-date as it was before restart
as soon as possible without any external communication.
For them this patch introduces new persistence API that allows saving
and restoring last seen committed index.
Message-Id: <YfFD53oS2j1My0p/@scylladb.com>
Abort signals stopped_error on all awaited entries, but if an entry is
added after this it will be destroyed without signaling and will cause
a waiter to get broken_promise.
Fixes#9688
Message-Id: <Ye6xJjTDooKSuZ87@scylladb.com>
Instead of lengthy blurbs, switch to single-line, machine-readable
standardized (https://spdx.dev) license identifiers. The Linux kernel
switched long ago, so there is strong precedent.
Three cases are handled: AGPL-only, Apache-only, and dual licensed.
For the latter case, I chose (AGPL-3.0-or-later and Apache-2.0),
reasoning that our changes are extensive enough to apply our license.
The changes we applied mechanically with a script, except to
licenses/README.md.
Closes#9937
seastar::later() was recently deprecated and replaced with two
alternatives: a cheap seastar::yield() and an expensive (but more
powerful) seastar::check_for_io_immediately(), that corresponds to
the original later().
This patch replaces all later() calls with the weaker yield(). In
all cases except one, it's unambiguously correct. In one case
(test/perf scheduling_latency_measurer::stop()) it's not so ambiguous,
since check_for_io_immediately() will additionally force a poll and
so will cause more work to be done (but no additional tasks to be
executed). However, I think that any measurement that relies on
the measuring the work on the last tick to be inaccurate (you need
thousands of ticks to get any amount of confidence in the
measurement) that in the end it doesn't matter what we pick.
Tests: unit (dev)
Closes#9904
Instead, expose `register_metrics()` at the `server` interface
(previously it was a private method of `server_impl`).
Metrics are global so `register_metrics()` cannot be called on
two servers that have the same ID, which is useful e.g. in tests when we
want to simulate server stops and restarts.
In absence of abort_source or timeouts in Raft API, automatic
bouncing can create too much noise during testing, especially
during network failures. Add an option to disable follower
bouncing feature, since randomized_nemesis_test has its own
bouncing which handles timeouts correctly.
Optionally disable forwarding in basic_generator_test.
Implement an RPC to forward add_entry calls from the follower
to leader. Bounce & retry in case of not_a_leader.
Do not retry in case of uncertainty - this can lead to adding
duplicate entries.
The feature is added to core Raft since it's needed by
all current clients - both topology and schema changes.
When forwarding an entry to a remote leader we may get back
a term/index pair that conflicts (has the same index, but is with
a higher term) with a local entry we're still waiting on.
This can happen, e.g. because there was a leader change and the
log was truncated, but we still haven't got the append_entries
RPC from the new leader, still haven't truncated the log locally,
still haven't aborted all the local waits for truncated entries.
Only remove the offending entry from the wait list and abort it.
There may be entries labeled with an older term to the right (with
higher commit index) of the conflicting entry. However, finding them,
would require a linear scan. If we allow it, we may end up doing this
linear scan for *every* conflicting entry during the transition
period, which brings us to N^2 complexity of this step. At the
same time, as soon as append_entries that commits a higher-term
entry with the same index reaches the follower, the waits
for the respective truncated entry will be aborted anyway (see
notify_waiters() which sets dropped_entry exception), so the scan
is unnecessary.
Similarly to being able to add entries, allow to modify
Raft group configuration on a follower. The implementation
works the same way as adding entries - forwards the command
to the leader.
Now that add_entry() or modify_config never throws not_a_leader,
it's more likely to throw timed_out_error, e.g. in case the
network is partitioned. Previously it was only possible due to a
semaphore wait timeout, and this scenario was not tested.
Handle timed_out_error on RPC level to let the existing tests
(specifically the randomized nemesis test) pass.