A series of refactors to the `raft_group0` service.
Read the commits in topological order for best experience.
This PR is more or less equivalent to the second-to-last commit of PR https://github.com/scylladb/scylla/pull/10835, I split it so we could have an easier time reviewing and pushing it through.
Closes#11024
* github.com:scylladb/scylla:
service: storage_service: additional assertions and comments
service/raft: raft_group0: additional logging, assertions, comments
service/raft: raft_group0: pass seed list and `as_voter` flag to `join_group0`
service/raft: raft_group0: rewrite `remove_from_group0`
service/raft: raft_group0: rewrite `leave_group0`
service/raft: raft_group0: split `leave_group0` from `remove_from_group0`
service/raft: raft_group0: introduce `setup_group0`
service/raft: raft_group0: introduce `load_my_addr`
service/raft: raft_group0: make some calls abortable
service/raft: raft_group0: remove some temporary variables
service/raft: raft_group0: refactor `do_discover_group0`.
service/raft: raft_group0: rename `create_server_for_group` to `create_server_for_group0`
service/raft: raft_group0: extract `start_server_for_group0` function
service/raft: raft_group0: create a private section
service/raft: discovery: `seeds` may contain `self`
We could yield between updating the list of servers in raft/fsm
and updating the raft_address_map, e.g. in case of a set_configuration.
If tick_leader happens before the raft_address_map is updated,
is_alive will be called with server_id that is not in the map yet.
Fix: scylladb/scylla-dtest#2753
Closes#11111
Expiring entries are added when a message is received from an unknown
host. If the host is later added to the raft configuration they become
non expiring. After that they can only be removed when the host is
dropped from the configuration, but they should never become expiring
again.
Refs #10826
Move some rare logs from TRACE to INFO level.
Add some assertions.
Write some more comments, including FIXMEs and TODOs.
Remove unnecessary `_shutdown_gate.hold()` (this is not a background
task).
Group 0 discovery would internally fetch the seed list from gossiper.
Gossiper would return the seed list from conf/scylla.yaml. This seed
list is proper for the bootstrapping scenario - we specify the initial
contact points for a node that joins a cluster.
We'll have to use a different list of seeds for group 0 discovery for
the upgrade scenario. Prepare for that by taking the seed list as a
parameter.
In the bootstrap scenario we'll pass the seed list down from
`storage_service::join_cluster`.
Additionally, `join_group0` now takes an `as_voter` flag, which is
`false` in the bootstrap scenario (we initially join as a non-voter) but
will be `true` in the upgrade scenario.
See previous commit. `remove_from_group0` had a similar problem as
`leave_group0`: it would handle the case where `raft_group0::_group0`
variant was not `raft::group_id` (i.e. we haven't joined group 0), but
RAFT local feature was enabled - i.e. the yet-unimplemented upgrade case
- by running discovery and calling `send_group0_modify_config`.
Instead, if we see that we've joined group 0 before, assume that we're
still a member and simply use the Raft `modify_config` API to remove
another server. If we're not a member it means we either decommissioned
or were removed by someone else; then we have no business trying to
remove others. There's also the unimplemented upgrade case but that will
come in another pull request.
Finally, add some logic for handling an edge case: suppose we joined
group 0 recently and we still didn't fully update our RPC address map
(it's being updated asynchronously by Raft's io_fiber). Thus we may fail
to find a member of group 0 in the address map. To handle this, ensure
we're up-to-date by performing a Raft read barrier.
State some assumptions in a comment.
Add a TODO for handling failures.
Remove unnecessary `_shutdown_gate.hold()` (this is not a background
task).
One of the following cases is true:
1. RAFT local feature is disabled. Then we don't do anything related to
group 0.
2. RAFT local feature is enabled and when we bootstrapped, we joined
group 0. Then `raft_group0::_group0` variant holds the
`raft::group_id` alternative.
3. RAFT local feature is enabled and when we bootstrapped we didn't join
group 0. This means the RAFT local feature was disabled when we
bootstrapped and we're in the (unimplemented yet) upgrade scenario.
`raft_group0::_group0` variant holds the `std::monostate` alternative.
The problem with the previous implementation was that it checked for the
conditions of the third case above - that RAFT local feature is enabled
but `_group0` does not hold `raft::group_id` - and if those conditions
were true, it executed some logic that didn't really make sense: it ran
the discovery algorithm and called `send_group0_modify_config` RPC.
In this rewrite I state some assumptions that `leave_group0` makes:
- we've finished the startup procedure.
- we're being run during decommission - after the node entered LEFT
status.
In the new implementation, if `_group0` does not hold `raft::group_id`
(checked by the internal `joined_group0()` helper), we simply return.
This is the yet-unimplemented upgrade case left for a follow-up PR.
Otherwise we fetch our Raft server ID (at this point it must be present
- otherwise it's a fatal error) and simply call `modify_config` from the
`raft::server` API.
Remove unnecessary call to `_shutdown_gate.hold()` (this is not a
background task).
`leave_group0` was responsible for both removing a different node from
group 0 and removing ourselves (leaving) group 0. The two scenarios are
a bit different and the handling will be rewritten in following commits.
Split `leave_group0` into two functions. Remove the incorrect comment
about idempotency - saying that the procedure is idempotent is an
oversimplification, one could argue it's incorrect since the second call
simply hangs, at least in the case of leaving group 0; following commits
will state what's happening more precisely.
Add some additional logging and assertions where the two functions are
called in `storage_service`.
Contains all logic for deciding to join (or not join) group 0.
Prepare for the case where we don't want to join group 0 immediately on
startup - the upgrade scenario (will be implemented in a follow-up).
Move the group 0 setup step earlier in `storage_service::join_cluster`.
`join_group0()` is now a private member of `raft_group0`. Some more
comments were written.
Compared to `load_or_create_my_addr` this function assumes that
the address is already present on disk; if not, it's a fatal error.
Use it in places where it would indeed be a fatal error
if the address was missing.
There are some calls to `modify_config` which should react to aborts
(e.g. when we shutdown Scylla).
There are also calls to `send_group0_modify_config` which should
probably also react to aborts, but the functions don't take
an abort_source parameter. This is fixable but I left TODOs for now.
The function no longer accesses the `_group0` variant directly, instead
it is made a member of `service::persistent_discovery`; the caller
guarantees that `persistent_discovery` is not destroyed before the
function finishes.
The function is now named `run`. A short comment was written at the
declaration site.
Make some members of `persistent_discovery` private, as they are only
used by `run`.
Simplify `struct tracker`, store the discovery output separately
(`struct tracker` is now responsible for a single thing).
Enclose the `parallel_for_each` over requests in a common coroutine
which keeps alive all the necessary things for the loop body and
performs the last step which was previously inside a `then`.
The set of seeds passed to the discovery algorithm may contain `self`.
The implementation will filter the `self` out (it calls `step(seeds)`;
`step` iterates over the given list of peers and ignores `_self`).
Specify this at the `discovery` constructor declaration site.
Simplify the code constructing `persistent_discovery` in
`raft_group0::discover_group0` using this assumption.
Whether a server can vote in a Raft configuration is not part of the
address. `server_address` was used in many context where `can_vote` is
irrelevant.
Split the struct: `server_address` now contains only `id` and
`server_info` as it did before `can_vote` was introduced. Instead we
have a `config_member` struct that contains a `server_address` and the
`can_vote` field.
Also remove an "unsafe" constructor from `server_address` where `id` was
provided but `server_info` was not. The constructor was used for tests
where `server_info` is irrelevant, but it's important not to forget
about the info in production code. The constructor was used for two
purposes:
- Invoking set operations such as `contains`. To solve this we use C++20
transparent hash and comparator functions, which allow invoking
`contains` and similar functions by providing a different key type (in
this case `raft::server_id` in set of addresses, for example).
- constructing addresses without `info`s in tests. For this we provide
helper functions in the test helpers module and use them.
It did nothing.
It will be readded in `raft_group0` and it will do something, stay
tuned.
With this we can remove the `feature_service` reference from
`raft_group_registry`.
Commit e739f2b779 ("cql3: expr: make evaluate() return a
cql3::raw_value rather than an expr::constant") introduced
raw_value::view() as a synonym to raw_value::to_view() to reduce
churn. To fix this duplication, we now remove raw_value::to_view().
raw_value::to_view() was picked for removal because is has fewer
call sites, reducing churn again.
Closes#10819
messaging_service.hh is a switchboard - it includes many things,
and many things include it. Therefore, changes in the things it
includes affect many translation units.
Reduce the dependencies by forward-declaring as much as possible.
This isn't pretty, but it reduces compile time and recompilations.
Other headers adjusted as needed so everything (including
`ninja dev-headers`) still compile.
Closes#10755
`raft_group0` does not own the source and is not responsible for calling
`request_abort`. The source comes from top-level `stop_signal` (see
main.cc) and that's where it's aborted.
Fixes#10668.
Closes#10678
The functions are called from RPC when a follower forwards a request to
a leader (`add_entry`, `modify_config`, `read_barrier`). The call may be
attempted during shutdown. The Raft shutdown code cleans up data structures
created by those requests. Make sure that they are not updated
concurrently with shutdown. This can lead to problems such as using the
server object after it was aborted, or even destroyed.
After this change, the RPC implementation may wait for a `execute_modify_config`
call to finish before finishing abort. That call in turn may be stuck on
`wait_for_entry`. Thus the waiter may prevent RPC from aborting. Fix
this be moving the wait on the future returned from `_rpc->abort()` in
`server::abort()` until after waiters were destroyed.
Writing into the group0 raft group on a client side involves locking
the state machine, choosing a state id and checking for its presence
after operation completes. The code that does it resides now in the
migration manager since the currently it is the only user of group0. In
the near future we will have more client for group0 and they all will
have to have the same logic, so the patch moves it to a separate class
raft_group0_client that any future user of group0 can use to write
into it.
Message-Id: <YoYAJwdTdbX+iCUn@scylladb.com>
On each shard, we register a listener for the new direct failure detector service.
The listener maintains a set of live addresses; on mark_alive it adds a
server to the set and on mark_dead it removes it. This set is then used
to implement the `raft::failure_detector` interface, consisting of
`is_alive()` function, which simply checks set membership.
There is some complexity in between, because we need to translate
direct_failure_detector endpoint_ids to inet_addresses and raft::server_ids
to inet_addreses, but all building blocks are already there.
We connect the group 0 raft server rpc implementation to the new direct
failure detector service, so that when servers are added or removed from
the the group 0 configuration, corresponding endpoints are added to the
direct failure detector service. Thus the set of detected endpoints will
be equal to the group 0 configuration.
This causes the failure detector service to start pinging endpoints,
but no listeners are registered yet. The following commit changes that.
Each feature has a private variable and a public accessor. Since the
accessor effectively makes the variable public, avoid the intermediary
and make the variable public directly.
To ease mechanical translation, the variable name is chosen as
the function name (without the cluster_supports_ prefix).
References throughout the codebase are adjusted.
When executing internal queries, it is important that the developer
will decide if to cache the query internally or not since internal
queries are cached indefinitely. Also important is that the programmer
will be aware if caching is going to happen or not.
The code contained two "groups" of `query_processor::execute_internal`,
one group has caching by default and the other doesn't.
Here we add overloads to eliminate default values for caching behaviour,
forcing an explicit parameter for the caching values.
All the call sites were changed to reflect the original caching default
that was there.
Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
The code would advertise the USES_RAFT feature when the SUPPORTS_RAFT
feature was enabled through a listener registered on the SUPPORTS_RAFT
feature.
This would cause a deadlock:
1. `gossiper::add_local_application_state(SUPPORTED_FEATURES, ...)`
locks the gossiper (it's called for the first time from sstables
format selector).
2. The function calls `on_change` listeners.
3. One of the listeners is the one for SUPPORTS_RAFT.
4. The listener calls
`gossiper::add_local_application_state(SUPPORTED_FEATURES, ...)`.
5. This tries to lock the gossiper.
In turn, depending on timing, this could hang the startup procedure,
which calls `add_local_application_state` multiple times at various
points, trying to take the lock inside gossiper.
This prevents us from testing raft / group 0, new schema change
procedures that use group 0, etc.
For now, simply remove the code that advertises the USES_RAFT feature.
Right now the feature has no other effect on the system than just
becoming enabled. In fact, it's possible that we don't need this second
feature at all (SUPPORTS_RAFT may be enough), but that's
work-in-progress. If needed, it will be easy to bring the enabling code
back (in a fixed form that doesn't cause a deadlock). We don't remove
the feature definitions yet just in case.
Refs: #10355
Since `gossiper::add_local_application_state` is not
safe to call concurrently from multiple shards (which
will cause a deadlock inside the method), call this
only on shard 0 in `_raft_support_listener`.
This fixes sporadic hangs when starting a fresh node in an
empty cluster where node hangs during startup.
Tests: unit(dev), manual
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
There is a listener in the `raft_group_registry`,
which makes the gossiper to re-publish supported
features app state to the cluster.
We don't need to do this in case `USES_RAFT_CLUSTER_MANAGEMENT`
feature is enabled before the usual time, i.e. before the
gossiper settles. So, short-circuit the listener logic in
that case and do nothing.
Also, don't do anything if raft group registry is not enabled
at all, this is just a generic safeguard.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Prior to the change, `USES_RAFT_CLUSTER_MANAGEMENT` feature wasn't
properly advertised upon enabling `SUPPORTS_RAFT_CLUSTER_MANAGEMENT`
raft feature.
This small series consists of 3 parts to fix the handling of supported
features for raft:
1. Move subscription for `SUPPORTS_RAFT_CLUSTER_MANAGEMENT` to the
`raft_group_registry`.
2. Update `system.local#supported_features` directly in the
`feature_service::support()` method.
3. Re-advertise gossiper state for `SUPPORTED_FEATURES` gossiper
value in the support callback within `raft_group_registry`.
* manmanson/track_supported_set_recalculation_v7:
raft: re-advertise gossiper features when raft feature support changes
raft: move tracking `SUPPORTS_RAFT_CLUSTER_MANAGEMENT` feature to raft
gms: feature_service: update `system.local#supported_features` when feature support changes
test: cql_test_env: enable features in a `seastar::thread`
Move the listener from feature service to the `raft_group_registry`.
Enable support for the `USES_RAFT_CLUSTER_MANAGEMENT`
feature when the former is enabled.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
This patch adds an ability to pass abort_source to raft request APIs (
add_entry, modify_config) to make them abortable. A request issuer not
always want to wait for a request to complete. For instance because a
client disconnected or because it no longer interested in waiting
because of a timeout. After this patch it can now abort waiting for such
requests through an abort source. Note that aborting a request only
aborts the wait for it to complete, it does not mean that the request
will not be eventually executed.
Message-Id: <YjHivLfIB9Xj5F4g@scylladb.com>
database
We're currently stopping raft_gr before
shutting the database down, but we fail to do that if
anything goes wrong before that, e.g. if
distributed_loader::init_non_system_keyspaces fails.
This change splits drain_on_shutdown out of stop()
to stop the raft groups before the database is stopped
and does the rest in a deferred_stop placed right
after the rafr_gr registry is strated.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
stop_servers should never fail since it's called on
the shutdown path.
Use a local gate in stop_servers() to wait on all
background raft group server aborts.
Also, handle theoretical exceptions from server::abort()
to guarantee success.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
We add a `peers()` method to `discovery` which returns the peers
discovered until now (including seeds). The caller of functions which
return an output -- `tick` or `request` -- is responsible for persisting
`peers()` before returning the output of `tick`/`request` (e.g. before
sending the response produced by `request` back). The user of
`discovery` is also responsible for restoring previously persisted peers
when constructing `discovery` again after a restart (e.g. if we
previously crashed in the middle of the algorithm).
The `persistent_discovery` class is a wrapper around `discovery` which
does exactly that.
The name `get_output` suggests that this is the only way to get output
from `discovery`. But there is a second public API: `request`, which also
provides us with a different kind of output.
Rename it to `tick`, which describes what the API is used for:
periodically ticking the discovery state machine in order to make
progress.
In `raft_group0::discover_group0`, when we detect that we became a
leader, we destroy the `discovery` object, create a group 0, and respond
with the group 0 information to all further requests.
However there is a small time window after becoming a leader but before
destroying the `discovery` object where we still answer to discovery
requests by returning peer lists, without informing the requester that
we become a leader.
This is unsafe, and the algorithm specification does not allow this. For
example, consider the seed graph 0 --> 1. It satisfies the property
required by the algorithm, i.e. that there exists a vertex reachable
from every other vertex. Now `1` can become a leader before `0` contacts it.
When `0` contacts `1`, it should learn from `1` that `1` created a group 0, so
`0` does not become a leader itself and create another group 0. However,
with the current implementation, it may happen that `0` contacts `1` and
receives a peer list (instead of group 0 information), and also becomes
a leader because it has the smallest ID, so we end up with two group 0s.
The correct thing to do is to stop returning peer lists to requests
immediately after becoming a leader. This is what we fix in this commit.
Except for the first node creating the group0, make other nodes join as
non-voters and make them voters after successful bootstrap.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
There will be nodes in non-voting state in configuration, so can_vote()
is not a good check. Use newer cfg.contains().
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Raft randomized nemesis test was improved by adding some more
chaos: randomizing the network delay, server configuration,
ticking speed of servers.
This allowed to catch a serious bug, which is fixed in the first patch.
The patchset also fixes bugs in the test itself and adds quality of life
improvements such as better diagnostics when inconsistency is detected.
* kbr/nemesis-random-v1:
test: raft: randomized_nemesis_test: print state of each state machine when detecting inconsistency
test: raft: randomized_nemesis_test: print details when detecting inconsistency
test: raft: randomized_nemesis_test: print snapshot details when taking/loading snapshots in `impure_state_machine`
test: raft: randomized_nemesis_test: keep server id in impure_state_machine
test: raft: randomized_nemesis_test: frequent snapshotting configuration
test: raft: randomized_nemesis_test: tick servers at different speeds in generator test
test: raft: randomized_nemesis_test: simplify ticker
test: raft: randomized_nemesis_test: randomize network delay
test: raft: randomized_nemesis_test: fix use-after-free in `environment::crash()`
test: raft: randomized_nemesis_test: fix use-after-free in two-way rpc functions
test: raft: randomized_nemesis_test: rpc: don't propagate `gate_closed_exception` outside
test: raft: randomized_nemesis_test: fix obsolete comment
raft: fsm: print configuration entries appearing in the log
raft: `operator<<(ostream&, ...)` implementation for `server_address` and `configuration`
raft: server: abort snapshot applications before waiting for rpc abort
raft: server: logging fix
raft: fsm: don't advance commit index beyond matched entries