Commit Graph

90 Commits

Author SHA1 Message Date
Eliran Sinvani
e0c7178e75 query_processor: remove default internal query caching behavior
When executing internal queries, it is important that the developer
will decide if to cache the query internally or not since internal
queries are cached indefinitely. Also important is that the programmer
will be aware if caching is going to happen or not.
The code contained two "groups" of `query_processor::execute_internal`,
one group has caching by default and the other doesn't.
Here we add overloads to eliminate default values for caching behaviour,
forcing an explicit parameter for the caching values.
All the call sites were changed to reflect the original caching default
that was there.

Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
2022-05-01 08:33:55 +03:00
Kamil Braun
b1b22f2c2b service: raft: don't support/advertise USES_RAFT feature
The code would advertise the USES_RAFT feature when the SUPPORTS_RAFT
feature was enabled through a listener registered on the SUPPORTS_RAFT
feature.

This would cause a deadlock:
1. `gossiper::add_local_application_state(SUPPORTED_FEATURES, ...)`
   locks the gossiper (it's called for the first time from sstables
   format selector).
2. The function calls `on_change` listeners.
3. One of the listeners is the one for SUPPORTS_RAFT.
4. The listener calls
   `gossiper::add_local_application_state(SUPPORTED_FEATURES, ...)`.
5. This tries to lock the gossiper.

In turn, depending on timing, this could hang the startup procedure,
which calls `add_local_application_state` multiple times at various
points, trying to take the lock inside gossiper.

This prevents us from testing raft / group 0, new schema change
procedures that use group 0, etc.

For now, simply remove the code that advertises the USES_RAFT feature.
Right now the feature has no other effect on the system than just
becoming enabled. In fact, it's possible that we don't need this second
feature at all (SUPPORTS_RAFT may be enough), but that's
work-in-progress. If needed, it will be easy to bring the enabling code
back (in a fixed form that doesn't cause a deadlock). We don't remove
the feature definitions yet just in case.

Refs: #10355
2022-04-15 16:08:25 +02:00
Pavel Solodovnikov
293c5f39ee service: raft_group0: make join_group0 re-entrant
Detect if we have already finished joining group0 before
and do nothing in that case.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-04-07 12:36:40 +03:00
Pavel Solodovnikov
0d5e2157e1 raft_group_registry: update gossiper state only on shard 0
Since `gossiper::add_local_application_state` is not
safe to call concurrently from multiple shards (which
will cause a deadlock inside the method), call this
only on shard 0 in `_raft_support_listener`.

This fixes sporadic hangs when starting a fresh node in an
empty cluster where node hangs during startup.

Tests: unit(dev), manual

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-04-07 12:33:40 +03:00
Pavel Solodovnikov
7903d2afa8 raft: don't update gossiper state if raft is enabled early or not enabled at all
There is a listener in the `raft_group_registry`,
which makes the gossiper to re-publish supported
features app state to the cluster.

We don't need to do this in case `USES_RAFT_CLUSTER_MANAGEMENT`
feature is enabled before the usual time, i.e. before the
gossiper settles. So, short-circuit the listener logic in
that case and do nothing.

Also, don't do anything if raft group registry is not enabled
at all, this is just a generic safeguard.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-04-07 12:31:29 +03:00
Tomasz Grabiec
cd5fec8a23 Merge "raft: re-advertise gossiper features when raft feature support changes" from Pavel
Prior to the change, `USES_RAFT_CLUSTER_MANAGEMENT` feature wasn't
properly advertised upon enabling `SUPPORTS_RAFT_CLUSTER_MANAGEMENT`
raft feature.

This small series consists of 3 parts to fix the handling of supported
features for raft:
1. Move subscription for `SUPPORTS_RAFT_CLUSTER_MANAGEMENT` to the
   `raft_group_registry`.
2. Update `system.local#supported_features` directly in the
   `feature_service::support()` method.
3. Re-advertise gossiper state for `SUPPORTED_FEATURES` gossiper
   value in the support callback within `raft_group_registry`.

* manmanson/track_supported_set_recalculation_v7:
  raft: re-advertise gossiper features when raft feature support changes
  raft: move tracking `SUPPORTS_RAFT_CLUSTER_MANAGEMENT` feature to raft
  gms: feature_service: update `system.local#supported_features` when feature support changes
  test: cql_test_env: enable features in a `seastar::thread`
2022-03-18 12:34:17 +01:00
Pavel Solodovnikov
ebc2178ea5 raft: re-advertise gossiper features when raft feature support changes
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-03-18 09:54:29 +03:00
Pavel Solodovnikov
011942dcce raft: move tracking SUPPORTS_RAFT_CLUSTER_MANAGEMENT feature to raft
Move the listener from feature service to the `raft_group_registry`.

Enable support for the `USES_RAFT_CLUSTER_MANAGEMENT`
feature when the former is enabled.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-03-18 09:54:25 +03:00
Gleb Natapov
a1604aa388 raft: make raft requests abortable
This patch adds an ability to pass abort_source to raft request APIs (
add_entry, modify_config) to make them abortable. A request issuer not
always want to wait for a request to complete. For instance because a
client disconnected or because it no longer interested in waiting
because of a timeout. After this patch it can now abort waiting for such
requests through an abort source. Note that aborting a request only
aborts the wait for it to complete, it does not mean that the request
will not be eventually executed.

Message-Id: <YjHivLfIB9Xj5F4g@scylladb.com>
2022-03-16 18:38:01 +01:00
Benny Halevy
8481852c91 cql_test_env: raft_group_registry::drain_on_shutdown before stopping the
database

We're currently stopping raft_gr before
shutting the database down, but we fail to do that if
anything goes wrong before that, e.g. if
distributed_loader::init_non_system_keyspaces fails.

This change splits drain_on_shutdown out of stop()
to stop the raft groups before the database is stopped
and does the rest in a deferred_stop placed right
after the rafr_gr registry is strated.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-03-14 11:49:44 +02:00
Benny Halevy
ac307d6a62 raft_group_registry: harden stop_servers
stop_servers should never fail since it's called on
the shutdown path.

Use a local gate in stop_servers() to wait on all
background raft group server aborts.

Also, handle theoretical exceptions from server::abort()
to guarantee success.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-03-14 11:49:44 +02:00
Benny Halevy
ab30feb71d raft_group_registry: delete unused _shutdown_gate
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-03-14 11:49:44 +02:00
Kamil Braun
4147ef94b0 service: raft: raft_group0: persist discovered peers and restore on restart
We add a `peers()` method to `discovery` which returns the peers
discovered until now (including seeds). The caller of functions which
return an output -- `tick` or `request` -- is responsible for persisting
`peers()` before returning the output of `tick`/`request` (e.g. before
sending the response produced by `request` back). The user of
`discovery` is also responsible for restoring previously persisted peers
when constructing `discovery` again after a restart (e.g. if we
previously crashed in the middle of the algorithm).

The `persistent_discovery` class is a wrapper around `discovery` which
does exactly that.
2022-02-14 12:05:18 +01:00
Kamil Braun
02d4087c6e service: raft: discovery: rename get_output to tick
The name `get_output` suggests that this is the only way to get output
from `discovery`. But there is a second public API: `request`, which also
provides us with a different kind of output.

Rename it to `tick`, which describes what the API is used for:
periodically ticking the discovery state machine in order to make
progress.
2022-02-14 12:04:37 +01:00
Kamil Braun
586ef8fc23 service: raft: discovery: stop returning peer_list from request after becoming leader
In `raft_group0::discover_group0`, when we detect that we became a
leader, we destroy the `discovery` object, create a group 0, and respond
with the group 0 information to all further requests.

However there is a small time window after becoming a leader but before
destroying the `discovery` object where we still answer to discovery
requests by returning peer lists, without informing the requester that
we become a leader.

This is unsafe, and the algorithm specification does not allow this. For
example, consider the seed graph 0 --> 1. It satisfies the property
required by the algorithm, i.e. that there exists a vertex reachable
from every other vertex. Now `1` can become a leader before `0` contacts it.
When `0` contacts `1`, it should learn from `1` that `1` created a group 0, so
`0` does not become a leader itself and create another group 0. However,
with the current implementation, it may happen that `0` contacts `1` and
receives a peer list (instead of group 0 information), and also becomes
a leader because it has the smallest ID, so we end up with two group 0s.

The correct thing to do is to stop returning peer lists to requests
immediately after becoming a leader. This is what we fix in this commit.
2022-02-14 12:04:37 +01:00
Alejo Sanchez
a0c2bc0df2 raft: nodes joining as non-voters
Except for the first node creating the group0, make other nodes join as
non-voters and make them voters after successful bootstrap.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-02-08 09:16:30 -04:00
Alejo Sanchez
2d9f40f716 raft: group 0: use cfg.contains() for config check
There will be nodes in non-voting state in configuration, so can_vote()
is not a good check. Use newer cfg.contains().

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2022-02-08 08:00:07 -04:00
Tomasz Grabiec
b78bab7286 Merge "raft: fixes and improvements to the library and nemesis test" from Kamil
Raft randomized nemesis test was improved by adding some more
chaos: randomizing the network delay, server configuration,
ticking speed of servers.

This allowed to catch a serious bug, which is fixed in the first patch.

The patchset also fixes bugs in the test itself and adds quality of life
improvements such as better diagnostics when inconsistency is detected.

* kbr/nemesis-random-v1:
  test: raft: randomized_nemesis_test: print state of each state machine when detecting inconsistency
  test: raft: randomized_nemesis_test: print details when detecting inconsistency
  test: raft: randomized_nemesis_test: print snapshot details when taking/loading snapshots in `impure_state_machine`
  test: raft: randomized_nemesis_test: keep server id in impure_state_machine
  test: raft: randomized_nemesis_test: frequent snapshotting configuration
  test: raft: randomized_nemesis_test: tick servers at different speeds in generator test
  test: raft: randomized_nemesis_test: simplify ticker
  test: raft: randomized_nemesis_test: randomize network delay
  test: raft: randomized_nemesis_test: fix use-after-free in `environment::crash()`
  test: raft: randomized_nemesis_test: fix use-after-free in two-way rpc functions
  test: raft: randomized_nemesis_test: rpc: don't propagate `gate_closed_exception` outside
  test: raft: randomized_nemesis_test: fix obsolete comment
  raft: fsm: print configuration entries appearing in the log
  raft: `operator<<(ostream&, ...)` implementation for `server_address` and `configuration`
  raft: server: abort snapshot applications before waiting for rpc abort
  raft: server: logging fix
  raft: fsm: don't advance commit index beyond matched entries
2022-01-31 13:25:27 +01:00
Kamil Braun
46f6a0cca5 raft: server: abort snapshot applications before waiting for rpc abort
The implementation of `rpc` may wait for all snapshot applications to
finish before it can finish aborting. This is what the
randomized_nemesis_test implementation did. This caused rpc abort to
hang in some scenarios.

In this commit, the order of abort calls is modified a bit. Instead of
waiting for rpc abort to finish and then aborting existing snapshot
applications, we call `rpc::abort()` and keep the future, then abort
snapshot applications, then wait on the future. Calling `rpc::abort()`
first is supposed to prevent new snapshot applications from starting;
a comment was added at the interface definition. The nemesis test
implementation had this property, and `raft_rpc` in group registry
was adjusted appropriately. Aborting the snapshot applications then
allows `rpc::abort()` to finish.
2022-01-26 16:06:45 +01:00
Gleb Natapov
579dcf187a raft: allow an option to persist commit index
Raft does not need to persist the commit index since a restarted node will
either learn it from an append message from a leader or (if entire cluster
is restarted and hence there is no leader) new leader will figure it out
after contacting a quorum. But some users may want to be able to bring
their local state machine to a state as up-to-date as it was before restart
as soon as possible without any external communication.

For them this patch introduces new persistence API that allows saving
and restoring last seen committed index.

Message-Id: <YfFD53oS2j1My0p/@scylladb.com>
2022-01-26 14:06:39 +01:00
Kamil Braun
6a00e790c7 service: raft: check and update state IDs during group 0 operations
The group 0 state machine will only modify state during command
application if the provided "previous state ID" is equal to the
last state ID present in the history table. Otherwise, the command will
be a no-op.

To ensure linearizability of group 0 changes, the performer of the
change must first read the last state ID, only then read the state
and send a command for the state machine. If a concurrent change
races with this command and manages to modify the state, we will detect
that the last state ID does not match during `apply`; all calls to
`apply` are serialized, and `apply` adds the new entry to the history
table at the end, after modifying the group 0 state.

The details of this mechanism are abstracted away with `group0_guard`.
To perform a group 0 change, one needs to call `announce`, which
requires a `group0_guard` to be passed in. The only way to obtain a
`group0_guard` is by calling `start_group0_operation`, which underneath
performs a read barrier on group 0, obtains the last state ID from the
history table, and constructs a new state ID that the change will append
to the history table. The read barrier ensures that all previously
completed changes are visible to this operation. The caller can then
perform any necessary validation, construct mutations which modify group
0 state, and finally call `announce`.

The guard also provides a timestamp which is used by the caller
to construct the mutations. The timestamp is obtained from the new state ID.
We ensure that it is greater than the timestamp of the last state ID.
Thus, if the change is successful, the applied mutations will have greater
timestamps than the previously applied mutations.

We also add two locks. The more important one, used to ensure
correctness, is `read_apply_mutex`. It is held when modifying group 0
state (in `apply` and `transfer_snapshot`) and when reading it (it's
taken when obtaining a `group0_guard` and released before a command is
sent in `announce`). Its goal is to ensure that we don't read partial
state, which could happen without it because group 0 state consist of
many parts and `apply` (or `transfer_snapshot`) potentially modifies all
of them. Note: this doesn't give us 100% protection; if we crash in the
middle of `apply` (or `transfer_snapshot`), then after restart we may
read partial state. To remove this possibility we need to ensure that
commands which were being applied before restart but not finished are
re-applied after restart, before anyone can read the state. I left a
TODO in `apply`.

The second lock, `operation_mutex`, is used to improve liveness. It is
taken when obtaining a `group0_guard` and released after a command is
applied (compare to `read_apply_mutex` which is released before a
command is sent). It is not taken inside `apply` or `transfer_snapshot`.
This lock ensures that multiple fibers running on the same node do not
attempt to modify group0 concurrently - this would cause some of them to
fail (due to the concurrent modification protection described above).
This is mostly important during first boot of the first node, when
services start for the first time and try to create their internal
tables. This lock serializes these attempts, ensuring that all of them
succeed.
2022-01-24 15:20:37 +01:00
Kamil Braun
509ac2130f service: raft: group0_state_machine: introduce group0_command
Objects of this type will be serialized and sent as commands to the
group 0 state machine. They contain a set of mutations which modify
group 0 tables (at this point: schema tables and group 0 history table),
the 'previous state ID' which is the last state ID present in the
history table when the operation described by this command has started,
and the 'new state ID' which will be appended to the history table if
this change is successful (successful = the previous state ID is still
equal to the last state ID in the history table at the moment of
application). It also contains the address of the node which constructed
this command.

The state ID mechanism will be described in more detail in a later
commit.
2022-01-24 15:20:37 +01:00
Kamil Braun
f908da919c service: raft: pass storage_proxy& to group0_state_machine
We'll use it to update the group 0 history table.
2022-01-24 15:12:50 +01:00
Kamil Braun
dce8ece4b6 service: raft: raft_state_machine: pass snapshot_descriptor to transfer_snapshot
Currently it takes just the snapshot ID. Extend it by taking the whole
snapshot descriptor.

In following commits I use this to perform additional logging.
2022-01-24 15:12:50 +01:00
Kamil Braun
538cc6ecb9 service: raft: rename schema_raft_state_machine to group0_state_machine
Generalize the name so it doesn't suggest that group 0 contains only
schema state.
2022-01-24 15:12:50 +01:00
Avi Kivity
fcb8d040e8 treewide: use Software Package Data Exchange (SPDX) license identifiers
Instead of lengthy blurbs, switch to single-line, machine-readable
standardized (https://spdx.dev) license identifiers. The Linux kernel
switched long ago, so there is strong precedent.

Three cases are handled: AGPL-only, Apache-only, and dual licensed.
For the latter case, I chose (AGPL-3.0-or-later and Apache-2.0),
reasoning that our changes are extensive enough to apply our license.

The changes we applied mechanically with a script, except to
licenses/README.md.

Closes #9937
2022-01-18 12:15:18 +01:00
Gleb Natapov
c500a90902 raft service: make one way raft messages truly one way
Raft core does not expect replies for most messages it sends, but they
are defined as two way by the IDL currently. Fix them to be one way.
2022-01-13 13:14:46 +02:00
Gleb Natapov
b1fea20d36 raft: move raft verbs to the IDL 2022-01-13 13:14:46 +02:00
Gleb Natapov
8a25b740df raft: split idl to rpc and storage
Storage uses only small part of the IDL, so it can include only the part
that is relevant to it.
2022-01-13 13:14:46 +02:00
Tomasz Grabiec
5eaca85e4b Merge "wire up schema raft state machine" from Gleb
This series wires up the schema state machine to process raft commands
and transfer snapshots. The series assumes that raft group zero is used
for schema transfer only and that single raft command contains single
schema change in a form of canonical_mutation array. Both assumptions
may change in which case the code will be changed accordingly, but we
need to start somewhere.

* scylla-dev/gleb/schema-raft-sm-v2:
  schema raft sm: request schema sync on schema_state_machine snapshot transfer
  raft service: delegate snapshot transfer to a state machine implementation
  schema raft sm: pass migration manager to schema_raft_state_machine and merge schema on apply()
2021-12-08 13:14:28 +01:00
Kamil Braun
485c0b1819 raft: server: don't register metrics in start()
Instead, expose `register_metrics()` at the `server` interface
(previously it was a private method of `server_impl`).

Metrics are global so `register_metrics()` cannot be called on
two servers that have the same ID, which is useful e.g. in tests when we
want to simulate server stops and restarts.
2021-12-07 11:23:33 +01:00
Gleb Natapov
cab1a1c403 schema raft sm: request schema sync on schema_state_machine snapshot transfer
If the schema state machine requests snapshot transfer it means that
it missed some schema mutations and needs a full sync. We already have
a function that does it: migration_manager::submit_migration_task(),
so call it on a snapshot transfer.
2021-12-02 14:55:29 +02:00
Gleb Natapov
431f931d2a raft service: delegate snapshot transfer to a state machine implementation
We want raft service to support all kinds of state machines and most
services provided by it may indeed be shared. But snapshot transfer is
very state machine specific and thus cannot be put into the raft service.
This patch delegates snapshot transfer implementation to a state machine
implementation.
2021-12-02 10:54:44 +02:00
Gleb Natapov
fd109ecff1 schema raft sm: pass migration manager to schema_raft_state_machine and merge schema on apply()
This patch wires up schema_raft_state_machine::apply() function. For now
it assumes that a raft command contains single schema change in the form
of a schema mutation array. It may change later (we may add more info to
a schema), but for now this will do.
2021-12-02 10:46:32 +02:00
Gleb Natapov
f2ab5f4e60 raft service: insert a new raft instance into the servers' list only after it is started
RPC module starts to dispatching calls to a server the moment it is in the
servers' list, but until raft::server::start() completes the instance is
not fully created yet and is not ready to accept anything. Fix the code
that initialize new raft group to insert new raft instance into the list
only after it is started.

Message-Id: <YZTFFW9v0NlV7spR@scylladb.com>
2021-12-01 13:11:49 +01:00
Konstantin Osipov
c22f945f11 raft: (service) manage Raft configuration during topology changes
Operations of adding or removing a node to Raft configuration
are made idempotent: they do nothing if already done, and
they are safe to resume after a failure.

However, since topology changes are not transactional, if a
bootstrap or removal procedure fails midway, Raft group 0
configuration may go out of sync with topology state as seen by
gossip.

In future we must change gossip to avoid making any persistent
changes to the cluster: all changes to persistent topology state
will be done exclusively through Raft Group 0.

Specifically, instead of persisting the tokens by advertising
them through gossip, the bootstrap will commit a change to a system
table using Raft group 0. nodetool will switch from looking at
gossip-managed tables to consulting with Raft Group 0 configuration
or Raft-managed tables.
Once this transformation is done, naturally, adding a node to Raft
configuration (perhaps as a non-voting member at first) will become the
first persistent change to ring state applied when a node joins;
removing a node from the Raft Group 0 configuration will become the last
action when removing a node.

Until this is done, do our best to avoid a cluster state when
a removed node or a node which addition failed is stuck in Raft
configuration, but the node is no longer present in gossip-managed
system tables. In other words, keep the gossip the primary source of
truth. For this purpose, carefully chose the timing when we
join and leave Raft group 0:

Join the Raft group 0 only after we've advertised our tokens, so the
cluster is aware of this node, it's visible in nodetool status,
but before node state jumps to "normal", i.e. before it accepts
queries. Since the operation is idempotent, invoke it on each
restart.

Remove the node from Group 0 *before* its tokens are removed
from gossip-managed system tables. This guarantees
that if removal from Raft group 0 fails for whatever reason,
the node stays in the ring, so nodetool removenode and
friends are re-tried.

Add tracing.
2021-11-25 12:35:42 +03:00
Konstantin Osipov
96e2594207 raft: (service) break a dependency loop
Break a dependency loop raft_rpc <-> raft_group_registry
via raft_address_map. Pass raft_address_map to raft_rpc and
raft_gossip_failure_detector explicitly, not entire raft_group_registry.

Extract server_for_group into a helper class. It's going to be used by
raft_group0 so make it easier to reference.
2021-11-25 11:50:38 +03:00
Konstantin Osipov
8ee88a9d8a raft: (discovery) introduce leader discovery state machine
Introduce a special state machine used to to find
a leader of an existing Raft cluster or create
a new cluster.

This state machine should be used when a new
Scylla node has no persisted Raft Group 0 configuration.

The algorithm is initialized with a list of seed
IP addresses, IP address of this server, and,
this server's Raft server id.

The IP addresses are used to construct an initial list of peers.

Then, the algorithm tries to contact each peer (excluding self) from
its peer list and share the peer list with this peer, as well as
get the peer's peer list. If this peer is already part of
some Raft cluster, this information is also shared. On a response
from a peer, the current peer's peer list is updated. The
algorithm stops when all peers have exchanged peer information or
one of the peers responds with id of a Raft group and Raft
server address of the group leader.

(If any of the peers fails to respond, the algorithm re-tries
ad infinitum with a timeout).

More formally, the algorithm stops when one of the following is true:
- it finds an instance with initialized Raft Group 0, with a leader
- all the peers have been contacted, and this server's
  Raft server id is the smallest among all contacted peers.
2021-11-25 11:50:38 +03:00
Konstantin Osipov
e3751068fe raft: (server) allow adding entries/modify config on a follower
Implement an RPC to forward add_entry calls from the follower
to leader. Bounce & retry in case of not_a_leader.
Do not retry in case of uncertainty - this can lead to adding
duplicate entries.

The feature is added to core Raft since it's needed by
all current clients - both topology and schema changes.

When forwarding an entry to a remote leader we may get back
a term/index pair that conflicts (has the same index, but is with
a higher term) with a local entry we're still waiting on.

This can happen, e.g. because there was a leader change and the
log was truncated, but we still haven't got the append_entries
RPC from the new leader, still haven't truncated the log locally,
still haven't aborted all the local waits for truncated entries.

Only remove the offending entry from the wait list and abort it.
There may be entries labeled with an older term to the right (with
higher commit index) of the conflicting entry. However, finding them,
would require a linear scan. If we allow it, we may end up doing this
linear scan for *every* conflicting entry during the transition
period, which brings us to N^2 complexity of this step. At the
same time, as soon as append_entries that commits a higher-term
entry with the same index reaches the follower, the waits
for the respective truncated entry will be aborted anyway (see
notify_waiters() which sets dropped_entry exception), so the scan
is unnecessary.

Similarly to being able to add entries, allow to modify
Raft group configuration on a follower. The implementation
works the same way as adding entries - forwards the command
to the leader.

Now that add_entry() or modify_config never throws not_a_leader,
it's more likely to throw timed_out_error, e.g. in case the
network is partitioned. Previously it was only possible due to a
semaphore wait timeout, and this scenario was not tested.
Handle timed_out_error on RPC level to let the existing tests
(specifically the randomized nemesis test) pass.
2021-11-25 11:50:38 +03:00
Avi Kivity
369afe3124 treewide: use coroutine::maybe_yield() instead of co_await make_ready_future()
The dedicated API shows the intent, and may be a tiny bit faster.

Closes #9382
2021-09-23 12:28:56 +02:00
Tomasz Grabiec
83113d8661 Merge "raft: new schema for storing raft snapshots" from Pavel Solodovnikov
Previously, the layout for storing raft snapshot
descriptors contained a `config` field, which had `blob`
data type.

That means `raft::configuration` for the snapshot was serialized
as a whole in binary form. It's convenient to implement and
is the most compact form of representing the data, but:

1. Hard to debug due to the need to de-serialize the data.
2. Plants a time bomb wrt. changing data layout and also the
   documentation in the future.

Remove the `config` field from `system.raft_snapshots` and
extract it to a separate `system.raft_config` table to store
the data in exploded form.

Also, modify the schema of `system.raft_snapshots` table in
the following way: add a `server_id` field as a part of
composite partition key ((group_id, server_id)) to
be able to start multiple raft servers belonging to one raft
group on the same scylla node.

Rename `id` field in `raft_snapshots` to `snapshot_id` so
it's self-documenting.

Rename `snapshot_id` from clustering key since a given server
can have only one snapshot installed at a time.

Note that the `raft::server_address` stucture contains an opaque
`info` member, which is `bytes`, but in the `raft_config` table
we use `ip_addr inet` field, instead. We always know that the
corresponding member field is going to contain an IP address (either v4
or v6) of a given raft server.

So, now the snapshots schema looks like this:

    CREATE TABLE raft_snapshots (
        group_id timeuuid,
        server_id uuid,
        snapshot_id uuid,
        idx int,
        term int,
        -- no `config` field here, moved to `raft_config` table
        PRIMARY KEY ((group_id, server_id))
    )

    CREATE TABLE raft_config (
         group_id timeuuid,
         my_server_id uuid,
         server_id uuid,
         disposition text, -- can be either 'CURRENT` or `PREVIOUS'
         can_vote bool,
         ip_addr inet,
         PRIMARY KEY ((group_id, my_server_id), server_id, disposition)
    );

This way it's much easier to extend the schema with new fields,
very easy to debug and inspect via CQL, and it's much more descriptive
in terms of self-documentation.

Tests: unit(dev)

* manmanson/raft_snapshots_new_schema_v2:
  test: adjust `schema_change_test` to include new `system.raft_config` table
  raft: new schema for storing raft snapshots
  raft: pass server id to `raft_sys_table_storage` instance
2021-09-10 20:41:59 +02:00
Gleb Natapov
ce40b01b07 raft: rename snapshot into snapshot_descriptor
The snapshot structure does not contain the snapshot itself but only
refers to it trough its id. Rename it to snapshot_descriptor for clarity.
2021-08-29 12:53:03 +03:00
Pavel Solodovnikov
8d3c0ee9b6 raft: new schema for storing raft snapshots
Previously, the layout for storing raft snapshot
descriptors contained a `config` field, which had `blob`
data type.

That means `raft::configuration` for the snapshot was serialized
as a whole in binary form. It's convenient to implement and
is the most compact form of representing the data, but:

1. Hard to debug due to the need to de-serialize the data.
2. Plants a time bomb wrt. changing data layout and also the
   documentation in the future.

Remove the `config` field from `system.raft_snapshots` and
extract it to a separate `system.raft_config` table to store
the data in exploded form.

Also, modify the schema of `system.raft_snapshots` table in
the following way: add a `server_id` field as a part of
composite partition key ((group_id, server_id)) to
be able to start multiple raft servers belonging to one raft
group on the same scylla node.

Rename `id` field in `raft_snapshots` to `snapshot_id` so
it's self-documenting.

Rename `snapshot_id` from clustering key since a given server
can have only one snapshot installed at a time.

Note that the `raft::server_address` stucture contains an opaque
`info` member, which is `bytes`, but in the `raft_config` table
we use `ip_addr inet` field, instead. We always know that the
corresponding member field is going to contain an IP address (either v4
or v6) of a given raft server.

So, now the snapshots schema looks like this:

    CREATE TABLE raft_snapshots (
        group_id timeuuid,
        server_id uuid,
        snapshot_id uuid,
        idx int,
        term int,
        -- no `config` field here, moved to `raft_config` table
        PRIMARY KEY ((group_id, server_id))
    )

    CREATE TABLE raft_config (
         group_id timeuuid,
         my_server_id uuid,
         server_id uuid,
         disposition text, -- can be either 'CURRENT` or `PREVIOUS'
         can_vote bool,
         ip_addr inet,
         PRIMARY KEY ((group_id, my_server_id), server_id, disposition)
    );

This way it's much easier to extend the schema with new fields,
very easy to debug and inspect via CQL, and it's much more descriptive
in terms of self-documentation.

Tests: unit(dev)

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-08-27 09:24:46 +03:00
Pavel Solodovnikov
0a8faee660 raft: pass server id to raft_sys_table_storage instance
Preparations for changing raft snapshots schema.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-08-27 09:24:20 +03:00
Gleb Natapov
03a266d73b raft: make read_barrier work on a follower as well as on a leader
This patch implements RAFT extension that allows to perform linearisable
reads by accessing local state machine. The extension is described
in section 6.4 of the PhD. To sum it up to perform a read barrier on
a follower it needs to asks a leader the last committed index that it
knows about. The leader must make sure that it is still a leader before
answering by communicating with a quorum. When follower gets the index
back it waits for it to be applied and by that completes read_barrier
invocation.

The patch adds three new RPC: read_barrier, read_barrier_reply and
execute_read_barrier_on_leader. The last one is the one a follower uses
to ask a leader about safe index it can read. First two are used by a
leader to communicate with a quorum.
2021-08-25 08:57:13 +03:00
Benny Halevy
4439e5c132 everywhere: cleanup defer.hh includes
Get rid of unused includes of seastar/util/{defer,closeable}.hh
and add a few that are missing from source files.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-08-22 21:11:39 +03:00
Pavel Solodovnikov
f98cb96506 raft: raft_sys_table_storage_test: don't use initializer lists inside loops and coroutines
Workaround for Clang bug: https://bugs.llvm.org/show_bug.cgi?id=51515

When compiled on aarch64 with ASAN support and -Og/-Oz/-Os optimization
level, `raft_sys_table_storage::do_store_log_entries` crashes during the
tests. ASAN incorrectly reports `stack-use-after-return` on
`std::vector` list initialization after initial coroutine suspension
(initializer list's data pointer starts to point to garbage).

The workaround is simple: don't use initializer lists in such case
and replace with a series of `emplace_back` calls.

Tests: unit(debug, aarch64)

Fixes #9178

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20210818102038.92509-1-pa.solodovnikov@scylladb.com>
2021-08-18 13:32:55 +03:00
Pavel Solodovnikov
bcbcc18aa1 raft: raft_sys_table_storage: fix broken load_snapshot and load_term_and_vote
Loading snapshot id and term + vote involve selecting static
fields from the "system.raft" table, constrained by a given
group id.

The code incorrectly assumes that, for example,
`SELECT snapshot_id FROM raft WHERE group_id=?` in
`load_snapshot` always returns  only one row.
This is not true, since this will return a row
for each (pk, ck) combination, which is (group_id, index)
for "system.raft" table.

The same applies for the `load_term_and_vote`, which selects
static `vote_term` and `vote` from "system.raft".

This results in a crash at node startup when there is
a non-empty raft log containing more than one entry
for a given `group_id`.

Restrict the selection to always return one row by applying
`LIMIT 1` clause.

Tests: unit(dev)

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20210723183232.742083-1-pa.solodovnikov@scylladb.com>
2021-07-25 02:01:34 +02:00
Konstantin Osipov
bd410da77a raft: (service) rename raft_services service to raft_group_registry
This is a more informative name. Helps see that, say, group0
is a separate service and not bundle all raft services together.
Message-Id: <20210619211412.3035835-3-kostja@scylladb.com>
2021-06-21 14:53:54 +03:00
Konstantin Osipov
025f18325e raft: (service) move raft service to namespace service
Message-Id: <20210619211412.3035835-2-kostja@scylladb.com>
2021-06-21 14:53:54 +03:00