Commit Graph

306 Commits

Author SHA1 Message Date
Pavel Solodovnikov
c0854a0f62 raft: create system tables only when raft experimental feature is set
Also introduce a tiny function to return raft-enabled db config
for cql testing.

Tests: unit(dev)

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20210826091432.279532-1-pa.solodovnikov@scylladb.com>
2021-08-26 12:21:12 +03:00
Gleb Natapov
3ff6f76cef raft: test: add read_barrier test to replication_test 2021-08-25 08:57:13 +03:00
Gleb Natapov
ad2c2abcb8 raft: test: add read_barrier tests to fsm_test 2021-08-25 08:57:13 +03:00
Gleb Natapov
03a266d73b raft: make read_barrier work on a follower as well as on a leader
This patch implements RAFT extension that allows to perform linearisable
reads by accessing local state machine. The extension is described
in section 6.4 of the PhD. To sum it up to perform a read barrier on
a follower it needs to asks a leader the last committed index that it
knows about. The leader must make sure that it is still a leader before
answering by communicating with a quorum. When follower gets the index
back it waits for it to be applied and by that completes read_barrier
invocation.

The patch adds three new RPC: read_barrier, read_barrier_reply and
execute_read_barrier_on_leader. The last one is the one a follower uses
to ask a leader about safe index it can read. First two are used by a
leader to communicate with a quorum.
2021-08-25 08:57:13 +03:00
Alejo Sanchez
a5c74a6442 raft: candidate timeout proportional to cluster size
To avoid dueling candidates with large clusters, make the timeout
proportional to the cluster size.

Debug mode is too slow for a test of 1000 nodes so it's disabled, but
the test passes for release and dev modes.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-08-24 13:09:01 +02:00
Alejo Sanchez
7206eae16e raft: testing: many nodes test
Tests with many nodes and realistic timers and ticks.

Network delays are kept as a fraction of ticks. (e.g. 20/100)

Tests with 600 or more nodes hang in debug mode.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-08-24 13:09:01 +02:00
Alejo Sanchez
87a03a3485 raft: replication test: remove unused tick_all
Tests now wait for normal ticks for election, remove deprecated tick_all
helper.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-08-24 13:09:01 +02:00
Alejo Sanchez
14c214d73e raft: replication test: delays
Allow test supplied delays for rpc communication.

Allow supplying network delay, local delay (nodes within the same
server), how many nodes are local, and an extra small delay simulating
local load.

Modify rpc class to support delays. If delays are enabled, it no longer
directly calls the other node's server code but it schedules it to be
called later. This makes the test more realistic as in the previous
version the first candidate was always going to get to all followers
first, preventing a dueling candidates scenario.

Previously, tickers were all scheduled at the same time, so there was no
spread of them across the tick time. Now these tickers are scheduled
with a uniform spread across this time (tick delta).

Also previously, for custom free elections used tick_all() which
traversed _in_configuration sequentially and ticked each. This, combined
with rpc outbound directly calling methods in the other server without
yielding, caused free elections to be unrealistic with same order
determined and first candidate always winning. This patch changes this
behavior. The free election uses normal tickers (now uniformly
distributed in tick delay time) and its loop waits for tick delay time
(yielding) and checks if there's a new leader. Also note the order might
not be the same in debug mode if more than one tick is scheduled.

As rpc messages are sent delayed, network connectivity needs to be
checked again before calling the function on the remote side.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-08-24 13:05:53 +02:00
Alejo Sanchez
db23823c77 raft: replication test: packet drop rpc helper
Add a helper to check if a packet should be dropped.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-08-23 17:50:16 +02:00
Alejo Sanchez
497af3167f raft: replication test: connectivity configuration
Pass packet drops within connectivity configuration struct.
Default to no packet drops.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-08-23 17:50:16 +02:00
Alejo Sanchez
e4d5428e8a raft: replication test: rpc network map in raft_cluster
Move rpc network map to raft cluster, no longer as static in rpc class.
2021-08-23 17:50:16 +02:00
Alejo Sanchez
192ac5be4c raft: replication test: use minimum granularity
seastar lowres_clock minimum granularity is 10ms, not 1ms.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-08-23 17:50:16 +02:00
Alejo Sanchez
5cfe6c1ca2 raft: replication test: minor: rename local to int ids
For clarity, name 0-based integer ids as int ids not local.
This is in contrast with 1-based UUID ids.
2021-08-23 17:50:16 +02:00
Alejo Sanchez
27d90f0165 raft: replication test: fix restart_tickers when partitioning
When partitioning, elect_new_leader restarts tickers, so don't
re-restart them in this case.

When leader is dropped and no new leader is specified, restart tickers
before free election.

If no change of leader, restart tickers.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-08-23 17:50:16 +02:00
Alejo Sanchez
e4262291f2 raft: replication test: partition ranges
Allow specifying ranges within partition to handle large number of
nodes.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-08-23 17:50:16 +02:00
Alejo Sanchez
56a110d42f raft: replication test: isolate one server
Support disconnection of one server with the rest.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-08-23 17:50:16 +02:00
Alejo Sanchez
6b3327c753 raft: replication test: move objects out of header
Use a separate cc file for definitions and objects.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-08-23 17:50:16 +02:00
Alejo Sanchez
cea18e6830 raft: replication test: make dummy command const
Make dummy command const in header.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-08-23 17:50:16 +02:00
Alejo Sanchez
2db3192ac3 raft: replication test: template clock type
Templetize clock type.

Use a struct for run_test to work around
https://bugs.llvm.org/show_bug.cgi?id=50345

With help from @kbr-

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-08-23 17:50:16 +02:00
Alejo Sanchez
cb35588fb1 raft: replication test: tick delta inside raft_cluster
Store tick delta inside raft_cluster.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-08-23 17:50:16 +02:00
Alejo Sanchez
49cb040037 raft: replication test: style - member initializer
Fix raft_cluster constructor member initializer list.
2021-08-23 17:50:16 +02:00
Alejo Sanchez
6e2ab657b3 raft: replication test: move common code out
Common replication test code moved to header.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-08-23 17:50:16 +02:00
Alejo Sanchez
a6cd35c512 raft: testing: refactor helper
Move definitions to helper object file.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-08-23 17:50:16 +02:00
Kamil Braun
3344ac8a6c test: raft: randomized_nemesis_test: a basic generator test
The previous commits introduced basic the generator concept and a
library of most common composition patterns.

In this commit we implement a test that uses this new infrastructure.

Two `Executable` operations are implemented:
- `raft_call` is for calling to a Raft cluster with a given state
  machine command,
- `network_majority_grudge` partitions the network in half,
  putting the leader in the minority.

We run a workload of these operations against a cluster of 5 nodes with
6 threads for executing the operations: one "nemesis thread" for
`network_majority_grudge` and 5 "client threads" for `raft_call`.
Each client thread randomly chooses a contact point which it tries first
when executing a `raft_call`, but it can also "bounce" - call a
different server when the previous returned "not_a_leader" (we use the
generic "bouncing" wrapper to do this).

For now we only print the resulting history. In a follow-up patchset
we will analyze it for consistency anomalies.
2021-08-16 13:07:08 +02:00
Kamil Braun
66ec484730 test: raft: generator: a library of basic generators
Operations and generators can be composed to create more complex
operations and generators. There are certain composition patterns useful
for many different test scenarios.

This commit introduces a couple of such patterns. For example:
- Given multiple different operation types, we can create a new
  operation type - `either_of` - which is a "union" of the original
  operation types. Executing `either_of` operation means executing an
  operation of one of the original types, but the specific type
  can be chosen in runtime.
- Given a generator `g`, `op_limit(n, g)` is a new generator which
  limits the number of operations produced by `g`.
- Given a generator `g` and a time duration of `d` ticks, `stagger(g, d)` is a
  new generator which spreads the operations from `g` roughly every `d`
  ticks. (The actual definition in code is more general and complex but
  the idea is similar.)

And so on.

Some of these patterns have correspodning notions in Jepsen, e.g. our
`stagger` has a corresponding `stagger` in Jepsen (although our
`stagger` is more general).
2021-08-16 13:07:08 +02:00
Kamil Braun
d8863c5a7b test: raft: introduce generators
We introduce the concepts of "operations" and "generators", basic
building blocks that will allow us to declaratively write randomized
tests for torturing simulated Raft clusters.

An "operation" is a data structure representing a computation which
may cause side effects such as calling a Raft cluster or partitioning
the network, represented in the code with the `Executable` concept.
It has an `execute` function performing the computation and returns
a result of type `result_type`. Different computations of the same type
share state of type `state_type`. The state can, for example, contain
database handles.

Each execution is performed on an abstract `thread' (represented by a `thread_id`)
and has a logical starting time point. The thread and start point together form
the execution's `context` which is passed as a reference to `execute`.

Two operations may be called in parallel only if they are on different threads.

A generator, represented through the `Generator` concept, produces a
sequence of operations. An operation can be fetched from a generator
using the `op` function, which also returns the next state of the
generator (generators are purely functional data structures).

The generator concept is inspired by the generators in the Jepsen
testing library for distributed systems.

We also implement `interpreter` which "interprets", or "runs", a given
generator, by fetching operations from the generator and executing them
with concurrency controlled by the abstract threads.

The algorithm used in the interpreter is also similar to the interpreter
algorithm in Jepsen, although there are differences. Most notably we don't
have a "worker" concept - everything runs on a single shard; but we use
"abstract threads" combined with futures for concurrency.
There is also no notion of "process". Finally, the interpreter doesn't
keep an explicit history, but instead uses a callback `Recorder` to notify
the user about operation invocations and completions. The user can
decide to save these events in a history, or perhaps they can analyze
them on the fly using constant memory.
2021-08-16 13:07:08 +02:00
Kamil Braun
421b1b9494 test: raft: introduce future_set
A set of futures that can be polled.

Polling the set (`poll` function) returns the value of one of
the futures which became available or `std::nullopt` if the given
logical durationd passes (according to the given timer), whichever
event happens first.  The current implementation assumes sequential
polling.

New futures can be added to the set with `add`.
All futures can be removed from the set with `release`.
2021-08-16 13:07:08 +02:00
Kamil Braun
a5e92e1c45 test: raft: randomized_nemesis_test: handle raft::stopped_error in timeout futures
The timeout futures in `call` and `reconfigure` may be discarded after
Raft servers were `abort()`ed which would result in
`raft::stopped_error` and the test complained about discarded
exceptional futures. Discard these errors explicitly.
2021-08-16 13:07:08 +02:00
Kamil Braun
7533c84e62 raft: sometimes become a candidate even if outside the configuration
There are situations where a node outside the current configuration is
the only node that can become a leader. We become candidates in such
cases. But there is an easy check for when we don't need to; a comment was
added explaining that.
2021-08-06 13:18:32 +02:00
Kamil Braun
93822b0ee7 test: raft: regression test for storing cluster configuration when taking snapshots
Before the fix introduced in the previous patch, the cluster would
forget its configuration when taking a snapshot, making it unable to
reelect a leader. This regression test catches that.
2021-08-06 12:17:22 +02:00
Kamil Braun
f050d3682c raft: fsm: stronger check for outdated remote snapshots
We must not apply remote snapshots with commit indexes smaller than our
local commit index; this could result in out-of-order command
application to the local state machine replica, leading to
serializability violations.
Message-Id: <20210805112736.35059-1-kbraun@scylladb.com>
2021-08-05 14:29:50 +02:00
Kamil Braun
4165045356 test: raft: randomized_nemesis_test: handle timeouts in rpc::send_snapshot
They were already correctly returned to the caller, but we had a
leftover discarded future that would sometimes end up with a
broken_promise exception. Ignore the exception explicitly.
Message-Id: <20210803122207.78406-1-kbraun@scylladb.com>
2021-08-04 15:24:47 +03:00
Tomasz Grabiec
3e47f28c65 Merge "raft: use the correct term when storing a snapshot" from Kamil
We should not use the current term; we should use the term of the
snapshot's index, which may be lower.

* https://github.com/kbr-/scylla/tree/snapshot-right-term-fix:
  test: raft: regression test for using the correct term when taking a snapshot
  test: raft: randomized_nemesis_test: server configuration parameter
  raft: use the correct term when storing a snapshot
2021-08-02 15:33:52 +02:00
Kamil Braun
ac5121a016 test: raft: regression test for using the correct term when taking a snapshot 2021-08-02 11:48:35 +02:00
Kamil Braun
63fdc718d4 test: raft: randomized_nemesis_test: server configuration parameter 2021-08-02 11:47:19 +02:00
Gleb Natapov
4764028cb3 raft: Remove leader_id from append_request
The filed is not used anywhere.

Message-Id: <YP0khmjK2JSp77AG@scylladb.com>
2021-07-28 20:30:07 +02:00
Kamil Braun
b5a7220da4 test: raft: randomized_nemesis_test: reconfigure function
Instead of calling `set_configuration` directly on a `raft::server`, the
caller will use the higher-level `reconfigure`. Similarly to `call`, the
function converts exceptions into return values (inside a `variant`) and
allows passing in a timeout parameter.
2021-07-13 11:15:26 +02:00
Kamil Braun
eb4a8d48aa test: raft: randomized_nemesis_test: refactor waiting for leader into a separate function 2021-07-13 11:15:26 +02:00
Kamil Braun
69c59ec801 test: raft: randomized_nemesis_test: persistence: avoid creating gaps in the log when storing snapshots
When storing a snapshot `snap`, if `snap.idx > e.idx` where `e`
is the last entry in the log (if any), we need to clear all previous
entries so that we don't create a gap in the log. The log must remain
contiguous.

One case is controversial: what to do if `snap.idx == e.idx + 1`.
Technically no gap would be created between the entry and the snapshot.
However, if we now want to store a new entry with index `e.idx + 2`,
that would create a gap between two entries which is illegal.
2021-07-13 11:15:26 +02:00
Kamil Braun
f381a97f6f test: raft: randomized_nemesis_test: persistence: handle complex state types
The usage of `template <..., State init_state>` in `persistence`
permitted using only a very restricted class of types (so called
"structural types").

Pass the initial state through `persistence`'s constructor instead. Also
modify the member functions so the State type doesn't need to have a
default constructor.
2021-07-13 11:15:25 +02:00
Kamil Braun
59e04b2b2e test: raft: randomized_nemesis_test: call: handle raft::dropped_entry
This exception happens when the leader stops being a leader in the
middle of a call. Expect it to happen and return it in the result
variant.
2021-07-13 11:15:25 +02:00
Kamil Braun
d97cf1a254 test: raft: randomized_nemesis_test: impure_state_machine/call: handle dropped channels
Inside `call`, if `add_entry` failed or the operation timed out,
the output channel promise would be dropped without setting a value, causing
a `broken_promise` exception. Furthermore the output future would be
dropped, so we get a discarded `broken_promise` future.

The fix:
1. When we drop a channel without a result (inside
   `impure_state_machine::with_output_channel`), set an explicit
    exception with a dedicated type.
2. Discard the channel future in a controlled way, explicitly handling
   the `output_channel_dropped` exception.
2021-07-13 11:15:25 +02:00
Kamil Braun
f51ff786bd test: raft: randomized_nemesis_test: environment: expose the network
Let the user of `environment` access the `network` directly
for e. g. introducing network partitions.
2021-07-13 11:15:25 +02:00
Kamil Braun
26d2f99cad test: raft: randomized_nemesis_test: configurable network delay and FD convict threshold
The following are now passed to `environement` as parameters:
- network delay,
- failure detector convict threshold.
Environment passes them further down when constructing the underlying
objects.
2021-07-13 11:15:25 +02:00
Kamil Braun
035ae2eb1b test: raft: randomized_nemesis_test: generalize with_env_and_ticker
Generalize the type of the callback: use a template parameter instead of
`noncopyable_function` and don't assume the return type of the callback.

This allows returning a result from `with_env_and_ticker`, e.g. for
performing analysis or logging the results after a part of the test that
used the environment and ticker have finished.
2021-07-13 11:15:25 +02:00
Kamil Braun
25fb195bc7 test: raft: randomized_nemesis_test: network: add_grudge, remove_grudge functions
Extend the interface of `network` to allow introducing and removing "grudges"
which prevent the delivery of messages from one given server to another
(when the time comes to deliver a message but there's a grudge, the
message is dropped).
2021-07-13 11:15:25 +02:00
Kamil Braun
774ef653b1 test: raft: randomized_nemesis_test: move ticker to its own header 2021-07-13 11:15:25 +02:00
Kamil Braun
a45e8e0db0 test: raft: randomized_nemesis_test: ticker: take logger as a constructor parameter
Remove the global dependency on `tlogger`.
2021-07-13 11:15:25 +02:00
Kamil Braun
21b5a6d9f7 test: raft: logical_timer: handle immediate timeout
If the user calls `with_timeout` with a time point that's already been
reached, we return `timed_out_error` immediately.
2021-07-13 11:15:25 +02:00
Kamil Braun
ed8e9a564a test: raft: logical_timer: on timeout, return the original future in the exception
More specifically, return a future which is equivalent to the original
future (when the original future resolves, this future will contain its
result).

Thus we don't discard the future, the user gets it back.
Let them decide what to do with it.
2021-07-13 11:15:25 +02:00