Commit Graph

26515 Commits

Author SHA1 Message Date
Pavel Emelyanov
fdfcda97d7 allocation_strategy: Mark size_for_allocation_strategy noexcept
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-05-19 09:23:49 +03:00
Botond Dénes
dbb6851d4d test/manual/sstable_scan_footprint: don't double close the semaphore
The semaphore `stats_collector` references is the one obtained from the
database object, which is already stopped by `database::stop()`, making
the stop in `~stats_collector()` redundant, and even worse, as it
triggers an assert failure. Remove it.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210518140913.276368-1-bdenes@scylladb.com>
2021-05-18 17:55:52 +03:00
Avi Kivity
16ff92745f Merge 'perf: add alternator frontend to perf_simple_query' from Piotr Sarna
The perf_simple_query tool is extended with another protocol
aside from CQL - alternator. The alternative (pun intended) benchmark
can be executed by using the `--alternator X` parameter, where X
specifies one of the alternator's mandatory write isolation options:
 - "forbid_rmw" - forbids RMW (read-modify-write) requests
 - "unsafe" - never uses LWT (lightweight transactions), even for RMW
 - "always_use_lwt" - uses LWT even for non-RMW requests
 - "only_rmw_uses_lwt" - that one's rather self-explanatory

Alternator cooperates with existing `--write` and `--delete` parameters.

Aside from being able to check for improvements/regressions
in the alternator module, it's also possible to check how different
isolation levels influence the number of allocations and overall
performance, or to compare alternator against CQL.

Example output showing the difference in isolation levels:

```bash
$ ./build/release/test/perf/perf_simple_query_g --smp 1 \
    --write --alternator only_rmw_uses_lwt --default-log-level error
random-seed=1235000092
Started alternator executor
10873.76 tps (202.9 allocs/op,  12.4 tasks/op,  369921 insns/op)
11096.09 tps (202.7 allocs/op,  12.1 tasks/op,  374792 insns/op)
11100.09 tps (203.0 allocs/op,  12.1 tasks/op,  376469 insns/op)
11068.98 tps (203.1 allocs/op,  12.1 tasks/op,  377132 insns/op)
11081.24 tps (203.2 allocs/op,  12.1 tasks/op,  377290 insns/op)

median 11081.24 tps (203.2 allocs/op,  12.1 tasks/op,  377290 insns/op)
median absolute deviation: 14.85
maximum: 11100.09
minimum: 10873.76

$ ./build/release/test/perf/perf_simple_query_g --smp 1 \
    --random-seed 1235000092 --write --alternator always_use_lwt \
    --default-log-level error
random-seed=1235000092
Started alternator executor
3605.35 tps (877.4 allocs/op, 174.6 tasks/op,  986666 insns/op)
3555.71 tps (890.0 allocs/op, 174.4 tasks/op, 1006945 insns/op)
3530.20 tps (899.7 allocs/op, 174.1 tasks/op, 1021908 insns/op)
3437.65 tps (908.2 allocs/op, 174.6 tasks/op, 1033992 insns/op)
3409.88 tps (913.2 allocs/op, 174.4 tasks/op, 1041240 insns/op)

median 3530.20 tps (899.7 allocs/op, 174.1 tasks/op, 1021908 insns/op)
median absolute deviation: 75.15
maximum: 3605.35
minimum: 3409.88
```

Closes #8656

* github.com:scylladb/scylla:
  perf: add alternator frontend to perf_simple_query
  cdc: make metadata.hh self-sufficient
  test: add minimal alternator_test_env
2021-05-18 16:17:54 +03:00
Piotr Sarna
6c6ccda8a0 perf: add alternator frontend to perf_simple_query
The perf_simple_query tool is extended with another protocol
aside from CQL - alternator. The alternative (pun intended) benchmark
can be executed by using the `--alternator X` parameter, where X
specifies one of the alternator's mandatory write isolation options:
 - "forbid_rmw" - forbids RMW (read-modify-write) requests
 - "unsafe" - never uses LWT (lightweight transactions), even for RMW
 - "always_use_lwt" - uses LWT even for non-RMW requests
 - "only_rmw_uses_lwt" - that one's rather self-explanatory

Alternator cooperates with existing --write and --delete parameters.

Aside from being able to check for improvements/regressions
in the alternator module, it's also possible to check how different
isolation levels influence the number of allocations and overall
performance, or to compare alternator against CQL.

$ ./build/release/test/perf/perf_simple_query_g --smp 1 \
    --write --alternator only_rmw_uses_lwt --default-log-level error
random-seed=1235000092
Started alternator executor
10873.76 tps (202.9 allocs/op,  12.4 tasks/op,  369921 insns/op)
11096.09 tps (202.7 allocs/op,  12.1 tasks/op,  374792 insns/op)
11100.09 tps (203.0 allocs/op,  12.1 tasks/op,  376469 insns/op)
11068.98 tps (203.1 allocs/op,  12.1 tasks/op,  377132 insns/op)
11081.24 tps (203.2 allocs/op,  12.1 tasks/op,  377290 insns/op)

median 11081.24 tps (203.2 allocs/op,  12.1 tasks/op,  377290 insns/op)
median absolute deviation: 14.85
maximum: 11100.09
minimum: 10873.76

$ ./build/release/test/perf/perf_simple_query_g --smp 1 \
    --random-seed 1235000092 --write --alternator always_use_lwt \
    --default-log-level error
random-seed=1235000092
Started alternator executor
3605.35 tps (877.4 allocs/op, 174.6 tasks/op,  986666 insns/op)
3555.71 tps (890.0 allocs/op, 174.4 tasks/op, 1006945 insns/op)
3530.20 tps (899.7 allocs/op, 174.1 tasks/op, 1021908 insns/op)
3437.65 tps (908.2 allocs/op, 174.6 tasks/op, 1033992 insns/op)
3409.88 tps (913.2 allocs/op, 174.4 tasks/op, 1041240 insns/op)

median 3530.20 tps (899.7 allocs/op, 174.1 tasks/op, 1021908 insns/op)
median absolute deviation: 75.15
maximum: 3605.35
minimum: 3409.88
2021-05-18 15:10:31 +02:00
Piotr Sarna
6e28c01c53 cdc: make metadata.hh self-sufficient
The header relies on topology_description class definition,
which is part of cdc/generation.hh.
2021-05-18 15:10:31 +02:00
Piotr Sarna
b6d6247a74 test: add minimal alternator_test_env
A minimal implementation of alternator test env, a younger cousin
of cql_test_env, is implemented. Note that using this environment
for unit tests is strongly discouraged in favor of the official
test/alternator pytest suite. Still, alternator_test_env has its uses
for microbenchmarks.
2021-05-18 15:10:31 +02:00
Takuya ASADA
a3b25e3d29 unified/uninstall.sh: simplify uninstall.sh, delete all files correctly
Current uninstall.sh is trying to do similar logic with install.sh,
but it makes script larger meaninglessly, and also it failing to remove
few files under /opt/scylladb.

Let's just do rm -rf /opt/scylladb, and drop few other files located out
side of /opt/scylladb.

Closes #8662
2021-05-18 14:55:18 +02:00
Asias He
0858619cba storage_service: Abort restore_replica_count when node is removed from the cluster
Consider the following procedure:

- n1, n2, n3
- n3 is down
- n1 runs nodetool removenode uuid_of_n3 to removenode from n3 the
  cluster
- n1 is down in the middle of removenode operation

Node n1 will set n3 to removing gossip status during removenode
operation. Whenever existing nodes learn a node is in removing gossip
status, they will call restore_replica_count to stream data from other
nodes for the ranges n3 loses if n3 was removed from the cluster. If
the streaming fails, the streaming will sleep and retry. The current
max number of retry attempts is 5. The sleep interval starts at 60
seconds and increases 1.5 times per sleep.

This can leave the cluster in a bad state. For example, nodes can go
out of disk space if the streaming continues.  We need a way to abort
such streaming attempts.

To abort the removenode operation and forcely remove the node, users
can run `nodetool removenode force` on any existing nodes to move the
node from removing gossip status to removed gossip status. However,
the restore_replica_count will not be aborted.

In this patch, a status checker is added in restore_replica_count, so
that once a node is in removed gossip status, restore_replica_count
will be aborted.

This patch is for older releases without the new NODE_OPS_CMD
infrastructure where such abort will happen automatically in case of
error.

Fixes #8651

Closes #8655
2021-05-18 14:55:18 +02:00
Botond Dénes
82bff1bcc6 test: cql_test_env: use proper scheduling groups
Currently `cql_test_env` runs its `func` in the default (main) group and
also leaves all scheduling groups in `dbcfg` default initialized to the
same scheduling group. This results in every part of the system,
normally isolated from each other, running in the same (default)
scheduling group. Not a big problem on its own, as we are talking about
tests, but this creates an artificial difference between the test and
the real environment, which is ever more pronounced since certain query
parameters are selected based on the current scheduling group.
To bring cql test env just that little bit closer to the real thing,
this patch creates all the scheduling groups main does (well almost) and
configures `dbcfg` with them.
Creating and destroying the scheduling group on each setup-teardown of
cql test env breaks some internal seastar components which don't like
seeing the same scheduling group with the same name but different id. So
create the scheduling groups once on first access and keep them around
until the test executable is running.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210514141614.128213-2-bdenes@scylladb.com>
2021-05-18 13:44:54 +03:00
Botond Dénes
300ee974f7 test: use with_cql_test_env_thread where needed
Currently `with_cql_test_env()` is equivalent to
`with_cql_test_env_thread()`, which resulted in many tests using the
former while really needing the latter and getting away with it. This
equivalence is incidental and will go away soon, so make sure all cql
test env using tests that expect to be run in a thread use the
appropriate variant.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210514141614.128213-1-bdenes@scylladb.com>
2021-05-18 13:44:52 +03:00
Avi Kivity
6db826475d Merge "Introduce segregate scrub mode" from Botond
"
The current scrub compaction has a serious drawback, while it is
very effective at removing any corruptions it recognizes, it is very
heavy-handed in its way of repairing such corruptions: it simply drops
all data that is suspected to be corrupt. While this *is* the safest way
to cleanse data, it might not be the best way from the point of view of
a user who doesn't want to loose data, even at the risk of retaining
some business-logic level corruption. Mind you, no database-level scrub
can ever fully repair data from the business-logic point of view, they
can only do so on the database-level. So in certain cases it might be
desirable to have a less heavy-handed approach of cleansing the data,
that tries as hard as it can to not loose any data.

This series introduces a new scrub mode, with the goal of addressing
this use-case: when the user doesn't want to loose any data. The new
mode is called "segregate" and it works by segregating its input into
multiple outputs such that each output contains a valid stream. This
approach can fix any out-of-order data, be that on the partition or
fragment level. Out-of-order partitions are simply written into a
separate output. Out of order fragments are handled by injecting a
partition-end/partition-start pair right before them, so that they are
now in a separate (duplicate) partition, that will just be written into
a separate output, just like a regular out-of-order partition.

The reason this series is posted as an RFC is that although I consider
the code stable and tested, there are some questions related to the UX.
* First and foremost every scrub that does more than just discard data
  that is suspected to be corrupt (but even these a certain degree) have
  to consider the possibility that they are rehabilitating corruptions,
  leaving them in the system without a warning, in the sense that the
  user won't see any more problems due to low-level corruptions and
  hence might think everything is alright, while data is still corrupt
  from the business logic point of view. It is very hard to draw a line
  between what should and shouldn't scrub do, yet there is a demand from
  users for scrub that can restore data without loosing any of it. Note
  that anybody executing such a scrub is already in a bad shape, even if
  they can read their data (they often can't) it is already corrupt,
  scrub is not making anything worse here.
* This series converts the previous `skip_corrupted` boolean into an
  enum, which now selects the scrub mode. This means that
  `skip_corrupted` cannot be combined with segregate to throw out what
  the former can't fix. This was chosen for simplicity, a bunch of
  flags, all interacting with each other is very hard to see through in
  my opinion, a linear mode selector is much more so.
* The new segregate mode goes all-in, by trying to fix even
  fragment-level disorder. Maybe it should only do it on the partition
  level, or maybe this should be made configurable, allowing the user to
  select what to happen with those data that cannot be fixed.

Tests: unit(dev), unit(sstable_datafile_test:debug)
"

* 'sstable-scrub-segregate-by-partition/v1' of https://github.com/denesb/scylla:
  test: boost/sstable_datafile_test: add tests for segregate mode scrub
  api: storage_service/keyspace_scrub: expose new segregate mode
  sstables: compaction/scrub: add segregate mode
  mutation_fragment_stream_validator: add reset methods
  mutation_writer: add segregate_by_partition
  api: /storage_service/keyspace_scrub: add scrub mode param
  sstables: compaction/scrub: replace skip_corrupted with mode enum
  sstables: compaction/scrub: prevent infinite loop when last partition end is missing
  tests: boost/sstable_datafile_test: use the same permit for all fragments in scrub tests
2021-05-18 13:43:01 +03:00
Botond Dénes
5eb4517f56 read_context: move_to_next_partition(): make reader creation atomic
Otherwise an interleaving cache update can clear the `_prev_snapshot`
before the reader is created, leading to the reader being created via a
null mutation source.

Tests: unit(dev, release, debug:row_cache_test)

Fixes #8671.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210518092317.227433-1-bdenes@scylladb.com>
2021-05-18 13:41:48 +03:00
Piotr Sarna
c8653d1321 cql3: enhance the fix for index paging type check
The original fix stripped the reversed type only from
the base table column, but it's better to be safe than sorry,
so the reverse is also stripped from the view column.

Refs #8667
Message-Id: <cb5dedb0b8b6b5eea3a69863ae50a0e906482665.1621330463.git.sarna@scylladb.com>
2021-05-18 12:47:35 +03:00
Takuya ASADA
60c0b37a4c install.sh: apply correct file security context when copying files
Currently, unified installer does not apply correct file security context
while copying files, it causes permission error on scylla-server.service.
We should apply default file security context while copying files, using
'-Z' option on /usr/bin/install.

Also, because install -Z requires normalized path to apply correct security
context, use 'realpath -m <PATH>' on path variables on the script.

Fixes #8589

Closes #8602
2021-05-18 12:09:51 +03:00
Takuya ASADA
6faa8b97ec install.sh: fix not such file or directory on nonroot
Since we have added scylla-node-exporter, we needed to do 'install -d'
for systemd directory and sysconfig directory before copying files.

Fixes #8663

Closes #8664
2021-05-18 12:03:45 +03:00
Avi Kivity
593ad4de1e Merge 'Fix type checking in index paging' from Piotr Sarna
When recreating the paging state from an indexed query,
a bunch of panic checks were introduced to make sure that
the code is correct. However, one of the checks is too eager -
namely, it throws an error if the base column type is not equal
to the view column type. It usually works correctly, unless the
base column type is a clustering key with DESC clustering order,
in which case the type is actually "reversed". From the point of view
of the paging state generation it's not important, because both
types deserialize in the same way, so the check should be less
strict and allow the base type to be reversed.

Tests: unit(release), along with the additional test case
       introduced in this series; the test also passes
       on Cassandra

Fixes #8666

Closes #8667

* github.com:scylladb/scylla:
  test: add a test case for paging with desc clustering order
  cql3: relax a type check for index paging
2021-05-18 11:34:59 +03:00
Kamil Braun
03ad111beb tree-wide: comments on deprecated functions to access global variables
Closes #8665
2021-05-18 11:31:10 +03:00
Botond Dénes
ae366868fb multishard_mutation_query: save_reader(): avoid round-trip for destroying rparts
Force its destruction when saving the reader.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210514140844.119362-1-bdenes@scylladb.com>
2021-05-18 10:07:13 +03:00
Botond Dénes
c98b0d0de8 test: cql_test_env: add trace logs to execute_cql()
In tests executing tons of these, it is useful to be able to enable a
trace logging of each one, to see which is the last successful one.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210514140531.118390-1-bdenes@scylladb.com>
2021-05-18 10:06:22 +03:00
Piotr Sarna
c36f432423 test: add a test case for paging with desc clustering order
Issue #8666 revealed an issue with validating types for paged
indexed queries - namely, the type checking mechanism is too strict
in comparing types and fails on mismatched clustering order -
e.g. an `int` column type is different from `int` with DESC
clustering order. As a result, users see a *very* confusing
message (because reversed types are printed as their underlying type):
 > Mismatched types for base and view columns c: int and int
This test case fails before the fix for #8666 and thus acts
as a regression test.
2021-05-17 17:06:50 +02:00
Piotr Sarna
544ef2caf3 cql3: relax a type check for index paging
When recreating the paging state from an indexed query,
a bunch of panic checks were introduced to make sure that
the code is correct. However, one of the checks is too eager
- namely, it throws an error if the base column type is not equal
to the view column type. It usually works correctly, unless the
base column type is a clustering key with DESC clustering order,
in which case the type is actually "reversed". From the point of view
of the paging state generation it's not important, because both
types deserialize in the same way, so the check should be less
strict and allow the base type to be reversed.

Tests: unit(release), along with the additional test case
       introduced in this series; the test also passes
       on Cassandra
Fixes #8666
2021-05-17 17:06:50 +02:00
Botond Dénes
dca808dd51 perf/perf_simple_query: add --enable-cache option
Allowing for testing performance with/out cache.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210517045402.16153-1-bdenes@scylladb.com>
2021-05-17 14:06:18 +02:00
Raphael S. Carvalho
10ae77966c compaction_manager: Don't swallow exception in procedure used by reshape and resharding
run_custom_job() was swallowing all exceptions, which is definitely
wrong because failure in a resharding or reshape would be incorrectly
interpreted as success, which means upper layer will continue as if
everything is ok. For example, ignoring a failure in resharding could
result in a shared sstable being left unresharded, so when that sstable
reaches a table, scylla would abort as shared ssts are no longer
accepted in the main sstable set.
Let's allow the exception to be propagated, so failure will be
communicated, and resharding and reshape will be all or nothing, as
originally intended.

Fixes #8657.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210515015721.384667-1-raphaelsc@scylladb.com>
2021-05-17 13:57:05 +02:00
Avi Kivity
8d6e575f59 perf_fast_forward: report instructions per fragment
Use a hardware counter to report instructions per fragment. Results
vary from ~4k insns/f when reading sequentially to more than 1M insns/f.

Instructions per fragment can be a more stable metric than frags/sec.
It would probably be even more stable with a fake file implementation
that works in-memory to eliminate seastar polling instruction variation.

Closes #8660
2021-05-17 11:33:24 +02:00
Tomasz Grabiec
8dddfab5db Merge 'db/virtual tables: Add infrastructure + system.status example table' from Piotr Wojtczak
This is the 1st PR in series with the goal to finish the hackathon project authored by @tgrabiec, @kostja, @amnonh and @mmatczuk (improved virtual tables + function call syntax in CQL). Virtual tables created within this framework are "materialized" in memtables, so current solution is for small tables only. As an example system.status was added. It was checked that DISTINCT and reverse ORDER BY do work.

This PR was created by @jul-stas and @StarostaGit
Fixes #8343

This is the same as #8364, but with a compilation fix (newly added `close()` method was not implemented by the reader)

Closes #8634

* github.com:scylladb/scylla:
  boost/tests: Add virtual_table_test for basic infrastructure
  boost/tests: Test memtable_filling_virtual_table as mutation_source
  db/system_keyspace: Add system.status virtual table
  db/virtual_table: Add a way to specify a range of partitions for virtual table queries.
  db/virtual_table: Introduce memtable_filling_virtual_table
  db: Add virtual tables interface
  db: Introduce chained_delegating_reader
2021-05-17 11:29:37 +02:00
Botond Dénes
5e39cedbe3 evictable_reader: remove _reader_created flag
This flag is not really needed, because we can just attempt a resume on
first use which will fail with the default constructed inactive read
handle and the reader will be created via the recreate-after-evicted
path.
This allows the same path to be used for all reader creation cases,
simplifying the logic and more importantly making further patching
easier without the special case.
To make the recreate path (almost) as cheap for the first reader
creation as it was with the special path, `_trim_range_tombstones` and
`_validate_partition_key` is only set when really needed.

Tests: unit(dev)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210514141511.127735-1-bdenes@scylladb.com>
2021-05-16 14:45:46 +03:00
Botond Dénes
3b57106627 evictable_reader: remove destructor
We now have close() which is expected to clean up, no need for cleanup
in the destructor and consequently a destructor at all.

Message-Id: <20210514112349.75867-1-bdenes@scylladb.com>
2021-05-16 12:19:41 +03:00
Benny Halevy
f4cfa530cc perf: enable instructions_retired_counter only once per executor::run
Enabling it for each run_worker call will invoke ioctl
PERF_EVENT_IOC_ENABLE in parallel to other workers running
and this may skew the results.

Test: perf_simple_query
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210514130542.301168-1-bhalevy@scylladb.com>
2021-05-16 12:13:27 +03:00
Tomasz Grabiec
28ac8d0f2b Merge "raft: randomized_nemesis_test framework" from Kamil
We introduce `PureStateMachine`, which is the most direct translation
of the mathematical definition of a state machine to C++ that I could
come up with.  Represented by a C++ concept, it consists of: a set of
inputs (represented by the `input_t` type), outputs (`output_t` type),
states (`state_t`), an initial state (`init`) and a transition
function (`delta`) which given a state and an input returns a new
state and an output.

The rest of the testing infrastructure is going to be generic
w.r.t. `PureStateMachine`. This will allow easily implementing tests
using both simple and complex state machines by substituting the
proper definition for this concept.

Next comes `logical_timer`: it is a wrapper around
`raft::logical_clock` that allows scheduling events to happen after a
certain number of logical clock ticks.  For example,
`logical_timer::sleep(20_t)` returns a future that resolves after 20
calls to `logical_timer::tick()`. It will be used to introduce
timeouts in the tests, among other things.

To replicate a state machine, our Raft implementation requires it to
be represented with the `raft::state_machine` interface.

`impure_state_machine` is an implementation of `raft::state_machine`
that wraps a `PureStateMachine`. It keeps a variable of type `state_t`
representing the current state. In `apply` it deserializes the given
command into `input_t`, uses the transition (`delta`) function to
produce the next state and output, replaces its current state with the
obtained state and returns the output (more on that below); it does so
sequentially for every given command. We can think of `PureStateMachine`
as the actual state machine - the business logic, and
`impure_state_machine` as the ``boilerplate'' that allows the pure machine
to be replicated by Raft and communicate with the external world.

The interface also requires maintainance of snapshots. We introduce the
`snapshots_t` type representing a set of snapshots known by a state
machine. `impure_state_machine` keeps a reference to `snapshots_t`
because it will share it with an implementation of `persistence`.

Returning outputs is a bit tricky because apply is ``write-only'' - it
returns `future<>`. We use the following technique:

1. Before sending a command to a Raft leader through `server::add_entry`,
   one must first directly contact the instance of `impure_state_machine`
   replicated by the leader, asking it to allocate an ``output channel''.
2. On such a request, `impure_state_machine` creates a channel
   (represented by a promise-future pair) and a unique ID; it stores the
   input side of the channel (the promise) with this ID internally and returns
   the ID and the output side of the channel (the future) to the requester.
3. After obtaining the ID, one serializes the ID together with the input
   and sends it as a command to Raft. Thus commands are (ID, machine input)
   pairs.
4. When `impure_state_machine` applies a command, it looks for a promise
   with the given ID. If it finds one, it sends the output through this
   channel.
5. The command sender waits for the output on the obtained future.

The allocation and deallocation of channels is done using the
`impure_state_machine::with_output_channel` function. The `call`
function is an implementation of the above technique.

Note that only the leader will attempt to send the output - other
replicas won't find the ID in their internal data structure. The set of
IDs and channels is not a part of the replicated state.

A failure may cause the output to never arrive (or even the command to
never be applied) so `call` waits for a limited time. It may also
mistakenly `call` a server which is not currently the leader, but it
is prepared to handle this error.

We implement the `raft::rpc` interface, allowing Raft servers to
communicate with other Raft servers.

The implementation is mostly boilerplate. It assumes that there exists a
method of message passing, given by a `send_message_t` function passed
in the constructor. It also handles the receival of messages in the
`receive` function. It defines the message type (`message_t`) that will
 be used by the message-passing method.

The actual message passing is implemented with `network` and `delivery_queue`.

The only slightly complex thing in `rpc` is the implementation of `send_snapshot`
which is the only function in the `raft::rpc` interface that actually
expects a response. To implement this, before sending the snapshot
message we allocate a promise-future pair and assign to it a unique ID;
we store the promise and the ID in a data structure. We then send the
snapshot together with the ID and wait on the future. The message
receival function on the other side, when it receives the snapshot message,
applies the snapshot and sends back a snapshot reply message that contains
the same ID. When we receive a snapshot reply message we look up the ID in the
data structure and if we find a promise, we push the reply through that
promise.

`rpc` also keeps a reference to `snapshots_t` - it will refer to the
same set of snapshots as the `impure_state_machine` on the same server.
It accesses the set when it receives or sends a snapshot message.

`persistence` represents the data that does not get lost between server
crashes and restarts.

We store a log of commands in `_stored_entries`. It is invariably
``contiguous'', meaning that the index of each entry except the first is
equal to the index of the previous entry plus one at all times (i.e.
after each yield). We assume that the caller provides log entries
in strictly increasing index order and without gaps.

Additionally to storing log entries, `persistence` can be asked to store
or load a snapshot. To implement this it takes a reference to a set of snapshots
(`snapshots_t&`) which it will share with `impure_state_machine` and an
implementation of `rpc`.  We ensure that the stored log either ``touches''
the stored snapshot on the right side or intersects it.

In order to simulate a production environment as closely as possible, we
implement a failure detector which uses heartbeats for deciding whether
to convict a server as failed. We convict a server if we don't receive a
heartbeat for a long enough time.

Similarly to `rpc`, `failure_detector` assumes a message passing method
given by a `send_heartbeat_t` function through the constructor.

`failure_detector` uses the knowledge about existing servers to decide
who to send heartbeats to. Updating this knowledge happens through
`add_server` and `remove_server` functions.

`network` is a simple priority queue of "events", where an event is a
message associated with delivery time. Each message contains a source,
a destination, and payload. The queue uses a logical clock to decide
when to deliver messages; it delivers are messages whose associated
times are smaller than the current time.

The exact delivery method is unknown to `network` but passed as a
`deliver_t` function in the constructor. The type of payload is generic.

The fact that `network` has delivered a message does not mean the
message was processed by the receiver. In fact, `network` assumes that
delivery is instantaneous, while processing a message may be a long,
complex computation, or even require IO. Thus, after a message is
delivered, something else must ensure that it is processed by the
destination server.

That something in our framework is `delivery_queue`. It will be the
bridge between `network` and `rpc`. While `network` is shared by all
servers - it represents the ``environment'' in which the servers live -
each server has its own private `delivery_queue`. When `network`
delivers an RPC message it will end up inside `delivery_queue`. A
separate fiber, `delivery_queue::receive_fiber()`, will process those
messages by calling `rpc::receive` (which is a potentially long
operation, thus returns a `future<>`) on the `rpc` of the destination
server.

`raft_server` is a package that contains `raft::server` and other
facilities needed for the server to communicate with its environment:
the delivery queue, the set of snapshots (shared by
`impure_state_machine`, `rpc` and `persistence`) and references to the
`impure_state_machine` and `rpc` instances of this server.

`environment` represents a set of `raft_server`s connected by a `network`.

The `network` inside is initialized with a message delivery function
which notifies the destination server's failure detector on each message
and if the message contains an RPC payload, pushes it into the destination's
`delivery_queue`.

Needs to be periodically `tick()`ed which ticks the network
and underlying servers.

`ticker` calls the given function as fast as the Seastar reactor
allows and yields between each call. It may be provided a limit
for the number of calls; it crashes the test if the limit is reached
before the ticker is `abort()`ed.

Finally, we add a simple test that serves as an example of using the
implemented framework. We introduce `ExRegister`, an implementation
of `PureStateMachine` that stores an `int32_t` and handles ``exchange''
and ``read'' inputs; an exchange replaces the state with the given value
and returns the previous state, a read does not modify the state and returns
the current state.  In order to pass the inputs to Raft we must
serialize them into commands so we implement instances of `ser::serializer`
for `ExReg`'s input types.

* kbr/randomized-nemesis-test-v5:
  raft: randomized_nemesis_test: basic test
  raft: randomized_nemesis_test: ticker
  raft: randomized_nemesis_test: environment
  raft: randomized_nemesis_test: server
  raft: randomized_nemesis_test: delivery queue
  raft: randomized_nemesis_test: network
  raft: randomized_nemesis_test: heartbeat-based failure detector
  raft: randomized_nemesis_test: memory backed persistence
  raft: randomized_nemesis_test: rpc
  raft: randomized_nemesis_test: impure_state_machine
  raft: randomized_nemesis_test: introduce logical_timer
  raft: randomized_nemesis_test: `PureStateMachine` concept
2021-05-14 17:33:40 +02:00
Tomasz Grabiec
0fdd2f8217 Merge "raft: fsm cleanups" from Gleb
* scylla-dev/raft-cleanup-v1:
  raft: drop _leader_progress tracking from the tracker
  raft: move current_leader into the follower state
  raft: add some precondition checks
2021-05-14 17:24:59 +02:00
Asias He
e4872a78b5 storage_service: Delay update pending ranges for replacing node
In commit c82250e0cf (gossip: Allow deferring
advertise of local node to be up), the replacing node is changed to postpone
the responding of gossip echo message to avoid other nodes sending read
requests to the replacing node. It works as following:

1) replacing node does not respond echo message to avoid other nodes to
mark replacing node as alive

2) replacing node advertises hibernate state so other nodes knows
replacing node is replacing

3) replacing node responds echo message so other nodes can mark
replacing node as alive

This is problematic because after step 2, the existing nodes in the
cluster will start to send writes to the replacing node, but at this
time it is possible that existing nodes haven't marked the replacing
node as alive, thus failing the write request unnecessarily.

For instance, we saw the following errors in issue #8013 (Cassandra
stress fails to achieve consistency when only one of the nodes is down)

```
scylla:
[shard 1] consistency - Live nodes 2 do not satisfy ConsistencyLevel (2
required, 1 pending, live_endpoints={127.0.0.2, 127.0.0.1},
pending_endpoints={127.0.0.3}) [shard 0] gossip - Fail to send
EchoMessage to 127.0.0.3: std::runtime_error (Not ready to respond
gossip echo message)

c-s:
java.io.IOException: Operation x10 on key(s) [4c4f4d37324c35304c30]:
Error executing: (UnavailableException): Not enough replicas available
for query at consistency QUORUM (2 required but only 1 alive
```

To solve this problem for older releases without the patch "repair:
Switch to use NODE_OPS_CMD for replace operation", a minimum fix is
implemented in this patch. Once existing nodes learn the replacing node
is in HIBERNATE state, they add the replacing as replacing, but only add
the replacing to the pending list only after the replacing node is
marked as alive.

With this patch, when the existing nodes start to write to the replacing
node, the replacing node is already alive.

Tests: replace_address_test.py:TestReplaceAddress.replace_node_same_ip_test + manual test
Fixes: #8013

Closes #8614
2021-05-14 17:24:28 +02:00
Tomasz Grabiec
102dcfc1fd Merge "scylla-gdb.py: introduce scylla read-stats" from Botond
Too many or too resource-hungry reads often lie at the heart of issues
that require an investigation with gdb. Therefore it is very useful to
have a way to summarize all reads found on a shard with their states and
resource consumptions. This is exactly what this new command does. For
this it uses the reader concurrency semaphores and their permits
respectively, which are now arranged in an intrusive list and therefore
are enumerable.
Example output:
(gdb) scylla read-stats
Semaphore _read_concurrency_sem with: 1/100 count and 14334414/14302576 memory resources, queued: 0, inactive=1
   permits count       memory table/description/state
         1     1     14279738 multishard_mutation_query_test.fuzzy_test/fuzzy-test/active
        16     0        53532 multishard_mutation_query_test.fuzzy_test/shard-reader/active
         1     0         1144 multishard_mutation_query_test.fuzzy_test/shard-reader/inactive
         1     0            0 *.*/view_builder/active
         1     0            0 multishard_mutation_query_test.fuzzy_test/multishard-mutation-query/active
        20     1     14334414 Total

* botond/scylla-gdb.py-scylla-reads/v5:
  scylla-gdb.py: introduce scylla read-stats
  scylla-gdb.py: add pretty printer for std::string_view
  scylla-gdb.py: std_map() add __len__()
  scylla-gdb.py: prevent infinite recursion in intrusive_list.__len__()
2021-05-14 16:07:14 +02:00
Takuya ASADA
838acb44d0 scylla-fstrim.timer: fix wrong description from 'daily' to 'weekly'
It scheduled weekly, not daily.

Fixes #8633

Closes #8644
2021-05-14 16:02:12 +02:00
Asias He
b8749f51cb repair: Consider memory bloat when calculate repair parallelism
The repair parallelism is calculated by the number of memory allocated to
repair and memory usage per repair instance. Currently, it does not
consider memory bloat issues (e.g., issue #8640) which cause repair to
use more memory and cause std::bad_alloc.

Be more conservative when calculating the parallelism to avoid repair
using too much memory.

Fixes #8641

Closes #8652
2021-05-14 16:02:08 +02:00
Piotr Sarna
c1cb7d87e1 auth: remove the fixed 15s delay during auth setup
The auth intialization path contains a fixed 15s delay,
which used to work around a couple of issues (#3320, #3850),
but is right now quite useless, because a retry mechanism
is already in place anyway.
This patch speeds up the boot process if authentication is enabled.
In particular, for a single-node clusters, common for test setups,
auth initialization now takes a couple of milliseconds instead
of the whole 15 seconds.

Fixes #8648

Closes #8649
2021-05-14 16:01:59 +02:00
Kamil Braun
c21311ecca raft: randomized_nemesis_test: basic test
This is a simple test that serves as an example of using the
framework implemented in the previous commits. We introduce
`ExRegister`, an implementation of `PureStateMachine` that stores
an `int32_t` and handles ``exchange'' and ``read'' inputs;
an exchange replaces the state with the given value and returns
the previous state, a read does not modify the state and returns
the current state.  In order to pass the inputs to Raft we must
serialize them into commands so we implement instances of `ser::serializer`
for `ExReg`'s input types.
2021-05-14 15:11:01 +02:00
Kamil Braun
66b9bc6fe1 raft: randomized_nemesis_test: ticker
`ticker` calls the given function as fast as the Seastar reactor
allows and yields between each call. It may be provided a limit
for the number of calls; it crashes the test if the limit is reached
before the ticker is `abort()`ed.

The commit also introduces a `with_env_and_ticker` helper function which
creates an `environment`, a `ticker`, and passes references to them to
the given function. It destroys them after the function finishes
by calling `abort()`.
2021-05-14 15:11:01 +02:00
Kamil Braun
c7cef58797 raft: randomized_nemesis_test: environment
`environment` represents a set of `raft_server`s connected by a `network`.

The `network` inside is initialized with a message delivery function
which notifies the destination server's failure detector on each message
and if the message contains an RPC payload, pushes it into the destination's
`delivery_queue`.

Needs to be periodically `tick()`ed which ticks the network
and underlying servers.

New servers can be created in the environment by calling `new_server`.
2021-05-14 15:11:01 +02:00
Kamil Braun
5095a4158e raft: randomized_nemesis_test: server
`raft_server` is a package that contains `raft::server` and other
facilities needed for the server to communicate with its environment:
the delivery queue, the set of snapshots (shared by
`impure_state_machine`, `rpc` and `persistence`) and references to the
`impure_state_machine` and `rpc` instances of this server.
2021-05-14 15:11:01 +02:00
Kamil Braun
f139fd4c28 raft: randomized_nemesis_test: delivery queue
The fact that `network` has delivered a message does not mean the
message was processed by the receiver. In fact, `network` assumes that
delivery is instantaneous, while processing a message may be a long,
complex computation, or even require IO. Thus, after a message is
delivered, something else must ensure that it is processed by the
destination server.

That something in our framework is `delivery_queue`. It will be the
bridge between `network` and `rpc`. While `network` is shared by all
servers - it represents the ``environment'' in which the servers live -
each server has its own private `delivery_queue`. When `network`
delivers an RPC message it will end up inside `delivery_queue`. A
separate fiber, `delivery_queue::receive_fiber()`, will process those
messages by calling `rpc::receive` (which is a potentially long
operation, thus returns a `future<>`) on the `rpc` of the destination
server.
2021-05-14 15:11:01 +02:00
Kamil Braun
2956f5f76c raft: randomized_nemesis_test: network
`network` is a simple priority queue of "events", where an event is a
message associated with delivery time. Each message contains a source,
a destination, and payload. The queue uses a logical clock to decide
when to deliver messages; it delivers are messages whose associated
times are smaller than the current time.

The exact delivery method is unknown to `network` but passed as a
`deliver_t` function in the constructor. The type of payload is generic.
2021-05-14 15:11:01 +02:00
Kamil Braun
3068a0aa70 raft: randomized_nemesis_test: heartbeat-based failure detector
In order to simulate a production environment as closely as possible, we
implement a failure detector which uses heartbeats for deciding whether
to convict a server as failed. We convict a server if we don't receive a
heartbeat for a long enough time.

Similarly to `rpc`, `failure_detector` assumes a message passing method
given by a `send_heartbeat_t` function through the constructor.

`failure_detector` uses the knowledge about existing servers to decide
who to send heartbeats to. Updating this knowledge happens through
`add_server` and `remove_server` functions.
2021-05-14 15:11:01 +02:00
Kamil Braun
51df600478 raft: randomized_nemesis_test: memory backed persistence
`persistence` represents the data that does not get lost between server
crashes and restarts.

We store a log of commands in `_stored_entries`. It is invariably
``contiguous'', meaning that the index of each entry except the first is
equal to the index of the previous entry plus one at all times (i.e.
after each yield). We assume that the caller provides log entries
in strictly increasing index order and without gaps.

Additionally to storing log entries, `persistence` can be asked to store
or load a snapshot. To implement this it takes a reference to a set of snapshots
(`snapshots_t&`) which it will share with `impure_state_machine` and an
implementation of `rpc` coming in a later commit.  We ensure that the stored
log either ``touches'' the stored snapshot on the right side or intersects it.
2021-05-14 15:11:01 +02:00
Kamil Braun
7a1f6e6d7b raft: randomized_nemesis_test: rpc
We implement the `raft::rpc` interface, allowing Raft servers to
communicate with other Raft servers.

The implementation is mostly boilerplate. It assumes that there exists a
method of message passing, given by a `send_message_t` function passed
in the constructor. It also handles the receival of messages in the
`receive` function. It defines the message type (`message_t`) that will
 be used by the message-passing method.

The actual message passing is implemented with `network` and `delivery_queue`
which are introduced in later commits.

The only slightly complex thing in `rpc` is the implementation of `send_snapshot`
which is the only function in the `raft::rpc` interface that actually
expects a response. To implement this, before sending the snapshot
message we allocate a promise-future pair and assign to it a unique ID;
we store the promise and the ID in a data structure. We then send the
snapshot together with the ID and wait on the future. The message
receival function on the other side, when it receives the snapshot message,
applies the snapshot and sends back a snapshot reply message that contains
the same ID. When we receive a snapshot reply message we look up the ID in the
data structure and if we find a promise, we push the reply through that
promise.

`rpc` also keeps a reference to `snapshots_t` - it will refer to the
same set of snapshots as the `impure_state_machine` on the same server.
It accesses the set when it receives or sends a snapshot message.
2021-05-14 15:11:01 +02:00
Kamil Braun
905126acc3 raft: randomized_nemesis_test: impure_state_machine
To replicate a state machine, our Raft implementation requires it to
be represented with the `raft::state_machine` interface.

`impure_state_machine` is an implementation of `raft::state_machine`
that wraps a `PureStateMachine`. It keeps a variable of type `state_t`
representing the current state. In `apply` it deserializes the given
command into `input_t`, uses the transition (`delta`) function to
produce the next state and output, replaces its current state with the
obtained state and returns the output (more on that below); it does so
sequentially for every given command. We can think of `PureStateMachine`
as the actual state machine - the business logic, and
`impure_state_machine` as the ``boilerplate'' that allows the pure machine
to be replicated by Raft and communicate with the external world.

The interface also requires maintainance of snapshots. We introduce the
`snapshots_t` type representing a set of snapshots known by a state
machine. `impure_state_machine` keeps a reference to `snapshots_t`
because it will share it with an implementation of `raft::persistence`
coming with a later commit.

Returning outputs is a bit tricky because apply is ``write-only'' - it
returns `future<>`. We use the following technique:

1. Before sending a command to a Raft leader through `server::add_entry`,
   one must first directly contact the instance of `impure_state_machine`
   replicated by the leader, asking it to allocate an ``output channel''.
2. On such a request, `impure_state_machine` creates a channel
   (represented by a promise-future pair) and a unique ID; it stores the
   input side of the channel (the promise) with this ID internally and returns
   the ID and the output side of the channel (the future) to the requester.
3. After obtaining the ID, one serializes the ID together with the input
   and sends it as a command to Raft. Thus commands are (ID, machine input)
   pairs.
4. When `impure_state_machine` applies a command, it looks for a promise
   with the given ID. If it finds one, it sends the output through this
   channel.
5. The command sender waits for the output on the obtained future.

The allocation and deallocation of channels is done using the
`impure_state_machine::with_output_channel` function. The `call`
function is an implementation of the above technique.

Note that only the leader will attempt to send the output - other
replicas won't find the ID in their internal data structure. The set of
IDs and channels is not a part of the replicated state.

A failure may cause the output to never arrive (or even the command to
never be applied) so `call` waits for a limited time. It may also
mistakenly `call` a server which is not currently the leader, but it
is prepared to handle this error.
2021-05-14 15:11:01 +02:00
Kamil Braun
3e02befccd raft: randomized_nemesis_test: introduce logical_timer
This is a wrapper around `raft::logical_clock` that allows scheduling
events to happen after a certain number of logical clock ticks.
For example, `logical_timer::sleep(20_t)` returns a future that resolves
after 20 calls to `logical_timer::tick()`.
2021-05-13 11:34:00 +02:00
Kamil Braun
15e3bd2620 raft: randomized_nemesis_test: PureStateMachine concept
The commit introduces `PureStateMachine`, which is the most direct translation
of the mathematical definition of a state machine to C++ that I could come up with.
Represented by a C++ concept, it consists of: a set of inputs
(represented by the `input_t` type), outputs (`output_t` type), states (`state_t`),
an initial state (`init`) and a transition function (`delta`) which
given a state and an input returns a new state and an output.

The rest of the testing infrastructure is going to be
generic w.r.t. `PureStateMachine`. This will allow easily implementing
tests using both simple and complex state machines by substituting the
proper definition for this concept.

One possibility of modifying this definition would be to have `delta`
return `future<pair<state_t, output_t>>` instead of
`pair<state_t, output_t>`. This would lose some ``purity'' but allow
long computations without reactor stalls in the tests. Such modification,
if we decide to do it, is trivial.
2021-05-13 11:34:00 +02:00
Alejo Sanchez
68f69671b5 raft: style: test optionals directly
Avoid using has_value() and test optional directly

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Message-Id: <20210512142018.297203-2-alejo.sanchez@scylladb.com>
2021-05-12 20:39:52 +02:00
Piotr Wojtczak
e6254acfd3 boost/tests: Add virtual_table_test for basic infrastructure 2021-05-12 17:05:35 +02:00
Piotr Wojtczak
8825ae128d boost/tests: Test memtable_filling_virtual_table as mutation_source
Uses the infrastructure for testing mutation_sources, but only a
subset of it which does not do fast forwarding (since virtual_table
does not support it).
2021-05-12 17:05:35 +02:00