Commit Graph

908 Commits

Author SHA1 Message Date
Botond Dénes
d1619eb38a Merge 'Remove qctx from helpers that retrieve truncation record' from Pavel Emelyanov
There are two places that do it -- commitlog and batchlog replayers. Both can have local system-keyspace reference and use system-keyspace local query-processor for it. The peering save_truncation_record() is not that simple and is not patched by this PR

Closes #13087

* github.com:scylladb/scylladb:
  system_keyspace: Unstatic get_truncation_record()
  system_keyspace: Unstatic get_truncated_at()
  batchlog_manager: Add system_keyspace dependency
  main: Swap batchlog manager and system keyspace starts
  system_keyspace: Unstatic get_truncated_position()
  system_keyspace: Remove unused method
  commitlog: Create commitlog_replayer with system keyspace
  test: Make cql_test_env::get_system_keyspace() return sharded
  commiltlog: Line-up field definitions
2023-03-07 10:19:55 +02:00
Kefu Chai
020483aa59 Update seastar submodule and main
this change also includes change to main, to make this commit compile.
see below:

* seastar 9b6e181e42...9cbc1fe889 (46):
  > Merge 'Make io-tester jobs share sched classes' from Pavel Emelyanov
  > io_tester.md: Update the `rps` configuration option description
  > io_tester: Add option to limit total number of requests sent
  > Merge 'Keep outgoing queue all cancellable while negotiating (again)' from Pavel Emelyanov
  > io_tester: Add option to share classes between jobs
  > rpc: Abort connection if send_entry() fails
  > Merge 'build: build dpdk with `-fPIC` if BUILD_SHARED_LIBS' from Kefu Chai
  > build: cooking.sh: use the same BUILD_SHARED_LIBS when building ingredients
  > build: cooking.sh: use the same generator when building ingredients
  > core/memory: handle `strerror_r` returning static string
  > Merge 'build, rpc: lz4 related cleanups' from Kefu Chai
  > build, rpc: do not support lz4 < 1.7.3
  > build: set the correct version when finding lz4
  > build: include CheckSymbolExists
  > rpc: do not include lz4.h in header
  > build: set CMP0135 for Cooking.cmake
  > docs: drop building-*.md
  > Merge 'seastar-addr2line: cleanups' from Kefu Chai
  > seastar-addr2line: refactor tests using unittest
  > seastar-addr2line: extract do_test() and main()
  > seastar-addr2line: do not import unused modules
  > scheduling: add a `rename` callback to scheduling_group_key_config
  > reactor: syscall thread: wakeup up reactor with finer granularity
  > build: build dpdk with `-fPIC` if BUILD_SHARED_LIBS
  > build: extract dpdk_extra_cflags out
  > core/sstring: remove a temporary variable
  > Merge 'treewide: include what we use, and add a checkheaders target' from Kefu Chai
  > perftune.py: auto-select the same number of IRQ cores on each NUMA
  > prometheus: remove unused headers
  > core/sstring: define <=> operator for sstring
  > Merge 'core: s/reserve_additional_memory/reserve_additional_memory_per_shard/' from Kefu Chai
  > include: do not include <concepts> directly
  > coding_style: note on self-contained header requirement
  > circileci: build checkheaders in addition to default target
  > build: add checkheaders target
  > net/toeplitz: s/u_int/unsigned/
  > net/tcp-stack: add forward declaration for seastar::socket
  > core, net, util: include used headers

* main: set reserved memory for wasm on per-shard basis

  this change is a follow-up of
  f05d612da8 and
  4a0134a097.

  this change depends on the related change in Seastar to reserve
  additional memory on a per-shard basis.

  per Wojciech Mitros's comment:

  > it should have probably been 50MB per shard

  in other words, as we always execute the same set of udf on all
  shards. and since one cannot predict the number of shards, but she
  could have a rough estimation on the size of memory a regular (set
  of) udf could use. so a per-shard setting makes more sense.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-03-06 18:41:34 +02:00
Pavel Emelyanov
1907518034 batchlog_manager: Add system_keyspace dependency
The manager will need system ks to get truncation record from, so add it
explicitly. Start-stop sequence no allows that

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-03-06 13:28:40 +03:00
Pavel Emelyanov
40b762b841 main: Swap batchlog manager and system keyspace starts
The former needs the latter to get truncation records from and will thus
need it as explicit dependency. In order to have it bathlog needs to
start after system ks. This works as starting batchlog manager doesn't
do anything that's required by system keyspace. This is indirectly
proven by cql-test-env in which batchlog manager starts later than it
does in main

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-03-06 13:28:40 +03:00
Pavel Emelyanov
47b61389b5 commitlog: Create commitlog_replayer with system keyspace
The replayer code needs system keyspace to fetch truncation records
from, thus it needs this explicit dependency. By the time it runs system
keyspace is fully initialized already

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-03-06 13:28:36 +03:00
Botond Dénes
1c0b47ee9b Merge 'treewide: remove unused variable and reference used one explicitly' from Kefu Chai
- treewide: do not define/capture unused variables
- sstables/sstables: mark dummy variable for loop [[maybe_unused]]
- util/result_try: reference this explicitly
- raft: reference this explicitly
- idl-compiler: mark captured this used
- build: reenable unused-{variable,lambda-capture} warnings

Closes #12915

* github.com:scylladb/scylladb:
  build: reenable unused-{variable,lambda-capture} warnings
  test: reader_concurrency_semaphore_test: define target_memory in debug mode
  api::failure_detector: mark set_phi_convict_threshold unimplemented
  test: memtable_test: mark dummy variable for loop [[maybe_unused]]
  idl-compiler: mark captured this used
  raft: reference this explicitly
  util/result_try: reference this explicitly
  sstables/sstables: mark dummy variable for loop [[maybe_unused]]
  treewide: do not define/capture unused variables
  service: storage_service: clear _node_ops in batch
2023-03-01 09:44:37 +02:00
Kefu Chai
3ae11de204 treewide: do not define/capture unused variables
these warnings are found by Clang-17 after removing
`-Wno-unused-lambda-capture` and '-Wno-unused-variable' from
the list of disabled warnings in `configure.py`.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-02-28 21:56:53 +08:00
Kefu Chai
7550be1fc6 main: move get_tools() into main()
there is not need to have a dedicated function which is only consumed
by `main()`. so let's move the body of `get_tools()` into `main`. and
with this change, a plain C array would suffice. so just use a plain
array for tools.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-02-28 17:09:46 +08:00
Kefu Chai
128dbebb76 main: add missing descriptions for tools
Fixes #13026
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-02-28 17:09:46 +08:00
Kefu Chai
ef0dfeb2fa main: track tools descriptin in tool struct
so we can manage the tools in a more structured way.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-02-28 17:09:46 +08:00
Kefu Chai
ffbbd59486 main: use a struct for representing tool
so we can encapsulate the description of a certain tool in this
struct with a more readable field name in comparison with a tuple<>,
if we want to track all tools in this vector.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-02-28 17:09:46 +08:00
Kefu Chai
73cf62469b main: expose tools as a vector<>
so, in addition to looking up a tool by the name in it, we will be
able to list all tools in this vector. this change paves the road to
a more general solution to handle `--list-tools`.

in this change

* `lookup_main_func()` is replaced by `get_tools()`.
* instead of checking `main_func` out of the if block,
  check it in the `if` block. as we already know if we have a matched
  tool in the `if` block, and we can early return right there.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-02-28 17:09:46 +08:00
Aleksandra Martyniuk
5d826f13e7 api: move get_and_update_ttl to task manager api
Task ttl can be set with task manager test api, which is disabled
in release mode.

Move get_and_update_ttl from task manager test api to task manager
api, so that it can be used in release mode.

Closes #12894
2023-02-17 10:19:06 +02:00
Kefu Chai
0cb842797a treewide: do not define/capture unused variables
these warnings are found by Clang-17 after removing
`-Wno-unused-lambda-capture` and '-Wno-unused-variable' from
the list of disabled warnings in `configure.py`.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-02-15 22:57:18 +02:00
Kefu Chai
d4315245a1 main: use defer_verbose_shutdown() to shutdown compaction manager
* use `defer_verbose_shutdown()` to shutdown compaction manager

`EDQUOT` is quite similar as `ENOSPC`, in the sense that both of them
are caused by environmental issues.

before this change, `compaction_manager` filters the
ENOSPC exceptions thrown by `compaction_manager::really_do_stop()`,
so they are not propagated to caller when calling
`compaction_manager::stop()` -- only a warning message is printed
in the log. but `EDQUOT` is not handled.

after this change, the exception raised by compaction manager's
stop process is not filtered anymore and is handled by
`defer_verbose_shutdown()` instead, which is able to check the
type of exception, and print out error message in the log. so
the `ENOSPC` and `EDQUOT` errors are taken care of, and more
visible from user's perspective as they are printed as errors
instead of warning. but they are not printed using the
`compaction_manager` logger anymore. so if our testing or user's
workflow depends on this behavior, the related setting should be
updated accordingly.

Fixes #12626
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-02-07 16:00:40 +08:00
Kefu Chai
c3ef353e3d main: consider EDQUOT as environmental failure also
EDQUOT can be returned as the errno when the underlying filesystem
is trying to reserve necessary resources from disk for performing
i/o on behalf of the effective user, and the filesystem fails to
acquire the necessary resources. it could be inode, volume space,
or whatever resources for completing the i/o operation. but none
of them is the consequence of scylla's fault. so we should not
`abort()` at seeing this errno. instead, it's should be reported
to the administrator.

in this change, EDQUOT is also considered as an environmental
failure just like EIO, EACCES and ENOSPC. they could be thrown
when stopping an server.

Fixes #12626
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-02-07 16:00:40 +08:00
Botond Dénes
8efa9b0904 Merge 'Avoid qctx from view-builder methods of system_keyspace' from Pavel Emelyanov
The system_keyspace defines several auxiliary methods to help view_builder update system.scylla_views_builds_in_progress and system.built_views tables. All use global qctx thing.

It only takes adding view_builder -> system_keyspace dependency in order to de-static all those helpers and let them use query-processor from it, not the qctx.

Closes #12728

* github.com:scylladb/scylladb:
  system_keysace: De-static calls that update view-building tables
  storage_service: Coroutinize mark_existing_views_as_built()
  api: Unset column_famliy endpoints
  api: Carry sharded<db::system_keyspace> reference over
  view_builder: Add system_keyspace dependency
2023-02-06 12:44:40 +02:00
Pavel Emelyanov
b347a0cf0b api: Unset column_famliy endpoints
The API calls in question will use system keyspace, that starts before
(and thus stops after) and nowadays indirectly uses database instance
that also starts earlier (and also stops later), so this avoids
potential dangling references.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-02-03 18:59:28 +03:00
Pavel Emelyanov
eac2e453f2 api: Carry sharded<db::system_keyspace> reference over
There's the column_family/get_built_indexes call that calls a system
keyspace method to fetch data from scylla_views_builds_in_progress
table, so the system keyspace reference will be needed in the API
handler.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-02-03 18:57:43 +03:00
Pavel Emelyanov
bbbeba103b view_builder: Add system_keyspace dependency
The view builder updates system.scylla_views_builds_in_progress and
.built_views tables and thus needs the system keyspace instance.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-02-03 18:55:58 +03:00
Aleksandra Martyniuk
47ef689077 compaction: create and register task manager's module for compaction
As an initial part of integration of compaction with task manager, compaction
module is added. Compaction module inherits from tasks::task_manager::module
and shared_ptr to it is kept in compaction manager. No compaction tasks are
created yet.
2023-02-03 13:52:30 +01:00
Kefu Chai
4a0134a097 db: system_keyspace: take the reserved_memory into account
before this change, we returns the total memory managed by Seastar
in the "total" field in system.memory. but this value only reflect
the total memory managed by Seastar's allocator. if
`reserve_additional_memory` is set when starting app_template,
Seastar's memory subsystem just reserves a chunk of memory of this
specified size for system, and takes the remaining memory. since
f05d612da8, we set this value to 50MB for wasmtime runtime. hence
the test of `TestRuntimeInfoTable.test_default_content` in dtest
fails. the test expects the size passed via the option of
`--memory` to be identical to the value reported by system.memory's
"total" field.

after this change, the "total" field takes the reserved memory
for wasm udf into account. the "total" field should reflect the total
size of memory used by Scylla, no matter how we use a certain portion
of the allocated memory.

Fixes #12522
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #12573
2023-01-24 14:07:44 +02:00
Kefu Chai
7f5bb19d1f main: move perf_sstable into scylla
* configure.py:
  - include `test/perf/perf_sstable` and its dependencies in scylla_perfs
* test/perf/perf_sstable.cc: change `main()` to
  `perf::scylla_sstable_main()`
* test/perf/entry_point.hh: add
  `perf::scylla_sstable_main()`
* main.cc:
  - dispatch "perf-sstable" subcommand to
    `perf::scylla_sstable_main`

before this change, we have a tool at `test/perf/perf_sstable`
for running performance tests by exercising sstable related operations.

after this change, the `test/perf/perf_sstable` is integreated
into `scylla` as a subcommand. so we can run `scylla perf-sstable`
[options, ...]` to perform the same tests previous driven by the tool.

Fixes #12484
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-01-19 17:42:52 +08:00
Kefu Chai
240f2c6f00 main: move perf_row_cache_update into scylla
* configure.py:
  - include `test/perf/perf_row_cache_update.cc` in scylla_perfs
* main.cc:
  - dispatch "perf-row-cache-update" subcommand to
    `perf::scylla_row_cache_update_main`
* test/perf/perf_fast_forward.cc: change `main()` to
  `perf::scylla_row_cache_update_main()`
* test/perf/entry_point.hh: add
  `perf::scylla_row_cache_update_main()`

before this change, we have a tool at `test/perf/perf_row_cache_update`
for running performance tests by updating row cache.

after this change, the `test/perf/perf_row_cache_update` is integreated
into `scylla` as a subcommand. so we can run `scylla perf-row-cache-update
[options, ...]` to perform the same tests previous driven by the tool.

Fixes #12484
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-01-19 17:42:46 +08:00
Kefu Chai
228ccdc1c7 main: move perf_fast_forward into scylla
* configure.py:
  - include `test/perf/perf_simple_query.cc` in scylla_perfs
* main.cc:
  - dispatch "perf-fast-forward" subcommand to
    `perf::scylla_fast_forward_main`
* test/perf/perf_fast_forward.cc: change `main()` to
  `perf::scylla_simple_query_main()`
* test/perf/entry_point.hh: add
  `perf::scylla_simple_query_main()`

before this change, we have a tool at `test/perf/perf_fast_forward`
for running performance tests by fast forwarding the reader.

after this change, the `test/perf/perf_fast_forward` is integreated
into `scylla` as a subcommand. so we can run `scylla perf-fast-forward
[options, ...]` to perform the same tests previous driven by the tool.

Fixes #12484
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-01-19 17:42:40 +08:00
Kefu Chai
09de031cab main: move perf_simple_query into scylla
* configure.py:
  - include scylla_perfs in scylla
  - move 'test/lib/debug.cc' down scylla_perfs, as the latter uses
    `debug::the_database`
  - link `scylla` against seastar_testing_libs also. because we
    use the helpers in `test/lib/random_utils.hh` for generating
    random numbers / sequences in `perf_simple_query.cc`, and
    `random_utils.hh` references `seastar::testing::local_random_engine`
    as a local RNG. but `seastar::testing::local_random_engine`
    is included in `libseastar_testing.a` or
    `libseastar_perf_testing.a`. since we already have the rules for
    linking against `libseastar_testing.a`, let's just reuse them,
    and link `scylla` against this new dependency.

* main.cc:
  - dispatch "perf-simple-query" subcommand to
    `perf::scylla_simple_query_main`
* test/perf/perf_simple_query.cc: change `main()` to
  `perf::scylla_simple_query_main()`
* test/perf/entry_point.hh: define the main function entries
  so `main.cc` can find them. it's quite like how we collect
  the entries in `tools/entry_point.hh`

before this change, we have a tool at `test/perf/perf_simple_query`
for running performance test by sending simple query to a single-node
cluster.

after this change, the `test/perf/perf_simple_query` is integreated
into `scylla` as a subcommand. so we can run `scylla perf-simple-query
[options, ...]` to perform the same tests previous driven by the tool.

Fixes #12484
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-01-19 17:42:30 +08:00
Kefu Chai
c65692a13a test: extract debug::the_database out
we want to integrate some perf test into scylla executable, so we
can run them on a regular basis. but `test/lib/cql_test_env.cc`
shares `debug::the_database` with `main.cc`, so we cannot just
compile them into a single binary without changing them.

before this change, both `test/lib/cql_test_env.cc`
and `main.cc` define `debug::the_database`.

after this change, `debug::the_database` is extracted into
`debug.cc`, so it compiles into a separate compiling unit.
and scylla and tests using seastar testing framework are linked
against `debug.cc` via `scylla_core` respectively. this paves the road to
integrating scylla with the tests linking aginst
`test/lib/cql_test_env.cc`.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-01-19 17:42:23 +08:00
Kefu Chai
965443d6be main: shift the args when checking exec_name
instead of introducing yet another variable for tracking the
status, update the args right away. for better readability.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-01-18 22:22:10 +08:00
Kefu Chai
835cd9bfc9 main: extract lookup_main_func() out
refactor main() to extract lookup_main_func() out, so we find
the main_func in a table instead of using a lengthy if-then-else
clause.

when the length of the list of candidates of dispatch grows, the
code would be less structured. so in this change, the code looking
up for the main_func is extracted into a dedicated function for
better readability.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-01-18 22:22:10 +08:00
Kamil Braun
a483915c62 db: system_keyspace: add a virtual table with raft configuration
Add a new virtual table `system.raft_state` that shows the currently
operating Raft configuration for each present group. The schema is the
same as `system.raft_snapshot_config` (the latter shows the config from
the last snapshot). In the future we plan to add more columns to this
table, showing more information (like the current leader and term),
hence the generic name.

Adding the table requires some plumbing of
`sharded<raft_group_registry>&` through function parameters to make it
accessible from `register_virtual_tables`, but it's mostly
straightforward.

Also added some APIs to `raft_group_registry` to list all groups and
find a given group (returning `nullptr` if one isn't found, not throwing
an exception).
2023-01-17 12:28:00 +01:00
Kefu Chai
114f30016a main: use std::shift_left() to consume tool name
for better readability.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #12536
2023-01-16 21:01:34 +02:00
Takuya ASADA
548c9e36a1 main: add tcp_timestamps sanity check
Check net.ipv4.tcp_timestamps, show warning message when it's not set to 1.

Fixes #12144

Closes #12199
2023-01-09 19:08:21 +02:00
Wojciech Mitros
f05d612da8 wasm: limit memory allocated using mmap
The wasmtime runtime allocates memory for the executable code of
the WASM programs using mmap and not the seastar allocator. As
a result, the memory that Scylla actually uses becomes not only
the memory preallocated for the seastar allocator but the sum of
that and the memory allocated for executable codes by the WASM
runtime.
To keep limiting the memory used by Scylla, we measure how much
memory do the WASM programs use and if they use too much, compiled
WASM UDFs (modules) that are currently not in use are evicted to
make room.
To evict a module it is required to evict all instances of this
module (the underlying implementation of modules and instances uses
shared pointers to the executable code). For this reason, we add
reference counts to modules. Each instance using a module is a
reference. When an instance is destroyed, a reference is removed.
If all references to a module are removed, the executable code
for this module is deallocated.
The eviction of a module is actually acheved by eviction of all
its references. When we want to free memory for a new module we
repeatedly evict instances from the wasm_instance_cache using its
LRU strategy until some module loses all its instances. This
process may not succeed if the instances currently in use (so not
in the cache) use too much memory - in this case the query also
fails. Otherwise the new module is added to the tracking system.
This strategy may evict some instances unnecessarily, but evicting
modules should not happen frequently, and any more efficient
solution requires an even bigger intervention into the code.
2023-01-06 14:07:29 +01:00
Wojciech Mitros
b8d28a95bf wasm: add configuration options for instance cache and udf execution
Different users may require different limits for their UDFs. This
patch allows them to configure the size of their cache of wasm,
the maximum size of indivitual instances stored in the cache, the
time after which the instances are evicted, the fuel that all wasm
UDFs are allowed to consume before yielding (for the control of
latency), the fuel that wasm UDFs are allowed to consume in total
(to allow performing longer computations in the UDF without
detecting an infinite loop) and the hard limit of the size of UDFs
that are executed (to avoid large allocations)
2023-01-06 14:07:27 +01:00
Kamil Braun
09da661eeb Merge 'raft: replace experimental raft option with dedicated flag' from Gleb Natapov
Unlike other experimental feature we want to raft to be opt in even
after it leaves experimental mode. For that we need to have a separate
option to enable it. The patch adds the binary option "consistent-cluster-management"
for that.

* 'consistent-cluster-management-flag' of github.com:scylladb/scylla-dev:
  raft: replace experimental raft option with dedicated flag
  main: move supervisor notification about group registry start where it actually starts
2023-01-05 15:21:35 +01:00
Petr Gusev
8417840647 raft: raft_group0, register RPC verbs on all shards
raft_group0 used to register RPC verbs only on shard 0.
This worked on clusters with the same --smp setting on
all nodes, since RPCs in this case are (usually)
processed on the same shard as the calling code,
and raft_group0 methods only run on shard 0.

A new test test_nodes_with_different_smp was added
to identify the problem.

Fixes: #12252
2023-01-03 17:04:07 +03:00
Gleb Natapov
1688163233 raft: replace experimental raft option with dedicated flag
Unlike other experimental feature we want to raft to be optional even
after it leaves experimental mode. For that we need to have a separate
option to enable it. The patch adds the binary option "consistent-cluster-management"
for that.
2023-01-03 11:15:11 +02:00
Gleb Natapov
29060cc235 main: move supervisor notification about group registry start where it actually starts
99fe580068 moved raft_group_registry::start call a bit later, but
forget to move supervisor notification call. Do it now.
2023-01-03 11:09:30 +02:00
Kamil Braun
f3243ff674 main: use Host ID as Raft ID
The Host ID now uniquely identifies a node (we no longer steal it during
node replace) and Raft is still experimental. We can reuse the Host ID
of a node as its Raft ID. This will allow us to remove and simplify a
lot of code.

With this we can already remove some dead code in this commit.
2022-12-12 15:14:51 +01:00
Pavel Emelyanov
be8512d7cc sstables, code: Wrap directory semaphore with concurrency
Currently this is a sharded<semaphore> started/stopped in main and
referenced by database in order to be fed into sstables code. This
semaphore always comes with the "concurrency" parameter that limits the
parallel_for_each parallelizm.

This patch wraps both together into directory_semaphore class. This
makes its usage simpler and will allow extending it in the future.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-12-05 11:59:30 +03:00
Kamil Braun
cbdcc944b5 service/raft: specialized verb for failure detector pinger
We used GOSSIP_ECHO verb to perform failure detection. Now we use
a special verb DIRECT_FD_PING introduced for this purpose.

There are multiple reasons to do so.

One minor reason: we want to use the same connection as other Raft
verbs: if we can't deliver Raft append_entries or vote messages
somewhere, that endpoint should be marked dead; if we can, the
endpoint should be marked alive. So putting pings on the same
connection as the other Raft verbs is important when dealing with
weird situations where some connections are available but others are
not. Observe that in `do_get_rpc_client_idx`, we put the new verb in
the right place.

Another minor reason: we remove the awkward gossiper `echo_pinger`
abstraction which required storing and updating gossiper generation
numbers. This also removes one dependency from Raft service code to
gossiper.

Major reason 1: the gossip echo handler has a weird mechanism where a
replacing node returns errors during the replace operation to some of
the nodes. In Raft however, we want to mark servers as alive when they
are alive, including a server running on a node that's replacing
another node.

Major reason 2, related to the previous one: when server B is
replacing server A with the same IP, the failure detector will try to
ping both servers. Both servers are mapped to the same IP by the
address map, so pings to both servers will reach server B. We want
server B to respond to the pings destined for server B, but not to
pings destined for server A, so the sender can mark B alive but keep A
marked dead.

To do this, we include the destination's Raft ID in our RPCs. The
destination compares the received ID with its own. If it's different,
it returns a `wrong_destination` response, and the failure detector
knows that the ping did not reach the destination (it reached someone
else).

Yet another reason: removes "Not ready to respond gossip echo
message" log spam during replace.
2022-12-01 20:54:18 +01:00
Kamil Braun
99fe580068 service/raft: make this node's Raft ID available early in group registry
Raft ID was loaded or created late in the boot procedure, in
`storage_service::join_token_ring`.

Create it earlier, as soon as it's possible (when `system_keyspace`
is started), pass it to `raft_group_registry::start` and store it inside
`raft_group_registry`.

We will use this Raft ID stored in group registry in following patches.
Also this reduces the number of disk accesses for this node's Raft ID.
It's now loaded from disk once, stored in `raft_group_registry`, then
obtained from there when needed.

This moves `raft_group_registry::start` a bit later in the startup
procedure - after `system_keyspace` is started - but it doesn't make
a difference.
2022-12-01 20:54:18 +01:00
Konstantin Osipov
73e5298273 raft: (address map) actively maintain ip <-> raft server id map
1) make address map API flexible

Before this patch:
- having a mapping without an actual IP address was an
  internal error
- not having a mapping for an IP address was an internal
  error
- re-mapping to a new IP address wasn't allowed

After this patch:

- the address map may contain a mapping
  without an actual IP address, and the caller must be prepared for it:
  find() will return a nullopt. This happens when we first add an entry
  to Raft configuration and only later learn its IP address, e.g.  via
  gossip.

- it is allowed to re-map an existing entry to a new address;
2) subscribe to gossip notifications

Learning IP addresses from gossip allows us to adjust
the address map whenever a node IP address changes.
Gossiper is also the only valid source of re-mapping, other sources
(RPC) should not re-map, since otherwise a packet from a removed
server can remap the id to a wrong address and impact liveness of a Raft
cluster.

3) prompt address map state with app state

Initialize the raft address map with initial
gossip application state, specifically IPs of members
of the cluster. With this, we no longer need to store
these IPs in Raft configuration (and update them when they change).

The obvious drawback of this approach is that a node
may join Raft config before it propagates its IP address
to the cluster via gossip - so the boot process has to
wait until it happens.

Gossip also doesn't tell us which IPs are members of Raft configuration,
so we subscribe to Group0 configuration changes to mark the
members of Raft config "non-expiring" in the address translation
map.

Thanks to the changes above, Raft configuration no longer
stores IP addresses.

We still keep the 'server_info' column in the raft_config system table,
in case we change our mind or decide to store something else in there.
2022-11-29 19:55:43 +03:00
Nadav Har'El
2dedb5ea75 alternator: make Alternator TTL feature no longer "experimental"
Until now, the Alternator TTL feature was considered "experimental",
and had to be manually enabled on all nodes of the cluster to be usable.

This patch removes this requirement and in essence GAs this feature.

Even after this patch, Alternator TTL is still a "cluster feature",
i.e., for this feature to be usable every node in the cluster needs
to support it. If any of the nodes is old and does not yet support this
feature, the UpdateTimeToLive request will not be accepted, so although
the expiration-scanning threads may exist on the newer nodes, they will
not do anything because none of the tables can be marked as having
expiration enabled.

This patch does not contain documentation fixes - the documentation
still suggests that the Alternator TTL feature is experimental.
The documentation patch will come separately.

Fixes #12037

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #12049
2022-11-24 17:21:39 +02:00
Kamil Braun
e086521c1a direct_failure_detector: get rid of complex endpoint_id translations
The direct failure detector operates on abstract `endpoint_id`s for
pinging. The `pigner` interface is responsible for translating these IDs
to 'real' addresses.

Earlier we used two types of addresses: IP addresses in 'production'
code (`gms::gossiper::direct_fd_pinger`) and `raft::server_id`s in test
code (in `randomized_nemesis_test`). For each of these use cases we
would maintain mappings between `endpoint_id`s and the address type.

In recent commits we switched the 'production' code to also operate on
Raft server IDs, which are UUIDs underneath.

In this commit we switch `endpoint_id`s from `unsigned` type to
`utils::UUID`. Because each use case operates in Raft server IDs, we can
perform a simple translation: `raft_id.uuid()` to get an `endpoint_id`
from a Raft ID, `raft::server_id{ep_id}` to obtain a Raft ID from
an `endpoint_id`. We no longer have to maintain complex sharded data
structures to store the mappings.
2022-11-04 09:38:08 +01:00
Kamil Braun
ac70a05c7e service/raft: store raft_address_map reference in direct_fd_pinger
The pinger will use the map to translate `raft::server_id`s to
`gms::inet_address`es when pinging.
2022-11-04 09:38:08 +01:00
Kamil Braun
2c20f2ab9d gms: gossiper: move direct_fd_pinger out to a separate service
In later commit `direct_fd_pinger` will operate in terms of
`raft::server_id`s. Decouple it from `gossiper` since we don't want to
entangle `gossiper` with Raft-specific stuff.
2022-11-04 09:38:08 +01:00
Pavel Emelyanov
efbfcdb97e Merge 'Replicate raft_address_map non-expiring entries to other shards' from Kamil Braun
Replicating `raft_address_map` entries is needed for the following use
cases:
- the direct failure detector - currently it assumes a static mapping of
  `raft::server_id`s to `gms::inet_address`es, which is obtained on Raft
  group 0 configuration changes. To handle dynamic mappings we need to
  modify the failure detector so it pings `raft::server_id`s and obtains
  the `gms::inet_address` before sending the message from
  `raft_address_map`. The failure detector is sharded, so we need the
  mappings to be available on all shards.
- in the future we'll have multiple Raft groups running on different
  shards. To send messages they'll need `raft_address_map`.

Initially I tried to replicate all entries - expiring and non-expiring.
The implementation turned out to be very complex - we need to handle
dropping expired entries and refreshing expiring entries' timestamps
across shards, and doing this correctly while accounting for possible
races is quite problematic.

Eventually I arrived at the conclusion that replicating only
non-expiring entries, and furthermore allowing non-expiring entries to
be added only on shard 0, is good enough for our use cases:
- The direct failure detector is pinging group 0 members only; group
  0 members correspond exactly to the non-expiring entries.
- Group 0 configuration changes are handled on shard 0, so non-expiring
  entries are added/removed on shard 0.
- When we have multiple Raft groups, we can reuse a single Raft server
  ID for all Raft servers running on a single node belonging to
  different groups; they are 'namespaced' by the group IDs. Furthermore,
  every node has a server that belongs to group 0. Thus for every Raft
  server in every group, it has a corresponding server in group 0 with
  the same ID, which has a non-expiring entry in `raft_address_map`,
  which is replicated to all shards; so every group will be able to
  deliver its messages.

With these assumptions the implementation is short and simple.
We can always complicate it in the future if we find that the
assumptions are too strong.

Closes #11791

* github.com:scylladb/scylladb:
  test/raft: raft_address_map_test: add replication test
  service/raft: raft_address_map: replicate non-expiring entries to other shards
  service/raft: raft_address_map: assert when entry is missing in drop_expired_entries
  service/raft: turn raft_address_map into a service
2022-11-03 18:34:42 +03:00
Aleksandra Martyniuk
dc80af33bc repair: add task_manager::module to repair_service
repair_service keeps a shared pointer to repair_module.
2022-10-31 10:04:50 +01:00
Kamil Braun
159bb32309 service/raft: turn raft_address_map into a service 2022-10-31 09:17:10 +01:00