There are two places that do it -- commitlog and batchlog replayers. Both can have local system-keyspace reference and use system-keyspace local query-processor for it. The peering save_truncation_record() is not that simple and is not patched by this PR
Closes#13087
* github.com:scylladb/scylladb:
system_keyspace: Unstatic get_truncation_record()
system_keyspace: Unstatic get_truncated_at()
batchlog_manager: Add system_keyspace dependency
main: Swap batchlog manager and system keyspace starts
system_keyspace: Unstatic get_truncated_position()
system_keyspace: Remove unused method
commitlog: Create commitlog_replayer with system keyspace
test: Make cql_test_env::get_system_keyspace() return sharded
commiltlog: Line-up field definitions
this change also includes change to main, to make this commit compile.
see below:
* seastar 9b6e181e42...9cbc1fe889 (46):
> Merge 'Make io-tester jobs share sched classes' from Pavel Emelyanov
> io_tester.md: Update the `rps` configuration option description
> io_tester: Add option to limit total number of requests sent
> Merge 'Keep outgoing queue all cancellable while negotiating (again)' from Pavel Emelyanov
> io_tester: Add option to share classes between jobs
> rpc: Abort connection if send_entry() fails
> Merge 'build: build dpdk with `-fPIC` if BUILD_SHARED_LIBS' from Kefu Chai
> build: cooking.sh: use the same BUILD_SHARED_LIBS when building ingredients
> build: cooking.sh: use the same generator when building ingredients
> core/memory: handle `strerror_r` returning static string
> Merge 'build, rpc: lz4 related cleanups' from Kefu Chai
> build, rpc: do not support lz4 < 1.7.3
> build: set the correct version when finding lz4
> build: include CheckSymbolExists
> rpc: do not include lz4.h in header
> build: set CMP0135 for Cooking.cmake
> docs: drop building-*.md
> Merge 'seastar-addr2line: cleanups' from Kefu Chai
> seastar-addr2line: refactor tests using unittest
> seastar-addr2line: extract do_test() and main()
> seastar-addr2line: do not import unused modules
> scheduling: add a `rename` callback to scheduling_group_key_config
> reactor: syscall thread: wakeup up reactor with finer granularity
> build: build dpdk with `-fPIC` if BUILD_SHARED_LIBS
> build: extract dpdk_extra_cflags out
> core/sstring: remove a temporary variable
> Merge 'treewide: include what we use, and add a checkheaders target' from Kefu Chai
> perftune.py: auto-select the same number of IRQ cores on each NUMA
> prometheus: remove unused headers
> core/sstring: define <=> operator for sstring
> Merge 'core: s/reserve_additional_memory/reserve_additional_memory_per_shard/' from Kefu Chai
> include: do not include <concepts> directly
> coding_style: note on self-contained header requirement
> circileci: build checkheaders in addition to default target
> build: add checkheaders target
> net/toeplitz: s/u_int/unsigned/
> net/tcp-stack: add forward declaration for seastar::socket
> core, net, util: include used headers
* main: set reserved memory for wasm on per-shard basis
this change is a follow-up of
f05d612da8 and
4a0134a097.
this change depends on the related change in Seastar to reserve
additional memory on a per-shard basis.
per Wojciech Mitros's comment:
> it should have probably been 50MB per shard
in other words, as we always execute the same set of udf on all
shards. and since one cannot predict the number of shards, but she
could have a rough estimation on the size of memory a regular (set
of) udf could use. so a per-shard setting makes more sense.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
The manager will need system ks to get truncation record from, so add it
explicitly. Start-stop sequence no allows that
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The former needs the latter to get truncation records from and will thus
need it as explicit dependency. In order to have it bathlog needs to
start after system ks. This works as starting batchlog manager doesn't
do anything that's required by system keyspace. This is indirectly
proven by cql-test-env in which batchlog manager starts later than it
does in main
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The replayer code needs system keyspace to fetch truncation records
from, thus it needs this explicit dependency. By the time it runs system
keyspace is fully initialized already
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
- treewide: do not define/capture unused variables
- sstables/sstables: mark dummy variable for loop [[maybe_unused]]
- util/result_try: reference this explicitly
- raft: reference this explicitly
- idl-compiler: mark captured this used
- build: reenable unused-{variable,lambda-capture} warnings
Closes#12915
* github.com:scylladb/scylladb:
build: reenable unused-{variable,lambda-capture} warnings
test: reader_concurrency_semaphore_test: define target_memory in debug mode
api::failure_detector: mark set_phi_convict_threshold unimplemented
test: memtable_test: mark dummy variable for loop [[maybe_unused]]
idl-compiler: mark captured this used
raft: reference this explicitly
util/result_try: reference this explicitly
sstables/sstables: mark dummy variable for loop [[maybe_unused]]
treewide: do not define/capture unused variables
service: storage_service: clear _node_ops in batch
these warnings are found by Clang-17 after removing
`-Wno-unused-lambda-capture` and '-Wno-unused-variable' from
the list of disabled warnings in `configure.py`.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
there is not need to have a dedicated function which is only consumed
by `main()`. so let's move the body of `get_tools()` into `main`. and
with this change, a plain C array would suffice. so just use a plain
array for tools.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
so we can encapsulate the description of a certain tool in this
struct with a more readable field name in comparison with a tuple<>,
if we want to track all tools in this vector.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
so, in addition to looking up a tool by the name in it, we will be
able to list all tools in this vector. this change paves the road to
a more general solution to handle `--list-tools`.
in this change
* `lookup_main_func()` is replaced by `get_tools()`.
* instead of checking `main_func` out of the if block,
check it in the `if` block. as we already know if we have a matched
tool in the `if` block, and we can early return right there.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Task ttl can be set with task manager test api, which is disabled
in release mode.
Move get_and_update_ttl from task manager test api to task manager
api, so that it can be used in release mode.
Closes#12894
these warnings are found by Clang-17 after removing
`-Wno-unused-lambda-capture` and '-Wno-unused-variable' from
the list of disabled warnings in `configure.py`.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
* use `defer_verbose_shutdown()` to shutdown compaction manager
`EDQUOT` is quite similar as `ENOSPC`, in the sense that both of them
are caused by environmental issues.
before this change, `compaction_manager` filters the
ENOSPC exceptions thrown by `compaction_manager::really_do_stop()`,
so they are not propagated to caller when calling
`compaction_manager::stop()` -- only a warning message is printed
in the log. but `EDQUOT` is not handled.
after this change, the exception raised by compaction manager's
stop process is not filtered anymore and is handled by
`defer_verbose_shutdown()` instead, which is able to check the
type of exception, and print out error message in the log. so
the `ENOSPC` and `EDQUOT` errors are taken care of, and more
visible from user's perspective as they are printed as errors
instead of warning. but they are not printed using the
`compaction_manager` logger anymore. so if our testing or user's
workflow depends on this behavior, the related setting should be
updated accordingly.
Fixes#12626
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
EDQUOT can be returned as the errno when the underlying filesystem
is trying to reserve necessary resources from disk for performing
i/o on behalf of the effective user, and the filesystem fails to
acquire the necessary resources. it could be inode, volume space,
or whatever resources for completing the i/o operation. but none
of them is the consequence of scylla's fault. so we should not
`abort()` at seeing this errno. instead, it's should be reported
to the administrator.
in this change, EDQUOT is also considered as an environmental
failure just like EIO, EACCES and ENOSPC. they could be thrown
when stopping an server.
Fixes#12626
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
The system_keyspace defines several auxiliary methods to help view_builder update system.scylla_views_builds_in_progress and system.built_views tables. All use global qctx thing.
It only takes adding view_builder -> system_keyspace dependency in order to de-static all those helpers and let them use query-processor from it, not the qctx.
Closes#12728
* github.com:scylladb/scylladb:
system_keysace: De-static calls that update view-building tables
storage_service: Coroutinize mark_existing_views_as_built()
api: Unset column_famliy endpoints
api: Carry sharded<db::system_keyspace> reference over
view_builder: Add system_keyspace dependency
The API calls in question will use system keyspace, that starts before
(and thus stops after) and nowadays indirectly uses database instance
that also starts earlier (and also stops later), so this avoids
potential dangling references.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There's the column_family/get_built_indexes call that calls a system
keyspace method to fetch data from scylla_views_builds_in_progress
table, so the system keyspace reference will be needed in the API
handler.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The view builder updates system.scylla_views_builds_in_progress and
.built_views tables and thus needs the system keyspace instance.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
As an initial part of integration of compaction with task manager, compaction
module is added. Compaction module inherits from tasks::task_manager::module
and shared_ptr to it is kept in compaction manager. No compaction tasks are
created yet.
before this change, we returns the total memory managed by Seastar
in the "total" field in system.memory. but this value only reflect
the total memory managed by Seastar's allocator. if
`reserve_additional_memory` is set when starting app_template,
Seastar's memory subsystem just reserves a chunk of memory of this
specified size for system, and takes the remaining memory. since
f05d612da8, we set this value to 50MB for wasmtime runtime. hence
the test of `TestRuntimeInfoTable.test_default_content` in dtest
fails. the test expects the size passed via the option of
`--memory` to be identical to the value reported by system.memory's
"total" field.
after this change, the "total" field takes the reserved memory
for wasm udf into account. the "total" field should reflect the total
size of memory used by Scylla, no matter how we use a certain portion
of the allocated memory.
Fixes#12522
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#12573
* configure.py:
- include `test/perf/perf_sstable` and its dependencies in scylla_perfs
* test/perf/perf_sstable.cc: change `main()` to
`perf::scylla_sstable_main()`
* test/perf/entry_point.hh: add
`perf::scylla_sstable_main()`
* main.cc:
- dispatch "perf-sstable" subcommand to
`perf::scylla_sstable_main`
before this change, we have a tool at `test/perf/perf_sstable`
for running performance tests by exercising sstable related operations.
after this change, the `test/perf/perf_sstable` is integreated
into `scylla` as a subcommand. so we can run `scylla perf-sstable`
[options, ...]` to perform the same tests previous driven by the tool.
Fixes#12484
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
* configure.py:
- include `test/perf/perf_row_cache_update.cc` in scylla_perfs
* main.cc:
- dispatch "perf-row-cache-update" subcommand to
`perf::scylla_row_cache_update_main`
* test/perf/perf_fast_forward.cc: change `main()` to
`perf::scylla_row_cache_update_main()`
* test/perf/entry_point.hh: add
`perf::scylla_row_cache_update_main()`
before this change, we have a tool at `test/perf/perf_row_cache_update`
for running performance tests by updating row cache.
after this change, the `test/perf/perf_row_cache_update` is integreated
into `scylla` as a subcommand. so we can run `scylla perf-row-cache-update
[options, ...]` to perform the same tests previous driven by the tool.
Fixes#12484
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
* configure.py:
- include `test/perf/perf_simple_query.cc` in scylla_perfs
* main.cc:
- dispatch "perf-fast-forward" subcommand to
`perf::scylla_fast_forward_main`
* test/perf/perf_fast_forward.cc: change `main()` to
`perf::scylla_simple_query_main()`
* test/perf/entry_point.hh: add
`perf::scylla_simple_query_main()`
before this change, we have a tool at `test/perf/perf_fast_forward`
for running performance tests by fast forwarding the reader.
after this change, the `test/perf/perf_fast_forward` is integreated
into `scylla` as a subcommand. so we can run `scylla perf-fast-forward
[options, ...]` to perform the same tests previous driven by the tool.
Fixes#12484
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
* configure.py:
- include scylla_perfs in scylla
- move 'test/lib/debug.cc' down scylla_perfs, as the latter uses
`debug::the_database`
- link `scylla` against seastar_testing_libs also. because we
use the helpers in `test/lib/random_utils.hh` for generating
random numbers / sequences in `perf_simple_query.cc`, and
`random_utils.hh` references `seastar::testing::local_random_engine`
as a local RNG. but `seastar::testing::local_random_engine`
is included in `libseastar_testing.a` or
`libseastar_perf_testing.a`. since we already have the rules for
linking against `libseastar_testing.a`, let's just reuse them,
and link `scylla` against this new dependency.
* main.cc:
- dispatch "perf-simple-query" subcommand to
`perf::scylla_simple_query_main`
* test/perf/perf_simple_query.cc: change `main()` to
`perf::scylla_simple_query_main()`
* test/perf/entry_point.hh: define the main function entries
so `main.cc` can find them. it's quite like how we collect
the entries in `tools/entry_point.hh`
before this change, we have a tool at `test/perf/perf_simple_query`
for running performance test by sending simple query to a single-node
cluster.
after this change, the `test/perf/perf_simple_query` is integreated
into `scylla` as a subcommand. so we can run `scylla perf-simple-query
[options, ...]` to perform the same tests previous driven by the tool.
Fixes#12484
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
we want to integrate some perf test into scylla executable, so we
can run them on a regular basis. but `test/lib/cql_test_env.cc`
shares `debug::the_database` with `main.cc`, so we cannot just
compile them into a single binary without changing them.
before this change, both `test/lib/cql_test_env.cc`
and `main.cc` define `debug::the_database`.
after this change, `debug::the_database` is extracted into
`debug.cc`, so it compiles into a separate compiling unit.
and scylla and tests using seastar testing framework are linked
against `debug.cc` via `scylla_core` respectively. this paves the road to
integrating scylla with the tests linking aginst
`test/lib/cql_test_env.cc`.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
instead of introducing yet another variable for tracking the
status, update the args right away. for better readability.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
refactor main() to extract lookup_main_func() out, so we find
the main_func in a table instead of using a lengthy if-then-else
clause.
when the length of the list of candidates of dispatch grows, the
code would be less structured. so in this change, the code looking
up for the main_func is extracted into a dedicated function for
better readability.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Add a new virtual table `system.raft_state` that shows the currently
operating Raft configuration for each present group. The schema is the
same as `system.raft_snapshot_config` (the latter shows the config from
the last snapshot). In the future we plan to add more columns to this
table, showing more information (like the current leader and term),
hence the generic name.
Adding the table requires some plumbing of
`sharded<raft_group_registry>&` through function parameters to make it
accessible from `register_virtual_tables`, but it's mostly
straightforward.
Also added some APIs to `raft_group_registry` to list all groups and
find a given group (returning `nullptr` if one isn't found, not throwing
an exception).
The wasmtime runtime allocates memory for the executable code of
the WASM programs using mmap and not the seastar allocator. As
a result, the memory that Scylla actually uses becomes not only
the memory preallocated for the seastar allocator but the sum of
that and the memory allocated for executable codes by the WASM
runtime.
To keep limiting the memory used by Scylla, we measure how much
memory do the WASM programs use and if they use too much, compiled
WASM UDFs (modules) that are currently not in use are evicted to
make room.
To evict a module it is required to evict all instances of this
module (the underlying implementation of modules and instances uses
shared pointers to the executable code). For this reason, we add
reference counts to modules. Each instance using a module is a
reference. When an instance is destroyed, a reference is removed.
If all references to a module are removed, the executable code
for this module is deallocated.
The eviction of a module is actually acheved by eviction of all
its references. When we want to free memory for a new module we
repeatedly evict instances from the wasm_instance_cache using its
LRU strategy until some module loses all its instances. This
process may not succeed if the instances currently in use (so not
in the cache) use too much memory - in this case the query also
fails. Otherwise the new module is added to the tracking system.
This strategy may evict some instances unnecessarily, but evicting
modules should not happen frequently, and any more efficient
solution requires an even bigger intervention into the code.
Different users may require different limits for their UDFs. This
patch allows them to configure the size of their cache of wasm,
the maximum size of indivitual instances stored in the cache, the
time after which the instances are evicted, the fuel that all wasm
UDFs are allowed to consume before yielding (for the control of
latency), the fuel that wasm UDFs are allowed to consume in total
(to allow performing longer computations in the UDF without
detecting an infinite loop) and the hard limit of the size of UDFs
that are executed (to avoid large allocations)
Unlike other experimental feature we want to raft to be opt in even
after it leaves experimental mode. For that we need to have a separate
option to enable it. The patch adds the binary option "consistent-cluster-management"
for that.
* 'consistent-cluster-management-flag' of github.com:scylladb/scylla-dev:
raft: replace experimental raft option with dedicated flag
main: move supervisor notification about group registry start where it actually starts
raft_group0 used to register RPC verbs only on shard 0.
This worked on clusters with the same --smp setting on
all nodes, since RPCs in this case are (usually)
processed on the same shard as the calling code,
and raft_group0 methods only run on shard 0.
A new test test_nodes_with_different_smp was added
to identify the problem.
Fixes: #12252
Unlike other experimental feature we want to raft to be optional even
after it leaves experimental mode. For that we need to have a separate
option to enable it. The patch adds the binary option "consistent-cluster-management"
for that.
The Host ID now uniquely identifies a node (we no longer steal it during
node replace) and Raft is still experimental. We can reuse the Host ID
of a node as its Raft ID. This will allow us to remove and simplify a
lot of code.
With this we can already remove some dead code in this commit.
Currently this is a sharded<semaphore> started/stopped in main and
referenced by database in order to be fed into sstables code. This
semaphore always comes with the "concurrency" parameter that limits the
parallel_for_each parallelizm.
This patch wraps both together into directory_semaphore class. This
makes its usage simpler and will allow extending it in the future.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
We used GOSSIP_ECHO verb to perform failure detection. Now we use
a special verb DIRECT_FD_PING introduced for this purpose.
There are multiple reasons to do so.
One minor reason: we want to use the same connection as other Raft
verbs: if we can't deliver Raft append_entries or vote messages
somewhere, that endpoint should be marked dead; if we can, the
endpoint should be marked alive. So putting pings on the same
connection as the other Raft verbs is important when dealing with
weird situations where some connections are available but others are
not. Observe that in `do_get_rpc_client_idx`, we put the new verb in
the right place.
Another minor reason: we remove the awkward gossiper `echo_pinger`
abstraction which required storing and updating gossiper generation
numbers. This also removes one dependency from Raft service code to
gossiper.
Major reason 1: the gossip echo handler has a weird mechanism where a
replacing node returns errors during the replace operation to some of
the nodes. In Raft however, we want to mark servers as alive when they
are alive, including a server running on a node that's replacing
another node.
Major reason 2, related to the previous one: when server B is
replacing server A with the same IP, the failure detector will try to
ping both servers. Both servers are mapped to the same IP by the
address map, so pings to both servers will reach server B. We want
server B to respond to the pings destined for server B, but not to
pings destined for server A, so the sender can mark B alive but keep A
marked dead.
To do this, we include the destination's Raft ID in our RPCs. The
destination compares the received ID with its own. If it's different,
it returns a `wrong_destination` response, and the failure detector
knows that the ping did not reach the destination (it reached someone
else).
Yet another reason: removes "Not ready to respond gossip echo
message" log spam during replace.
Raft ID was loaded or created late in the boot procedure, in
`storage_service::join_token_ring`.
Create it earlier, as soon as it's possible (when `system_keyspace`
is started), pass it to `raft_group_registry::start` and store it inside
`raft_group_registry`.
We will use this Raft ID stored in group registry in following patches.
Also this reduces the number of disk accesses for this node's Raft ID.
It's now loaded from disk once, stored in `raft_group_registry`, then
obtained from there when needed.
This moves `raft_group_registry::start` a bit later in the startup
procedure - after `system_keyspace` is started - but it doesn't make
a difference.
1) make address map API flexible
Before this patch:
- having a mapping without an actual IP address was an
internal error
- not having a mapping for an IP address was an internal
error
- re-mapping to a new IP address wasn't allowed
After this patch:
- the address map may contain a mapping
without an actual IP address, and the caller must be prepared for it:
find() will return a nullopt. This happens when we first add an entry
to Raft configuration and only later learn its IP address, e.g. via
gossip.
- it is allowed to re-map an existing entry to a new address;
2) subscribe to gossip notifications
Learning IP addresses from gossip allows us to adjust
the address map whenever a node IP address changes.
Gossiper is also the only valid source of re-mapping, other sources
(RPC) should not re-map, since otherwise a packet from a removed
server can remap the id to a wrong address and impact liveness of a Raft
cluster.
3) prompt address map state with app state
Initialize the raft address map with initial
gossip application state, specifically IPs of members
of the cluster. With this, we no longer need to store
these IPs in Raft configuration (and update them when they change).
The obvious drawback of this approach is that a node
may join Raft config before it propagates its IP address
to the cluster via gossip - so the boot process has to
wait until it happens.
Gossip also doesn't tell us which IPs are members of Raft configuration,
so we subscribe to Group0 configuration changes to mark the
members of Raft config "non-expiring" in the address translation
map.
Thanks to the changes above, Raft configuration no longer
stores IP addresses.
We still keep the 'server_info' column in the raft_config system table,
in case we change our mind or decide to store something else in there.
Until now, the Alternator TTL feature was considered "experimental",
and had to be manually enabled on all nodes of the cluster to be usable.
This patch removes this requirement and in essence GAs this feature.
Even after this patch, Alternator TTL is still a "cluster feature",
i.e., for this feature to be usable every node in the cluster needs
to support it. If any of the nodes is old and does not yet support this
feature, the UpdateTimeToLive request will not be accepted, so although
the expiration-scanning threads may exist on the newer nodes, they will
not do anything because none of the tables can be marked as having
expiration enabled.
This patch does not contain documentation fixes - the documentation
still suggests that the Alternator TTL feature is experimental.
The documentation patch will come separately.
Fixes#12037
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#12049
The direct failure detector operates on abstract `endpoint_id`s for
pinging. The `pigner` interface is responsible for translating these IDs
to 'real' addresses.
Earlier we used two types of addresses: IP addresses in 'production'
code (`gms::gossiper::direct_fd_pinger`) and `raft::server_id`s in test
code (in `randomized_nemesis_test`). For each of these use cases we
would maintain mappings between `endpoint_id`s and the address type.
In recent commits we switched the 'production' code to also operate on
Raft server IDs, which are UUIDs underneath.
In this commit we switch `endpoint_id`s from `unsigned` type to
`utils::UUID`. Because each use case operates in Raft server IDs, we can
perform a simple translation: `raft_id.uuid()` to get an `endpoint_id`
from a Raft ID, `raft::server_id{ep_id}` to obtain a Raft ID from
an `endpoint_id`. We no longer have to maintain complex sharded data
structures to store the mappings.
In later commit `direct_fd_pinger` will operate in terms of
`raft::server_id`s. Decouple it from `gossiper` since we don't want to
entangle `gossiper` with Raft-specific stuff.
Replicating `raft_address_map` entries is needed for the following use
cases:
- the direct failure detector - currently it assumes a static mapping of
`raft::server_id`s to `gms::inet_address`es, which is obtained on Raft
group 0 configuration changes. To handle dynamic mappings we need to
modify the failure detector so it pings `raft::server_id`s and obtains
the `gms::inet_address` before sending the message from
`raft_address_map`. The failure detector is sharded, so we need the
mappings to be available on all shards.
- in the future we'll have multiple Raft groups running on different
shards. To send messages they'll need `raft_address_map`.
Initially I tried to replicate all entries - expiring and non-expiring.
The implementation turned out to be very complex - we need to handle
dropping expired entries and refreshing expiring entries' timestamps
across shards, and doing this correctly while accounting for possible
races is quite problematic.
Eventually I arrived at the conclusion that replicating only
non-expiring entries, and furthermore allowing non-expiring entries to
be added only on shard 0, is good enough for our use cases:
- The direct failure detector is pinging group 0 members only; group
0 members correspond exactly to the non-expiring entries.
- Group 0 configuration changes are handled on shard 0, so non-expiring
entries are added/removed on shard 0.
- When we have multiple Raft groups, we can reuse a single Raft server
ID for all Raft servers running on a single node belonging to
different groups; they are 'namespaced' by the group IDs. Furthermore,
every node has a server that belongs to group 0. Thus for every Raft
server in every group, it has a corresponding server in group 0 with
the same ID, which has a non-expiring entry in `raft_address_map`,
which is replicated to all shards; so every group will be able to
deliver its messages.
With these assumptions the implementation is short and simple.
We can always complicate it in the future if we find that the
assumptions are too strong.
Closes#11791
* github.com:scylladb/scylladb:
test/raft: raft_address_map_test: add replication test
service/raft: raft_address_map: replicate non-expiring entries to other shards
service/raft: raft_address_map: assert when entry is missing in drop_expired_entries
service/raft: turn raft_address_map into a service