Commit Graph

3124 Commits

Author SHA1 Message Date
Benny Halevy
c61083852c storage_service: handle_state_normal: calculate candidates_for_removal when replacing tokens
We currently try to detect a replaced node so to insert it to
endpoints_to_remove when it has no owned tokens left.
However, for each token we first generate a multimap using
get_endpoint_to_token_map_for_reading().

There are 2 problems with that:

1. unless the replaced node owns a single token, this map will not
   be empty after erasing one token out of it, since the
   token metadata has not changed yet (this is done later with
   update_normal_tokens(owned_tokens, endpoint)).
2. generating this map for each token is inefficient, turning this
   algorithm complexity to quadratic in the number of tokens...

This change copies the current token_to_endpoint map
to temporary map and erases replaced tokens from it,
while maintaining a set of candidates_for_removal.

After traversing all replaced tokens, we check again
the `token_to_endpoint_map` erasing from `candidates_for_removal`
any endpoint that still owns tokens.
The leftover candidates are endpoints the own no tokens
and so they are added to `hosts_to_remove`.

Fixes #12082

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #12141
2022-12-05 16:17:18 +01:00
Kamil Braun
cbdcc944b5 service/raft: specialized verb for failure detector pinger
We used GOSSIP_ECHO verb to perform failure detection. Now we use
a special verb DIRECT_FD_PING introduced for this purpose.

There are multiple reasons to do so.

One minor reason: we want to use the same connection as other Raft
verbs: if we can't deliver Raft append_entries or vote messages
somewhere, that endpoint should be marked dead; if we can, the
endpoint should be marked alive. So putting pings on the same
connection as the other Raft verbs is important when dealing with
weird situations where some connections are available but others are
not. Observe that in `do_get_rpc_client_idx`, we put the new verb in
the right place.

Another minor reason: we remove the awkward gossiper `echo_pinger`
abstraction which required storing and updating gossiper generation
numbers. This also removes one dependency from Raft service code to
gossiper.

Major reason 1: the gossip echo handler has a weird mechanism where a
replacing node returns errors during the replace operation to some of
the nodes. In Raft however, we want to mark servers as alive when they
are alive, including a server running on a node that's replacing
another node.

Major reason 2, related to the previous one: when server B is
replacing server A with the same IP, the failure detector will try to
ping both servers. Both servers are mapped to the same IP by the
address map, so pings to both servers will reach server B. We want
server B to respond to the pings destined for server B, but not to
pings destined for server A, so the sender can mark B alive but keep A
marked dead.

To do this, we include the destination's Raft ID in our RPCs. The
destination compares the received ID with its own. If it's different,
it returns a `wrong_destination` response, and the failure detector
knows that the ping did not reach the destination (it reached someone
else).

Yet another reason: removes "Not ready to respond gossip echo
message" log spam during replace.
2022-12-01 20:54:18 +01:00
Kamil Braun
02c64becdc db: system_keyspace: de-staticize {get,set}_raft_server_id
Part of the anti-globals war.
2022-12-01 20:54:18 +01:00
Kamil Braun
99fe580068 service/raft: make this node's Raft ID available early in group registry
Raft ID was loaded or created late in the boot procedure, in
`storage_service::join_token_ring`.

Create it earlier, as soon as it's possible (when `system_keyspace`
is started), pass it to `raft_group_registry::start` and store it inside
`raft_group_registry`.

We will use this Raft ID stored in group registry in following patches.
Also this reduces the number of disk accesses for this node's Raft ID.
It's now loaded from disk once, stored in `raft_group_registry`, then
obtained from there when needed.

This moves `raft_group_registry::start` a bit later in the startup
procedure - after `system_keyspace` is started - but it doesn't make
a difference.
2022-12-01 20:54:18 +01:00
Kamil Braun
0f9d0dd86e Merge 'raft: support IP address change' from Konstantin Osipov
This is the core of dynamic IP address support in Raft, moving out the
IP address sourcing from Raft Group 0 configuration to gossip. At start
of Raft, the raft id <> IP address translation map is tuned into the
gossiper notifications and learns IP addresses of Raft hosts from them.

The series intentionally doesn't contain the part which speeds up the
initial cluster assembly by persisting the translation cache and using
more sources besides gossip (discovery, RPC) to show correctness of the
approach.

Closes #12035

* github.com:scylladb/scylladb:
  raft: (rpc) do not throw in case of a missing IP address in RPC
  raft: (address map) actively maintain ip <-> raft server id map
2022-11-30 15:40:18 +01:00
Michał Jadwiszczak
8e64e18b80 forward_service: add debug logs
Adds a few debug logs to see what is happening in https://github.com/scylladb/scylladb/issues/11684

Wrapped `forward_result::printer` into `seastar::value_of` to lazy
evaluate the printer

Closes #12113
2022-11-30 12:15:26 +02:00
Botond Dénes
50aea9884b Merge 'Improve the Raft upgrade procedure' from Kamil Braun
Better logging, less code, a minor fix.

Closes #12135

* github.com:scylladb/scylladb:
  service/raft: raft_group0: less repetitive logging calls
  service/raft: raft_group0: fix sleep_with_exponential_backoff
2022-11-30 11:24:20 +02:00
Avi Kivity
6a5d9ff261 treewide: use non-experimental std::source_location
Now that we use libstdc++ 12, we can use the standardized
source_location.

Closes #12137
2022-11-30 11:06:43 +02:00
Konstantin Osipov
fbe7886cc0 raft: (rpc) do not throw in case of a missing IP address in RPC
Remove raft_address_map::get_inet_address()

While at it, coroutinize some rpc mehtods.

To propagate up the event of missing IP address, use coroutine::exception(
with a proper type (raft::transport_error) and a proper error message.

This is a building block from removing
raft_address_map::get_inet_address() which is too generic, and shifting
the responsibility of handling missing addresses to the address map
clients. E.g. one-way RPC shouldn't throw if an address is missing, but
just drop the message.

PS An attempt to use a single template function rendered to be too
complex:
- some functions require a gate, some don't
- some return void, some future<> and some future<raft::data_type>
2022-11-29 19:55:48 +03:00
Konstantin Osipov
73e5298273 raft: (address map) actively maintain ip <-> raft server id map
1) make address map API flexible

Before this patch:
- having a mapping without an actual IP address was an
  internal error
- not having a mapping for an IP address was an internal
  error
- re-mapping to a new IP address wasn't allowed

After this patch:

- the address map may contain a mapping
  without an actual IP address, and the caller must be prepared for it:
  find() will return a nullopt. This happens when we first add an entry
  to Raft configuration and only later learn its IP address, e.g.  via
  gossip.

- it is allowed to re-map an existing entry to a new address;
2) subscribe to gossip notifications

Learning IP addresses from gossip allows us to adjust
the address map whenever a node IP address changes.
Gossiper is also the only valid source of re-mapping, other sources
(RPC) should not re-map, since otherwise a packet from a removed
server can remap the id to a wrong address and impact liveness of a Raft
cluster.

3) prompt address map state with app state

Initialize the raft address map with initial
gossip application state, specifically IPs of members
of the cluster. With this, we no longer need to store
these IPs in Raft configuration (and update them when they change).

The obvious drawback of this approach is that a node
may join Raft config before it propagates its IP address
to the cluster via gossip - so the boot process has to
wait until it happens.

Gossip also doesn't tell us which IPs are members of Raft configuration,
so we subscribe to Group0 configuration changes to mark the
members of Raft config "non-expiring" in the address translation
map.

Thanks to the changes above, Raft configuration no longer
stores IP addresses.

We still keep the 'server_info' column in the raft_config system table,
in case we change our mind or decide to store something else in there.
2022-11-29 19:55:43 +03:00
Kamil Braun
3dbcff435f service/raft: raft_group0: less repetitive logging calls
Some log messages in retry loops in the Raft upgrade procedure included
a sentence like "sleeping before retrying..."; but not all of them.

With the recently added `sleep_with_exponential_backoff` abstraction we
can put this "sleeping..." message in a single place, and it's also easy
to say how long we're going to sleep.

I also enjoy using this `source_location` thing.
2022-11-29 17:42:43 +01:00
Kamil Braun
580bdec875 service/raft: raft_group0: fix sleep_with_exponential_backoff
It was immediately jumping to _max_retry_period.
2022-11-29 16:27:59 +01:00
Avi Kivity
0da66371a5 storage_proxy: coroutinize inner continuation of create_hint_sync_point()
It is part of a coroutine::parallel_for_each(), which is safe for lambda coroutines.

Closes #12057
2022-11-28 11:30:00 +02:00
Avi Kivity
70bfa708f5 storage_proxy: coroutinize change_hints_host_filter()
Trivial straight-line code, no performance implications.

Closes #12056
2022-11-23 15:34:24 +02:00
Benny Halevy
731a74c71f storage_proxy: pass topology& to sort_endpoints_by_proximity
It mustn't use the latest topology that may differ from the
one used by the query as it may be missing nodes
(e.g. after concurrent decommission).

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-11-22 15:02:40 +02:00
Benny Halevy
ab3fc1e069 storage_proxy: pass topology& to is_worth_merging_for_range_query
It mustn't use the latest topology that may differ from the
one used by the query as it may be missing nodes
(e.g. after concurrent decommission).

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-11-22 15:01:58 +02:00
Avi Kivity
994603171b Merge 'Add validator to the mutation compactor' from Botond Dénes
Fragment reordering and fragment dropping bugs have been plaguing us since forever. To fight them we added a validator to the sstable write path to prevent really messed up sstables from being written.
This series adds validation to the mutation compactor. This will cover reads and compaction among others, hopefully ridding us of such bugs on the read path too.
This series fixes some benign looking issues found by unit tests after the validator was added -- although how benign a producer emitting two partition-ends depends entirely on how the consumer reacts to it, so no such bug is actually benign.

Fixes: https://github.com/scylladb/scylladb/issues/11174

Closes #11532

* github.com:scylladb/scylladb:
  mutation_compactor: add validator
  mutation_fragment_stream_validator: add a 'none' validation level
  test/boost/mutation_query_test: test_partition_limit: sort input data
  querier: consume_page(): use partition_start as the sentinel value
  treewide: use ::for_partition_end() instead of ::end_of_partition_tag_t{}
  treewide: use ::for_partition_start() instead of ::partition_start_tag_t{}
  position_in_partition: add for_partition_{start,end}()
2022-11-20 20:33:26 +02:00
Kamil Braun
d7649a86c4 Merge 'Build up to support of dynamic IP address changes in Raft' from Konstantin Osipov
We plan to stop storing IP addresses in Raft configuration, and instead
use the information disseminated through gossip to locate Raft peers.

Implement patches that are building up to that:
* improve Raft API of configuration change notifications
* disseminate raft host id in Gossip
* avoid using Raft addresses from Raft configuraiton, and instead
  consistently use the translation layer between raft server id <-> IP
  address

Closes #11953

* github.com:scylladb/scylladb:
  raft: persist the initial raft address map
  raft: (upgrade) do not use IP addresses from Raft config
  raft: (and gossip) begin gossiping raft server ids
  raft: change the API of conf change notifications
2022-11-18 11:38:19 +01:00
Asias He
4571fcf9e7 token_metadata: Rename is_member to is_normal_token_owner
The name is_normal_token_owner is more clear than is_member.
The is_normal_token_owner reflects what it really checks.
2022-11-18 09:29:20 +08:00
Pavel Emelyanov
a396c27efc Merge 'message: messaging_service: fix topology_ignored for pending endpoints in get_rpc_client' from Kamil Braun
`get_rpc_client` calculates a `topology_ignored` field when creating a
client which says whether the client's endpoint had topology information
when this client was created. This is later used to check if that client
needs to be dropped and replaced with a new client which uses the
correct topology information.

The `topology_ignored` field was incorrectly calculated as `true` for
pending endpoints even though we had topology information for them. This
would lead to unnecessary drops of RPC clients later. Fix this.

Remove the default parameter for `with_pending` from
`topology::has_endpoint` to avoid similar bugs in the future.

Apparently this fixes #11780. The verbs used by decommission operation
use RPC client index 1 (see `do_get_rpc_client_idx` in
message/messaging_service.cc). From local testing with additional
logging I found that by the time this client is created (i.e. the first
verb in this group is used), we already know the topology. The node is
pending at that point - hence the bug would cause us to assume we don't
know the topology, leading us to dropping the RPC client later, possibly
in the middle of a decommission operation.

Fixes: #11780

Closes #11942

* github.com:scylladb/scylladb:
  message: messaging_service: check for known topology before calling is_same_dc/rack
  test: reenable test_topology::test_decommission_node_add_column
  test/pylib: util: configurable period in wait_for
  message: messaging_service: fix topology_ignored for pending endpoints in get_rpc_client
  message: messaging_service: topology independent connection settings for GOSSIP verbs
2022-11-17 20:14:32 +03:00
Konstantin Osipov
262566216b raft: persist the initial raft address map 2022-11-17 14:26:36 +03:00
Konstantin Osipov
b35af73fdf raft: (upgrade) do not use IP addresses from Raft config
Always use raft address map to obtain the IP addresses
of upgrade peers. Right now the map is populated
from Raft configuration, so it's an equivalent transformation,
but in the future raft address map will be populated from other sources:
discovery and gossip, hence the logic of upgrade will change as well.

Do not proceed with the upgrade if an address is
missing from the map, since it means we failed to contact a raft member.
2022-11-17 14:26:31 +03:00
Konstantin Osipov
051dceeaff raft: (and gossip) begin gossiping raft server ids
We plan to use gossip data to educate Raft RPC about IP addresses
of raft peers. Add raft server ids to application state, so
that when we get a notification about a gossip peer we can
identify which raft server id this notification is for,
specifically, we can find what IP address stands for this server
id, and, whenever the IP address changes, we can update Raft
address map with the new address.

On the same token, at boot time, we now have to start Gossip
before Raft, since Raft won't be able to send any messages
without gossip data about IP addresses.
2022-11-17 12:07:31 +03:00
Konstantin Osipov
990c7a209f raft: change the API of conf change notifications
Pass a change diff into the notification callback,
rather than add or remove servers one by one, so that
if we need to persist the state, we can do it once per
configuration change, not for every added or removed server.

For now still pass added and removed entries in two separate calls
per a single configuration change. This is done mainly to fulfill the
library contract that it never sends messages to servers
outside the current configuration. The group0 RPC
implementation doesn't need the two calls, since it simply
marks the removed servers as expired: they are not removed immediately
anyway, and messages can still be delivered to them.
However, there may be test/mock implementations of RPC which
could benefit from this contract, so we decided to keep it.
2022-11-17 12:07:31 +03:00
Kamil Braun
9b2449d3ea test: reenable test_topology::test_decommission_node_add_column
Also improve the test to increase the probability of reproducing #11780
by injecting sleeps in appropriate places.

Without the fix for #11780 from the earlier commit, the test reproduces
the issue in roughly half of all runs in dev build on my laptop.
2022-11-16 14:01:50 +01:00
Botond Dénes
cbf9be9715 Merge 'Avoid 0.0.0.0 (and :0) as preferred IP' from Pavel Emelyanov
Despite docs discourage from using INADDR_ANY as listen address, this is not disabled in code. Worse -- some snitch drivers may gossip it around as the INTERNAL_IP state. This set prevents this from happening and also adds a sanity check not to use this value if it somehow sneaks in.

Closes #11846

* github.com:scylladb/scylladb:
  messaging_service: Deny putting INADD_ANY as preferred ip
  messaging_service: Toss preferred ip cache management
  gossiping_property_file_snitch: Dont gossip INADDR_ANY preferred IP
  gossiping_property_file_snitch: Make _listen_address optional
2022-11-16 08:30:42 +02:00
Pavel Emelyanov
bd48fdaad5 Merge 'handle_state_normal: do not update topology of removed endpoint' from Benny Halevy
Currently, when replacing a node ip, keeping the old host,
we might end up with the the old endpoint in system.peers
if it is inserted back into the topology by `handle_state_normal`
when on_join is called with the old endpoint.

Then, later on, on_change sees that:
```
    if (get_token_metadata().is_member(endpoint)) {
        co_await do_update_system_peers_table(endpoint, state, value);
```

As described in #11925.

Fixes #11925

Closes #11930

* github.com:scylladb/scylladb:
  storage_service, system_keyspace: add debugging around system.peers update
  storage_service: handle_state_normal: update topology and notify_joined endpoint only if not removed
2022-11-14 13:58:28 +03:00
Botond Dénes
f1a039fc2b treewide: use ::for_partition_start() instead of ::partition_start_tag_t{}
We just added a convenience static factory method for partition start,
change the present users of the clunky constructor+tag to use it
instead.
2022-11-11 09:58:18 +02:00
Benny Halevy
38d8777d42 storage_service, system_keyspace: add debugging around system.peers update
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-11-09 14:45:47 +02:00
Benny Halevy
5401b6055c storage_service: handle_state_normal: update topology and notify_joined endpoint only if not removed
Currently, when replacing a node ip, keeping the old host,
we might end up with the the old endpoint in system.peers
if it is inserted back into the topology by `handle_state_normal`
when on_join is called with the old endpoint.

Then, later on, on_change sees that:
```
        if (get_token_metadata().is_member(endpoint)) {
            co_await do_update_system_peers_table(endpoint, state, value);
```

As described in #11925.

Fixes #11925

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-11-09 14:45:22 +02:00
Gleb Natapov' via ScyllaDB development
2100a8f4ca service: raft: demote configuration change error to warning since it is retried anyway
Message-Id: <Y2ohbFtljmd5MNw0@scylladb.com>
2022-11-09 00:09:39 +01:00
Kamil Braun
e086521c1a direct_failure_detector: get rid of complex endpoint_id translations
The direct failure detector operates on abstract `endpoint_id`s for
pinging. The `pigner` interface is responsible for translating these IDs
to 'real' addresses.

Earlier we used two types of addresses: IP addresses in 'production'
code (`gms::gossiper::direct_fd_pinger`) and `raft::server_id`s in test
code (in `randomized_nemesis_test`). For each of these use cases we
would maintain mappings between `endpoint_id`s and the address type.

In recent commits we switched the 'production' code to also operate on
Raft server IDs, which are UUIDs underneath.

In this commit we switch `endpoint_id`s from `unsigned` type to
`utils::UUID`. Because each use case operates in Raft server IDs, we can
perform a simple translation: `raft_id.uuid()` to get an `endpoint_id`
from a Raft ID, `raft::server_id{ep_id}` to obtain a Raft ID from
an `endpoint_id`. We no longer have to maintain complex sharded data
structures to store the mappings.
2022-11-04 09:38:08 +01:00
Kamil Braun
bdeef77f20 service/raft: ping raft::server_ids, not gms::inet_addresses
Whenever a Raft configuration change is performed, `raft::server` calls
`raft_rpc::add_server`/`raft_rpc::remove_server`. Our `raft_rpc`
implementation has a function, `_on_server_update`, passed in the
constructor, which it called in `add_server`/`remove_server`;
that function would update the set of endpoints detected by the
direct failure detector. `_on_server_update` was passed an IP address
and that address was added to / removed from the failure detector set
(there's another translation layer between the IP addresses and internal
failure detector 'endpoint ID's; but we can ignore it for the purposes
of this commit).

Therefore: the failure detector was pinging a certain set of IP
addresses. These IP addresses were updated during Raft configuration
changes.

To implement the `is_alive(raft::server_id)` function (required by
`raft::failure_detector` interface), we would translate the ID using
the Raft address map, which is currently also updated during
configuration changes, to an IP address, and check if that IP address is
alive according to the direct failure detector (which maintained an
`_alive_set` of type `unordered_set<gms::inet_address>`).

This all works well but it assumes that servers can be identified using
IP addresses - it doesn't play well with the fact that servers may
change their IP addresses. The only immutable identifier we have for a
server is `raft::server_id`. In the future, Raft configurations will not
associate IP addresses with Raft servers; instead we will assume that IP
addresses can change at any time, and there will be a different
mechanism that eventually updates the Raft address map with the latest
IP address for each `raft::server_id`.

To prepare us for that future, in this commit we no longer operate in
terms of IP addresses in the failure detector, but in terms of
`raft::server_id`s. Most of the commit is boilerplate, changing
`gms::inet_address` to `raft::server_id` and function/variable names.
The interesting changes are:
- in `is_alive`, we no longer need to translate the `raft::server_id` to
  an IP address, because now the stored `_alive_set` already contains
  `raft::server_id`s instead of `gms::inet_address`es.
- the `ping` function now takes a `raft::server_id` instead of
  `gms::inet_address`. To send the ping message, we need to translate
  this to IP address; we do it by the `raft_address_map` pointer
  introduced in an earlier commit.

Thus, there is still a point where we have to translate between
`raft::server_id` and `gms::inet_address`; but observe we now do it at
the last possible moment - just before sending the message. If we
have no translation, we consider the `ping` to have failed - it's
equivalent to a network failure where no route to a given address was
found.
2022-11-04 09:38:08 +01:00
Kamil Braun
ac70a05c7e service/raft: store raft_address_map reference in direct_fd_pinger
The pinger will use the map to translate `raft::server_id`s to
`gms::inet_address`es when pinging.
2022-11-04 09:38:08 +01:00
Kamil Braun
2c20f2ab9d gms: gossiper: move direct_fd_pinger out to a separate service
In later commit `direct_fd_pinger` will operate in terms of
`raft::server_id`s. Decouple it from `gossiper` since we don't want to
entangle `gossiper` with Raft-specific stuff.
2022-11-04 09:38:08 +01:00
Pavel Emelyanov
efbfcdb97e Merge 'Replicate raft_address_map non-expiring entries to other shards' from Kamil Braun
Replicating `raft_address_map` entries is needed for the following use
cases:
- the direct failure detector - currently it assumes a static mapping of
  `raft::server_id`s to `gms::inet_address`es, which is obtained on Raft
  group 0 configuration changes. To handle dynamic mappings we need to
  modify the failure detector so it pings `raft::server_id`s and obtains
  the `gms::inet_address` before sending the message from
  `raft_address_map`. The failure detector is sharded, so we need the
  mappings to be available on all shards.
- in the future we'll have multiple Raft groups running on different
  shards. To send messages they'll need `raft_address_map`.

Initially I tried to replicate all entries - expiring and non-expiring.
The implementation turned out to be very complex - we need to handle
dropping expired entries and refreshing expiring entries' timestamps
across shards, and doing this correctly while accounting for possible
races is quite problematic.

Eventually I arrived at the conclusion that replicating only
non-expiring entries, and furthermore allowing non-expiring entries to
be added only on shard 0, is good enough for our use cases:
- The direct failure detector is pinging group 0 members only; group
  0 members correspond exactly to the non-expiring entries.
- Group 0 configuration changes are handled on shard 0, so non-expiring
  entries are added/removed on shard 0.
- When we have multiple Raft groups, we can reuse a single Raft server
  ID for all Raft servers running on a single node belonging to
  different groups; they are 'namespaced' by the group IDs. Furthermore,
  every node has a server that belongs to group 0. Thus for every Raft
  server in every group, it has a corresponding server in group 0 with
  the same ID, which has a non-expiring entry in `raft_address_map`,
  which is replicated to all shards; so every group will be able to
  deliver its messages.

With these assumptions the implementation is short and simple.
We can always complicate it in the future if we find that the
assumptions are too strong.

Closes #11791

* github.com:scylladb/scylladb:
  test/raft: raft_address_map_test: add replication test
  service/raft: raft_address_map: replicate non-expiring entries to other shards
  service/raft: raft_address_map: assert when entry is missing in drop_expired_entries
  service/raft: turn raft_address_map into a service
2022-11-03 18:34:42 +03:00
Kamil Braun
7d84007fd5 service/raft: raft_address_map: replicate non-expiring entries to other shards
Replicating `raft_address_map` entries is needed for the following use
cases:
- the direct failure detector - currently it assumes a static mapping of
  `raft::server_id`s to `gms::inet_address`es, which is obtained on Raft
  group 0 configuration changes. To handle dynamic mappings we need to
  modify the failure detector so it pings `raft::server_id`s and obtains
  the `gms::inet_address` before sending the message from
  `raft_address_map`. The failure detector is sharded, so we need the
  mappings to be available on all shards.
- in the future we'll have multiple Raft groups running on different
  shards. To send messages they'll need `raft_address_map`.

Initially I tried to replicate all entries - expiring and non-expiring.
The implementation turned out to be very complex - we need to handle
dropping expired entries and refreshing expiring entries' timestamps
across shards, and doing this correctly while accounting for possible
races is quite problematic.

Eventually I arrived at the conclusion that replicating only
non-expiring entries, and furthermore allowing non-expiring entries to
be added only on shard 0, is good enough for our use cases:
- The direct failure detector is pinging group 0 members only; group
  0 members correspond exactly to the non-expiring entries.
- Group 0 configuration changes are handled on shard 0, so non-expiring
  entries are added/removed on shard 0.
- When we have multiple Raft groups, we can reuse a single Raft server
  ID for all Raft servers running on a single node belonging to
  different groups; they are 'namespaced' by the group IDs. Furthermore,
  every node has a server that belongs to group 0. Thus for every Raft
  server in every group, it has a corresponding server in group 0 with
  the same ID, which has a non-expiring entry in `raft_address_map`,
  which is replicated to all shards; so every group will be able to
  deliver its messages.

With these assumptions the implementation is short and simple.
We can always complicate it in the future if we find that the
assumptions are too strong.
2022-10-31 09:17:12 +01:00
Kamil Braun
acacbad465 service/raft: raft_address_map: assert when entry is missing in drop_expired_entries 2022-10-31 09:17:12 +01:00
Kamil Braun
159bb32309 service/raft: turn raft_address_map into a service 2022-10-31 09:17:10 +01:00
Botond Dénes
2c021affd1 Merge 'storage_service, repair: use per-shard abort_source' from Benny Halevy
Prevent copying shared_ptr across shards
in do_sync_data_using_repair by allocating
a shared_ptr<abort_source> per shard in
node_ops_meta_data and respectively in node_ops_info.

Fixes #11826

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #11827

* github.com:scylladb/scylladb:
  repair: use sharded abort_source to abort repair_info
  repair: node_ops_info: add start and stop methods
  storage_service: node_ops_abort_thread: abort all node ops on shutdown
  storage_service: node_ops_abort_thread: co_return only after printing log message
  storage_service: node_ops_meta_data: add start and stop methods
  repair: node_ops_info: prevent accidental copy
2022-10-31 09:43:34 +02:00
Benny Halevy
9ef2631ec2 api, service: storage_service: removenode: allow passing ignore_nodes as uuid:s
Currently the api is inconsistent: requiring a uuid for the
host_id of the node to be removed, while the ignored nodes list
is given as comma-separated ip addresses.

Instead, support identifying the ignored_nodes either
by their host_id (uuid) or ip address.

Also, require all ignore_nodes to be of the same kind:
either UUIDs or ip addresses, as a mix of the 2 is likely
indicating a user error.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-10-28 07:49:03 +03:00
Benny Halevy
40cd685371 storage_service: get_ignore_dead_nodes_for_replace: use tm.parse_host_id_and_endpoint
Allow specifying the dead node to ignore either as host_id
or ip address.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-10-28 07:38:13 +03:00
Benny Halevy
340a5a0c94 api: storage_service: remove_node: validate host_id
The node to be removed must be identified by its host_id.
Validate that at the api layer and pass the parsed host_id
down to storage_service::removenode.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-10-28 07:38:13 +03:00
Pavel Emelyanov
aa7a759ac9 messaging_service: Toss preferred ip cache management
Make it call cache_preferred_ip() even when the cache is loaded from
system_keyspace and move the connection reset there. This is mainly to
prepare for the next patch, but also makes the code a bit shorter

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-27 14:25:43 +03:00
Benny Halevy
88f993e5ed repair: node_ops_info: add start and stop methods
Prepare for adding a sharded<abort_source> member.

Wire start/stop in storage_service::node_ops_meta_data.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-10-27 12:18:30 +03:00
Benny Halevy
c2f384093d storage_service: node_ops_abort_thread: abort all node ops on shutdown
A later patch adds a sharded<abort_source> to node_ops_info.
On shutdown, we must orderly stop it, so use node_ops_abort_thread
shutdown path (where node_ops_singal_abort is called will a nullopt)
to abort (and stop) all outstanding node_ops by passing
a null_uuid to node_ops_abort, and let it iterate over all
node ops to abort and stop them.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-10-27 12:14:06 +03:00
Benny Halevy
0efd290378 storage_service: node_ops_abort_thread: co_return only after printing log message
Currently the function co_returns if (!uuid_opt)
so the log info message indicating it's stopped
is not printed.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-10-27 12:14:03 +03:00
Benny Halevy
47e4761b4e storage_service: node_ops_meta_data: add start and stop methods
Prepare for starting and stopping repair node_ops_info

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-10-27 12:14:03 +03:00
Benny Halevy
5c25066ea7 repair: node_ops_info: prevent accidental copy
Delete node_ops_info copy and move constructors before
we add a sharded<abort_source> member for the per-shard repairs
in the next patch.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-10-27 12:14:03 +03:00
Pavel Emelyanov
64c9359443 storage_proxy: Don't use default-initialized endpoint in get_read_executor()
After calling filter_for_query() the extra_replica to speculate to may
be left default-initialized which is :0 ipv6 address. Later below this
address is used as-is to check if it belongs to the same DC or not which
is not nice, as :0 is not an address of any existing endpoint.

Recent move of dc/rack data onto topology made this place reveal itself
by emitting the internal error due to :0 not being present on the
topology's collection of endpoints. Prior to this move the dc filter
would count :0 as belonging to "default_dc" datacenter which may or may
not match with the dc of the local node.

The fix is to explicitly tell set extra_replica from unset one.

fixes: #11825

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #11833
2022-10-25 09:16:50 +03:00