Commit Graph

3075 Commits

Author SHA1 Message Date
Pavel Emelyanov
64c9359443 storage_proxy: Don't use default-initialized endpoint in get_read_executor()
After calling filter_for_query() the extra_replica to speculate to may
be left default-initialized which is :0 ipv6 address. Later below this
address is used as-is to check if it belongs to the same DC or not which
is not nice, as :0 is not an address of any existing endpoint.

Recent move of dc/rack data onto topology made this place reveal itself
by emitting the internal error due to :0 not being present on the
topology's collection of endpoints. Prior to this move the dc filter
would count :0 as belonging to "default_dc" datacenter which may or may
not match with the dc of the local node.

The fix is to explicitly tell set extra_replica from unset one.

fixes: #11825

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #11833
2022-10-25 09:16:50 +03:00
Botond Dénes
396d9e6a46 Merge 'Subscribe repair_info::abort on node_ops_meta_data::abort_source' from Pavel Emelyanov
The storage_service::stop() calls repair_service::abort_repair_node_ops() but at that time the sharded<repair_service> is already stopped and call .local() on it just crashes.

The suggested fix is to remove explicit storage_service -> repair_service kick. Instead, the repair_infos generated for the sake of node-ops are subscribed on the node_ops_meta_data's abort source and abort themselves automatically.

fixes: #10284

Closes #11797

* github.com:scylladb/scylladb:
  repair: Remove ops_uuid
  repair: Remove abort_repair_node_ops() altogether
  repair: Subscribe on node_ops_info::as abortion
  repair: Keep abort source on node_ops_info
  repair: Pass node_ops_info arg to do_sync_data_using_repair()
  repair: Mark repair_info::abort() noexcept
  node_ops: Remove _aborted bit
  node_ops: Simplify construction of node_ops_metadata
  main: Fix message about repair service starting
2022-10-21 10:08:43 +03:00
Pavel Emelyanov
898579027d gossiper: Pass current snitch name into checker
Gossiper makes sure local snitch name is the same as the one of other
nodes in the ring. It now gets global snitch to get the name, this patch
passes the name as an argument, because the caller (storage_service) has
snitch instance local reference

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-20 12:33:38 +03:00
Pavel Emelyanov
ea8bfc4844 storage_service: Keep local snitch reference
Storage service uses snitch in several places:
- boot
- snitch-reconfigured subscription
- preferred IP reconnection

At this point it's worth adding storage_service->snitch explicit
dependency and patch the above to use local reference

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-20 12:30:00 +03:00
Nadav Har'El
264f453b9d Merge 'Associate alternator user with its service level configuration' from Piotr Sarna
Until now, authentication in alternator served only two purposes:
 - refusing clients without proper credentials
 - printing user information with logs

After this series, this user information is passed to lower layers, which also means that users are capable of attaching service levels to roles, and this service level configuration will be effective with alternator requests.

tests: manually by adding more debug logs and inspecting that per-service-level timeout value was properly applied for an authenticated alternator user

Fixes #11379

Closes #11380

* github.com:scylladb/scylladb:
  alternator: propagate authenticated user in client state
  client_state: add internal constructor with auth_service
  alternator: pass auth_service and sl_controller to server
2022-10-19 23:27:48 +03:00
Botond Dénes
2d581e9e8f Merge "Maintain dc/rack by topology" from Pavel Emelyanov
"
There's an ongoing effort to move the endpoint -> {dc/rack} mappings
from snitch onto topology object and this set finalizes it. After it the
snitch service stops depending on gossiper and system keyspace and is
ready for de-globalization. As a nice side-effect the system keyspace no
longer needs to maintain the dc/rack info cache and its starting code gets
relaxed.

refs: #2737
refs: #2795
"

* 'br-snitch-dont-mess-with-topology-data-2' of https://github.com/xemul/scylla: (23 commits)
  system_keyspace: Dont maintain dc/rack cache
  system_keyspace: Indentation fix after previous patch
  system_keyspace: Coroutinuze build_dc_rack_info()
  topology: Move all post-configuration to topology::config
  snitch: Start early
  gossiper: Do not export system keyspace
  snitch: Remove gossiper reference
  snitch: Mark get_datacenter/_rack methods const
  snitch: Drop some dead dependency knots
  snitch, code: Make get_datacenter() report local dc only
  snitch, code: Make get_rack() report local rack only
  storage_service: Populate pending endpoint in on_alive()
  code: Populate pending locations
  topology: Put local dc/rack on topology early
  topology: Add pending locations collection
  topology: Make get_location() errors more verbose
  token_metadata: Add config, spread everywhere
  token_metadata: Hide token_metadata_impl copy constructor
  gosspier: Remove messaging service getter
  snitch: Get local address to gossip via config
  ...
2022-10-19 06:50:21 +03:00
Pavel Emelyanov
8231b4ec1b repair: Subscribe on node_ops_info::as abortion
When node_ops_meta_data aborts it also kicks repair to find and abort
all relevant repair_infos. Now it can be simplified by subscribing
repair_meta on the abort source and aborting it without explicit kick

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-18 20:04:23 +03:00
Pavel Emelyanov
bf5825daac repair: Keep abort source on node_ops_info
Next patches will need to subscribe on node_ops_meta_data's abort source
inside repair code, so keep the pointer on node_ops_info too. At the
same time, the node_ops_info::abort becomes obsolete, because the same
check can be performed via the abort_source->abort_requested()

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-18 20:04:23 +03:00
Pavel Emelyanov
34458ec2c5 node_ops: Remove _aborted bit
A short cleanup "while at it" -- the node_ops_meta_data doesn't need to
carry dedicated _aborted boolean -- the abort source that sets it is
available instantly

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-18 20:04:22 +03:00
Pavel Emelyanov
96f0695731 node_ops: Simplify construction of node_ops_metadata
It always constructs node_ops_info the same way

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-18 20:03:53 +03:00
Tomasz Grabiec
87b7e7ff9c Merge 'storage_proxy: prepare for fencing, complex ops' from Avi Kivity
Following up on 69aea59d97, which added fencing support
for simple reads and writes, this series does the same for the
complex ops:
 - partition scan
 - counter mutation
 - paxos

With this done, the coordinator knows about all in-flight requests and
can delay topology changes until they are retired.

Closes #11296

* github.com:scylladb/scylladb:
  storage_proxy: hold effective_replication_map for the duration of a paxos transaction
  storage_proxy: move paxos_response_handler class to .cc file
  storage_proxy: deinline paxos_response_handler constructor/destructor
  storage_proxy: use consistent effective_replication_map for counter coordinator
  storage_proxy: improve consistency in query_partition_key_range{,_concurrent}
  storage_proxy: query_partition_key_range_concurrent: reduce smart pointer use
  storage_proxy: query_partition_key_range_concurrent: improve token_metadata consistency
  storage_proxy: query_singular: use fewer smart pointers
  storage_proxy: query_singular: simplify lambda captures
  locator: effective_replication_map: provide non-smart-pointer accessor to token_metadata
  storage_proxy: use consistent token_metadata with rest of singular read
2022-10-14 15:44:35 +02:00
Avi Kivity
1feaa2dfb4 storage_proxy: handle_write: use coroutine::all() instead of when_all()
coroutine::all() saves an allocation. Since it's safe for lambda
coroutines, remove a coroutine::lambda wrapper.

Closes #11749
2022-10-14 06:56:16 +03:00
Tomasz Grabiec
ee2398960c Merge 'service/raft: simplify raft_address_map' from Kamil Braun
The `raft_address_map` code was "clever": it used two intrusive data structures and did a lot of manual lifetime management; raw pointer manipulation, manual deletion of objects... It wasn't clear who owns which object, who is responsible for deleting what. And there was a lot of code.

In this PR we replace one of the intrusive data structures with a good old `std::unordered_map` and make ownership clear by replacing the raw pointers with `std::unique_ptr`. Furthermore, some invariants which were not clear and enforced in runtime are now encoded in the type system.

The code also became shorter: we reduced its length from ~360 LOC to ~260 LOC.

Closes #11763

* github.com:scylladb/scylladb:
  service/raft: raft_address_map: get rid of `is_linked` checks
  service/raft: raft_address_map: get rid of `to_list_iterator`
  service/raft: raft_address_map: simplify ownership of `expiring_entry_ptr`
  service/raft: raft_address_map: move _last_accessed field from timestamped_entry to expiring_entry_ptr
  service/raft: raft_address_map: don't use intrusive set for timestamped entries
  service/raft: raft_address_map: store reference to `timestamped_entry` in `expiring_entry_ptr`
2022-10-13 18:08:49 +02:00
Kamil Braun
5a9371bcb0 service/raft: raft_address_map: get rid of is_linked checks
Being linked is an invariant of `expiring_entry_ptr`. Make it explicit
by moving the `_expiring_list.push_front` call into the constructor.
2022-10-13 15:17:07 +02:00
Kamil Braun
cdf3367c05 service/raft: raft_address_map: get rid of to_list_iterator
Unnecessary.
2022-10-13 15:17:06 +02:00
Kamil Braun
0e29495c38 service/raft: raft_address_map: simplify ownership of expiring_entry_ptr
The owner of `expiring_entry_ptr` was almost uniquely its corresponding
`timestamp_entry`; it would delete the expiring entry when it itself got
destroyed. There was one call to explicit `unlink_and_dispose`, which
made the picture unclear.

Make the picture clear: `timestamped_entry` now contains a `unique_ptr`
to its `expiring_entry_ptr`. The `unlink_and_dispose` was replaced with
`_lru_entry = nullptr`.

We can also get rid of the back-reference from `expiring_entry_ptr` to
`timestamped_entry`.

The code becomes shorter and simpler.
2022-10-13 15:16:40 +02:00
Petr Gusev
c76cf5956d removenode: don't stream data from the leaving node
If a removenode is run for a recently stopped node,
the gossiper may not yet know that the node is down,
and the removenode will fail with a Stream failed error
trying to stream data from that node.

In this patch we explicitly reject removenode operation
if the gossiper considers the leaving node up.

Closes #11704
2022-10-13 15:11:32 +02:00
Asias He
6134fe4d1f storage_service: Prevent removed node to rejoin in handle_state_normal
- Start n1, n2, n3 (127.0.0.3)
- Stop n3
- Change ip address of n3 to 127.0.0.33 and restart n3
- Decommission n3
- Start new node n4

The node n4 will learn from the gossip entry for 127.0.0.3 that node
127.0.0.3 is in shutdown status which means 127.0.0.3 is still part of
the ring.

This patch prevents this by checking the status for the host id on all
the entries. If any of the entries shows the node with the host id is in
LEFT status, reject to put the node in NORMAL status.

Fixes #11355

Closes #11361
2022-10-13 15:11:32 +02:00
Avi Kivity
a2da08f9f9 storage_proxy: hold effective_replication_map for the duration of a paxos transaction
Luckily, all topology calculations are done in get_paxos_participants(),
so all we have to do is it hold the effective_replication_map for the
duration of the transaction, and pass it to get_paxos_participants().
This ensures that the coordinator knows about all in-flight requests
and can fence them from topology changes.
2022-10-13 14:27:26 +03:00
Avi Kivity
69aaa5e131 storage_proxy: move paxos_response_handler class to .cc file
It's not used elsewhere.
2022-10-13 14:27:26 +03:00
Avi Kivity
b2f3934e95 storage_proxy: deinline paxos_response_handler constructor/destructor
They have no business being inline as it's a heavyweight object.
2022-10-13 14:27:26 +03:00
Avi Kivity
94e4ff11be storage_proxy: use consistent effective_replication_map for counter coordinator
Hold the effective_replication_map while talking to the counter leader,
to allow for fencing in the future. The code is somewhat awkward because
the API allows for multiple keyspaces to be in use.

The error code generation, already broken as it doesn't use the correct
table, continues to be broken in that it doesn't use the correct
effective_replication_map, for the same reason.
2022-10-13 14:27:23 +03:00
Avi Kivity
406a046974 storage_proxy: improve consistency in query_partition_key_range{,_concurrent}
query_partition_key_range captures a token_metadata_ptr and uses
it consistently in sequential calls to query_partition_key_range_concurrent
(via tail recursion), but each invocation of
query_partition_key_range_concurrent captures its own
effective_replication_map_ptr. Since these are captured at different times,
they can be inconsistent after the first iteration.

Fix by capturing it once in the caller and propagating it everywhere.
2022-10-13 13:56:52 +03:00
Avi Kivity
5d320e95d5 storage_proxy: query_partition_key_range_concurrent: reduce smart pointer use
Capture token_metadata by reference rather than smart pointer, since
out effective_replication_map_ptr protects it.
2022-10-13 13:56:52 +03:00
Avi Kivity
f75efa965f storage_proxy: query_partition_key_range_concurrent: improve token_metadata consistency
Derive the token_metadata from the effective_replication_map rather than
getting it independently. Not a real bug since these were in the same
continuation, but safer this way.
2022-10-13 13:56:52 +03:00
Avi Kivity
161ce4b34f storage_proxy: query_singular: use fewer smart pointers
Capture token_metadata by reference since we're protecting it with
the mighty effective_replication_map_ptr. This saves a few instructions
to manage smart pointers.
2022-10-13 13:56:33 +03:00
Avi Kivity
efd89c1890 storage_proxy: query_singular: simplify lambda captures
The lambdas in query_singular do not outlive the enclosing coroutine,
so they can capture everything by reference. This simplifies life
for a future update of the lambda, since there's one thing less to
worry about.
2022-10-13 13:52:54 +03:00
Avi Kivity
86a48cf12f storage_proxy: use consistent token_metadata with rest of singular read
query_singular() uses get_token_metadata_ptr() and later, in
get_read_executor(), captures the effective_replication_map(). This
isn't a bug, since the two are captured in the same continuation and
are therefore consistent, but a way to ensure it stays so is to capture
the effective_replication_map earlier and derive the token_metadata from
it.
2022-10-13 13:46:04 +03:00
Kamil Braun
92dd1f7307 service/raft: raft_address_map: move _last_accessed field from timestamped_entry to expiring_entry_ptr
`timestamped_entry` had two fields:

```
optional<clock_time_point> _last_accessed
expiring_entry_ptr* _lru_entry
```

The `raft_address_map` data structure maintained an invariant:
`_last_accessed` is set if and only if `_lru_entry` is not null.
This invariant could be broken for a while when constructing an expiring
`timestamped_entry`: the constructor was given an `expiring = true`
flag, which set the `_last_accessed` field; this was redundant, because
immediately after a corresponding `expiring_entry_ptr` was constructed
which again reset the `_last_accessed` field and set `_lru_entry`.

The code becomes simpler and shorter when we move `_last_accessed` field
into `expiring_entry_ptr`. The invariant is now guaranteed by the type
system: `_last_accessed` is no longer `optional`.
2022-10-12 12:22:57 +02:00
Kamil Braun
262b9473d5 service/raft: raft_address_map: don't use intrusive set for timestamped entries
Intrusive data structures are harder to reason about. In
`raft_address_map` there's a good reason to use an intrusive list for
storing `expiring_entry_ptr`s: we move the entries around in the list
(when their expiration times change) but we want for the objects to stay
in place because `timestamped_entry`s may point to them (although we
could simply update the pointers using the existing back-reference...)

However, there's not much reason to store `timestamped_entry` in an
intrusive set. It was basically used in one place: when dropping expired
entries, we iterate over the list of `expiring_entry_ptr`s and we want
to drop the corresponding `timestamped_entry` as well, which is easy
when we have a pointer to the entry and it's a member of an intrusive
container. But we can deal with it when using non-intrusive containers:
just `find` the element in the container to erase it.

The code becomes shorter with this change.

I also use a map instead of a set because we need to modify the
`timestamped_entry` which wouldn't be possible if it was used as an
`unordered_set` key. In fact using map here makes more sense: we were
using the intrusive set similarly to a map anyway because all lookups
were performed using the `_id` field of `timestamped_entry` (now the
field was moved outside the struct, it's used as the map's key).
2022-10-12 12:22:50 +02:00
Kamil Braun
0c13c85752 service/raft: raft_address_map: store reference to timestamped_entry in expiring_entry_ptr
The class was storing a pointer which couldn't be null.
A reference is a better fit in this case.
2022-10-11 17:21:01 +02:00
Asias He
810b424a8c storage_service: Reject to bootstrap new node when node has unknown gossip status
- Start a cluster with n1, n2, n3
- Full cluster shutdown n1, n2, n3
- Start n1, n2 and keep n3 as shutdown
- Add n4

Node n4 will learn the ip and uuid of n3 but it does not know the gossip
status of n3 since gossip status is published only by the node itself.
After full cluster shutdown, gossip status of n3 will not be present
until n3 is restarted again. So n4 will not think n3 is part of the
ring.

In this case, it is better to reject the bootstrap.

With this patch, one would see the following when adding n4:

```
ERROR 2022-09-01 13:53:14,480 [shard 0] init - Startup failed:
std::runtime_error (Node 127.0.0.3 has gossip status=UNKNOWN. Try fixing it
before adding new node to the cluster.)
```

The user needs to perform either of the following before adding a new node:

1) Run nodetool removenode to remove n3
2) Restart n3 to get it back to the cluster

Fixes #6088

Closes #11425
2022-10-11 15:47:34 +03:00
Botond Dénes
378c6aeebd Merge 'More Raft upgrade tests' from Kamil Braun
Refactor the existing upgrade tests, extracting some common functionality to
helper functions.

Add more tests. They are checking the upgrade procedure and recovery from
failure in scenarios like when a node fails causing the procedure to get stuck
or when we lose a majority in a fully upgraded cluster.

Add some new functionalities to `ScyllaRESTAPIClient` like injecting errors and
obtaining gossip generation numbers.

Extend the removenode function to allow ignoring dead nodes.

Improve checking for CQL availability when starting nodes to speed up testing.

Closes #11725

* github.com:scylladb/scylladb:
  test/topology_raft_disabled: more Raft upgrade tests
  test/topology_raft_disabled: refactor `test_raft_upgrade`
  test/pylib: scylla_cluster: pass a list of ignored nodes to removenode
  test/pylib: rest_client: propagate errors from put_json
  test/pylib: fix some type hints
  test/pylib: scylla_cluster: don't create and drop keyspaces to check if cql is up
2022-10-11 15:30:00 +03:00
Kamil Braun
08e654abf5 Merge 'raft: (service) cleanups on the path for dynamic IP address support' from Konstantin Osipov
In preparation for supporting IP address changes of Raft Group 0:
1) Always use start_server_for_group0() to start a server for group 0.
   This will provide a single extension point when it's necessary to
   prompt raft_address_map with gossip data.
2) Don't use raft::server_address in discovery, since going forward
   discovery won't store raft::server_address. On the same token stop
   using discovery::peer_set anywhere outside discovery (for persistence),
   use a peer_list instead, which is easier to marshal.

Closes #11676

* github.com:scylladb/scylladb:
  raft: (discovery) do not use raft::server_address to carry IP data
  raft: (group0) API refactoring to avoid raft::server_address
  raft: rename group0_upgrade.hh to group0_fwd.hh
  raft: (group0) move the code around
  raft: (discovery) persist a list of discovered peers, not a set
  raft: (group0) always start group0 using start_server_for_group0()
2022-10-11 13:43:41 +02:00
Asias He
58c65954b8 storage_service: Reject decommission if nodes are down
- Start n1, n2, n3

- Apply network nemesis as below:
  + Block gossip traffic going from nodes 1 and 2 to node 3.
  + All the other rpc traffic flows normally, including gossip traffic
    from node 3 to nodes 1 and 2 and responses to node_ops commands from
    nodes 1 and 2 to node 3.

- Decommission n3

Currently, the decommission will be successful because all the network
traffic is ok. But n3 could not advertise status STATUS_LEFT to the rest
of the cluster due to the network nemesis applied. As a result, n1 and
n3 could not move the n3 from STATUS_LEAVING to STATUS_LEFT, so n3 will
stay in DL forever.

I know why the node stays DL forever. The problem is that with
node_ops_cmd based node operation, we still rely on the gossip status of
STATUS_LEFT from the node being decommissioned to notify other nodes
this node has finished decommission and can be moved from STATUS_LEAVING
to STATUS_LEFT.

This patch fixes by checking gossip liveness before running
decommission. Reject if required peer nodes are down.

With the fix, the decommission of n3 will fail like this:

$ nodetool decommission -p 7300
nodetool: Scylla API server HTTP POST to URL
'/storage_service/decommission' failed: std::runtime_error
(decommission[adb3950e-a937-4424-9bc9-6a75d880f23d]: Rejected
decommission operation, removing node=127.0.0.3, sync_nodes=[127.0.0.2,
127.0.0.3, 127.0.0.1], ignore_nodes=[], nodes_down={127.0.0.1})

Fixes #11302

Closes #11362
2022-10-11 14:09:28 +03:00
Pavel Emelyanov
8b8b37cdda system_keyspace: Dont maintain dc/rack cache
Some good news finally. The saved dc/rack info about the ring is now
only loaded once on start. So the whole cache is not needed and the
loading code in storage_service can be greatly simplified

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-11 05:18:31 +03:00
Pavel Emelyanov
4206b1f98f snitch, code: Make get_datacenter() report local dc only
The continuation of the previous patch -- all the code uses
topology::get_datacenter(endpoint) to get peers' dc string. The topology
still uses snitch for that, but it already contains the needed data.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-11 05:17:08 +03:00
Pavel Emelyanov
6c6711404f snitch, code: Make get_rack() report local rack only
All the code out there now calls snitch::get_rack() to get rack for the
local node. For other nodes the topology::get_rack(endpoint) is used.
Since now the topology is properly populated with endpoints, it can
finally be patched to stop using snitch and get rack from its internal
collections

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-11 05:17:08 +03:00
Pavel Emelyanov
bc813771e8 storage_service: Populate pending endpoint in on_alive()
A special-purpose add-on to the previous patch.

When messaging service accepts a new connection it sometimes may want to
drop it early based on whether the client is from the same dc/rack or
not. However, at this stage the information might have not yet had
chances to be spread via storage service pending-tokens updating paths,
so here's one more place -- the on_alive() callback

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-11 05:17:08 +03:00
Pavel Emelyanov
1be97a0a76 code: Populate pending locations
Previous patches added the concept of pending endpoints in the topology,
this patch populates endpoints in this state.

Also, the set_pending_ranges() is patched to make sure that the tokens
added for the enpoint(s) are added for something that's known by the
topology. Same check exists in update_normal_tokens()

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-11 05:17:08 +03:00
Pavel Emelyanov
77bde21024 storage_service: Shuffle on_alive() callback
No functional changes, just keep some conditions from if()s as local
variables. This is the churn-reducing preparation for one of the the
next patches

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-11 05:17:08 +03:00
Konstantin Osipov
3e46c32d7b raft: (discovery) do not use raft::server_address to carry IP data
We plan to remove IP information from Raft addresses.
raft::server_address is used in Raft configuration and
also in discovery, which is a separate algorithm, as a handy data
structure, to avoid having new entities in RPC.

Since we plan to remove IP addresses from Raft configuration,
using raft::server_address in discovery and still storing
IPs in it would create ambiguity: in some uses raft::server_address
would store an IP, and in others - would not.

So switch to an own data structure for the purposes of discovery,
discovery_peer, which contains a pair ip, raft server id.

Note to reviewers: ideally we should switch to URIs
in discovery_peer right away. Otherwise we may have to
deal with incompatible changes in discovery when adding URI
support to Scylla.
2022-10-10 16:24:33 +03:00
Konstantin Osipov
8857e017c7 raft: (group0) API refactoring to avoid raft::server_address
Replace raft::server_address in a few raft_group0 API
calls with raft::server_id.

These API calls do not need raft::server_address, i.e. the
address part, anyway, and since going forward raft::server_address
will not contain the IP address, stop using it in these calls.

This is a beginning of a multi-patch series to reduce
raft::server_address usage to core raft only.
2022-10-10 15:58:48 +03:00
Konstantin Osipov
224dd9ce1e raft: rename group0_upgrade.hh to group0_fwd.hh
The plan is to add other group-0-related forward declarations
to this file, not just the ones for upgrade.
2022-10-10 15:58:48 +03:00
Konstantin Osipov
e226624daf raft: (group0) move the code around
Move load/store functions for discovered peers up,
since going forward they'll be used to in start_server_for_group0(),
to extend the address map prior to start (and thus speed up
bootstrap).
2022-10-10 15:58:48 +03:00
Konstantin Osipov
199b6d6705 raft: (discovery) persist a list of discovered peers, not a set
We plan to reuse the discovery table to store the peers
after discovery is over, so load/store API must be generalized
to use outside discovery. This includes sending
the list of persisted peers over to a new member of the cluster.
2022-10-10 15:58:48 +03:00
Konstantin Osipov
746322b740 raft: (group0) always start group0 using start_server_for_group0()
When IP addresses are removed from raft::configuration, it's key
to initialize raft_address_map with IP addresses before we start group
0. Best place to put this initialization is start_server_for_group0(),
so make sure all paths which create group 0 use
start_server_for_group0().
2022-10-10 15:58:48 +03:00
Kamil Braun
4974a31510 test/topology_raft_disabled: more Raft upgrade tests
The tests are checking the upgrade procedure and recovery from failure
in scenarios like when a node fails causing the procedure to get stuck
or when we lose a majority in a fully upgraded cluster.

Added some new functionalities to `ScyllaRESTAPIClient` like injecting
errors and obtaining gossip generation numbers.
2022-10-10 14:32:10 +02:00
Pavel Emelyanov
caed12c8f2 system_keyspace: Add .shutdown() method
Many services out there have one (sometimes called .drain()) that's
called early on stop and that's responsible for prearing the service for
stop -- aborting pending/in-flight fibers and alike.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-10 15:29:33 +03:00
Petr Gusev
0923cb435f raft: mark removed servers as expiring instead of dropping them
There is a flaw in how the raft rpc endpoints are
currently managed. The io_fiber in raft::server
is supposed to first add new servers to rpc, then
send all the messages and then remove the servers
which have been excluded from the configuration.
The problem is that the send_messages function
isn't synchronous, it schedules send_append_entries
to run after all the current requests to the
target server, which can happen
after we have already removed the server from address_map.

In this patch the remove_server function is changed to mark
the server_id as expiring rather than synchronously dropping it.
This means all currently scheduled requests to
that server will still be able to resolve
the ip address for that server_id.

Fixes: #11228

Closes #11748
2022-10-07 19:08:34 +02:00