Commit Graph

4079 Commits

Author SHA1 Message Date
Gleb Natapov
ba7aa0d582 storage_service: topology coordinator: add error injection point to be able to pause the topology coordinator 2024-01-14 15:45:53 +02:00
Gleb Natapov
1afc891bd5 storage_service: topology coordinator: add logging to removenode and decommission
Add some useful logging to removenode and decommission to be used by
tests later.
2024-01-14 15:45:53 +02:00
Gleb Natapov
97ab3f6622 storage_service: topology_coordinator: introduce cleanup REST API integrated with the topology coordinator
Introduce new REST API "/storage_service/cleanup_all"
that, when triggered, instructs the topology coordinator to initiate
cluster wide cleanup on all dirty nodes. It is done by introducing new
global command "global_topology_request::cleanup".
2024-01-14 15:45:53 +02:00
Gleb Natapov
0adb3904d8 storage_service: topology coordinator: manage cluster cleanup as part of the topology management
Sometimes it is unsafe to start a new topology operation before cleanup
runs on dirty nodes. This patch detects the situation when the topology
operation to be executed cannot be run safely until all dirty nodes do
cleanup and initiates the cleanup automatically. It also waits for
cleanup to complete before proceeding with the topology operation.

There can be a situation that nodes that needs cleanup dies and will
never clear the flag. In this case if a topology operation that wants to
run next does not have this node in its ignore node list it may stuck
forever. To fix this the patch also introduces the "liveness aware"
request queue management: we do not simple choose _a_ request to run next,
but go over the queue and find requests that can proceed considering
the nodes liveness situation. If there are multiple requests eligible to
run the patch introduces the order based on the operation type: replace,
join, remove, leave, rebuild. The order is such so to not trigger cleanup
needlessly.
2024-01-14 15:45:50 +02:00
Gleb Natapov
c9b7bd5a33 storage_service: topology coordinator: provide a version of get_excluded_nodes that does not need node_to_work_on as a parameter
Needed by the next patch.
2024-01-14 14:44:07 +02:00
Gleb Natapov
067267ff76 storage_service: topology coordinator: make topology coordinator lifecycle subscriber
We want to change the coordinator to consider nodes liveness when
processing the topology operation queue. If there is no enough live
nodes to process any of the ops we want to cancel them. For that to work
we need to be able to kick the coordinator if liveness situation
changes.
2024-01-14 14:44:07 +02:00
Gleb Natapov
f70c4127c6 storage_service: topology coordinator: introduce sstable cleanup fiber
Introduce a fiber that waits on a topology event and when it sees that
the node it runs on needs to perform sstable cleanup it initiates one
for each non tablet, non local table and resets "cleanup" flag back to
"clean" in the topology.
2024-01-14 14:44:07 +02:00
Gleb Natapov
5b246920ae storage_proxy: allow to wait for all ongoing writes
We want to be able to wait for all writes started through the storage
proxy before a fence is advanced. Add phased_barrier that is entered
on each local write operation before checking the fence to do so. A
write will be either tracked by the phased_barrier or fenced. This will
be needed to wait for all non fenced local writes to complete before
starting a cleanup.
2024-01-14 14:44:07 +02:00
Gleb Natapov
b2ba77978c storage_service: topology coordinator: mark nodes as needing cleanup when required
A cleanup needs to run when a node loses an ownership of a range (during
bootstrap) or if a range movement to an normal node failed (removenode,
decommission failure). Mark all dirty node as "cleanup needed" in those cases.
2024-01-14 14:43:59 +02:00
Gleb Natapov
dbededb1a6 storage_service: add mark_nodes_as_cleanup_needed function
The function creates a mutation that sets cleanup to "needed" for each
normal node that, according to the erm, has data it does not own after
successful or unsuccessful topology operation.
2024-01-14 14:43:33 +02:00
Gleb Natapov
cc54796e23 raft topology: add cleanup state to the topology state machine
The patch adds cleanup state to the persistent and in memory state and
handles the loading. The state can be "clean" which means no cleanup
needed, "needed" which means the node is dirty and needs to run cleanup
at some point, "running" which means that cleanup is running by the node
right now and when it will be completed the state will be reset to "clean".
2024-01-14 13:30:54 +02:00
Kamil Braun
4e18f8b453 Merge 'topology_state_load: stop waiting for IP-s' from Petr Gusev
The loop in `id2ip` lambda makes problems if we are applying an old raft
log that contains long-gone nodes. In this case, we may never receive
the `IP` for a node and stuck in the loop forever. In this series we
replace the loop with an if - we just don't update the `host_id <-> ip`
mapping in the `token_metadata.topology` if we don't have an `IP` yet.

The PR moves `host_id -> IP` resolution to the data plane, now it
happens each time the IP-based methods of `erm` are called. We need this
because IPs may not be known at the time the erm is built. The overhead
of `raft_address_map` lookup is added to each data plane request, but it
should be negligible. In this PR `erm/resolve_endpoints` continues to
treat missing IP for `host_id` as `internal_error`, but we plan to relax
this in the follow-up (see this PR first comment).

Closes scylladb/scylladb#16639

* github.com:scylladb/scylladb:
  raft ips: rename gossiper_state_change_subscriber_proxy -> raft_ip_address_updater
  gossiper_state_change_subscriber_proxy: call sync_raft_topology_nodes
  storage_service: topology_state_load: remove IP waiting loop
  storage_service: sync_raft_topology_nodes: add target_node parameter
  storage_service: sync_raft_topology_nodes: move loops to the end
  storage_service: sync_raft_topology_nodes: rename extract process_left_node and process_transition_node
  storage_service: sync_raft_topology_nodes: rename add_normal_node -> process_normal_node
  storage_service: sync_raft_topology_nodes: move update_topology up
  storage_service: topology_state_load: remove clone_async/clear_gently overhead
  storage_service: fix indentation
  storage_service: extract sync_raft_topology_nodes
  storage_service: topology_state_load: move remove_endpoint into mutate_token_metadata
  address_map: move gossiper subscription logic into storage_service
  topology_coordinator: exec_global_command: small refactor, use contains + reformat
  storage_service: wait_for_ip for new nodes
  storage_service.idl.hh: fix raft_topology_cmd.command declaration
  erm: for_each_natural_endpoint_until: use is_vnode == true
  erm: switch the internal data structures to host_id-s
  erm: has_pending_ranges: switch to host_id
2024-01-12 18:46:51 +01:00
Petr Gusev
e24bee545b raft ips: rename gossiper_state_change_subscriber_proxy -> raft_ip_address_updater 2024-01-12 18:29:22 +04:00
Petr Gusev
6e7bbc94f4 gossiper_state_change_subscriber_proxy: call sync_raft_topology_nodes
When a node changes its IP we need to store the mapping in
system.peers and update token_metadata.topology and erm
in-memory data structures.

The test_change_ip was improved to verify this new
behaviour. Before this patch the test didn't check
that IPs used for data requests are updated on
IP change. In this commit we add the read/write check.
It fails on insert with 'node unavailable'
error without the fix.
2024-01-12 18:28:57 +04:00
Petr Gusev
6d6e1ba8fb storage_service: topology_state_load: remove IP waiting loop
The loop makes problems if we are applying an old
raft log that contains long-gone nodes. In this case, we may
never receive the IP for a node and stuck in the loop forever.

The idea of the patch is to replace the loop with an
if - we just don't update the host_id <-> ip mapping
in the token_metadata.topology if we don't have an IP yet.
When we get the mapping later, we'll call
sync_raft_topology_nodes again from
gossiper_state_change_subscriber_proxy.
2024-01-12 15:37:50 +04:00
Petr Gusev
260874c860 storage_service: sync_raft_topology_nodes: add target_node parameter
If it's set, instead of going over all the nodes in raft topology,
the function will update only the specified node. This parameter
will be used in the next commit, in the call to sync_raft_topology_nodes
from gossiper_state_change_subscriber_proxy.
2024-01-12 15:37:50 +04:00
Petr Gusev
a9d58c3db5 storage_service: sync_raft_topology_nodes: move loops to the end 2024-01-12 15:37:50 +04:00
Petr Gusev
d1bce3651b storage_service: sync_raft_topology_nodes: rename extract process_left_node and process_transition_node 2024-01-12 15:37:50 +04:00
Petr Gusev
aa37b6cfd3 storage_service: sync_raft_topology_nodes: rename add_normal_node -> process_normal_node 2024-01-12 15:37:50 +04:00
Petr Gusev
a508d7ffc5 storage_service: sync_raft_topology_nodes: move update_topology up
In this and the following commits we prepare sync_raft_topology_nodes
to handle target_node parameter - the single host_id which should be
updated.
2024-01-12 15:37:50 +04:00
Petr Gusev
1b12f4b292 storage_service: topology_state_load: remove clone_async/clear_gently overhead
Before the patch we used to clone the entire token_metadata
and topology only to immediately drop everything in clear_gently.
This is a sheer waste.
2024-01-12 15:37:50 +04:00
Petr Gusev
1531e5e063 storage_service: fix indentation 2024-01-12 15:37:50 +04:00
Petr Gusev
9c50637f28 storage_service: extract sync_raft_topology_nodes
In the following commits we need part of the
topology_state_load logic to be applied
from gossiper_state_change_subscriber_proxy.
In this commit we extract this logic into a
new function sync_raft_topology_nodes.
2024-01-12 15:37:50 +04:00
Petr Gusev
9679b49cf4 storage_service: topology_state_load: move remove_endpoint into mutate_token_metadata
In the next commit we extract the loops by nodes into
a new function, in this commit we just move them
closer to each other.

Now the remove_endpoint function might be called under
token_metadata_lock (mutate_token_metadata takes it).
It's not a problem since gossiper event handlers in
raft_topology mode doesn't modify token_metadata so
we won't get a deadlock.
2024-01-12 15:37:50 +04:00
Petr Gusev
15b8e565ed address_map: move gossiper subscription logic into storage_service
We are going to remove the IP waiting loop from topology_state_load
in subsequent commits. An IP for a given host_id may change
after this function has been called by raft. This means we need
to subscribe to the gossiper notifications and call it later
with a new id<->ip mapping.

In this preparatory commit we move the existing address_map
update logic into storage_service so that in later commits
we can enhance it with topology_state_load call.
2024-01-12 15:37:50 +04:00
Petr Gusev
743be190f9 topology_coordinator: exec_global_command: small refactor, use contains + reformat 2024-01-12 15:37:50 +04:00
Petr Gusev
db1f0d5889 storage_service: wait_for_ip for new nodes
When a new node joins the cluster we need to be sure that it's IP
is known to all other nodes. In this patch we do this by waiting
for the IP to appear in raft_address_map.

A new raft_topology_cmd::command::wait_for_ip command is added.
It's run on all nodes of the cluster before we put the topology
into transition state. This applies both to new and replacing nodes.
It's important to run wait_for_ip before moving to
topology::transition_state::join_group0 since in this state
node IPs are already used to populate pending nodes in erm.
2024-01-12 15:37:46 +04:00
Botond Dénes
5f44ae8371 Merge 'Add more logging for gossiper::lock_endpoint and storage_service::handle_state_normal' from Kamil Braun
In a longevity test reported in scylladb/scylladb#16668 we observed that
NORMAL state is not being properly handled for a node that replaced
another node. Either handle_state_normal is not being called, or it is
but getting stuck in the middle. Which is the case couldn't be
determined from the logs, and attempts at creating a local reproducer
failed.

Thus the plan is to continue debugging using the longevity test, but we need
more logs. To check whether `handle_state_normal` was called and which branches
were taken, include some INFO level logs there. Also, detect deadlocks inside
`gossiper::lock_endpoint` by reporting an error message if `lock_endpoint`
waits for the lock for too long.

Ref: scylladb/scylladb#16668

Closes scylladb/scylladb#16733

* github.com:scylladb/scylladb:
  gossiper: report error when waiting too long for endpoint lock
  gossiper: store source_location instead of string in endpoint_permit
  storage_service: more verbose logging in handle_state_normal
2024-01-12 10:51:21 +02:00
Petr Gusev
1928dc73a8 erm: has_pending_ranges: switch to host_id
In the next patches we are going to change erm data structures
(replication_map and ring_mapping) from IP to host_id. Having
locator::host_id instead of IP in has_pending_ranges arguments
makes this transition easier.
2024-01-12 12:23:19 +04:00
Benny Halevy
3e938dbb5a storage_service: get rid of handle_state_moving declaration
The implementation was already removed in e64613154f

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#16742
2024-01-12 09:38:23 +02:00
Kamil Braun
664349a10f storage_service: more verbose logging in handle_state_normal
In a longevity test reported in scylladb/scylladb#16668 we observed that
NORMAL state is not being properly handled for a node that replaced
another node. Either handle_state_normal is not being called, or it is
but getting stuck in the middle. Which is the case couldn't be
determined from the logs, and attempts at creating a local reproducer
failed.

Improve the INFO level logging in handle_state_normal to aid debugging
in the future.

The amount of logs is still constant per-node. Even though some log
messages report all tokens owned by a node, handle_state_normal calls
are still rare. The most "spammy" situation is when a node starts and
calls handle_state_normal for every other node in the cluster, but it is
a once-per-startup event.
2024-01-10 16:39:55 +01:00
Botond Dénes
f4f724921c load_meter: get_load_map(): don't unconditionally dereference _lb
Said method has a check on `_lb` not being null, before accessing it.
However, since 0e5754a, there was an unconditional access, adding an
entry for the local node. Move this inside the if, so it is covered by
the null-check. The only caller is the api (probably nodetool), the
worst that can happend is that they get completely empty load-map if
they call too early during startup.

Fixes: #16617

Closes scylladb/scylladb#16659
2024-01-09 16:02:12 +03:00
Kamil Braun
d4f4b58f3a Merge 'topology_coordinator: reject removenode if the removed node is alive' from Patryk Jędrzejczak
The removenode operation is defined to succeed only if the node
being removed is dead. Currently, we reject this operation on the
initiator side (in `storage_service::raft_removenode`) when the
failure detector considers the node being removed alive. However,
it is possible that even if the initiator considers the node dead,
the topology coordinator will consider it alive when handling the
topology request. For example, the topology coordinator can use
a bigger failure detector timeout, or the node being removed can
suddenly resurrect.

This PR makes the topology coordinator reject removenode if the
node being removed is considered alive. It also adds
`test_remove_alive_node` that verifies this change.

Fixes scylladb/scylladb#16109

Closes scylladb/scylladb#16584

* github.com:scylladb/scylladb:
  test: add test_remove_alive_node
  topology_coordinator: reject removenode if the removed node is alive
  test: ManagerClient: remove unused wait_for_host_down
  test: remove_node: wait until the node being removed is dead
2024-01-08 12:39:23 +01:00
Kamil Braun
d11e824802 Merge 'storage_service: make all Raft-based operations abortable' from Patryk Jędrzejczak
During a shutdown, we call `storage_service::stop_transport` first.
We may try to apply a Raft command after that, or still be in the
the process of applying a command. In such a case, the shutdown
process will hang because Raft retries replicating a command until
it succeeds even in the case of a network error. It will stop when
a corresponding abort source is set. However, if we pass `nullptr`
to a function like `add_entry`, it won't stop. The shutdown
process will hang forever.

We fix all places that incorrectly pass `nullptr`. These shutdown
hangs are not only theoretical. The incorrect `add_entry` call in
`update_topology_state` caused scylladb/scylladb#16435.

Additionally, we remove the default `nullptr` values in all member
functions of `server` and `raft_group0_client` to avoid similar bugs
in the future.

Fixes scylladb/scylladb#16435

Closes scylladb/scylladb#16663

* github.com:scylladb/scylladb:
  server, raft_group0_client: remove the default nullptr values
  storage_service: make all Raft-based operations abortable
2024-01-08 11:30:56 +01:00
Patryk Jędrzejczak
df2034ebd7 server, raft_group0_client: remove the default nullptr values
The previous commit has fixed 5 bugs of the same type - incorrectly
passing the default nullptr to one of the changed functions. At
least some of these bugs wouldn't appear if there was no default
value. It's much harder to make this kind of a bug if you have to
write "nullptr". It's also much easier to detect it in review.

Moreover, these default values are rarely used outside tests.
Keeping them is just not worth the time spent on debugging.
2024-01-05 18:45:50 +01:00
Patryk Jędrzejczak
3d4af4ecf1 storage_service: make all Raft-based operations abortable
During a shutdown, we call `storage_service::stop_transport` first.
We may try to apply a Raft command after that, or still be in the
the process of applying a command. In such a case, the shutdown
process will hang because Raft retries replicating a command until
it succeeds even in the case of a network error. It will stop when
a corresponding abort source is set. However, if we pass `nullptr`
to a function like `add_entry`, it won't stop. The shutdown
process will hang forever.

We fix all places that incorrectly pass `nullptr`. These shutdown
hangs are not only theoretical. The incorrect `add_entry` call in
`update_topology_state` caused scylladb/scylladb#16435.
2024-01-05 18:45:20 +01:00
Kefu Chai
7e84e03f52 gms: do not include unused headers
these unused includes were identified by clangd. see
https://clangd.llvm.org/guides/include-cleaner#unused-include-warning
for more details on the "Unused include" warning.

because the removal of `#include "unimplemented.hh"`,
`service/migration_manager.cc` misses the definition of
`unimplemented::cause::VALIDATION`, so include the header where it is
used.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#16654
2024-01-05 13:37:08 +02:00
Kamil Braun
bf068dd023 Merge handle error in cdc generation propagation during bootstrap from Gleb
Bootstrap cannot proceed if cdc generation propagation to all nodes
fails, so the patch series handles the error by rolling the ongoing
topology operation back.

* 'gleb/raft-cdc-failure' of github.com:scylladb/scylla-dev:
  test: add test to check failure handling in cdc generation commit
  storage_service: topology coordinator: rollback on failure to commit cdc generation
2024-01-04 15:38:51 +01:00
Kamil Braun
f942bf4a1f Merge 'Do not update endpoint state via gossiper::add_saved_endpoint once it was updated via gossip' from Benny Halevy
Currently, `add_saved_endpoint` is called from two paths:  One, is when
loading states from system.peers in the join path (join_cluster,
join_token_ring), when `_raft_topology_change_enabled` is false, and the
other is from `storage_service::topology_state_load` when raft topology
changes are enabled.

In the later path, from `topology_state_load`, `add_saved_endpoint` is
called only if the endpoint_state does not exist yet.  However, this is
checked without acquiring the endpoint_lock and so it races with the
gossiper, and once `add_saved_endpoint` acquires the lock, the endpoint
state may already be populated.

Since `add_saved_endpoint` applies local information about the endpoint
state (e.g. tokens, dc, rack), it uses the local heart_beat_version,
with generation=0 to update the endpoint states, and that is
incompatible with changes applies via gossip that will carry the
endpoint's generation and version, determining the state's update order.

This change makes sure that the endpoint state is never update in
`add_saved_endpoint` if it has non-zero generation.  An internal error
exception is thrown if non-zero generation is found, and in the only
call site that might reach that state, in
`storage_service::topology_state_load`, the caller acquires the
endpoint_lock for checking for the existence of the endpoint_state,
calling `add_saved_endpoint` under the lock only if the endpoint_state
does not exist.

Fixes #16429

Closes scylladb/scylladb#16432

* github.com:scylladb/scylladb:
  gossiper: add_saved_endpoint: keep heart_beat_state if ep_state is found
  storage_service: topology_state_load: lock endpoint for add_saved_endpoint
  raft_group_registry: move on_alive error injection to gossiper
2024-01-04 14:47:10 +01:00
Kefu Chai
34259a03d0 treewide: use consteval string as format string when formatting log message
seastar::logger is using the compile-time format checking by default if
compiled using {fmt} 8.0 and up. and it requires the format string to be
consteval string, otherwise we have to use `fmt::runtime()` explicitly.

so adapt the change, let's use the consteval string when formatting
logging messages.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#16612
2024-01-02 19:08:47 +02:00
Kamil Braun
949658590f Merge 'raft topology: do not update token metadata in on_alive and on_remove' from Patryk Jędrzejczak
In the Raft-based topology, we should never update token metadata
through gossip notifications. `storage_service::on_alive` and
`storage_service::on_remove` do it, so we ignore their parts that
touch token metadata.

Additionally, we improve some logs in other places where we ignore
the function because of using the Raft-based topology.

Fixes scylladb/scylladb#15732

Closes scylladb/scylladb#16528

* github.com:scylladb/scylladb:
  storage_service: handle_state_left, handle_state_normal: improve logs
  raft topology: do not update token metadata in on_alive and on_remove
2024-01-02 16:08:50 +01:00
Benny Halevy
0e5a666e6f storage_service: drop do_update_system_peers_table
It is no longer used after previous patch.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-12-31 18:37:34 +02:00
Benny Halevy
13d395fa6a storage_service: on_change: fixup indentation
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-12-31 18:37:34 +02:00
Benny Halevy
ad8a9104d8 endpoint_state subscriptions: batch on_change notification
Rather than calling on_change for each particular
application_state, pass an endpoint_state::map_type
with all changed states, to be processed as a batch.

In particular, thise allows storage_service::on_change
to update_peer_info once for all changed states.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-12-31 18:37:34 +02:00
Benny Halevy
1d07a596bf everywhere: drop before_change subscription
None of the subscribers is doing anything before_change.
This is done before changing `on_change` in the following patch.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-12-31 18:37:34 +02:00
Benny Halevy
2075c85b70 storage_service: seal peer info with host_id
When adding a peer via update_peer_info,
insert all columns in a single query
using system_keyspace::peer_info.
This ensures that `host_id` is inserted along with all
other app states, so we can rely on it
when loading the peer info after restart.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-12-31 18:37:34 +02:00
Benny Halevy
eb4cd388ce storage_service: update_peer_info: pass peer_info to sys_ks
Use the newly added system_keyspace::peer_info
to pass a struct of all optional system.peea members
to system_keyspace::update_peer_info.

Add `get_peer_info_for_update` to construct said struct
from the endpoint state.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-12-31 18:37:34 +02:00
Benny Halevy
5abf556399 gms: endpoint_state: define application_state_map
Have a central definition for the map held
in the endpoint_state (before changing it to
std::unordered_map).

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-12-31 18:37:34 +02:00
Benny Halevy
3099c5b8ab storage_service: topology_state_load: lock endpoint for add_saved_endpoint
`topology_state_load` currently calls `add_saved_endpoint`
only if it finds no endpoint_state_ptr for the endpoint.
However, this is done before locking the endpoint
and the endpoint state could be inserted concurrently.

To prevent that, a permit_id parameter was added to
`add_saved_endpoint` allowing the caller to call it
while the endpoint is locked.  With that, `topology_state_load`
locks the endpoint and checks the existence of the endpoint state
under the lock, before calling `add_saved_endpoint`.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-12-31 16:48:57 +02:00
Benny Halevy
db434e8cb5 raft_group_registry: move on_alive error injection to gossiper
Move the `raft_group_registry::on_alive` error injection point
to `gossiper::real_mark_alive` so it can delay marking the endpoint as
alive, and calling the `on_alive` callback, but without holding
the endpoint_lock.

Note that the entry for this endpoint in `_pending_mark_alive_endpoints`
still blocks marking it as alive until real_mark_alive completes.

Fixes #16506

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-12-31 15:28:54 +02:00