scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-25 02:50:33 +00:00

Author	SHA1	Message	Date
Gleb Natapov	ba7aa0d582	storage_service: topology coordinator: add error injection point to be able to pause the topology coordinator	2024-01-14 15:45:53 +02:00
Gleb Natapov	1afc891bd5	storage_service: topology coordinator: add logging to removenode and decommission Add some useful logging to removenode and decommission to be used by tests later.	2024-01-14 15:45:53 +02:00
Gleb Natapov	97ab3f6622	storage_service: topology_coordinator: introduce cleanup REST API integrated with the topology coordinator Introduce new REST API "/storage_service/cleanup_all" that, when triggered, instructs the topology coordinator to initiate cluster wide cleanup on all dirty nodes. It is done by introducing new global command "global_topology_request::cleanup".	2024-01-14 15:45:53 +02:00
Gleb Natapov	0adb3904d8	storage_service: topology coordinator: manage cluster cleanup as part of the topology management Sometimes it is unsafe to start a new topology operation before cleanup runs on dirty nodes. This patch detects the situation when the topology operation to be executed cannot be run safely until all dirty nodes do cleanup and initiates the cleanup automatically. It also waits for cleanup to complete before proceeding with the topology operation. There can be a situation that nodes that needs cleanup dies and will never clear the flag. In this case if a topology operation that wants to run next does not have this node in its ignore node list it may stuck forever. To fix this the patch also introduces the "liveness aware" request queue management: we do not simple choose _a_ request to run next, but go over the queue and find requests that can proceed considering the nodes liveness situation. If there are multiple requests eligible to run the patch introduces the order based on the operation type: replace, join, remove, leave, rebuild. The order is such so to not trigger cleanup needlessly.	2024-01-14 15:45:50 +02:00
Gleb Natapov	c9b7bd5a33	storage_service: topology coordinator: provide a version of get_excluded_nodes that does not need node_to_work_on as a parameter Needed by the next patch.	2024-01-14 14:44:07 +02:00
Gleb Natapov	067267ff76	storage_service: topology coordinator: make topology coordinator lifecycle subscriber We want to change the coordinator to consider nodes liveness when processing the topology operation queue. If there is no enough live nodes to process any of the ops we want to cancel them. For that to work we need to be able to kick the coordinator if liveness situation changes.	2024-01-14 14:44:07 +02:00
Gleb Natapov	f70c4127c6	storage_service: topology coordinator: introduce sstable cleanup fiber Introduce a fiber that waits on a topology event and when it sees that the node it runs on needs to perform sstable cleanup it initiates one for each non tablet, non local table and resets "cleanup" flag back to "clean" in the topology.	2024-01-14 14:44:07 +02:00
Gleb Natapov	5b246920ae	storage_proxy: allow to wait for all ongoing writes We want to be able to wait for all writes started through the storage proxy before a fence is advanced. Add phased_barrier that is entered on each local write operation before checking the fence to do so. A write will be either tracked by the phased_barrier or fenced. This will be needed to wait for all non fenced local writes to complete before starting a cleanup.	2024-01-14 14:44:07 +02:00
Gleb Natapov	b2ba77978c	storage_service: topology coordinator: mark nodes as needing cleanup when required A cleanup needs to run when a node loses an ownership of a range (during bootstrap) or if a range movement to an normal node failed (removenode, decommission failure). Mark all dirty node as "cleanup needed" in those cases.	2024-01-14 14:43:59 +02:00
Gleb Natapov	dbededb1a6	storage_service: add mark_nodes_as_cleanup_needed function The function creates a mutation that sets cleanup to "needed" for each normal node that, according to the erm, has data it does not own after successful or unsuccessful topology operation.	2024-01-14 14:43:33 +02:00
Gleb Natapov	cc54796e23	raft topology: add cleanup state to the topology state machine The patch adds cleanup state to the persistent and in memory state and handles the loading. The state can be "clean" which means no cleanup needed, "needed" which means the node is dirty and needs to run cleanup at some point, "running" which means that cleanup is running by the node right now and when it will be completed the state will be reset to "clean".	2024-01-14 13:30:54 +02:00
Kamil Braun	4e18f8b453	Merge 'topology_state_load: stop waiting for IP-s' from Petr Gusev The loop in `id2ip` lambda makes problems if we are applying an old raft log that contains long-gone nodes. In this case, we may never receive the `IP` for a node and stuck in the loop forever. In this series we replace the loop with an if - we just don't update the `host_id <-> ip` mapping in the `token_metadata.topology` if we don't have an `IP` yet. The PR moves `host_id -> IP` resolution to the data plane, now it happens each time the IP-based methods of `erm` are called. We need this because IPs may not be known at the time the erm is built. The overhead of `raft_address_map` lookup is added to each data plane request, but it should be negligible. In this PR `erm/resolve_endpoints` continues to treat missing IP for `host_id` as `internal_error`, but we plan to relax this in the follow-up (see this PR first comment). Closes scylladb/scylladb#16639 * github.com:scylladb/scylladb: raft ips: rename gossiper_state_change_subscriber_proxy -> raft_ip_address_updater gossiper_state_change_subscriber_proxy: call sync_raft_topology_nodes storage_service: topology_state_load: remove IP waiting loop storage_service: sync_raft_topology_nodes: add target_node parameter storage_service: sync_raft_topology_nodes: move loops to the end storage_service: sync_raft_topology_nodes: rename extract process_left_node and process_transition_node storage_service: sync_raft_topology_nodes: rename add_normal_node -> process_normal_node storage_service: sync_raft_topology_nodes: move update_topology up storage_service: topology_state_load: remove clone_async/clear_gently overhead storage_service: fix indentation storage_service: extract sync_raft_topology_nodes storage_service: topology_state_load: move remove_endpoint into mutate_token_metadata address_map: move gossiper subscription logic into storage_service topology_coordinator: exec_global_command: small refactor, use contains + reformat storage_service: wait_for_ip for new nodes storage_service.idl.hh: fix raft_topology_cmd.command declaration erm: for_each_natural_endpoint_until: use is_vnode == true erm: switch the internal data structures to host_id-s erm: has_pending_ranges: switch to host_id	2024-01-12 18:46:51 +01:00
Petr Gusev	e24bee545b	raft ips: rename gossiper_state_change_subscriber_proxy -> raft_ip_address_updater	2024-01-12 18:29:22 +04:00
Petr Gusev	6e7bbc94f4	gossiper_state_change_subscriber_proxy: call sync_raft_topology_nodes When a node changes its IP we need to store the mapping in system.peers and update token_metadata.topology and erm in-memory data structures. The test_change_ip was improved to verify this new behaviour. Before this patch the test didn't check that IPs used for data requests are updated on IP change. In this commit we add the read/write check. It fails on insert with 'node unavailable' error without the fix.	2024-01-12 18:28:57 +04:00
Petr Gusev	6d6e1ba8fb	storage_service: topology_state_load: remove IP waiting loop The loop makes problems if we are applying an old raft log that contains long-gone nodes. In this case, we may never receive the IP for a node and stuck in the loop forever. The idea of the patch is to replace the loop with an if - we just don't update the host_id <-> ip mapping in the token_metadata.topology if we don't have an IP yet. When we get the mapping later, we'll call sync_raft_topology_nodes again from gossiper_state_change_subscriber_proxy.	2024-01-12 15:37:50 +04:00
Petr Gusev	260874c860	storage_service: sync_raft_topology_nodes: add target_node parameter If it's set, instead of going over all the nodes in raft topology, the function will update only the specified node. This parameter will be used in the next commit, in the call to sync_raft_topology_nodes from gossiper_state_change_subscriber_proxy.	2024-01-12 15:37:50 +04:00
Petr Gusev	a9d58c3db5	storage_service: sync_raft_topology_nodes: move loops to the end	2024-01-12 15:37:50 +04:00
Petr Gusev	d1bce3651b	storage_service: sync_raft_topology_nodes: rename extract process_left_node and process_transition_node	2024-01-12 15:37:50 +04:00
Petr Gusev	aa37b6cfd3	storage_service: sync_raft_topology_nodes: rename add_normal_node -> process_normal_node	2024-01-12 15:37:50 +04:00
Petr Gusev	a508d7ffc5	storage_service: sync_raft_topology_nodes: move update_topology up In this and the following commits we prepare sync_raft_topology_nodes to handle target_node parameter - the single host_id which should be updated.	2024-01-12 15:37:50 +04:00
Petr Gusev	1b12f4b292	storage_service: topology_state_load: remove clone_async/clear_gently overhead Before the patch we used to clone the entire token_metadata and topology only to immediately drop everything in clear_gently. This is a sheer waste.	2024-01-12 15:37:50 +04:00
Petr Gusev	1531e5e063	storage_service: fix indentation	2024-01-12 15:37:50 +04:00
Petr Gusev	9c50637f28	storage_service: extract sync_raft_topology_nodes In the following commits we need part of the topology_state_load logic to be applied from gossiper_state_change_subscriber_proxy. In this commit we extract this logic into a new function sync_raft_topology_nodes.	2024-01-12 15:37:50 +04:00
Petr Gusev	9679b49cf4	storage_service: topology_state_load: move remove_endpoint into mutate_token_metadata In the next commit we extract the loops by nodes into a new function, in this commit we just move them closer to each other. Now the remove_endpoint function might be called under token_metadata_lock (mutate_token_metadata takes it). It's not a problem since gossiper event handlers in raft_topology mode doesn't modify token_metadata so we won't get a deadlock.	2024-01-12 15:37:50 +04:00
Petr Gusev	15b8e565ed	address_map: move gossiper subscription logic into storage_service We are going to remove the IP waiting loop from topology_state_load in subsequent commits. An IP for a given host_id may change after this function has been called by raft. This means we need to subscribe to the gossiper notifications and call it later with a new id<->ip mapping. In this preparatory commit we move the existing address_map update logic into storage_service so that in later commits we can enhance it with topology_state_load call.	2024-01-12 15:37:50 +04:00
Petr Gusev	743be190f9	topology_coordinator: exec_global_command: small refactor, use contains + reformat	2024-01-12 15:37:50 +04:00
Petr Gusev	db1f0d5889	storage_service: wait_for_ip for new nodes When a new node joins the cluster we need to be sure that it's IP is known to all other nodes. In this patch we do this by waiting for the IP to appear in raft_address_map. A new raft_topology_cmd::command::wait_for_ip command is added. It's run on all nodes of the cluster before we put the topology into transition state. This applies both to new and replacing nodes. It's important to run wait_for_ip before moving to topology::transition_state::join_group0 since in this state node IPs are already used to populate pending nodes in erm.	2024-01-12 15:37:46 +04:00
Botond Dénes	5f44ae8371	Merge 'Add more logging for `gossiper::lock_endpoint` and `storage_service::handle_state_normal`' from Kamil Braun In a longevity test reported in scylladb/scylladb#16668 we observed that NORMAL state is not being properly handled for a node that replaced another node. Either handle_state_normal is not being called, or it is but getting stuck in the middle. Which is the case couldn't be determined from the logs, and attempts at creating a local reproducer failed. Thus the plan is to continue debugging using the longevity test, but we need more logs. To check whether `handle_state_normal` was called and which branches were taken, include some INFO level logs there. Also, detect deadlocks inside `gossiper::lock_endpoint` by reporting an error message if `lock_endpoint` waits for the lock for too long. Ref: scylladb/scylladb#16668 Closes scylladb/scylladb#16733 * github.com:scylladb/scylladb: gossiper: report error when waiting too long for endpoint lock gossiper: store source_location instead of string in endpoint_permit storage_service: more verbose logging in handle_state_normal	2024-01-12 10:51:21 +02:00
Petr Gusev	1928dc73a8	erm: has_pending_ranges: switch to host_id In the next patches we are going to change erm data structures (replication_map and ring_mapping) from IP to host_id. Having locator::host_id instead of IP in has_pending_ranges arguments makes this transition easier.	2024-01-12 12:23:19 +04:00
Benny Halevy	3e938dbb5a	storage_service: get rid of handle_state_moving declaration The implementation was already removed in `e64613154f` Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes scylladb/scylladb#16742	2024-01-12 09:38:23 +02:00
Kamil Braun	664349a10f	storage_service: more verbose logging in handle_state_normal In a longevity test reported in scylladb/scylladb#16668 we observed that NORMAL state is not being properly handled for a node that replaced another node. Either handle_state_normal is not being called, or it is but getting stuck in the middle. Which is the case couldn't be determined from the logs, and attempts at creating a local reproducer failed. Improve the INFO level logging in handle_state_normal to aid debugging in the future. The amount of logs is still constant per-node. Even though some log messages report all tokens owned by a node, handle_state_normal calls are still rare. The most "spammy" situation is when a node starts and calls handle_state_normal for every other node in the cluster, but it is a once-per-startup event.	2024-01-10 16:39:55 +01:00
Botond Dénes	f4f724921c	load_meter: get_load_map(): don't unconditionally dereference _lb Said method has a check on `_lb` not being null, before accessing it. However, since `0e5754a`, there was an unconditional access, adding an entry for the local node. Move this inside the if, so it is covered by the null-check. The only caller is the api (probably nodetool), the worst that can happend is that they get completely empty load-map if they call too early during startup. Fixes: #16617 Closes scylladb/scylladb#16659	2024-01-09 16:02:12 +03:00
Kamil Braun	d4f4b58f3a	Merge 'topology_coordinator: reject removenode if the removed node is alive' from Patryk Jędrzejczak The removenode operation is defined to succeed only if the node being removed is dead. Currently, we reject this operation on the initiator side (in `storage_service::raft_removenode`) when the failure detector considers the node being removed alive. However, it is possible that even if the initiator considers the node dead, the topology coordinator will consider it alive when handling the topology request. For example, the topology coordinator can use a bigger failure detector timeout, or the node being removed can suddenly resurrect. This PR makes the topology coordinator reject removenode if the node being removed is considered alive. It also adds `test_remove_alive_node` that verifies this change. Fixes scylladb/scylladb#16109 Closes scylladb/scylladb#16584 * github.com:scylladb/scylladb: test: add test_remove_alive_node topology_coordinator: reject removenode if the removed node is alive test: ManagerClient: remove unused wait_for_host_down test: remove_node: wait until the node being removed is dead	2024-01-08 12:39:23 +01:00
Kamil Braun	d11e824802	Merge 'storage_service: make all Raft-based operations abortable' from Patryk Jędrzejczak During a shutdown, we call `storage_service::stop_transport` first. We may try to apply a Raft command after that, or still be in the the process of applying a command. In such a case, the shutdown process will hang because Raft retries replicating a command until it succeeds even in the case of a network error. It will stop when a corresponding abort source is set. However, if we pass `nullptr` to a function like `add_entry`, it won't stop. The shutdown process will hang forever. We fix all places that incorrectly pass `nullptr`. These shutdown hangs are not only theoretical. The incorrect `add_entry` call in `update_topology_state` caused scylladb/scylladb#16435. Additionally, we remove the default `nullptr` values in all member functions of `server` and `raft_group0_client` to avoid similar bugs in the future. Fixes scylladb/scylladb#16435 Closes scylladb/scylladb#16663 * github.com:scylladb/scylladb: server, raft_group0_client: remove the default nullptr values storage_service: make all Raft-based operations abortable	2024-01-08 11:30:56 +01:00
Patryk Jędrzejczak	df2034ebd7	server, raft_group0_client: remove the default nullptr values The previous commit has fixed 5 bugs of the same type - incorrectly passing the default nullptr to one of the changed functions. At least some of these bugs wouldn't appear if there was no default value. It's much harder to make this kind of a bug if you have to write "nullptr". It's also much easier to detect it in review. Moreover, these default values are rarely used outside tests. Keeping them is just not worth the time spent on debugging.	2024-01-05 18:45:50 +01:00
Patryk Jędrzejczak	3d4af4ecf1	storage_service: make all Raft-based operations abortable During a shutdown, we call `storage_service::stop_transport` first. We may try to apply a Raft command after that, or still be in the the process of applying a command. In such a case, the shutdown process will hang because Raft retries replicating a command until it succeeds even in the case of a network error. It will stop when a corresponding abort source is set. However, if we pass `nullptr` to a function like `add_entry`, it won't stop. The shutdown process will hang forever. We fix all places that incorrectly pass `nullptr`. These shutdown hangs are not only theoretical. The incorrect `add_entry` call in `update_topology_state` caused scylladb/scylladb#16435.	2024-01-05 18:45:20 +01:00
Kefu Chai	7e84e03f52	gms: do not include unused headers these unused includes were identified by clangd. see https://clangd.llvm.org/guides/include-cleaner#unused-include-warning for more details on the "Unused include" warning. because the removal of `#include "unimplemented.hh"`, `service/migration_manager.cc` misses the definition of `unimplemented::cause::VALIDATION`, so include the header where it is used. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#16654	2024-01-05 13:37:08 +02:00
Kamil Braun	bf068dd023	Merge `handle error in cdc generation propagation during bootstrap` from Gleb Bootstrap cannot proceed if cdc generation propagation to all nodes fails, so the patch series handles the error by rolling the ongoing topology operation back. * 'gleb/raft-cdc-failure' of github.com:scylladb/scylla-dev: test: add test to check failure handling in cdc generation commit storage_service: topology coordinator: rollback on failure to commit cdc generation	2024-01-04 15:38:51 +01:00
Kamil Braun	f942bf4a1f	Merge 'Do not update endpoint state via gossiper::add_saved_endpoint once it was updated via gossip' from Benny Halevy Currently, `add_saved_endpoint` is called from two paths: One, is when loading states from system.peers in the join path (join_cluster, join_token_ring), when `_raft_topology_change_enabled` is false, and the other is from `storage_service::topology_state_load` when raft topology changes are enabled. In the later path, from `topology_state_load`, `add_saved_endpoint` is called only if the endpoint_state does not exist yet. However, this is checked without acquiring the endpoint_lock and so it races with the gossiper, and once `add_saved_endpoint` acquires the lock, the endpoint state may already be populated. Since `add_saved_endpoint` applies local information about the endpoint state (e.g. tokens, dc, rack), it uses the local heart_beat_version, with generation=0 to update the endpoint states, and that is incompatible with changes applies via gossip that will carry the endpoint's generation and version, determining the state's update order. This change makes sure that the endpoint state is never update in `add_saved_endpoint` if it has non-zero generation. An internal error exception is thrown if non-zero generation is found, and in the only call site that might reach that state, in `storage_service::topology_state_load`, the caller acquires the endpoint_lock for checking for the existence of the endpoint_state, calling `add_saved_endpoint` under the lock only if the endpoint_state does not exist. Fixes #16429 Closes scylladb/scylladb#16432 * github.com:scylladb/scylladb: gossiper: add_saved_endpoint: keep heart_beat_state if ep_state is found storage_service: topology_state_load: lock endpoint for add_saved_endpoint raft_group_registry: move on_alive error injection to gossiper	2024-01-04 14:47:10 +01:00
Kefu Chai	34259a03d0	treewide: use consteval string as format string when formatting log message seastar::logger is using the compile-time format checking by default if compiled using {fmt} 8.0 and up. and it requires the format string to be consteval string, otherwise we have to use `fmt::runtime()` explicitly. so adapt the change, let's use the consteval string when formatting logging messages. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#16612	2024-01-02 19:08:47 +02:00
Kamil Braun	949658590f	Merge 'raft topology: do not update token metadata in on_alive and on_remove' from Patryk Jędrzejczak In the Raft-based topology, we should never update token metadata through gossip notifications. `storage_service::on_alive` and `storage_service::on_remove` do it, so we ignore their parts that touch token metadata. Additionally, we improve some logs in other places where we ignore the function because of using the Raft-based topology. Fixes scylladb/scylladb#15732 Closes scylladb/scylladb#16528 * github.com:scylladb/scylladb: storage_service: handle_state_left, handle_state_normal: improve logs raft topology: do not update token metadata in on_alive and on_remove	2024-01-02 16:08:50 +01:00
Benny Halevy	0e5a666e6f	storage_service: drop do_update_system_peers_table It is no longer used after previous patch. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-12-31 18:37:34 +02:00
Benny Halevy	13d395fa6a	storage_service: on_change: fixup indentation Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-12-31 18:37:34 +02:00
Benny Halevy	ad8a9104d8	endpoint_state subscriptions: batch on_change notification Rather than calling on_change for each particular application_state, pass an endpoint_state::map_type with all changed states, to be processed as a batch. In particular, thise allows storage_service::on_change to update_peer_info once for all changed states. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-12-31 18:37:34 +02:00
Benny Halevy	1d07a596bf	everywhere: drop before_change subscription None of the subscribers is doing anything before_change. This is done before changing `on_change` in the following patch. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-12-31 18:37:34 +02:00
Benny Halevy	2075c85b70	storage_service: seal peer info with host_id When adding a peer via update_peer_info, insert all columns in a single query using system_keyspace::peer_info. This ensures that `host_id` is inserted along with all other app states, so we can rely on it when loading the peer info after restart. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-12-31 18:37:34 +02:00
Benny Halevy	eb4cd388ce	storage_service: update_peer_info: pass peer_info to sys_ks Use the newly added system_keyspace::peer_info to pass a struct of all optional system.peea members to system_keyspace::update_peer_info. Add `get_peer_info_for_update` to construct said struct from the endpoint state. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-12-31 18:37:34 +02:00
Benny Halevy	5abf556399	gms: endpoint_state: define application_state_map Have a central definition for the map held in the endpoint_state (before changing it to std::unordered_map). Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-12-31 18:37:34 +02:00
Benny Halevy	3099c5b8ab	storage_service: topology_state_load: lock endpoint for add_saved_endpoint `topology_state_load` currently calls `add_saved_endpoint` only if it finds no endpoint_state_ptr for the endpoint. However, this is done before locking the endpoint and the endpoint state could be inserted concurrently. To prevent that, a permit_id parameter was added to `add_saved_endpoint` allowing the caller to call it while the endpoint is locked. With that, `topology_state_load` locks the endpoint and checks the existence of the endpoint state under the lock, before calling `add_saved_endpoint`. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-12-31 16:48:57 +02:00
Benny Halevy	db434e8cb5	raft_group_registry: move on_alive error injection to gossiper Move the `raft_group_registry::on_alive` error injection point to `gossiper::real_mark_alive` so it can delay marking the endpoint as alive, and calling the `on_alive` callback, but without holding the endpoint_lock. Note that the entry for this endpoint in `_pending_mark_alive_endpoints` still blocks marking it as alive until real_mark_alive completes. Fixes #16506 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-12-31 15:28:54 +02:00

1 2 3 4 5 ...

4079 Commits