scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-06-03 13:37:04 +00:00

Author	SHA1	Message	Date
Asias He	d3efb3ab6f	storage_service: Set session id for raft_rebuild Raft rebuild is broken because the session id is not set. The following was seen when run rebuild stream_session - [Stream #8cfca940-afc9-11ee-b6f1-30b8f78c1451] stream_transfer_task: Fail to send to 127.0.70.1:0: seastar::rpc::remote_verb_error (Session not found: 00000000-0000-0000-0000-000000000000) with raft topology, e.g., scylla --enable-repair-based-node-ops 0 --consistent-cluster-management true --experimental-features consistent-topology-changes Fix by setting the session id. Fixes #16741 Closes scylladb/scylladb#16814	2024-01-18 12:47:20 +02:00
Kefu Chai	0ae81446ef	./: not include unused headers these unused includes were identified by clangd. see https://clangd.llvm.org/guides/include-cleaner#unused-include-warning for more details on the "Unused include" warning. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#16766	2024-01-17 16:30:14 +02:00
Kamil Braun	787b24cd24	Merge 'raft topology: join: shut down a node on error in response handler' from Patryk Jędrzejczak If the joining node fails while handling the response from the topology coordinator, it hangs even though it knows the join operation has failed. Therefore, we ensure it shuts down in this patch. Additionally, we ensure that if the first join request response was a rejection or the node failed while handling it, the following acceptances by the (possibly different) coordinator don't succeed. The node considers the join operation as failed. We shouldn't add it to the cluster. Fixes scylladb/scylladb#16333 Closes scylladb/scylladb#16650 * github.com:scylladb/scylladb: topology_coordinator: clarify warnings raft topology: join: allow only the first response to be a succesful acceptance storage_service: join_node_response_handler: fix indentation raft topology: join: shut down a node on error in response handler	2024-01-17 14:55:26 +01:00
Botond Dénes	f22fc88a64	Merge 'Configure service levels interval' from Michał Jadwiszczak Service level controller updates itself in interval. However the interval time is hardcoded in main to 10 seconds and it leads to long sleeps in some of the tests. This patch moves this value to `service_levels_interval_ms` command line option and sets this value to 0.5s in cql-pytest. Closes scylladb/scylladb#16394 * github.com:scylladb/scylladb: test:cql-pytest: change service levels intervals in tests configure service levels interval	2024-01-17 12:24:49 +02:00
Tomasz Grabiec	3d76aefb98	Merge "Enhance topology request status tracking" from Gleb Currently to figure out if a topology request is complete a submitter checks the topology state and tries to figure out from that the status of the request. This is not exact. Lets look at rebuild handling for instance. To figure out if request is completed the code waits for request object to disappear from the topology, but if another rebuild starts between the end of the previous one and the code noticing that it completed the code will continue waiting for the next rebuild. Another problem is that in case of operation failure there is no way to pass an error back to the initiator. This series solves those problems by assigning an id for each request and tracking the status of each request in a separate table. The initiator can query the request status from the table and see if the request was completed successfully or if it failed with an error, which is also evadable from the table. The schema for the table is: CREATE TABLE system.topology_requests ( id timeuuid PRIMARY KEY, initiating_host uuid, start_time timestamp, done boolean, error text, end_time timestamp, ); and all entries have TTL of one month.	2024-01-17 00:37:19 +01:00
Gleb Natapov	9a7243d71a	storage_service: topology coordinator: Consolidate some mutation builder code	2024-01-16 17:02:54 +02:00
Gleb Natapov	a145a73136	storage_service: topology coordinator: make topology operation rollback error more informative Include an error which caused the rollback.	2024-01-16 17:02:54 +02:00
Gleb Natapov	bf91eb37f2	storage_service: topology coordinator: make topology operation cancellation error more informative Include the list of nodes that were down when cancellation happened.	2024-01-16 17:02:54 +02:00
Gleb Natapov	8beb399b72	storage_service: topology coordinator: consolidate some code in cancel_all_requests There is a code duplication that can be avoided.	2024-01-16 17:02:54 +02:00
Gleb Natapov	fba6877b3e	storage_service: topology coordinator: TTL topology request table To prevent topology_request table growth TTL all writes to expire after a month.	2024-01-16 17:02:54 +02:00
Gleb Natapov	d576ed31dc	storage_service: topology request: drop explicit shutdown rpc Now that we have explicit status for each request we may use it to replace shutdown notification rpc. During a decommission, in left_token_ring state, we set done to true after metadata barrier that waits for all request to the decommissioning node to complete and notify the decommissioning node with a regular barrier. At this point the node will see that the request is complete and exit.	2024-01-16 17:02:54 +02:00
Gleb Natapov	84197ff735	storage_service: topology coordinator: check topology operation completion using status in topology_requests table Instead of trying to guess if a request completed by looking into the topology state (which is sometimes can be error prone) look at the request status in the new topology_requests. If request failed report a reason for the failure from the table.	2024-01-16 17:02:54 +02:00
Kefu Chai	2dbf044b91	cql3: do not include unused headers these unused includes were identified by clangd. see https://clangd.llvm.org/guides/include-cleaner#unused-include-warning for more details on the "Unused include" warning. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#16791	2024-01-16 16:43:17 +02:00
Gleb Natapov	1c18476385	storage_service: topology coordinator: update topology_requests table with requests progress Make topology coordinator update request's status in topology_requests table as it changes.	2024-01-16 15:35:18 +02:00
Gleb Natapov	1ce1c5001d	topology coordinator: add topology_requests table to group0 snapshot Since the table is updated through raft's group0 state machine its content needs to be part of the snapshot.	2024-01-16 13:57:27 +02:00
Gleb Natapov	584551f849	topology coordinator: add request_id to the topology state machine Provide a unique ID for each topology request and store it the topology state machine. It will be used to index new topology requests table in order to retrieve request status.	2024-01-16 13:57:27 +02:00
Avi Kivity	5e70dd1dbe	database: don't allow keyspace objects to be copied keyspace objects are heavyweight and copies are immediately our-of-date, so copying them is bad. Fix by deleting the copy constructor and copy assignment operator. One call site is fixed. This call site is safe since the it's only used for accessing a few attributes (introduced in `f70c4127c6`). Closes scylladb/scylladb#16782	2024-01-15 21:48:32 +01:00
Kefu Chai	e5300f3e21	topology_state_machine: add formatter for service::cleanup_status before this change, we rely on the default-generated fmt::formatter created from operator<<, but fmt v10 dropped the default-generated formatter. in this change, we define a formatter for service::cleanup_status, and remove its operator<<(). Refs #13245 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#16778	2024-01-15 21:31:42 +02:00
Kefu Chai	ece2bd2f6e	service: do not include unused headers these unused includes were identified by clangd. see https://clangd.llvm.org/guides/include-cleaner#unused-include-warning for more details on the "Unused include" warning. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#16764	2024-01-15 13:29:33 +02:00
Gleb Natapov	ba7aa0d582	storage_service: topology coordinator: add error injection point to be able to pause the topology coordinator	2024-01-14 15:45:53 +02:00
Gleb Natapov	1afc891bd5	storage_service: topology coordinator: add logging to removenode and decommission Add some useful logging to removenode and decommission to be used by tests later.	2024-01-14 15:45:53 +02:00
Gleb Natapov	97ab3f6622	storage_service: topology_coordinator: introduce cleanup REST API integrated with the topology coordinator Introduce new REST API "/storage_service/cleanup_all" that, when triggered, instructs the topology coordinator to initiate cluster wide cleanup on all dirty nodes. It is done by introducing new global command "global_topology_request::cleanup".	2024-01-14 15:45:53 +02:00
Gleb Natapov	0adb3904d8	storage_service: topology coordinator: manage cluster cleanup as part of the topology management Sometimes it is unsafe to start a new topology operation before cleanup runs on dirty nodes. This patch detects the situation when the topology operation to be executed cannot be run safely until all dirty nodes do cleanup and initiates the cleanup automatically. It also waits for cleanup to complete before proceeding with the topology operation. There can be a situation that nodes that needs cleanup dies and will never clear the flag. In this case if a topology operation that wants to run next does not have this node in its ignore node list it may stuck forever. To fix this the patch also introduces the "liveness aware" request queue management: we do not simple choose _a_ request to run next, but go over the queue and find requests that can proceed considering the nodes liveness situation. If there are multiple requests eligible to run the patch introduces the order based on the operation type: replace, join, remove, leave, rebuild. The order is such so to not trigger cleanup needlessly.	2024-01-14 15:45:50 +02:00
Gleb Natapov	c9b7bd5a33	storage_service: topology coordinator: provide a version of get_excluded_nodes that does not need node_to_work_on as a parameter Needed by the next patch.	2024-01-14 14:44:07 +02:00
Gleb Natapov	067267ff76	storage_service: topology coordinator: make topology coordinator lifecycle subscriber We want to change the coordinator to consider nodes liveness when processing the topology operation queue. If there is no enough live nodes to process any of the ops we want to cancel them. For that to work we need to be able to kick the coordinator if liveness situation changes.	2024-01-14 14:44:07 +02:00
Gleb Natapov	f70c4127c6	storage_service: topology coordinator: introduce sstable cleanup fiber Introduce a fiber that waits on a topology event and when it sees that the node it runs on needs to perform sstable cleanup it initiates one for each non tablet, non local table and resets "cleanup" flag back to "clean" in the topology.	2024-01-14 14:44:07 +02:00
Gleb Natapov	5b246920ae	storage_proxy: allow to wait for all ongoing writes We want to be able to wait for all writes started through the storage proxy before a fence is advanced. Add phased_barrier that is entered on each local write operation before checking the fence to do so. A write will be either tracked by the phased_barrier or fenced. This will be needed to wait for all non fenced local writes to complete before starting a cleanup.	2024-01-14 14:44:07 +02:00
Gleb Natapov	b2ba77978c	storage_service: topology coordinator: mark nodes as needing cleanup when required A cleanup needs to run when a node loses an ownership of a range (during bootstrap) or if a range movement to an normal node failed (removenode, decommission failure). Mark all dirty node as "cleanup needed" in those cases.	2024-01-14 14:43:59 +02:00
Gleb Natapov	dbededb1a6	storage_service: add mark_nodes_as_cleanup_needed function The function creates a mutation that sets cleanup to "needed" for each normal node that, according to the erm, has data it does not own after successful or unsuccessful topology operation.	2024-01-14 14:43:33 +02:00
Gleb Natapov	cc54796e23	raft topology: add cleanup state to the topology state machine The patch adds cleanup state to the persistent and in memory state and handles the loading. The state can be "clean" which means no cleanup needed, "needed" which means the node is dirty and needs to run cleanup at some point, "running" which means that cleanup is running by the node right now and when it will be completed the state will be reset to "clean".	2024-01-14 13:30:54 +02:00
Kamil Braun	4e18f8b453	Merge 'topology_state_load: stop waiting for IP-s' from Petr Gusev The loop in `id2ip` lambda makes problems if we are applying an old raft log that contains long-gone nodes. In this case, we may never receive the `IP` for a node and stuck in the loop forever. In this series we replace the loop with an if - we just don't update the `host_id <-> ip` mapping in the `token_metadata.topology` if we don't have an `IP` yet. The PR moves `host_id -> IP` resolution to the data plane, now it happens each time the IP-based methods of `erm` are called. We need this because IPs may not be known at the time the erm is built. The overhead of `raft_address_map` lookup is added to each data plane request, but it should be negligible. In this PR `erm/resolve_endpoints` continues to treat missing IP for `host_id` as `internal_error`, but we plan to relax this in the follow-up (see this PR first comment). Closes scylladb/scylladb#16639 * github.com:scylladb/scylladb: raft ips: rename gossiper_state_change_subscriber_proxy -> raft_ip_address_updater gossiper_state_change_subscriber_proxy: call sync_raft_topology_nodes storage_service: topology_state_load: remove IP waiting loop storage_service: sync_raft_topology_nodes: add target_node parameter storage_service: sync_raft_topology_nodes: move loops to the end storage_service: sync_raft_topology_nodes: rename extract process_left_node and process_transition_node storage_service: sync_raft_topology_nodes: rename add_normal_node -> process_normal_node storage_service: sync_raft_topology_nodes: move update_topology up storage_service: topology_state_load: remove clone_async/clear_gently overhead storage_service: fix indentation storage_service: extract sync_raft_topology_nodes storage_service: topology_state_load: move remove_endpoint into mutate_token_metadata address_map: move gossiper subscription logic into storage_service topology_coordinator: exec_global_command: small refactor, use contains + reformat storage_service: wait_for_ip for new nodes storage_service.idl.hh: fix raft_topology_cmd.command declaration erm: for_each_natural_endpoint_until: use is_vnode == true erm: switch the internal data structures to host_id-s erm: has_pending_ranges: switch to host_id	2024-01-12 18:46:51 +01:00
Petr Gusev	e24bee545b	raft ips: rename gossiper_state_change_subscriber_proxy -> raft_ip_address_updater	2024-01-12 18:29:22 +04:00
Petr Gusev	6e7bbc94f4	gossiper_state_change_subscriber_proxy: call sync_raft_topology_nodes When a node changes its IP we need to store the mapping in system.peers and update token_metadata.topology and erm in-memory data structures. The test_change_ip was improved to verify this new behaviour. Before this patch the test didn't check that IPs used for data requests are updated on IP change. In this commit we add the read/write check. It fails on insert with 'node unavailable' error without the fix.	2024-01-12 18:28:57 +04:00
Petr Gusev	6d6e1ba8fb	storage_service: topology_state_load: remove IP waiting loop The loop makes problems if we are applying an old raft log that contains long-gone nodes. In this case, we may never receive the IP for a node and stuck in the loop forever. The idea of the patch is to replace the loop with an if - we just don't update the host_id <-> ip mapping in the token_metadata.topology if we don't have an IP yet. When we get the mapping later, we'll call sync_raft_topology_nodes again from gossiper_state_change_subscriber_proxy.	2024-01-12 15:37:50 +04:00
Petr Gusev	260874c860	storage_service: sync_raft_topology_nodes: add target_node parameter If it's set, instead of going over all the nodes in raft topology, the function will update only the specified node. This parameter will be used in the next commit, in the call to sync_raft_topology_nodes from gossiper_state_change_subscriber_proxy.	2024-01-12 15:37:50 +04:00
Petr Gusev	a9d58c3db5	storage_service: sync_raft_topology_nodes: move loops to the end	2024-01-12 15:37:50 +04:00
Petr Gusev	d1bce3651b	storage_service: sync_raft_topology_nodes: rename extract process_left_node and process_transition_node	2024-01-12 15:37:50 +04:00
Petr Gusev	aa37b6cfd3	storage_service: sync_raft_topology_nodes: rename add_normal_node -> process_normal_node	2024-01-12 15:37:50 +04:00
Petr Gusev	a508d7ffc5	storage_service: sync_raft_topology_nodes: move update_topology up In this and the following commits we prepare sync_raft_topology_nodes to handle target_node parameter - the single host_id which should be updated.	2024-01-12 15:37:50 +04:00
Petr Gusev	1b12f4b292	storage_service: topology_state_load: remove clone_async/clear_gently overhead Before the patch we used to clone the entire token_metadata and topology only to immediately drop everything in clear_gently. This is a sheer waste.	2024-01-12 15:37:50 +04:00
Petr Gusev	1531e5e063	storage_service: fix indentation	2024-01-12 15:37:50 +04:00
Petr Gusev	9c50637f28	storage_service: extract sync_raft_topology_nodes In the following commits we need part of the topology_state_load logic to be applied from gossiper_state_change_subscriber_proxy. In this commit we extract this logic into a new function sync_raft_topology_nodes.	2024-01-12 15:37:50 +04:00
Petr Gusev	9679b49cf4	storage_service: topology_state_load: move remove_endpoint into mutate_token_metadata In the next commit we extract the loops by nodes into a new function, in this commit we just move them closer to each other. Now the remove_endpoint function might be called under token_metadata_lock (mutate_token_metadata takes it). It's not a problem since gossiper event handlers in raft_topology mode doesn't modify token_metadata so we won't get a deadlock.	2024-01-12 15:37:50 +04:00
Petr Gusev	15b8e565ed	address_map: move gossiper subscription logic into storage_service We are going to remove the IP waiting loop from topology_state_load in subsequent commits. An IP for a given host_id may change after this function has been called by raft. This means we need to subscribe to the gossiper notifications and call it later with a new id<->ip mapping. In this preparatory commit we move the existing address_map update logic into storage_service so that in later commits we can enhance it with topology_state_load call.	2024-01-12 15:37:50 +04:00
Petr Gusev	743be190f9	topology_coordinator: exec_global_command: small refactor, use contains + reformat	2024-01-12 15:37:50 +04:00
Petr Gusev	db1f0d5889	storage_service: wait_for_ip for new nodes When a new node joins the cluster we need to be sure that it's IP is known to all other nodes. In this patch we do this by waiting for the IP to appear in raft_address_map. A new raft_topology_cmd::command::wait_for_ip command is added. It's run on all nodes of the cluster before we put the topology into transition state. This applies both to new and replacing nodes. It's important to run wait_for_ip before moving to topology::transition_state::join_group0 since in this state node IPs are already used to populate pending nodes in erm.	2024-01-12 15:37:46 +04:00
Michał Jadwiszczak	f6a464ad81	configure service levels interval So far the service levels interval, responsible for updating SL configuration, was hardcoded in main. Now it's extracted to `service_levels_interval_ms` option.	2024-01-12 10:28:24 +01:00
Botond Dénes	5f44ae8371	Merge 'Add more logging for `gossiper::lock_endpoint` and `storage_service::handle_state_normal`' from Kamil Braun In a longevity test reported in scylladb/scylladb#16668 we observed that NORMAL state is not being properly handled for a node that replaced another node. Either handle_state_normal is not being called, or it is but getting stuck in the middle. Which is the case couldn't be determined from the logs, and attempts at creating a local reproducer failed. Thus the plan is to continue debugging using the longevity test, but we need more logs. To check whether `handle_state_normal` was called and which branches were taken, include some INFO level logs there. Also, detect deadlocks inside `gossiper::lock_endpoint` by reporting an error message if `lock_endpoint` waits for the lock for too long. Ref: scylladb/scylladb#16668 Closes scylladb/scylladb#16733 * github.com:scylladb/scylladb: gossiper: report error when waiting too long for endpoint lock gossiper: store source_location instead of string in endpoint_permit storage_service: more verbose logging in handle_state_normal	2024-01-12 10:51:21 +02:00
Petr Gusev	1928dc73a8	erm: has_pending_ranges: switch to host_id In the next patches we are going to change erm data structures (replication_map and ring_mapping) from IP to host_id. Having locator::host_id instead of IP in has_pending_ranges arguments makes this transition easier.	2024-01-12 12:23:19 +04:00
Benny Halevy	3e938dbb5a	storage_service: get rid of handle_state_moving declaration The implementation was already removed in `e64613154f` Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes scylladb/scylladb#16742	2024-01-12 09:38:23 +02:00

1 2 3 4 5 ...

4103 Commits