scylladb

Author	SHA1	Message	Date
Benny Halevy	4bd0845fce	gossiper: make send_gossip_echo cancellable Currently send_gossip_echo has a 22 seconds timeout during which _abort_source is ignored. Mark the verb as cancellable so it can be canceled on shutdown / abort. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-30 11:46:10 +03:00
Benny Halevy	fa1c3e86a9	gossiper: add send_echo helper CAll send_gossip_echo using a centralized helper. A following patch will make it abortable. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-30 11:45:51 +03:00
Benny Halevy	e06d226d08	gossiper: failure_detector_loop_for_node: ignore abort_requested_exception Aborting the failure detector happens normally when the node shuts down. There's no need to log anything about it, as long as we abort the function cleanly. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-30 11:05:24 +03:00
Benny Halevy	83c69642f7	gossiper: failure_detector_loop_for_node: check if abort_requested in loop condition The same as the loop condition in the direct_failure_detector. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-30 11:05:24 +03:00
Botond Dénes	f5125ffa18	Merge 'Ensure raft group0 RPCs use the gossip scheduling group.' from Sergey Zolotukhin Scylla operations use concurrency semaphores to limit the number of concurrent operations and prevent resource exhaustion. The semaphore is selected based on the current scheduling group. For RAFT group operations, it is essential to use a system semaphore to avoid queuing behind user operations. This patch ensures that RAFT operations use the `gossip` scheduling group to leverage the system semaphore. Fixes scylladb/scylladb#21637 Backport: 6.2 and 6.1 Closes scylladb/scylladb#22779 * github.com:scylladb/scylladb: Ensure raft group0 RPCs use the gossip scheduling group Move RAFT operations verbs to GOSSIP group.	2025-04-16 09:11:29 +03:00
Sergey Zolotukhin	e05c082002	Ensure raft group0 RPCs use the gossip scheduling group Scylla operations use concurrency semaphores to limit the number of concurrent operations and prevent resource exhaustion. The semaphore is selected based on the current scheduling group. For Raft group operations, it is essential to use a system semaphore to avoid queuing behind user operations. This commit adds a check to ensure that the raft group0 RPCs are executed with the `gossiper` scheduling group.	2025-04-14 17:10:46 +02:00
Benny Halevy	cecfb6dfd7	gms: gossiper: use named gate Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-12 11:28:48 +03:00
Patryk Jędrzejczak	07a7a75b98	Merge 'raft: implement the limited voters feature' from Emil Maskovsky Currently if raft is enabled all nodes are voters in group0. However it is not necessary to have all nodes to be voters - it only slows down the raft group operation (since the quorum is large) and makes deployments with asymmetrical DCs problematic (2 DCs with 5 nodes along 1 DC with 10 nodes will lose the majority if large DC is isolated). The topology coordinator will now maintain a state where there are only limited number of voters, evenly distributed across the DCs and racks. After each node addition or removal the voters are recalculated and rebalanced if necessary. That means: * When a new node is added, it might become a voter depending on the current distribution of voters - either if there are still some voter "slots" available, or if the new node is a better candidate than some existing voter (in which case the existing node voter status might be revoked). * When a voter node is removed or stopped (shut down), its voter status is revoked and another node might become a voter instead (this can also depend on other circumstances, like e.g. changing the number of DCs). * If a node addition or removal causes a change in number of data centers (DCs) or racks, the rebalance action might become wider (as there are some special rules applying to 1 vs 2 vs more DCs, also changing the number of racks might cause similar effects in the voters distribution) Special conditions for various number of DCs: * 1 DC: Can have up to the maximum allowed number of voters (5 - see below) * 2 DCs: The distribution of the voters will be asymmetric (if possible), meaning that we can tolerate a loss of the DC with the smaller number of voters (if both would have the same number of voters we'd lose majority if any of the DCs is lost). For example, if we have 2 DCs with 2 nodes each, one of them will only have 1 voter (despite the limit of 5). Also, if one of the 2 DCs has more racks than the other and the node count allows it, the DC with the more racks will have more voters. * 3 and more DCs: The distribution of the voters will be so that every DC has strictly less than half of the total voters (so a loss of any of the DCs cannot lead to the majority loss). Again, DCs with more racks are being preferred in the voter distribution. At the moment we will be handling the zero-token nodes in the same way as the regular nodes (i.e. the zero-token nodes will not take any priority in the voter distribution). Technically it doesn't make much sense to have a zero-token node that is not a voter (when there are regular nodes in the same DC being voters), but currently the intended purpose of zero-token nodes is to form an "arbiter DC" (in case of 2 DCs, creating a third DC with zero-token nodes only), so for that intended purpose no special handling is needed and will work out of the box. If a preference of zero token nodes will eventually be needed/requested, it will be added separately from this PR. The maximum number of voters of 5 has been chosen as the smallest "safe" value. We can lose majority when multiple nodes (possibly in different dcs and racks) die independently in a short time span. With less than 5 voters, we would lose majority if 2 voters died, which is very unlikely to happen but not entirely impossible. With 5 voters, at least 3 voters must die to lose majority, which can be safely considered impossible in the case of independent failures. Currently the limit will not be configurable (we might introduce configurable limits later if that would be needed/requested). Tests added: * boost/group0_voter_registry_test.cc: run time on CI: ~3.5s * topology_custom/test_raft_voters.py: parametrized with 1 or 3 nodes per DC, the run time on CI: 1: ~20s. 3: ~40s, approx 1 min total Fixes: scylladb/scylladb#18793 No backport: This is a new feature that will not be backported. Closes scylladb/scylladb#21969 * https://github.com/scylladb/scylladb: raft: distribute voters by rack inside DC raft/test: fix lint warnings in `test_raft_no_quorum` raft/test: add the upgrade test for limited voters feature raft topology: handle on_up/on_down to add/remove node from voters raft: fix the indentation after the limited voters changes raft: implement the limited voters feature raft: drop the voter removal from the decommission raft/test: disable the `stop_before_becoming_raft_voter` test raft/test: stop the server less gracefully in the voters test	2025-04-10 15:29:15 +02:00
Aleksandra Martyniuk	b80e957a40	gms: add REPAIR_BASED_TABLET_REBUILD cluster feature	2025-04-08 10:42:01 +02:00
Avi Kivity	8d2a41db82	Merge "Fixes for gossiper conversion to host id" from Gleb " The series contains fixes to gossiper conversion to host id. There are two fixes where we could erroneously send outdated entry in a gossiper message and a fix for force_remove_endpoint which was not converted to work on host id and this caused it to not delete the entry in some cases (in replace with the same ip case). " * 'gleb/host-id-fixes' of github.com:scylladb/scylla-dev: gossiper: send newest entry in a digest message gossiper: change make_random_gossip_digest to return value instead of modifying passed parameter gossiper: move force_remove_endpoint to work on host id gossiper: do not send outdated endpoint in gossiper round	2025-04-07 17:04:28 +03:00
Emil Maskovsky	1d06ea3a5a	raft: implement the limited voters feature Currently if raft is enabled all nodes are voters in group0. However it is not necessary to have all nodes to be voters - it only slows down the raft group operation (since the quorum is large) and makes deployments with asymmetrical DCs problematic (2 DCs with 5 nodes along 1 DC with 10 nodes will lose the majority if large DC is isolated). The topology coordinator will now maintain a state where there are only limited number of voters, evenly distributed across the DCs and racks. After each node addition or removal the voters are recalculated and rebalanced if necessary. That means: * When a new node is added, it might become a voter depending on the current distribution of voters - either if there are still some voter "slots" available, or if the new node is a better candidate than some existing voter (in which case the existing node voter status might be revoked). * When a voter node is removed or stopped (shut down), its voter status is revoked and another node might become a voter instead (this can also depend on other circumstances, like e.g. changing the number of DCs). * If a node addition or removal causes a change in number of datacenters (DCs) or racks, the rebalance action might become wider (as there are some special rules applying to 1 vs 2 vs more DCs, also changing the number of racks might cause similar effects in the voters distribution) Special conditions for various number of DCs: * 1 DC: Can have up to the maximum allowed number of voters (5 - see below) * 2 DCs: The distribution of the voters will be asymmetric (if possible), meaning that we can tolerate a loss of the DC with the smaller number of voters (if both would have the same number of voters we'd lose the majority if any of the DCs is lost). For example, if we have 2 DCs with 2 nodes each, one of them will only have 1 voter (despite the limit of 5). Also, if one of the 2 DCs has more racks than the other and the node count allows it, the DC with the more racks will have more voters. * 3 and more DCs: The distribution of the voters will be so that every DC has strictly less than half of the total voters (so a loss of any of the DCs cannot lead to the majority loss). Again, DCs with more racks are being preferred in the voter distribution. At the moment we will be handling the zero-token nodes in the same way as the regular nodes (i.e. the zero-token nodes will not take any priority in the voter distribution). Technically it doesn't make much sense to have a zero-token node that is not a voter (when there are regular nodes in the same DC being voters), but currently the intended purpose of zero-token nodes is to form an "arbiter DC" (in case of 2 DCs, creating a third DC with zero-token nodes only), so for that intended purpose no special handling is needed and will work out of the box. If a preference of zero token nodes will eventually be needed/requested, it will be added separately from this PR. Currently the voter limits will not be configurable (we might introduce configurable limits later if that would be needed/requested). The feature is enabled by the `group0_limited_voters` feature flag to avoid issues with cluster upgrade (the feature will be only enabled once all nodes in the cluster are upgraded to the version supporting the feature). Fixes: scylladb/scylladb#18793	2025-04-07 12:31:18 +02:00
Gleb Natapov	a982db326e	gossiper: send newest entry in a digest message In cases where two entries have the same ip address send information only for the newest one. Now we send both which make the receiver use one of them at random and it may be outdated one (though it should only cause more data than needed to be requested).	2025-04-06 18:39:24 +03:00
Gleb Natapov	8d534ee68e	gossiper: change make_random_gossip_digest to return value instead of modifying passed parameter	2025-04-06 18:39:24 +03:00
Gleb Natapov	6f53611337	gossiper: move force_remove_endpoint to work on host id Since the gossiper works on host ids now it is incorrect to leave this function to work on ip. It makes it impossible to delete outdated entry since the "gossiper.get_host_id(endpoint) != id" check will always be false for such entries (get_host_id() always returns most up -to-date mapping.	2025-04-06 18:39:24 +03:00
Gleb Natapov	df6cd87bcc	gossiper: do not send outdated endpoint in gossiper round Now that the gossiper map is id based there can be a situation where two entries have the same ip, Shadow round should send the newest one in this cased. The patch makes it so. Fixes: #23553	2025-04-06 15:08:03 +03:00
Botond Dénes	fcdae20fd1	Merge 'Add tablet enforcing option' from Benny Halevy This series add a new config option: `tablets_mode_for_new_keyspaces` that replaces the existing `enable_tablets` option. It can be set to the following values: disabled: New keyspaces use vnodes by default, unless enabled by the tablets={'enabled':true} option enabled: New keyspaces use tablets by default, unless disabled by the tablets={'disabled':true} option enforced: New keyspaces must use tablets. Tablets cannot be disabled using the CREATE KEYSPACE option `tablets_mode_for_new_keyspaces=disabled` or `tablets_mode_for_new_keyspaces=enabled` control whether tablets are disabled or enabled by default for new keyspaces, respectively. In either cases, tablets can be opted-in or out using the `tablets={'enabled':...}` keyspace option, when the keyspace is created. `tablets_mode_for_new_keyspaces=enforced` enables tablets by default for new keyspaces, like `tablets_mode_for_new_keyspaces=enabled`. However, it does not allow to opt-out when creating new keyspaces by setting `tablets = {'enabled': false}` Refs scylladb/scylla-enterprise#4355 * Requires backport to 2025.1 Closes scylladb/scylladb#22273 * github.com:scylladb/scylladb: boost/tablets_test: verify failure to create keyspace with tablets and non network replication strategy tablets: enforce tablets using tablets_mode_for_new_keyspaces=enforced config option db/config: add tablets_mode_for_new_keyspaces option	2025-04-03 16:32:19 +03:00
Avi Kivity	882f405eed	Merge "Convert gossiper's endpoint state map to be host id based" from Gleb " The series makes endpoint state map in the gossiper addressable by host id instead of ips. The transition has implication outside of the gossiper as well. Gossiper based topology operations are affected by this change since they assume that the mapping is ip based. On wire protocol is not affected by the change as maps that are sent by the gossiper protocol remain ip based. If old node sends two different entries for the same host id the one with newer generation is applied. If new node has two ids that are mapped to the same ip the newer one is added to the outgoing map. Interoperability was verified manually by running mixed cluster. The series concludes the conversion of the system to be host id based. " * 'gleb/gossipper-endpoint-map-to-host-id-v2' of github.com:scylladb/scylla-dev: gossiper: make examine_gossiper private gossiper: rename get_nodes_with_host_id to get_node_ip treewide: drop id parameter from gossiper::for_each_endpoint_state treewide: move gossiper to index nodes by host id gossiper: drop ip from replicate function parameters gossiper: drop ip from apply_new_states parameters gossiper: drop address from handle_major_state_change parameter list gossiper: pass rpc::client_info to gossiper_shutdown verb handler gossiper: add try_get_host_id function gossiper: add ip to endpoint_state serialization: fix std::map de-serializer to not invoke value's default constructor gossiper: drop template from wait_alive_helper function gossiper: move get_supported_features and its users to host id storage_service: make candidates_for_removal host id based gossiper: use peers table to detect address change storage_service: use std::views::keys instead of std::views::transform that returns a key gossiper: move _pending_mark_alive_endpoints to host id gossiper: do not allow to assassinate endpoint in raft topology mode gossiper: fix indentation after previous patch gossiper: do not allow to assassinate non existing endpoint	2025-04-02 12:30:00 +03:00
Michał Chojnowski	61316e29df	gms/feature_service: add the SSTABLE_COMPRESSION_DICTS cluster feature This feature will guard against writing SSTables containing compression dictionaries before the entire cluster is able to understand them.	2025-04-01 00:07:28 +02:00
Gleb Natapov	3abe5de8bf	gossiper: make examine_gossiper private	2025-03-31 16:50:50 +03:00
Gleb Natapov	afdfde8300	gossiper: rename get_nodes_with_host_id to get_node_ip Also change it to return std::optional instead of std::set since now there can be only on ip mapped to an id.	2025-03-31 16:50:50 +03:00
Gleb Natapov	28fb84117d	treewide: drop id parameter from gossiper::for_each_endpoint_state We have it in endpoint_state anyway, so no need to pass both.	2025-03-31 16:50:50 +03:00
Gleb Natapov	4609bbbbb2	treewide: move gossiper to index nodes by host id This patch changes gossiper to index nodes by host ids instead of ips. The main data structure that changes is _endpoint_state_map, but this results in a lot of changes since everything that uses the map directly or indirectly has to be changed. The big victim of this outside of the gossiper itself is topology over gossiper code. It works on IPs and assumes the gossiper does the same and both need to be changed together. Changes to other subsystems are much smaller since they already mostly work on host ids anyway.	2025-03-31 16:50:50 +03:00
Gleb Natapov	19ac05b0ba	gossiper: drop ip from replicate function parameters We have it in endpoint_state now, so no need to pass both.	2025-03-31 16:50:50 +03:00
Gleb Natapov	c5b8429bec	gossiper: drop ip from apply_new_states parameters We have it in endpoint_state now, so no need to pass both.	2025-03-31 16:50:50 +03:00
Gleb Natapov	6da5f541a2	gossiper: drop address from handle_major_state_change parameter list We have it in endpoint_state now, so no need to pass both.	2025-03-31 16:50:50 +03:00
Gleb Natapov	5e06bf76e0	gossiper: pass rpc::client_info to gossiper_shutdown verb handler It will be needed later to obtain host id of the peer.	2025-03-31 16:50:50 +03:00
Gleb Natapov	704580b197	gossiper: add try_get_host_id function The function returns unengaged std::optional if id is not found instead of throwing like get_host_id does.	2025-03-31 16:50:45 +03:00
Gleb Natapov	6999b474a1	gossiper: add ip to endpoint_state Store endpoint's IP in the endpoint state. Currently it is stored as a key in gossiper's endpoint map, but we are going to change that. The new filed is not serialized when endpoint state is sent over rpc, so it is set by the rpc handler from the value in the map that is in the rpc message. This map will not be changed to be host id based to not break interoperability.	2025-03-31 15:42:08 +03:00
Gleb Natapov	e5cc3b75f8	gossiper: drop template from wait_alive_helper function Move ip to id translation to the caller.	2025-03-31 15:42:07 +03:00
Gleb Natapov	0dd86b4f1d	gossiper: move get_supported_features and its users to host id	2025-03-31 15:42:07 +03:00
Gleb Natapov	a581a99dbf	gossiper: move _pending_mark_alive_endpoints to host id Index _pending_mark_alive_endpoints map by host id instead of ip	2025-03-31 15:25:39 +03:00
Gleb Natapov	555149c153	gossiper: do not allow to assassinate endpoint in raft topology mode It does nothing but harm in raft topology mode.	2025-03-31 15:25:39 +03:00
Gleb Natapov	4cc1c10035	gossiper: fix indentation after previous patch	2025-03-31 15:25:39 +03:00
Gleb Natapov	e8b7aaa0d4	gossiper: do not allow to assassinate non existing endpoint We assume that all endpoint states have HOST_ID set or the host id is available locally, but the assassinate code injects a state without HOST_ID for not existing endpoint violating this assumption.	2025-03-31 15:25:39 +03:00
Benny Halevy	62aeba759b	tablets: enforce tablets using tablets_mode_for_new_keyspaces=enforced config option `tablets_mode_for_new_keyspaces=enforced` enables tablets by default for new keyspaces, like `tablets_mode_for_new_keyspaces=enabled`. However, it does not allow to opt-out when creating new keyspaces by setting `tablets = {'enabled': false}`. Refs scylladb/scylla-enterprise#4355 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-03-24 15:32:16 +02:00
Benny Halevy	c62865df90	db/config: add tablets_mode_for_new_keyspaces option The new option deprecates the existing `enable_tablets` option. It will be extended in the next patch with a 3rd value: "enforced" while will enable tablets by default for new keyspace but without the posibility to opt out using the `tablets = {'enabled': false}` keyspace schema option. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-03-24 14:54:45 +02:00
Patryk Jędrzejczak	9970c1fcc3	gossip: allow group 0 ID mismatch in the Raft-based recovery procedure This patch ensures that members of the new group 0 can gossip with members of the old group 0 during rolling restart in the Raft-based recovery procedure. Without this change, restarted nodes (members of the new group 0) wouldn't be marked as UP by other nodes (members of the old group 0), which would decrease availability.	2025-03-14 13:53:05 +01:00
Avi Kivity	696ce4c982	Merge "convert some parts of the gossiper to host ids" from Gleb " This is series starts conversion of the gossiper to use host ids to index nodes. It does not touch the main map yet, but converts a lot of internal code to host id. There are also some unrelated cleanups that were done while working on the series. On of which is dropping code related to old shadow round. We replaced shadow round with explicit GOSSIP_GET_ENDPOINT_STATES verb in `cd7d64f588` which is in scylla-4.3.0, so there should be no compatibility problem. We already dropped a lot of old shadow round code in previous patches anyway. I tested manually that old and new node can co-exist in the same cluster, " * 'gleb/gossiper-host-id-v2' of github.com:scylladb/scylla-dev: (33 commits) gossiper: drop unneeded code gossiper: move _expire_time_endpoint_map to host_id gossiper: move _just_removed_endpoints to host id gossiper: drop unused get_msg_addr function messaging_service: change connection dropping notification to pass host id only messaging_service: pass host id to remove_rpc_client in down notification treewide: pass host id to endpoint_lifecycle_subscriber treewide: drop endpoint life cycle subscribers that do nothing load_meter: move to host id treewide: use host id directly in endpoint state change subscribers treewide: pass host id to endpoint state change subscribers gossiper: drop deprecated unsafe_assassinate_endpoint operation storage_service: drop unused code in handle_state_removed treewide: drop endpoint state change subscribers that do nothing gossiper: drop ip address from handle_echo_msg and simplify code since host_id is now mandatory gossiper: start using host ids to send messages earlier messaging_service: add temporary address map entry on incoming connection topology_coordinator: notify about IP change from sync_raft_topology_nodes as well treewide: move everyone to use host id based gossiper::is_alive and drop ip based one storage_proxy: drop unused template ...	2025-03-13 13:36:31 +02:00
Avi Kivity	b1d9f80d85	Merge 'tablets: Make load balancing capacity-aware' from Tomasz Grabiec Before this patch, the load balancer was equalizing tablet count per shard, so it achieved balance assuming that: 1) tablets have the same size 2) shards have the same capacity That can cause imbalance of utilization if shards have different capacity, which can happen in heterogeneous clusters with different instance types. One of the causes for capacity difference is that larger instances run with fewer shards due to vCPUs being dedicated to IRQ handling. This makes those shards have more disk capacity, and more CPU power. After this patch, the load balancer equalizes shard's storage utilization, so it no longer assumes that shards have the same capacity. It still assumes that each tablet has equal size. So it's a middle step towards full size-aware balancing. One consequence is that to be able to balance, the load balancer need to know about every node's capacity, which is collected with the same RPC which collects load_stats for average tablet size. This is not a significant set back because migrations cannot proceed anyway if nodes are down due to barriers. We could make intra-node migration scheduling work without capacity information, but it's pointless due to above, so not implemented. Also, per-shard goal for tablet count is still the same for all nodes in the cluster, so nodes with less capacity will be below limit and nodes with more capacity will be slightly above limit. This shouldn't be a significant problem in practice, we could compensate for this by increasing the limit. Refs #23042 Closes scylladb/scylladb#23079 * github.com:scylladb/scylladb: tablets: Make load balancing capacity-aware topology_coordinator: Fix confusing log message topology_coordinator: Refresh load stats after adding a new node topology_coordinator: Allow capacity stats to be refreshed with some nodes down topology_coordinator: Refactor load status refreshing so that it can be triggered from multiple places test: boost: tablets_test: Always provide capacity in load_stats test: perf_load_balancing: Set node capacity test: perf_load_balancing: Convert to topology_builder config, disk_space_monitor: Allow overriding capacity via config storage_service, tablets: Collect per-node capacity in load_stats	2025-03-11 14:34:27 +02:00
Gleb Natapov	57f2b6d825	gossiper: drop unneeded code host_id is already available at this point.	2025-03-11 12:09:22 +02:00
Gleb Natapov	cca228265e	gossiper: move _expire_time_endpoint_map to host_id Index _expire_time_endpoint_map map by host id instead of ip	2025-03-11 12:09:22 +02:00
Gleb Natapov	c45b50bbe6	gossiper: move _just_removed_endpoints to host id Index _just_removed_endpoints map by host id instead of ip	2025-03-11 12:09:22 +02:00
Gleb Natapov	22739bb39a	gossiper: drop unused get_msg_addr function	2025-03-11 12:09:22 +02:00
Gleb Natapov	499eb4d17f	treewide: pass host id to endpoint state change subscribers	2025-03-11 12:09:22 +02:00
Gleb Natapov	eb59205caf	gossiper: drop deprecated unsafe_assassinate_endpoint operation It was always deprecated.	2025-03-11 12:09:21 +02:00
Gleb Natapov	696aee3adc	treewide: drop endpoint state change subscribers that do nothing Provide default implementation for them instead. Will be easier to rework them later.	2025-03-11 12:09:21 +02:00
Gleb Natapov	7dcffda6bd	gossiper: drop ip address from handle_echo_msg and simplify code since host_id is now mandatory	2025-03-11 12:09:21 +02:00
Gleb Natapov	8425c26462	gossiper: start using host ids to send messages earlier Send digest ack and ack2 by host ids as well now since the id->ip mapping is available after receiving digest syn. It allows to convert more code to host id here.	2025-03-11 12:09:21 +02:00
Gleb Natapov	0e3dcb7954	treewide: move everyone to use host id based gossiper::is_alive and drop ip based one	2025-03-11 12:09:21 +02:00
Gleb Natapov	e47f251178	gossiper: move _live_endpoints and _unreachable_endpoints endpoint to host_id Index live and dead endpoints by host id. It also allows to simplify some code that does a translation.	2025-03-11 12:09:21 +02:00

1 2 3 4 5 ...

1319 Commits