scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-05-29 11:10:40 +00:00

Author	SHA1	Message	Date
Pavel Emelyanov	eb5b52f598	Merge 'main: make DC and rack immutable after bootstrap' from Piotr Dulikowski Changing DC or rack on a node which was already bootstrapped is, in case of vnodes, very unsafe (almost guaranteed to cause data loss or unavailability), and is outright not supported if the cluster has a tablet-backed keyspaces. Moreover, the possibility of doing that makes it impossible to uphold some of the invariants promised by the RF-rack-valid flag, which is eventually going to become unconditionally enabled. Get rid of the above problems by removing the possibility of changing the DC / rack of a node. A node will now fail to start if its snitch reports a different DC or rack than the one that was reported during the first boot. Fixes: scylladb/scylladb#23278 Fixes: scylladb/scylladb#22869 Marking for backport to 2025.1, as this is a necessary part of the RF-rack-valid saga Closes scylladb/scylladb#23800 * github.com:scylladb/scylladb: doc: changing topology when changing snitches is no longer supported test: cluster: introduce test_no_dc_rack_change storage_service: don't update DC/rack in update_topology_with_local_metadata main: make dc and rack immutable after bootstrap test: cluster: remove test_snitch_change	2025-04-21 15:52:55 +03:00
Piotr Dulikowski	1791ae3581	storage_service: don't update DC/rack in update_topology_with_local_metadata The DC/rack are now immutable and cannot be changed after restart, so there is no need to update the node's system.topology entry with this information on restart.	2025-04-17 16:22:58 +02:00
Botond Dénes	8ac7c54d8b	Merge 'topology_coordinator: stop: await all background_action_holder:s' from Benny Halevy Add missing awaits for the rebuild_repair and repair background actions. Although the background actions hold the _async_gate which is closed in topology_coordinator::run(), stop() still needs to await all background action futures and handle any errors they may have left behind. Fixes #23755 * The issue exists since 6.2 Closes scylladb/scylladb#17712 * github.com:scylladb/scylladb: topology_coordinator: stop: await all background_action_holder:s topology_coordinator: stop: improve error messages topology_coordinator: stop: define stop_background_action helper	2025-04-17 12:10:29 +03:00
Kefu Chai	a33651b03e	db, service: do not include unused header these unused headers were flagged by clang-include-cleaner. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#23735	2025-04-17 11:49:59 +03:00
Benny Halevy	7a0f5e0a54	topology_coordinator: stop: await all background_action_holder:s Add missing awaits for the rebuild_repair and repair background actions. Although the background actions hold the _async_gate which is closed in topology_coordinator::run(), stop() still needs to await all background action futures and handle any errors they may have left behind. Fixes #23755 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-16 15:23:02 +03:00
Benny Halevy	6de79d0dd3	topology_coordinator: stop: improve error messages "when cleanup" is ill-formed. Use "when XYZ" to "during XYZ" instead. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-16 15:20:58 +03:00
Benny Halevy	d624795fda	topology_coordinator: stop: define stop_background_action helper Refactor the code to use a helper to await background_action_holder and handle any errors by printing a warning. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-16 15:20:39 +03:00
Botond Dénes	f5125ffa18	Merge 'Ensure raft group0 RPCs use the gossip scheduling group.' from Sergey Zolotukhin Scylla operations use concurrency semaphores to limit the number of concurrent operations and prevent resource exhaustion. The semaphore is selected based on the current scheduling group. For RAFT group operations, it is essential to use a system semaphore to avoid queuing behind user operations. This patch ensures that RAFT operations use the `gossip` scheduling group to leverage the system semaphore. Fixes scylladb/scylladb#21637 Backport: 6.2 and 6.1 Closes scylladb/scylladb#22779 * github.com:scylladb/scylladb: Ensure raft group0 RPCs use the gossip scheduling group Move RAFT operations verbs to GOSSIP group.	2025-04-16 09:11:29 +03:00
Tomasz Grabiec	001d3b2415	Merge 'storage_service: preserve state of busy topology when transiting tablet' from Łukasz Paszkowski Commit `876478b84f` ("storage_service: allow concurrent tablet migration in tablets/move API", 2024-02-08) introduced a code path on which the topology state machine would be busy -- in "tablet_draining" or "tablet_migration" state -- at the time of starting tablet migration. The pre-commit code would unconditionally transition the topology to "tablet_migration" state, assuming the topology had been idle previously. On the new code path, this state change would be idempotent if the topology state machine had been busy in "tablet_migration", but the state change would incorrectly overwrite the "tablet_draining" state otherwise. Restrict the state change to when the topology state machine is idle. In addition, add the topology update to the "updates" vector with plain push_back(). emplace_back() is not helpful here, as topology_mutation_builder::build() cannot construct in-place, and so we invoke the "canonical_mutation" move constructor once, either way. Unit test: Start a two node cluster. Create a single tablet on one of the nodes. Start decommissioning that node, but block decommissioning at once. In that state (i.e., in "tablet_draining"), move the tablet manually to the other node. Check that transit_tablet() leaves the topology transition state alone. Fixes https://github.com/scylladb/scylladb/issues/20073. Commit `876478b84f` was first released in scylla-6.0.0, so we might want to backport this patch accordingly. Closes scylladb/scylladb#23751 * github.com:scylladb/scylladb: storage_service: add unit test for mid-decommission transit_tablet() storage_service: preserve state of busy topology when transiting tablet	2025-04-16 00:19:24 +02:00
Pavel Emelyanov	b79137eaa4	storage_service: Use this->_features directly This dependency is already there, storage service doesn't need to go rounds via database reference to get to the features. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#23739	2025-04-15 21:11:12 +03:00
Laszlo Ersek	841ca652a0	storage_service: add unit test for mid-decommission transit_tablet() Start a two node cluster. Create a single tablet on one of the nodes. Start decommissioning that node, but block decommissioning at once. In that state (i.e., in "tablet_draining"), move the tablet manually to the other node. Check that transit_tablet() leaves the topology transition state alone. Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>	2025-04-15 15:15:25 +02:00
Laszlo Ersek	e1186f0ae6	storage_service: preserve state of busy topology when transiting tablet Commit `876478b84f` ("storage_service: allow concurrent tablet migration in tablets/move API", 2024-02-08) introduced a code path on which the topology state machine would be busy -- in "tablet_draining" or "tablet_migration" state -- at the time of starting tablet migration. The pre-commit code would unconditionally transition the topology to "tablet_migration" state, assuming the topology had been idle previously. On the new code path, this state change would be idempotent if the topology state machine had been busy in "tablet_migration", but the state change would incorrectly overwrite the "tablet_draining" state otherwise. Restrict the state change to when the topology state machine is idle. In addition, add the topology update to the "updates" vector with plain push_back(). emplace_back() is not helpful here, as topology_mutation_builder::build() cannot construct in-place, and so we invoke the "canonical_mutation" move constructor once, either way. Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>	2025-04-15 13:44:45 +02:00
Emil Maskovsky	3930ee8e3c	raft: fix data center remaining nodes initialization The `_remaining_nodes` attribute of the data center information was not initialized correctly. The parameter was passed by value to the initialization function instead of by reference or pointer. As a result, `_remaining_nodes` was left initialized to zero, causing an underflow when decrementing its value. This bug did not significantly impact behavior because other safeguards, such as capping the maximum voters per data center by the total number of nodes, masked the issue. However, it could lead to inefficiencies, as the remaining nodes check would not trigger correctly. Fixes: scylladb/scylladb#23702 No backport: The bug is only present in the master branch, so no backport is required. Closes scylladb/scylladb#23704	2025-04-15 09:58:32 +02:00
Nadav Har'El	fbcf77d134	raft: make group0 Raft operation timeout configurable A recent commit `370707b111` (re)introduced a timeout for every group0 Raft operation. This timeout was set to 60 seconds, which, paraphrasing Bill Gates, "ought to be enough for anybody". However, one of the things we do as a group0 operation is schema changes, and we already noticed a few years ago, see commit `0b2cf21932`, that in some extremely overloaded test machines where tests run hundreds of times (!) slower than usual, a single big schema operation - such as Alternator's DeleteTable deleting a table and multiple of its CDC or view tables - sometimes takes more than 60 seconds. The above fix changed the client's timeout to wait for 300 seconds instead of 60 seconds, but now we also need to increase our Raft timeout, or the server can time out. We've seen this happening recently making some tests flaky in CI (issue #23543). So let's make this timeout configurable, as a new configuration option group0_raft_op_timeout_in_ms. This option defaults to 60000 (i.e, 60 seconds), the same as the existing default. The test framework overrides this default with a a higher 300 second timeout, matching the client-side timeout. Before this patch, this timeout was already configurable in a strange way, using injections. But this was a misstep: We already have more than a dozen timeouts configurable through the normal configration, and this one should have been configured in the same way. There is nothing "holy" about the default of 60 seconds we chose, and who knows maybe in the future we might need to tweek it in the field, just like we made the other timeouts tweakable. Injections cannot be used in release mode, but configuration options can. Fixes #23543 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#23717	2025-04-15 10:57:39 +03:00
Sergey Zolotukhin	e05c082002	Ensure raft group0 RPCs use the gossip scheduling group Scylla operations use concurrency semaphores to limit the number of concurrent operations and prevent resource exhaustion. The semaphore is selected based on the current scheduling group. For Raft group operations, it is essential to use a system semaphore to avoid queuing behind user operations. This commit adds a check to ensure that the raft group0 RPCs are executed with the `gossiper` scheduling group.	2025-04-14 17:10:46 +02:00
Benny Halevy	e1fe82ed33	utils: phased_barrier, pluggable: use named gate Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-12 11:47:00 +03:00
Benny Halevy	5e864b6277	topology_coordinator: use named gate Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-12 11:29:46 +03:00
Benny Halevy	a67ed59399	storage_service: use named gate Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-12 11:28:49 +03:00
Benny Halevy	39f1175451	storage_proxy: wait_for_hint_sync_point: use named gate Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-12 11:28:49 +03:00
Benny Halevy	e228a112fe	storage_proxy: remote: use named gate Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-12 11:28:49 +03:00
Benny Halevy	0a1e7de6ea	service: session: use named gate Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-12 11:28:49 +03:00
Benny Halevy	747446cb25	service: raft: raft_rpc: use named gate Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-12 11:28:49 +03:00
Benny Halevy	01bb3980fc	service: raft: raft_group0: use named gate Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-12 11:28:49 +03:00
Benny Halevy	6118150d44	service: raft: persistent_discovery: use named gate Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-12 11:28:49 +03:00
Benny Halevy	e430df6332	service: raft: group0_state_machine: use named gate Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-12 11:28:49 +03:00
Benny Halevy	5f8b5724e6	service: migration_manager: use named gate Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-12 11:28:49 +03:00
Patryk Jędrzejczak	07a7a75b98	Merge 'raft: implement the limited voters feature' from Emil Maskovsky Currently if raft is enabled all nodes are voters in group0. However it is not necessary to have all nodes to be voters - it only slows down the raft group operation (since the quorum is large) and makes deployments with asymmetrical DCs problematic (2 DCs with 5 nodes along 1 DC with 10 nodes will lose the majority if large DC is isolated). The topology coordinator will now maintain a state where there are only limited number of voters, evenly distributed across the DCs and racks. After each node addition or removal the voters are recalculated and rebalanced if necessary. That means: * When a new node is added, it might become a voter depending on the current distribution of voters - either if there are still some voter "slots" available, or if the new node is a better candidate than some existing voter (in which case the existing node voter status might be revoked). * When a voter node is removed or stopped (shut down), its voter status is revoked and another node might become a voter instead (this can also depend on other circumstances, like e.g. changing the number of DCs). * If a node addition or removal causes a change in number of data centers (DCs) or racks, the rebalance action might become wider (as there are some special rules applying to 1 vs 2 vs more DCs, also changing the number of racks might cause similar effects in the voters distribution) Special conditions for various number of DCs: * 1 DC: Can have up to the maximum allowed number of voters (5 - see below) * 2 DCs: The distribution of the voters will be asymmetric (if possible), meaning that we can tolerate a loss of the DC with the smaller number of voters (if both would have the same number of voters we'd lose majority if any of the DCs is lost). For example, if we have 2 DCs with 2 nodes each, one of them will only have 1 voter (despite the limit of 5). Also, if one of the 2 DCs has more racks than the other and the node count allows it, the DC with the more racks will have more voters. * 3 and more DCs: The distribution of the voters will be so that every DC has strictly less than half of the total voters (so a loss of any of the DCs cannot lead to the majority loss). Again, DCs with more racks are being preferred in the voter distribution. At the moment we will be handling the zero-token nodes in the same way as the regular nodes (i.e. the zero-token nodes will not take any priority in the voter distribution). Technically it doesn't make much sense to have a zero-token node that is not a voter (when there are regular nodes in the same DC being voters), but currently the intended purpose of zero-token nodes is to form an "arbiter DC" (in case of 2 DCs, creating a third DC with zero-token nodes only), so for that intended purpose no special handling is needed and will work out of the box. If a preference of zero token nodes will eventually be needed/requested, it will be added separately from this PR. The maximum number of voters of 5 has been chosen as the smallest "safe" value. We can lose majority when multiple nodes (possibly in different dcs and racks) die independently in a short time span. With less than 5 voters, we would lose majority if 2 voters died, which is very unlikely to happen but not entirely impossible. With 5 voters, at least 3 voters must die to lose majority, which can be safely considered impossible in the case of independent failures. Currently the limit will not be configurable (we might introduce configurable limits later if that would be needed/requested). Tests added: * boost/group0_voter_registry_test.cc: run time on CI: ~3.5s * topology_custom/test_raft_voters.py: parametrized with 1 or 3 nodes per DC, the run time on CI: 1: ~20s. 3: ~40s, approx 1 min total Fixes: scylladb/scylladb#18793 No backport: This is a new feature that will not be backported. Closes scylladb/scylladb#21969 * https://github.com/scylladb/scylladb: raft: distribute voters by rack inside DC raft/test: fix lint warnings in `test_raft_no_quorum` raft/test: add the upgrade test for limited voters feature raft topology: handle on_up/on_down to add/remove node from voters raft: fix the indentation after the limited voters changes raft: implement the limited voters feature raft: drop the voter removal from the decommission raft/test: disable the `stop_before_becoming_raft_voter` test raft/test: stop the server less gracefully in the voters test	2025-04-10 15:29:15 +02:00
Avi Kivity	9559e53f55	Merge 'Adjust tablet-mon.py for capacity-aware load balancing' from Tomasz Grabiec After load-balancer was made capacity-aware it no longer equalizes tablet count per shard, but rather utilization of shard's storage. This makes the old presentation mode not useful in assessing whether balance was reached, since nodes with less capacity will get fewer tablets when in balanced state. This PR adds a new default presentation mode which scales tablet size by its storage utilization so that tablets which have equal shard utilization take equal space on the graph. To facilitate that, a new virtual table was added: system.load_per_node, which allows the tool to learn about load balancer's view on per-node capacity. It can also serve as a debugging interface to get a view of current balance according to the load-balancer. Closes scylladb/scylladb#23584 * github.com:scylladb/scylladb: tablet-mon.py: Add presentation mode which scales tablet size by its storage utilization tablet-mon.py: Center tablet id text properly in the vertical axis tablet-mon.py: Show migration stage tag in table mode only when migrating virtual-tables: Introduce system.load_per_node virtual_tables: memtable_filling_virtual_table: Propagate permit to execute() docs: virtual-tables: Fix instructions service: tablets: Keep load_stats inside tablet_allocator	2025-04-10 14:59:08 +03:00
Pavel Emelyanov	88318d3b50	topology_coordinator: Use shorter fault-injection overloads There are few places that want to pause until a message is received from the test. There's a convenience one-line suger to do it. One test needs update its expectations about log message that appears when scylle steps on it and actually starts waiting. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#23390	2025-04-10 14:05:46 +03:00
Tomasz Grabiec	b5211cca85	Merge 'tablets: rebuild: use repair for tablet rebuild' from Aleksandra Martyniuk Currently, when we rebuild a tablet, we stream data from all replicas. This creates a lot of redundancy, wastes bandwidth and CPU resources. In this series, we split the streaming stage of tablet rebuild into two phases: first we stream tablet's data from only one replica and then repair the tablet. Fixes: https://github.com/scylladb/scylladb/issues/17174. Needs backport to 2025.1 to prevent out of space during streaming Closes scylladb/scylladb#23187 * github.com:scylladb/scylladb: test: add test for rebuild with repair locator: service: move to rebuild_v2 transition if cluster is upgraded locator: service: add transition to rebuild_repair stage for rebuild_v2 locator: service: add rebuild_repair tablet transition stage locator: add maybe_get_primary_replica locator: service: add rebuild_v2 tablet transition kind gms: add REPAIR_BASED_TABLET_REBUILD cluster feature	2025-04-09 21:35:37 +02:00
Tomasz Grabiec	76bc11c78c	service: tablets: Keep load_stats inside tablet_allocator So that virtual tables can pick them up. It's a better place to keep them than in topology_coordinator.	2025-04-09 20:21:51 +02:00
Robert Bindar	4e3eb2fdac	Move direct_failure_detector from root to service/ direct_failure_detector used to be used by gms/ as well, but that's not the case anymore, so raft/ is the only user. Fixes #23133 Signed-off-by: Robert Bindar <robert.bindar@scylladb.com> Closes scylladb/scylladb#23248	2025-04-08 13:03:24 +03:00
Aleksandra Martyniuk	372b562f5e	test: add test for rebuild with repair	2025-04-08 10:42:02 +02:00
Aleksandra Martyniuk	acd32b24d3	locator: service: move to rebuild_v2 transition if cluster is upgraded If cluster is upgraded to version containing rebuild_v2 transition kind, move to this transition kind instead of rebuild.	2025-04-08 10:42:02 +02:00
Aleksandra Martyniuk	eb17af6143	locator: service: add transition to rebuild_repair stage for rebuild_v2 Modify write_both_read_old and streaming stages in rebuild_v2 transition kind: write_both_read_old moves to rebuild_repair stage and streaming stage streams data only from one replica.	2025-04-08 10:42:02 +02:00
Aleksandra Martyniuk	4a847df55c	locator: service: add rebuild_repair tablet transition stage Currently, in the streaming stage of rebuild tablet transition, we stream tablet data from all replicas. This patch series splits the streaming stage into two phases: - repair phase, where we repair the tablet; - streaming phase, where we stream tablet data from one replica. rebuild_repair is a stage that will be used to perform the repair phase. It executes the tablet repair on tablet_info::replicas. A primary replica out of migration_streraming_info::read_from is the repair master. If the repair succeeds, we move to streaming tablet transition stage, and to cleanup_target - if it fails. The repair bypasses the tablet repair scheduler and it does not update the repair_time. A transition to the rebuild_repair stage will be added in the following patches.	2025-04-08 10:42:02 +02:00
Aleksandra Martyniuk	ed7b8bb787	locator: service: add rebuild_v2 tablet transition kind Currently, in the streaming stage of rebuild tablet transition, we stream tablet data from all replicas. This patch series splits the streaming stage into two phases: - repair phase, where we repair the tablet; - streaming phase, where we stream tablet data from one replica. To differentiate the two streaming methods, a new tablet transition kind - rebuild_v2 - is added. The transtions and stages for rebuild_v2 transition kind will be added in the following patches.	2025-04-08 10:42:01 +02:00
Avi Kivity	8d2a41db82	Merge "Fixes for gossiper conversion to host id" from Gleb " The series contains fixes to gossiper conversion to host id. There are two fixes where we could erroneously send outdated entry in a gossiper message and a fix for force_remove_endpoint which was not converted to work on host id and this caused it to not delete the entry in some cases (in replace with the same ip case). " * 'gleb/host-id-fixes' of github.com:scylladb/scylla-dev: gossiper: send newest entry in a digest message gossiper: change make_random_gossip_digest to return value instead of modifying passed parameter gossiper: move force_remove_endpoint to work on host id gossiper: do not send outdated endpoint in gossiper round	2025-04-07 17:04:28 +03:00
Emil Maskovsky	76ceaf129b	raft: distribute voters by rack inside DC Distribute the voters evenly across racks in the datacenters. When distributing the voters across datacenters, the datacenters with more racks will be preferred in case of a tie. Also, in case of asymmetric voter distribution (2 DCs), the DC with more racks will have more voters (if the node counts allow it). In case of a single datacenter, the voters will be distributed across racks evenly (in the similar manner as done for the whole datacenters). The intention is that similar to losing a datacenter, we want to avoid losing the majority if a rack goes down - so if there are multiple racks, we want to distribute the voters across them in such a way that losing the whole rack will not cause the majority loss (if possible).	2025-04-07 12:31:37 +02:00
Emil Maskovsky	a740623fa1	raft topology: handle on_up/on_down to add/remove node from voters Adding and removing the voters based on the node up/down events. This improves the availability of the system by automatically adjusting the number of voters in the system to use the alive nodes in precedence. We can then also drop the voter removal from the `write_both_read_old` to further simplify the code - the node will be removed from the voters when it goes down. However we only can do that in case the feature is enabled.	2025-04-07 12:31:37 +02:00
Emil Maskovsky	dc6afd47b7	raft: fix the indentation after the limited voters changes Fix the indentation that needs to be changed because of the added condition. This is done separately to make it easier to review the main commit with the functional changes.	2025-04-07 12:31:37 +02:00
Emil Maskovsky	1d06ea3a5a	raft: implement the limited voters feature Currently if raft is enabled all nodes are voters in group0. However it is not necessary to have all nodes to be voters - it only slows down the raft group operation (since the quorum is large) and makes deployments with asymmetrical DCs problematic (2 DCs with 5 nodes along 1 DC with 10 nodes will lose the majority if large DC is isolated). The topology coordinator will now maintain a state where there are only limited number of voters, evenly distributed across the DCs and racks. After each node addition or removal the voters are recalculated and rebalanced if necessary. That means: * When a new node is added, it might become a voter depending on the current distribution of voters - either if there are still some voter "slots" available, or if the new node is a better candidate than some existing voter (in which case the existing node voter status might be revoked). * When a voter node is removed or stopped (shut down), its voter status is revoked and another node might become a voter instead (this can also depend on other circumstances, like e.g. changing the number of DCs). * If a node addition or removal causes a change in number of datacenters (DCs) or racks, the rebalance action might become wider (as there are some special rules applying to 1 vs 2 vs more DCs, also changing the number of racks might cause similar effects in the voters distribution) Special conditions for various number of DCs: * 1 DC: Can have up to the maximum allowed number of voters (5 - see below) * 2 DCs: The distribution of the voters will be asymmetric (if possible), meaning that we can tolerate a loss of the DC with the smaller number of voters (if both would have the same number of voters we'd lose the majority if any of the DCs is lost). For example, if we have 2 DCs with 2 nodes each, one of them will only have 1 voter (despite the limit of 5). Also, if one of the 2 DCs has more racks than the other and the node count allows it, the DC with the more racks will have more voters. * 3 and more DCs: The distribution of the voters will be so that every DC has strictly less than half of the total voters (so a loss of any of the DCs cannot lead to the majority loss). Again, DCs with more racks are being preferred in the voter distribution. At the moment we will be handling the zero-token nodes in the same way as the regular nodes (i.e. the zero-token nodes will not take any priority in the voter distribution). Technically it doesn't make much sense to have a zero-token node that is not a voter (when there are regular nodes in the same DC being voters), but currently the intended purpose of zero-token nodes is to form an "arbiter DC" (in case of 2 DCs, creating a third DC with zero-token nodes only), so for that intended purpose no special handling is needed and will work out of the box. If a preference of zero token nodes will eventually be needed/requested, it will be added separately from this PR. Currently the voter limits will not be configurable (we might introduce configurable limits later if that would be needed/requested). The feature is enabled by the `group0_limited_voters` feature flag to avoid issues with cluster upgrade (the feature will be only enabled once all nodes in the cluster are upgraded to the version supporting the feature). Fixes: scylladb/scylladb#18793	2025-04-07 12:31:18 +02:00
Emil Maskovsky	8b186ab0ff	raft: drop the voter removal from the decommission In the particular case of node decommission, this code doesn't really matter in production and only confuses us. Losing majority is an extremely rare event, and for this code to help one would have to lose majority in a very specific way (exactly half of the nodes die in a short time window during decommission), which is unrealistic. In addition, this code will be completely irrelevant (and would never be executed) once we implement #23266. Refs: scylladb/scylladb#23266	2025-04-07 12:23:25 +02:00
Gleb Natapov	6f53611337	gossiper: move force_remove_endpoint to work on host id Since the gossiper works on host ids now it is incorrect to leave this function to work on ip. It makes it impossible to delete outdated entry since the "gossiper.get_host_id(endpoint) != id" check will always be false for such entries (get_host_id() always returns most up -to-date mapping.	2025-04-06 18:39:24 +03:00
Botond Dénes	1198213000	Merge 'tablets: Make tablet allocation equalize per-shard load ' from Tomasz Grabiec Before, it was equalizing per-node load (tablet count), which is wrong in heterogeneous clusters. Nodes with fewer shards will end up with overloaded shards. Refs #23378 Closes scylladb/scylladb#23478 * github.com:scylladb/scylladb: tablets: Make tablet allocation equalize per-shard load tablets: load_balancer: Fix reporting of total load per node	2025-04-03 16:32:53 +03:00
Avi Kivity	882f405eed	Merge "Convert gossiper's endpoint state map to be host id based" from Gleb " The series makes endpoint state map in the gossiper addressable by host id instead of ips. The transition has implication outside of the gossiper as well. Gossiper based topology operations are affected by this change since they assume that the mapping is ip based. On wire protocol is not affected by the change as maps that are sent by the gossiper protocol remain ip based. If old node sends two different entries for the same host id the one with newer generation is applied. If new node has two ids that are mapped to the same ip the newer one is added to the outgoing map. Interoperability was verified manually by running mixed cluster. The series concludes the conversion of the system to be host id based. " * 'gleb/gossipper-endpoint-map-to-host-id-v2' of github.com:scylladb/scylla-dev: gossiper: make examine_gossiper private gossiper: rename get_nodes_with_host_id to get_node_ip treewide: drop id parameter from gossiper::for_each_endpoint_state treewide: move gossiper to index nodes by host id gossiper: drop ip from replicate function parameters gossiper: drop ip from apply_new_states parameters gossiper: drop address from handle_major_state_change parameter list gossiper: pass rpc::client_info to gossiper_shutdown verb handler gossiper: add try_get_host_id function gossiper: add ip to endpoint_state serialization: fix std::map de-serializer to not invoke value's default constructor gossiper: drop template from wait_alive_helper function gossiper: move get_supported_features and its users to host id storage_service: make candidates_for_removal host id based gossiper: use peers table to detect address change storage_service: use std::views::keys instead of std::views::transform that returns a key gossiper: move _pending_mark_alive_endpoints to host id gossiper: do not allow to assassinate endpoint in raft topology mode gossiper: fix indentation after previous patch gossiper: do not allow to assassinate non existing endpoint	2025-04-02 12:30:00 +03:00
Avi Kivity	69684e16d8	Merge 'sstables: add SSTable compression with shared dictionaries ' from Michał Chojnowski This PR extends Scylla's SSTable compression with the ability to use compression dictionaries shared across compression chunks. This involves several changes: - We refactor `compression_parameters` and friends (`compressor`, `sstables::local_compression`, `sstables::compression`) to prepare for making the construction of `compressor`s asynchronous, to enable sharing pieces of compressors (the dictionaries) across shards. - We introduce the notion of "hidden compression options" which are written to `CompressionInfo.db` and used to construct decompressors, like regular options, but don't appear in the schema. (We later stuff the SSTable's dictionary into `CompressionInfo.db` using a sequence of such options). - We add a cluster feature which guards the creation of dictionary-compressed SSTables. - We introduce a central "compressor factory" (one instance shared by all shards), which from this point onward is used to construct all `compressor` objects (one per SSTable) used to process the SSTables. When constructing a compressor for writing, it uses the "current"/"recommended" dictionary (which is passed to the factory from the actively-observed contents of the group0-managed `system.dicts`). When constructing a compressor for reading, it uses the dictionary written in the hidden compression options in CompressionInfo.db. And it keeps dictionaries deduplicated, so that each unique live dictionary blob has only one instance in memory, shared across shards. - We teach the relevant `lz4` and `zstd` compressor wrappers about the dictionaries. - We add a HTTP API call which samples pieces of the given table (i.e. the Data.db files) from across the cluster, trains a dictionary on it, and publishes it via `system.dicts` as the new current dictionary for that table. (And we add some RPC verbs to support that). - We add a HTTP API call which estimates the impact of various available compression configurations on the compression ratio. - We add an autotrainer fiber which periodically retrains dicts for dict-aware tables and publishes them if they seem to be a significant improvement. Known imperfections: - The factory currently keeps one dictionary instance on the entire node, but we probably want one copy per NUMA node. I didn't do that because exposing NUMA knowledge to Scylla seems to require some changes in Seastar first. New feature, no backporting involved. Closes scylladb/scylladb#23025 * github.com:scylladb/scylladb: docs: add user-facing documentation for SSTable compression with shared dicts docs/dev: add sstable-compression-dicts.md test: add test_sstable_compression_dictionaries_autotrain.py test: add test_sstable_compression_dictionaries_basic.py test/pylib/rest_client: add `keyspace_upgrade_sstables` helper main: run a sstable_dict_autotrainer api: add the estimate_compression_ratios API call dict_autotrainer: introduce sstable_dict_autotrainer db/system_keyspace: add query_dict_timestamp compress: add ZstdWithDictsCompressor and LZ4WithDictsCompressor main: clean up sstable compression dicts after table drops sstables/compress: discard hidden compression options after the decompressor is created compress: change compressor_ptr from shared_ptr to unique_ptr api: add the retrain_dict API call storage_service: add some dict-related routines main: in compression_dict_updated_callback, recognize and use SSTable compression dicts storage_service: add do_sample_sstables() messaging_service: add SAMPLE_SSTABLES and ESTIMATE_SSTABLE_VOLUME verbs db/system_keyspace: let `system.dicts` helpers be used for dicts other than the RPC compression dict raft/group0_state_machine: on `system.dicts` mutations, pass the affected partitition keys to the callback database: add sample_data_files() database: add take_sstable_set_snapshot() compress: teach `lz4_processor` about dictionaries compress: teach `zstd_processor` about dictionaries sstables: delegate compressor creation to the compressor factory sstables: plug an `sstable_compressor_factory` into `sstables_manager` sstables: introduce sstable_compressor_factory utils/hashers: add get_sha256() gms/feature_service: add the SSTABLE_COMPRESSION_DICTS cluster feature compress: add hidden dictionary options compress: remove `compression_parameters::get_compressor()` sstables/compress: remove get_sstable_compressor() sstables/compress: move ownership of `compressor` to `sstable::compression` compress: remove compressor::option_names() compress: clean up the constructor of zstd_processor compress: squash zstd.cc into compress.cc sstables/compress: break the dependency of `compression_parameters` on `compressor` compress.hh: switch compressor::name() from an instance member to a virtual call bytes: adapt fmt_hex to std::span<const std::byte>	2025-04-01 12:47:34 +03:00
Piotr Smaron	370707b111	service: restore default timeout in `announce_with_raft` This restored timeout seems to have been accidentally removed in `7081215552 (r2005352424)`. Without it, `raft_server_with_timeouts::run_with_timeout` will get `std::nullopt` as a value of the `timeout` parameter and perform an operation without any timeout, whereas previously it would have waited for the default timeout specified in `raft_server_for_group::default_op_timeout`. Closes scylladb/scylladb#23380	2025-04-01 10:20:16 +03:00
Michał Chojnowski	4f0d453acf	dict_autotrainer: introduce sstable_dict_autotrainer Add a fiber responsible for periodic re-training of compression dictionaries (for tables which opted into dict-aware compression). As of this patch, it works like this: every `$tick_period` (15 minutes), if we are the current Raft leader, we check for dict-aware tables which have no dict, or a dict older than `$retrain_period`. For those tables, if they have enough data (>1GiB) for a training, we train a new dict and check if it's significantly better than the current one (provides ratio smaller than 95% of current ratio), and if so, we update the dict.	2025-04-01 00:07:30 +02:00
Michał Chojnowski	bea866a46f	main: clean up sstable compression dicts after table drops When a table is dropped, its corresponding dictionary in `system.dicts` -- if any -- should be deleted, otherwise it will remain forever as garbage. This commit implements such cleanup.	2025-04-01 00:07:30 +02:00

1 2 3 4 5 ...

5341 Commits