scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-26 11:30:36 +00:00

Author	SHA1	Message	Date
Aleksandra Martyniuk	2dcea5a27d	streaming: use host_id in file streaming Use host ids instead of ips in file-streaming. Fixes: #22421. Closes scylladb/scylladb#24055	2025-05-12 09:36:48 +03:00
Gleb Natapov	7403de241c	test: add reproducer for #22777 Add sleep before starting gossiper to increase a chance of getting old gossiper entry about yourself before updating local gossiper info with new IP address.	2025-05-06 11:21:17 +03:00
Gleb Natapov	ecd14753c0	storage_service: Do not remove gossiper entry on address change When gossiper indexed entries by ip an old entry had to be removed on an address change, but the index is id based, so even if ip was change the entry should stay. Gossiper simply updates an ip address there.	2025-05-04 17:59:07 +03:00
Gleb Natapov	a2178b7c31	storage_service: use id to check for local node IP may change and an old gossiper message with previous IP may be processed when it shouldn't. Fixes: #22777	2025-05-04 17:59:07 +03:00
Pavel Emelyanov	eb5b52f598	Merge 'main: make DC and rack immutable after bootstrap' from Piotr Dulikowski Changing DC or rack on a node which was already bootstrapped is, in case of vnodes, very unsafe (almost guaranteed to cause data loss or unavailability), and is outright not supported if the cluster has a tablet-backed keyspaces. Moreover, the possibility of doing that makes it impossible to uphold some of the invariants promised by the RF-rack-valid flag, which is eventually going to become unconditionally enabled. Get rid of the above problems by removing the possibility of changing the DC / rack of a node. A node will now fail to start if its snitch reports a different DC or rack than the one that was reported during the first boot. Fixes: scylladb/scylladb#23278 Fixes: scylladb/scylladb#22869 Marking for backport to 2025.1, as this is a necessary part of the RF-rack-valid saga Closes scylladb/scylladb#23800 * github.com:scylladb/scylladb: doc: changing topology when changing snitches is no longer supported test: cluster: introduce test_no_dc_rack_change storage_service: don't update DC/rack in update_topology_with_local_metadata main: make dc and rack immutable after bootstrap test: cluster: remove test_snitch_change	2025-04-21 15:52:55 +03:00
Piotr Dulikowski	1791ae3581	storage_service: don't update DC/rack in update_topology_with_local_metadata The DC/rack are now immutable and cannot be changed after restart, so there is no need to update the node's system.topology entry with this information on restart.	2025-04-17 16:22:58 +02:00
Botond Dénes	f5125ffa18	Merge 'Ensure raft group0 RPCs use the gossip scheduling group.' from Sergey Zolotukhin Scylla operations use concurrency semaphores to limit the number of concurrent operations and prevent resource exhaustion. The semaphore is selected based on the current scheduling group. For RAFT group operations, it is essential to use a system semaphore to avoid queuing behind user operations. This patch ensures that RAFT operations use the `gossip` scheduling group to leverage the system semaphore. Fixes scylladb/scylladb#21637 Backport: 6.2 and 6.1 Closes scylladb/scylladb#22779 * github.com:scylladb/scylladb: Ensure raft group0 RPCs use the gossip scheduling group Move RAFT operations verbs to GOSSIP group.	2025-04-16 09:11:29 +03:00
Tomasz Grabiec	001d3b2415	Merge 'storage_service: preserve state of busy topology when transiting tablet' from Łukasz Paszkowski Commit `876478b84f` ("storage_service: allow concurrent tablet migration in tablets/move API", 2024-02-08) introduced a code path on which the topology state machine would be busy -- in "tablet_draining" or "tablet_migration" state -- at the time of starting tablet migration. The pre-commit code would unconditionally transition the topology to "tablet_migration" state, assuming the topology had been idle previously. On the new code path, this state change would be idempotent if the topology state machine had been busy in "tablet_migration", but the state change would incorrectly overwrite the "tablet_draining" state otherwise. Restrict the state change to when the topology state machine is idle. In addition, add the topology update to the "updates" vector with plain push_back(). emplace_back() is not helpful here, as topology_mutation_builder::build() cannot construct in-place, and so we invoke the "canonical_mutation" move constructor once, either way. Unit test: Start a two node cluster. Create a single tablet on one of the nodes. Start decommissioning that node, but block decommissioning at once. In that state (i.e., in "tablet_draining"), move the tablet manually to the other node. Check that transit_tablet() leaves the topology transition state alone. Fixes https://github.com/scylladb/scylladb/issues/20073. Commit `876478b84f` was first released in scylla-6.0.0, so we might want to backport this patch accordingly. Closes scylladb/scylladb#23751 * github.com:scylladb/scylladb: storage_service: add unit test for mid-decommission transit_tablet() storage_service: preserve state of busy topology when transiting tablet	2025-04-16 00:19:24 +02:00
Pavel Emelyanov	b79137eaa4	storage_service: Use this->_features directly This dependency is already there, storage service doesn't need to go rounds via database reference to get to the features. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#23739	2025-04-15 21:11:12 +03:00
Laszlo Ersek	841ca652a0	storage_service: add unit test for mid-decommission transit_tablet() Start a two node cluster. Create a single tablet on one of the nodes. Start decommissioning that node, but block decommissioning at once. In that state (i.e., in "tablet_draining"), move the tablet manually to the other node. Check that transit_tablet() leaves the topology transition state alone. Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>	2025-04-15 15:15:25 +02:00
Laszlo Ersek	e1186f0ae6	storage_service: preserve state of busy topology when transiting tablet Commit `876478b84f` ("storage_service: allow concurrent tablet migration in tablets/move API", 2024-02-08) introduced a code path on which the topology state machine would be busy -- in "tablet_draining" or "tablet_migration" state -- at the time of starting tablet migration. The pre-commit code would unconditionally transition the topology to "tablet_migration" state, assuming the topology had been idle previously. On the new code path, this state change would be idempotent if the topology state machine had been busy in "tablet_migration", but the state change would incorrectly overwrite the "tablet_draining" state otherwise. Restrict the state change to when the topology state machine is idle. In addition, add the topology update to the "updates" vector with plain push_back(). emplace_back() is not helpful here, as topology_mutation_builder::build() cannot construct in-place, and so we invoke the "canonical_mutation" move constructor once, either way. Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>	2025-04-15 13:44:45 +02:00
Sergey Zolotukhin	e05c082002	Ensure raft group0 RPCs use the gossip scheduling group Scylla operations use concurrency semaphores to limit the number of concurrent operations and prevent resource exhaustion. The semaphore is selected based on the current scheduling group. For Raft group operations, it is essential to use a system semaphore to avoid queuing behind user operations. This commit adds a check to ensure that the raft group0 RPCs are executed with the `gossiper` scheduling group.	2025-04-14 17:10:46 +02:00
Benny Halevy	a67ed59399	storage_service: use named gate Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-12 11:28:49 +03:00
Aleksandra Martyniuk	acd32b24d3	locator: service: move to rebuild_v2 transition if cluster is upgraded If cluster is upgraded to version containing rebuild_v2 transition kind, move to this transition kind instead of rebuild.	2025-04-08 10:42:02 +02:00
Aleksandra Martyniuk	eb17af6143	locator: service: add transition to rebuild_repair stage for rebuild_v2 Modify write_both_read_old and streaming stages in rebuild_v2 transition kind: write_both_read_old moves to rebuild_repair stage and streaming stage streams data only from one replica.	2025-04-08 10:42:02 +02:00
Aleksandra Martyniuk	4a847df55c	locator: service: add rebuild_repair tablet transition stage Currently, in the streaming stage of rebuild tablet transition, we stream tablet data from all replicas. This patch series splits the streaming stage into two phases: - repair phase, where we repair the tablet; - streaming phase, where we stream tablet data from one replica. rebuild_repair is a stage that will be used to perform the repair phase. It executes the tablet repair on tablet_info::replicas. A primary replica out of migration_streraming_info::read_from is the repair master. If the repair succeeds, we move to streaming tablet transition stage, and to cleanup_target - if it fails. The repair bypasses the tablet repair scheduler and it does not update the repair_time. A transition to the rebuild_repair stage will be added in the following patches.	2025-04-08 10:42:02 +02:00
Aleksandra Martyniuk	ed7b8bb787	locator: service: add rebuild_v2 tablet transition kind Currently, in the streaming stage of rebuild tablet transition, we stream tablet data from all replicas. This patch series splits the streaming stage into two phases: - repair phase, where we repair the tablet; - streaming phase, where we stream tablet data from one replica. To differentiate the two streaming methods, a new tablet transition kind - rebuild_v2 - is added. The transtions and stages for rebuild_v2 transition kind will be added in the following patches.	2025-04-08 10:42:01 +02:00
Gleb Natapov	6f53611337	gossiper: move force_remove_endpoint to work on host id Since the gossiper works on host ids now it is incorrect to leave this function to work on ip. It makes it impossible to delete outdated entry since the "gossiper.get_host_id(endpoint) != id" check will always be false for such entries (get_host_id() always returns most up -to-date mapping.	2025-04-06 18:39:24 +03:00
Avi Kivity	882f405eed	Merge "Convert gossiper's endpoint state map to be host id based" from Gleb " The series makes endpoint state map in the gossiper addressable by host id instead of ips. The transition has implication outside of the gossiper as well. Gossiper based topology operations are affected by this change since they assume that the mapping is ip based. On wire protocol is not affected by the change as maps that are sent by the gossiper protocol remain ip based. If old node sends two different entries for the same host id the one with newer generation is applied. If new node has two ids that are mapped to the same ip the newer one is added to the outgoing map. Interoperability was verified manually by running mixed cluster. The series concludes the conversion of the system to be host id based. " * 'gleb/gossipper-endpoint-map-to-host-id-v2' of github.com:scylladb/scylla-dev: gossiper: make examine_gossiper private gossiper: rename get_nodes_with_host_id to get_node_ip treewide: drop id parameter from gossiper::for_each_endpoint_state treewide: move gossiper to index nodes by host id gossiper: drop ip from replicate function parameters gossiper: drop ip from apply_new_states parameters gossiper: drop address from handle_major_state_change parameter list gossiper: pass rpc::client_info to gossiper_shutdown verb handler gossiper: add try_get_host_id function gossiper: add ip to endpoint_state serialization: fix std::map de-serializer to not invoke value's default constructor gossiper: drop template from wait_alive_helper function gossiper: move get_supported_features and its users to host id storage_service: make candidates_for_removal host id based gossiper: use peers table to detect address change storage_service: use std::views::keys instead of std::views::transform that returns a key gossiper: move _pending_mark_alive_endpoints to host id gossiper: do not allow to assassinate endpoint in raft topology mode gossiper: fix indentation after previous patch gossiper: do not allow to assassinate non existing endpoint	2025-04-02 12:30:00 +03:00
Michał Chojnowski	4f0d453acf	dict_autotrainer: introduce sstable_dict_autotrainer Add a fiber responsible for periodic re-training of compression dictionaries (for tables which opted into dict-aware compression). As of this patch, it works like this: every `$tick_period` (15 minutes), if we are the current Raft leader, we check for dict-aware tables which have no dict, or a dict older than `$retrain_period`. For those tables, if they have enough data (>1GiB) for a training, we train a new dict and check if it's significantly better than the current one (provides ratio smaller than 95% of current ratio), and if so, we update the dict.	2025-04-01 00:07:30 +02:00
Michał Chojnowski	4115a6fece	storage_service: add some dict-related routines storage_service will be the interface between the API layer (or the automatic training loop) and the dict machinery. This commit implements the relevant interface for that. It adds methods that: 1. Take SSTable samples from the cluster, using the new RPC verbs. 2. Train a dict on the sample. (The trainer will be plugged in from `main`). 3. Publishes the trained dictionary. (By adding mutations to Raft group 0). Perhaps this should be moved to a separate "service". But it's not like `storage_service` has a clear purpose anyway.	2025-04-01 00:07:29 +02:00
Michał Chojnowski	380f409c46	storage_service: add do_sample_sstables() Adds a helper which uses ESTIMATE_SSTABLE_VOLUME and SAMPLE_SSTABLES RPC calls to gather a combined sample of SSTable Data files for the given table from the entire cluster.	2025-04-01 00:07:29 +02:00
Michał Chojnowski	94c33b6760	messaging_service: add SAMPLE_SSTABLES and ESTIMATE_SSTABLE_VOLUME verbs Add two verbs needed to implement dictionary training for SSTable compression. SAMPLE_SSTABLES returns a list of randomly-selected chunks of Data files with a given cardinality and using a given chunk size, for the given table. ESTIMATE_SSTABLE_VOLUME returns the total uncompressed size of all Data files the given table.	2025-04-01 00:07:29 +02:00
Michał Chojnowski	b77c611c00	raft/group0_state_machine: on `system.dicts` mutations, pass the affected partitition keys to the callback Before this patch, `system.dicts` contains only one dictionary, for RPC compression, with the fixed name "general". In later parts of this series, we will add more dictionaries to system.dicts, one per table, for SSTable compression. To enable that, this patch adjusts the callback mechanism for group0's `write_mutations` command, so that the mutation callbacks for group0-managed tables can see which partition keys were affected. This way, the callbacks can query only the modified partitions instead of doing a full scan. (This is necessary to prevent quadratic behaviours.) For now, only the `system.dicts` callback uses the partition keys.	2025-04-01 00:07:29 +02:00
Gleb Natapov	afdfde8300	gossiper: rename get_nodes_with_host_id to get_node_ip Also change it to return std::optional instead of std::set since now there can be only on ip mapped to an id.	2025-03-31 16:50:50 +03:00
Gleb Natapov	28fb84117d	treewide: drop id parameter from gossiper::for_each_endpoint_state We have it in endpoint_state anyway, so no need to pass both.	2025-03-31 16:50:50 +03:00
Gleb Natapov	4609bbbbb2	treewide: move gossiper to index nodes by host id This patch changes gossiper to index nodes by host ids instead of ips. The main data structure that changes is _endpoint_state_map, but this results in a lot of changes since everything that uses the map directly or indirectly has to be changed. The big victim of this outside of the gossiper itself is topology over gossiper code. It works on IPs and assumes the gossiper does the same and both need to be changed together. Changes to other subsystems are much smaller since they already mostly work on host ids anyway.	2025-03-31 16:50:50 +03:00
Gleb Natapov	0dd86b4f1d	gossiper: move get_supported_features and its users to host id	2025-03-31 15:42:07 +03:00
Gleb Natapov	f97bb6922d	storage_service: make candidates_for_removal host id based	2025-03-31 15:42:07 +03:00
Gleb Natapov	82491cec19	gossiper: use peers table to detect address change This requires serializing entire handle_state_normal with a lock since it both reads and updates peers table now (it only updated it before the change). This is not a big deal since most of it is already serialized with token metadata lock. We cannot use it to serialize peers writes as well since the code that removes an endpoint from peers table also removes it from gossiper which causes on_remove notification to be called and it may take the metadata lock as well causing deadlock.	2025-03-31 15:41:44 +03:00
Gleb Natapov	1c2a9257e9	storage_service: use std::views::keys instead of std::views::transform that returns a key	2025-03-31 15:25:39 +03:00
Pavel Emelyanov	1da889f239	Merge 'Allow abort during join_cluster' from Benny Halevy Bootstrap or replace can take a long time, but since `feef7d3fa1`, the stop_signal is checked only in checkpoints, and in particular, abort isn't requested during join_cluster. Fixes #23222 * requires backport on top of https://github.com/scylladb/scylladb/pull/23184 Closes scylladb/scylladb#23306 * github.com:scylladb/scylladb: main: allow abort during join_cluster main: add checkpoint before joining cluster storage_service: add start_sys_dist_ks	2025-03-26 15:48:58 +03:00
Michael Litvak	49b8cf2d1d	storage_service: fix tablet split of materialized views This fixes an issue where materialized view tablets are not split because they are not registered as split candidates by the storage service. The code in storage_service::replicate_to_all_cores was changed in `4bfa3060d0` to handle normal tables and view tables separately, but with that change register_tablet_split_candidate is applied only to normal tables and not every table like before. We fix it by registering view tables as well. We add a test to verify that split of MV tables works. Closes scylladb/scylladb#23335	2025-03-24 08:23:58 +01:00
Botond Dénes	8f0d0daf53	Merge 'repair: allow concurrent repair and migration of two different tablets' from Aleksandra Martyniuk Do not hold erm during repair of a tablet that is started with tablet repair scheduler. This way two different tablets can be repaired and migrated concurrently. The same tablet won't be migrated while being repaired as it is provided by topology coordinator. Use topology_guard to maintain safety. Fixes: https://github.com/scylladb/scylladb/issues/22408. Needs backport to 2025.1 that introduces the tablet repair scheduler. Closes scylladb/scylladb#22842 * github.com:scylladb/scylladb: test: add test to check concurrent tablets migration and repair repair: do not hold erm for repair scheduled by scheduler repair: get total rf based on current erm repair: make shard_repair_task_impl::erm private repair: do not pass erm to put_row_diff_with_rpc_stream when unnecessary repair: do not pass erm to flush_rows_in_working_row_buf when unnecessary repair: pass session_id to repair_writer_impl::create_writer repair: keep materialized topology guard in shard_repair_task_impl repair: pass session_id to repair_meta	2025-03-19 08:55:24 +02:00
Kefu Chai	aca00118fb	service: fix misspellings these misspellings were flagged by codespell. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#23334	2025-03-18 22:21:45 +02:00
Benny Halevy	0fc196991a	storage_service: add start_sys_dist_ks Currently, there's a call to `supervisor::notify("starting system distributed keyspace")` which is misleading as it is identical to a similar message in main() when starting the sharded service. Change that to a storage_service log messages and be more specific that the sys_dist_ks shards are started. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-03-16 12:05:23 +02:00
Patryk Jędrzejczak	4fd0e93154	test: add tests for the Raft-based recovery procedure	2025-03-14 13:53:05 +01:00
Patryk Jędrzejczak	fd51d7e448	treewide: allow recreating group 0 in the Raft-based recovery procedure This patch adds support for recreating group 0 after losing majority. This is the only part of the new Raft-based recovery procedure that touches Scylla core. The following steps are necessary to recreate group 0: 1. Determine the new group 0 members. These are alive nodes that are normal or rebuilding. 2. Choose the recovery leader - the node which will become the new group 0 leader. This must be one of the nodes with the latest persistent group 0 state. 3. Remove `raft_group_id` from `system.scylla_local` and truncate `system.discovery` on each live node. 4. Set the new scylla.yaml parameter - `recovery_leader` - to Host ID of the recovery leader on each live node. 5. Rolling restart all live nodes, but the recovery leader must be restarted first. In the implementation, restarts in step 5 are very similar to normal restarts with the Raft-based topology enabled. The only differences are: 1. Steps 3-4 make the restarting node discover the new group 0 in `join_cluster`. 2. The group 0 server is started in `join_group0`, not `setup_group0_if_exists`. 3. The restarting node joins the new group 0 in `join_topology` using `legacy_handshaker`. There is no reason to contact the topology coordinator since the node has already joined the topology. Unfortunately, this patch creates another execution path for the starting logic. `join_cluster` becomes even messier. However, there is nothing we can do about it. Joining group 0 without joining topology is something completely new. Having a few small changes without touching other execution paths is the best we can do. We will start removing the old stuff soon, after making the Raft-based topology mandatory, and the situation will improve.	2025-03-14 13:52:57 +01:00
Aleksandra Martyniuk	928f92c780	repair: pass session_id to repair_meta Pass session_id of tablet repair down the stack from the repair request to repair_meta. The session_id will be utiziled in the following patches.	2025-03-14 10:20:12 +01:00
Avi Kivity	696ce4c982	Merge "convert some parts of the gossiper to host ids" from Gleb " This is series starts conversion of the gossiper to use host ids to index nodes. It does not touch the main map yet, but converts a lot of internal code to host id. There are also some unrelated cleanups that were done while working on the series. On of which is dropping code related to old shadow round. We replaced shadow round with explicit GOSSIP_GET_ENDPOINT_STATES verb in `cd7d64f588` which is in scylla-4.3.0, so there should be no compatibility problem. We already dropped a lot of old shadow round code in previous patches anyway. I tested manually that old and new node can co-exist in the same cluster, " * 'gleb/gossiper-host-id-v2' of github.com:scylladb/scylla-dev: (33 commits) gossiper: drop unneeded code gossiper: move _expire_time_endpoint_map to host_id gossiper: move _just_removed_endpoints to host id gossiper: drop unused get_msg_addr function messaging_service: change connection dropping notification to pass host id only messaging_service: pass host id to remove_rpc_client in down notification treewide: pass host id to endpoint_lifecycle_subscriber treewide: drop endpoint life cycle subscribers that do nothing load_meter: move to host id treewide: use host id directly in endpoint state change subscribers treewide: pass host id to endpoint state change subscribers gossiper: drop deprecated unsafe_assassinate_endpoint operation storage_service: drop unused code in handle_state_removed treewide: drop endpoint state change subscribers that do nothing gossiper: drop ip address from handle_echo_msg and simplify code since host_id is now mandatory gossiper: start using host ids to send messages earlier messaging_service: add temporary address map entry on incoming connection topology_coordinator: notify about IP change from sync_raft_topology_nodes as well treewide: move everyone to use host id based gossiper::is_alive and drop ip based one storage_proxy: drop unused template ...	2025-03-13 13:36:31 +02:00
Gleb Natapov	cca228265e	gossiper: move _expire_time_endpoint_map to host_id Index _expire_time_endpoint_map map by host id instead of ip	2025-03-11 12:09:22 +02:00
Gleb Natapov	24d30073f9	messaging_service: pass host id to remove_rpc_client in down notification Do not iterate over all client indexed by hos id to search for those with given IP. Look up by host id directly since now we know it in down notification. In cases host id is not known look it up by ip.	2025-03-11 12:09:22 +02:00
Gleb Natapov	4ca627b533	treewide: pass host id to endpoint_lifecycle_subscriber	2025-03-11 12:09:22 +02:00
Gleb Natapov	48a1030c91	treewide: use host id directly in endpoint state change subscribers Now that we have host ids in endpoint state change subscribers some of them can be simplified by using the id directly instead of locking it up by ip.	2025-03-11 12:09:22 +02:00
Gleb Natapov	499eb4d17f	treewide: pass host id to endpoint state change subscribers	2025-03-11 12:09:22 +02:00
Gleb Natapov	c17a8b4a76	storage_service: drop unused code in handle_state_removed	2025-03-11 12:09:21 +02:00
Gleb Natapov	696aee3adc	treewide: drop endpoint state change subscribers that do nothing Provide default implementation for them instead. Will be easier to rework them later.	2025-03-11 12:09:21 +02:00
Gleb Natapov	c3035caeb5	topology_coordinator: notify about IP change from sync_raft_topology_nodes as well Currently sync_raft_topology_nodes() only send join notification if a node is new in the topology, but sometimes a node changes IP and the join notification should be send for the new IP as well. Usually it is done from ip_address_updater, but topology reload can run first and then the notification will be missed. The solution is to send notification during topology reload as well.	2025-03-11 12:09:21 +02:00
Gleb Natapov	0e3dcb7954	treewide: move everyone to use host id based gossiper::is_alive and drop ip based one	2025-03-11 12:09:21 +02:00
Gleb Natapov	e47f251178	gossiper: move _live_endpoints and _unreachable_endpoints endpoint to host_id Index live and dead endpoints by host id. It also allows to simplify some code that does a translation.	2025-03-11 12:09:21 +02:00

1 2 3 4 5 ...

2322 Commits