scylladb

Author	SHA1	Message	Date
Petr Gusev	4eee5bc273	strong_consistency: add state_machine and raft_command These commands will be used by strongly consistent tablets to submit mutations to Raft. A simple state_machine implementation is introduced to apply these commands. We apply commands in batches to reduce commitlog I/O overhead. The batched variant of database::apply has known atomicity issues. For example, it does not guarantee atomicity under memory pressure: some mutations may be published to the memtable while others are blocked in run_when_memory_available. We will address these issues later.	2026-01-21 14:56:00 +01:00
Wojciech Mitros	fc2aecea69	idl: don't redefine bound_weight and partition_region in paging_state.idl.hh Bound_weight and partition_region are defined in both paging_state.idl.hh and position_in_partition.idl.hh. This isn't currently causing any issues, but if a future RPC uses both the paging_state and position_in_partition, after including both files we'll get a duplicate error. In this patch we prevent this by removing the definitions from paging_state.idl.hh and including position_in_partition.idl.hh in their place. Closes scylladb/scylladb#28228	2026-01-20 11:12:47 +02:00
Gleb Natapov	82f80478b8	direct_failure_detector: pass timeout to direct_fd_ping verb Currently direct_fd_ping runs without timeout, but the verb is not waited forever, the wait is canceled after a timeout, this timeout simply is not passed to the rpc. It may create a situation where the rpc callback can runs on a destination but it is no longer waited on. Change the code to pass timeout to rpc as well and return earlier from the rpc handler if the timeout is reached by the time the callback is called. This is backwards compatible since timeout is passed as optional.	2025-12-02 14:55:20 +02:00
Nadav Har'El	9cde93e3da	Merge 'db/view/view_building_coordinator: get rid of task's state in group0' from Michał Jadwiszczak Previously, the view building coordinator relied on setting each task's state to STARTED and then explicitly removing these state entries once tasks finished, before scheduling new ones. This approach induced a significant number of group0 commits, particularly in large clusters with many nodes and tablets, negatively impacting performance and scalability. With the update, the coordinator and worker logic has been restructured to operate without maintaining per-task states. Instead, tasks are simply tracked with an aborted boolean flag, which is still essential for certain tablet operations. This change removes much of the coordination complexity, simplifies the view building code, and reduces operational overhead. In addition, the coordinator now batches reports of finished tasks before making commits. Rather than committing task completions individually, it aggregates them and reports in groups, significantly minimizing the frequency of group0 commits. This new approach is expected to improve efficiency and scalability during materialized view construction, especially in large deployments. Fixes https://github.com/scylladb/scylladb/issues/26311 This patch needs to be backported to 2025.4. Closes scylladb/scylladb#26897 * github.com:scylladb/scylladb: docs/dev/view-building-coordinator: update the docs after recent changes db/view/view_building: send coordinator's term in the RPC db/view/view_building_state: replace task's state with `aborted` flag db/view/view_building_coordinator: batch finished tasks reporting db/view/view_building_worker: change internal implementation db/view/view_building_coordinator: change `work_on_tasks` RPC return type	2025-11-26 11:35:44 +02:00
Michał Jadwiszczak	fb8cbf1615	db/view/view_building: send coordinator's term in the RPC To avoid case when an old coordinator (which hasn't been stopped yet) dictates what should be done, add raft term to the `work_on_view_building_tasks` RPC. The worker needs to check if the term matches the current term from raft server, and deny the request when the term is bad.	2025-11-25 12:14:05 +01:00
Michael Litvak	005807ebb8	Revert "storage service: add repair colocated tablets rpc" This reverts commit `11f045bb7c`. The rpc was added together with colocated tablets in 2025.4 to support a "shared repair" operation of a group of colocated tablets that repairs all of them and allows also for special behavior as opposed to repairing a single specific tablet. It is not used anymore because we decided to not repair all colocated tablets in a single shared operation, but to repair only the base table, and in a later release support repairing colocated tables individually. We can remove the rpc in 2025.4 because it is introduced in the same version.	2025-11-25 09:06:48 +01:00
Gleb Natapov	39cec4ae45	topology: let banned node know that it is banned Currently if a banned node tries to connect to a cluster it fails to create connections, but has no idea why, so from inside the node it looks like it has communication problems. This patch adds new rpc NOTIFY_BANNED which is sent back to the node when its connection is dropped. On receiving the rpc the node isolates itself and print an informative message about why it did so. Closes scylladb/scylladb#26943	2025-11-24 17:12:13 +01:00
Michał Jadwiszczak	08974e1d50	db/view/view_building_worker: change internal implementation This commit doesn't change the logic behind the view building worker but it changes how the worker is executing view building tasks. Previously, the worker had a state only on shard0 and it was reacting to changes in group0 state. When it noticed some tasks were moved to `STARTED` state, the worker was creating a batch for it on the shard0 state. The RPC call was used only to start the batch and to get its result. Now, the main logic of batch management was moved to the RPC call handler. The worker has a local state on each shard and the state contains: - unique ptr to the batch - set of completed tasks - information for which views the base table was flushed So currently, each batch lives on a shard where it has its work to do exclusively. This eliminates a need to do a synchronization between shard0 and work shard, which was a painful point in previous implementation. The worker still reacts to changes in group0 view building state, but currently it's only used to observe whether any view building tasks was aborted by setting `ABORTED` state. To prepare for further changes to drop the view building task state, the worker ignores `IDLE` and `STARTED` states completely.	2025-11-24 11:12:31 +01:00
Michał Jadwiszczak	6d853c8f11	db/view/view_building_coordinator: change `work_on_tasks` RPC return type During the initial implementation of the view builing coordinator, we decided that if a view building task fails locally on the worker (example reason: view update's target replica is not available), the worker will retry this work instead of reporting a failure to the coordinator. However, we left return type of the RPC, which was telling if a task was finished successfully or aborted. But the worker doesn't need to report that a task was aborted, because it's the coordinator, who decides to abort a task. So, this commit changes the return type to list of UUIDs of completed tasks. Previously length of the returned vector needed to be the same as length of the vector sent in the request. No we can drop this restriction and the RPC handler return list of UUIDs of completed tasks (subset of vector sent in the request). This change is required to drop `STARTED` state in next commits. Since Scylla 2025.4 wasn't released yet and we're going to merge this patch before releasing, no RPC versioning or cluster feature is needed.	2025-11-24 11:12:29 +01:00
Ferenc Szili	b4ca12b39a	load_stats: change data structure which contains tablet sizes This patch changes the tablet size map in load_stats. Previously, this data structure was: std::unordered_map<range_based_tablet_id, uint64_t> tablet_sizes; and is changed into: std::unordered_map<table_id, std::unordered_map<dht::token_range, uint64_t>> tablet_sizes; This allows for improved performance of tablet tablet size reconciliation.	2025-10-24 14:37:00 +02:00
Asias He	13dd88b010	repair: Rename incremental mode name Using the name regular as the incremental mode could be confusing, since regular might be interpreted as the non-incremental repair. It is better to use incremental directly. Before: - regular (standard incremental repair) - full (full incremental repair) - disabled (incremental repair disabled) After: - incremental (standard incremental repair) - full (full incremental repair) - disabled (incremental repair disabled) Fixes #26503 Closes scylladb/scylladb#26504	2025-10-10 15:21:54 +03:00
Ferenc Szili	20aeed1607	load balancing: extend locator::load_stats to collect tablet sizes This commit extend the TABLE_LOAD_STATS RPC with data about the tablet replica sizes and effective disk capacity. Effective disk capacity of a node is computed as a sum of the sizes of all tablet replicas on a node and available disk space. This is the first change in the size based load balancing series. Closes scylladb/scylladb#26035	2025-10-03 13:37:22 +02:00
Alex Dathskovsky	5e89a78c8f	raft: refactor can_vote logic and type This PR refactors the can_vote function in the Raft algorithms for improved clarity and maintainability by providing safer strong boolean types to the raft algorithm. Fixes: #21937 Backport: No backport required Closes scylladb/scylladb#25787	2025-09-24 13:55:05 +02:00
Ernest Zaslavsky	5ba5aec1f8	treewide: Move mutation related files to a `mutation` directory As requested in #22104, moved the files and fixed other includes and build system. Moved files: - combine.hh - collection_mutation.hh - collection_mutation.cc - converting_mutation_partition_applier.hh - converting_mutation_partition_applier.cc - counters.hh - counters.cc - timestamp.hh Fixes: #22104 This is a cleanup, no need to backport Closes scylladb/scylladb#25085	2025-09-24 13:23:38 +03:00
Ernest Zaslavsky	a1f18a8883	treewide: Move schema related files to a `schema` directory As requested in #22111 , moved the files and fixed other includes and build system. Moved files: - frozen_schema.hh - frozen_schema.cc - schema_mutations.hh - schema_mutations.cc - column_computation.hh Fixes: #22111 Closes scylladb/scylladb#25089	2025-09-17 17:31:05 +03:00
Botond Dénes	bde7d8ddbd	Merge 'service: pass current session_id to repair rpc' from Aleksandra Martyniuk Currently, in repair_tablet we retrieve session_id from tablet map (and throw if it isn't specified). In case of topology coordinator failover, we may end up in a situation where a node runs outdated repair, treating session of a different operation as the repair's session: - topology coordinator starts repair transition (A); - topology coordinator sends tablet repair rpc to node1; - topology coordinator is separated from the cluster; - new topology coordinator is elected; - new topology coordinator sees waiting repair request (A_2) and executes it; - new repair of the same tablet is requested (B); - new topology coordinator starts repair transition (B); - new topology coordinator sends tablet repair rpc to node2; - node2 starts repair (B) as repair master; - node1 starts repair (A), checks the current session (B), proceeds with repair (B) as repair master. Send current session_id in repair_tablet rpc. If this session_id and session id got from tablet map don't match, an exception is thrown. Fixes: https://github.com/scylladb/scylladb/issues/23318. No backport; changes in rpc signatures Closes scylladb/scylladb#25369 * github.com:scylladb/scylladb: test: check that repair with outdated session_id fails service: pass current session_id to repair rpc	2025-09-17 17:28:35 +03:00
Patryk Jędrzejczak	368d70ee15	Merge 'LWT: implement fencing' from Petr Gusev This PR consists of three parts: * Small refactoring of the fencing APIs in storage_proxy (renames + comments + some functions were extracted) * Implement the fencing for LWT verbs itself. This includes checking the fencing token before and after local replica data accesses. * Two new `test.py` tests in `test_fencing.py`, which check the fencing in some real-world scenarios. Backport: no need -- fencing for LWT requests is needed primarily for LWT over tablets, which is not released yet. Fixes scylladb/scylladb#22332 Closes scylladb/scylladb#25550 * https://github.com/scylladb/scylladb: test_tablets_lwt: eliminate redundant disable_tablet_balancing test_fencing: add test_lwt_fencing_upgrade pylib: extract upgrade helpers from test_sstable_compression_dictionaries_upgrade.py test_fencing: add test_fenced_out_on_tablet_migration_while_handling_paxos_verb test_fencing: test_fence_lwt_during_bootstap pylib/rest_client.py: encode injection name storage_proxy_stats: add fenced_out_requests metric storage_proxy: add fencing to Paxos verbs storage_proxy::apply_fence: add overload that throws on failure storage_proxy: extract apply_fence_result sp::apply_fence: rename to apply_fence_on_ready sp::apply_fence: rename to check_fence sp::apply_fence: make non-generic	2025-09-16 23:40:48 +03:00
Ernest Zaslavsky	d624413ddd	treewide: Move query related files to a new `query` directory As requested in #22120, moved the files and fixed other includes and build system. Moved files: - query.cc - query-request.hh - query-result.hh - query-result-reader.hh - query-result-set.cc - query-result-set.hh - query-result-writer.hh - query_id.hh - query_result_merger.hh Fixes: #22120 This is a cleanup, no need to backport Closes scylladb/scylladb#25105	2025-09-16 23:40:47 +03:00
Petr Gusev	6d7af84fed	storage_proxy: add fencing to Paxos verbs This commit adds fencing support to all Paxos verbs: * Pass an optional (for backward compatibility) fencing_token as a parameter to the prepare, accept, learn, and prune verbs. * Call apply_fence twice — before and after accessing local data. This ensures that if the coordinator is fenced out mid-request, the replica does not return success, which would otherwise incorrectly contribute to achieving the target CL. Without this, a user might observe successful writes that become unreadable after the topology operation completes. * For prune, call apply_fence only once because it does not return a response to the LWT coordinator. Fixes scylladb/scylladb#22332	2025-09-15 11:24:53 +02:00
Asias He	9bca90be0d	repair: Fix repair_row_level_stop verb idl The version keyword is missed for the optional mark_as_repaired parameter. This causes the new node to expect more data to come: INFO 2025-09-01 19:23:05,332 [shard 0:strm] rpc - client 127.0.7.6:50116 msg_id 8: caught exception while processing a message: std::out_of_range (deserialization buffer underflow) When the sender is an old node in a mixed cluster, the data will never come. To fix, add the missing version keyword. Our idl-compiler.py should have caught the typo since the keyword was missing in the [[]] tag. Fixes #25666 Closes scylladb/scylladb#25782	2025-09-12 15:58:19 +03:00
Asias He	cb7db47ae1	repair: Add incremental_mode option for tablet repair This patch introduces a new `incremental_mode` parameter to the tablet repair REST API, providing more fine-grained control over the incremental repair process. Previously, incremental repair was on and could not be turned off. This change allows users to select from three distinct modes: - `regular`: This is the default mode. It performs a standard incremental repair, processing only unrepaired sstables and skipping those that are already repaired. The repair state (`repaired_at`, `sstables_repaired_at`) is updated. - `full`: This mode forces the repair to process all sstables, including those that have been previously repaired. This is useful when a full data validation is needed without disabling the incremental repair feature. The repair state is updated. - `disabled`: This mode completely disables the incremental repair logic for the current repair operation. It behaves like a classic (pre-incremental) repair, and it does not update any incremental repair state (`repaired_at` in sstables or `sstables_repaired_at` in the system.tablets table). The implementation includes: - Adding the `incremental_mode` parameter to the `/storage_service/repair/tablet` API endpoint. - Updating the internal repair logic to handle the different modes. - Adding a new test case to verify the behavior of each mode. - Updating the API documentation and developer documentation. Fixes #25605 Closes scylladb/scylladb#25693	2025-09-09 06:50:21 +03:00
Avi Kivity	bc5773f777	Merge 'Add out of space prevention mechanisms' from Łukasz Paszkowski When a scaling out is delayed or fails, it is crucial to ensure that clusters remain operational and recoverable even under extreme conditions. To achieve this, the following proactive measures are implemented: - reject writes - includes: inserts, updates, deletes, counter updates, hints, read+repair and lwt writes - applicable to: user tables, views, CDC log, audit, cql tracing - stop running compactions/repairs and prevent from starting new ones - reject incoming tablet migrations The aforementioned mechanisms are automatically enabled when node's disk utilization reaches the critical level (default: 98%) and disabled when the utilization drop below the threshold. Apart from that, the series add tests that require mounted volumes to simulate out of space. The paths to the volumes can be provided using the a pytest argument, i.e. `--space-limited-dirs`. When not provided, tests are skipped. Test scenarios: 1. Start a cluster and write data until one of the nodes reaches 90% of the disk utilization 2. Perform an operation that would take the nodes over 100% 3. The nodes should not exceed the critical disk utilization (98% by default) 4. Scale out the cluster by adding one node per rack 5. Retry or wait for the operation from step 2 The operation is: writing data, running compactions, building materialized views, running repair, migrating tablets (caused by RF change, decommission). The test is successful, if no nodes run out of space, the operation from step 2 is aborted/paused/timed out and the operation from step 5 is successful. `perf-simple-query --smp 1 -m 1G` results obtained for fixed 400MHz frequency: Read path (before) ``` instructions_per_op: mean= 39661.51 standard-deviation=34.53 median= 39655.39 median-absolute-deviation=23.33 maximum=39708.71 minimum=39622.61 ``` Read path (after) ``` instructions_per_op: mean= 39691.68 standard-deviation=34.54 median= 39683.14 median-absolute-deviation=11.94 maximum=39749.32 minimum=39656.63 ``` Write path (before): ``` instructions_per_op: mean= 50942.86 standard-deviation=97.69 median= 50974.11 median-absolute-deviation=34.25 maximum=51019.23 minimum=50771.60 ``` Write path (after): ``` instructions_per_op: mean= 51000.15 standard-deviation=115.04 median= 51043.93 median-absolute-deviation=52.19 maximum=51065.81 minimum=50795.00 ``` Fixes: https://github.com/scylladb/scylladb/issues/14067 Refs: https://github.com/scylladb/scylladb/issues/2871 No backport, as it is a new feature. Closes scylladb/scylladb#23917 * github.com:scylladb/scylladb: tests/cluster: Add new storage tests test/scylla_cluster: Override workdir when passed via cmdline streaming: Reject incoming migrations storage_service: extend locator::load_stats to collect per-node critical disk utilization flag repair_service: Add a facility to disable the service compaction_manager: Subscribe to out of space controller compaction_manager: Replace enabled/disabled states with running state database: Add critical_disk_utilization mode database can be moved to disk_space_monitor: add subscription API for threshold-based disk space monitoring docs: Add feature documentation config: Add critical_disk_utilization_level option replica/exceptions: Add a new custom replica exception	2025-08-30 18:47:57 +03:00
Aleksandra Martyniuk	8f967cde5c	service: pass current session_id to repair rpc Currently, in repair_tablet we retrieve session_id from tablet map (and throw if it isn't specified). In case of topology coordinator failover, we may end up in a situation where a node runs outdated repair, treating session of a different operation as the repair's session: - topology coordinator starts repair transition (A); - topology coordinator sends tablet repair rpc to node1; - topology coordinator is separated from the cluster; - new topology coordinator is elected; - new topology coordinator sees waiting repair request (A_2) and executes it; - new repair of the same tablet is requested (B); - new topology coordinator starts repair transition (B); - new topology coordinator sends tablet repair rpc to node2; - node2 starts repair (B) as repair master; - node1 starts repair (A), checks the current session (B), proceeds with repair (B) as repair master. Send current session_id in repair_tablet rpc. If this session_id and session id got from tablet map don't match, an exception is thrown.	2025-08-29 16:46:52 +02:00
Łukasz Paszkowski	54201960e6	storage_service: extend locator::load_stats to collect per-node critical disk utilization flag This commit extends the TABLE_LOAD_STATS RPC with information whether a node operates in the critical disk utilization mode. This information will be needed to distict between the causes why a table migration/repair was interrupted.	2025-08-29 14:56:13 +02:00
Łukasz Paszkowski	132fd1e3f2	replica/exceptions: Add a new custom replica exception The new exception `critical_disk_utilization_exception` is thrown when the user table mutation writes are being blocked due to e.g. reaching a critical disk utilization level. This new exception, is then correctly handled on the coordinator side when transforming into `mutation_write_failure_exception` with a meaningful error message: "Write rejected due to critical disk utilization".	2025-08-28 18:06:37 +02:00
Michał Jadwiszczak	e901b6fde4	message/messaging_service: add `work_on_view_building_tasks` RPC The RPC will be used by view building coordinator to attach to and wait for tasks performed by view building worker (introduced in later commit). The RPC gets vector of tasks' ids and returns vector of `view_task_result`s. i-th task result reffers to i-th task id.	2025-08-27 08:55:47 +02:00
Asias He	0d7e518a26	repair: Add tablet incremental repair support The central idea of incremental repair is to allow repair participants to select and repair only a portion of the dataset to speed up the repair process. All repair participants must utilize an identical selection method to repair and synchronize the same selected dataset. There are two primary selection methods: time-based and file-based. The time-based method selects data within a specified time frame. It is versatile but it is less efficient because it requires reading all of the dataset and omitting data beyond the time frame. The file-based method selects data from unrepaired SSTables and is more efficient because it allows the entire SSTable to be omitted. This document patch implements the file-based selection method. Incremental repair will only be supported for tablet tables; it will not be supported for vnode tables. On one hand, the legacy vnode is less important to support. On the other hand, the incremental repair for vnode is much harder to implement. With vnodes, a SSTalbe could contain data for multiple vnode ranges. When a given vnode range is repaired, only a portion of the SSTable is repaired. This complicates the manipulation of SSTables significantly during both repair and compaction. With tablets, an entire tablet is repaired so that a sstable is either fully repaired or not repaired which is a huge simplification. This patch uses the repaired_at from sstables::statistics component to mark a sstable as repaired. It uses a virtual clock as the repair timestamp, i.e., using a monotonically increasing number for the repaired_at field of a SSTable and sstables_repaired_at column in system.tablets table. Notice that when a sstable is not repaired, the repaired_at field will be set to the default value 0 by default. The being_repaired in memory field of a SSTable is used to explicitly mark that a SSTable is being selected. The following variables are used for incremental repair: The repaired_at on disk field of a SSTable is used. - A 64-bit number increases sequentially The sstables_repaired_at is added to the system.tablets table. - repaired_at <= sstables_repaired_at means the sstable is repaired The being_repaired in memory field of a SSTable is added. - A repair UUID tells which sstable has participated in the repair Initial test results: 1) Medium dataset results Node amount: 3 Instance type: i4i.2xlarge Disk usage per node: ~500GB Cluster pre-populated with ~500GB of data before starting repairs job. Results for Repair Timings: The regular repair run took 210 mins. Incremental repair 1st run took 183 mins, 2nd and 3rd runs took around 48s The speedup is: 183 mins / 48s = 228X 2) Small dataset results Node amount: 3 Instance type: i4i.2xlarge Disk usage per node: ~167GB Cluster pre-populated with ~167GB of data before starting the repairs job. Regular repair 1st run took 110s, 2nd and 3rd runs took 110s. Incremental repair 1st run took 110 seconds, 2nd and 3rd run took 1.5 seconds. The speedup is: 110s / 1.5s = 73X 3) Large dataset results Node amount: 6 Instance type: i4i.2xlarge, 3 racks 50% of base load, 50% read/write Dataset == Sum of data on each node Dataset Non-incremental repair (minutes) 1.3 TiB 31:07 3.5 TiB 25:10 5.0 TiB 19:03 6.3 TiB 31:42 Dataset Incremental repair (minutes) 1.3 TiB 24:32 3.0 TiB 13:06 4.0 TiB 5:23 4.8 TiB 7:14 5.6 TiB 3:58 6.3 TiB 7:33 7.0 TiB 6:55 Fixes #22472	2025-08-18 11:01:21 +08:00
Piotr Dulikowski	ec7832cc84	Merge 'Raft-based recovery procedure: simplify rolling restart with recovery_leader' from Patryk Jędrzejczak The following steps are performed in sequence as part of the Raft-based recovery procedure: - set `recovery_leader` to the host ID of the recovery leader in `scylla.yaml` on all live nodes, - send the `SIGHUP` signal to all Scylla processes to reload the config, - perform a rolling restart (with the recovery leader being restarted first). These steps are not intuitive and more complicated than they could be. In this PR, we simplify these steps. From now on, we will be able to simply set `recovery_leader` on each node just before restarting it. Apart from making necessary changes in the code, we also update all tests of the Raft-based recovery procedure and the user-facing documentation. Fixes scylladb/scylladb#25015 The Raft-based procedure was added in 2025.2. This PR makes the procedure simpler and less error-prone, so it should be backported to 2025.2 and 2025.3. Closes scylladb/scylladb#25032 * github.com:scylladb/scylladb: docs: document the option to set recovery_leader later test: delay setting recovery_leader in the recovery procedure tests gossip: add recovery_leader to gossip_digest_syn db: system_keyspace: peers_table_read_fixup: remove rows with null host_id db/config, gms/gossiper: change recovery_leader to UUID db/config, utils: allow using UUID as a config option	2025-08-04 08:29:32 +02:00
Asias He	e28c75aa79	repair: Avoid too many fragments in a single repair_row_on_wire When repairing a partition with many rows, we can store many fragments in a repair_row_on_wire object which is sent as a rpc stream message. This could cause reactor stalls when the rpc stream compression is turned on, because the compression compresses the whole message without any split and compression. This patch solves the problem at the higher level by reducing the message size that is sent to the rpc stream. Tests are added to make sure the message split works. Fixes #24808	2025-07-29 13:43:53 +08:00
Asias He	266a518e4c	repair: Change partition_key_and_mutation_fragments to use chunked_vector With the change in "repair: Avoid too many fragments in a single repair_row_on_wire", the std::list<frozen_mutation_fragment> _mfs; in partition_key_and_mutation_fragments will not contain large number of fragments any more. Switch to use chunked_vector.	2025-07-29 13:43:17 +08:00
Patryk Jędrzejczak	ba5b5c7d2f	gossip: add recovery_leader to gossip_digest_syn In the new Raft-based recovery procedure, live nodes join the new group 0 one by one during a rolling restart. There is a time window when some of them are in the old group 0, while others are in the new group 0. This causes a group 0 mismatch in `gossiper::handle_syn_msg`. The current solution for this problem is to ignore group 0 mismatches if `recovery_leader` is set on the local node and to ask the administrator to perform the rolling restart in the following way: - set `recovery_leader` in `scylla.yaml` on all live nodes, - send the `SIGHUP` signal to all Scylla processes to reload the config, - proceed with the rolling restart. This commit makes `gossiper::handle_syn_msg` ignore group 0 mismatches when exactly one of the two gossiping nodes has `recovery_leader` set. We achieve this by adding `recovery_leader` to `gossip_digest_syn`. This change makes setting `recovery_leader` earlier on all nodes and reloading the config unnecessary. From now on, the administrator can simply restart each node with `recovery_leader` set. However, note that nodes that join group 0 must have `recovery_leader` set until all nodes join the new group 0. For example, assume that we are in the middle of the rolling restart and one of the nodes in the new group 0 crashes. It must be restarted with `recovery_leader` set, or else it would reject `gossip_digest_syn` messages from nodes in the old group 0. To avoid problems in such cases, we will continue to recommend setting `recovery_leader` in `scylla.yaml` instead of passing it as a command line argument.	2025-07-23 15:36:57 +02:00
Benny Halevy	3feb759943	everywhere: use utils::chunked_vector for list of mutations Currently, we use std::vector<*mutation> to keep a list of mutations for processing. This can lead to large allocation, e.g. when the vector size is a function of the number of tables. Use a chunked vector instead to prevent oversized allocations. `perf-simple-query --smp 1` results obtained for fixed 400MHz frequency and PGO disabled: Before (read path): ``` enable-cache=1 Running test with config: {partitions=10000, concurrency=100, mode=read, query_single_key=no, counters=no} Disabling auto compaction Creating 10000 partitions... 89055.97 tps ( 66.1 allocs/op, 0.0 logallocs/op, 14.2 tasks/op, 39417 insns/op, 18003 cycles/op, 0 errors) 103372.72 tps ( 66.1 allocs/op, 0.0 logallocs/op, 14.2 tasks/op, 39380 insns/op, 17300 cycles/op, 0 errors) 98942.27 tps ( 66.1 allocs/op, 0.0 logallocs/op, 14.2 tasks/op, 39413 insns/op, 17336 cycles/op, 0 errors) 103752.93 tps ( 66.1 allocs/op, 0.0 logallocs/op, 14.2 tasks/op, 39407 insns/op, 17252 cycles/op, 0 errors) 102516.77 tps ( 66.1 allocs/op, 0.0 logallocs/op, 14.2 tasks/op, 39403 insns/op, 17288 cycles/op, 0 errors) throughput: mean= 99528.13 standard-deviation=6155.71 median= 102516.77 median-absolute-deviation=3844.59 maximum=103752.93 minimum=89055.97 instructions_per_op: mean= 39403.99 standard-deviation=14.25 median= 39406.75 median-absolute-deviation=9.30 maximum=39416.63 minimum=39380.39 cpu_cycles_per_op: mean= 17435.81 standard-deviation=318.24 median= 17300.40 median-absolute-deviation=147.59 maximum=18002.53 minimum=17251.75 ``` After (read path) ``` enable-cache=1 Running test with config: {partitions=10000, concurrency=100, mode=read, query_single_key=no, counters=no} Disabling auto compaction Creating 10000 partitions... 59755.04 tps ( 66.2 allocs/op, 0.0 logallocs/op, 14.2 tasks/op, 39466 insns/op, 22834 cycles/op, 0 errors) 71854.16 tps ( 66.1 allocs/op, 0.0 logallocs/op, 14.2 tasks/op, 39417 insns/op, 17883 cycles/op, 0 errors) 82149.45 tps ( 66.1 allocs/op, 0.0 logallocs/op, 14.2 tasks/op, 39411 insns/op, 17409 cycles/op, 0 errors) 49640.04 tps ( 66.1 allocs/op, 0.0 logallocs/op, 14.3 tasks/op, 39474 insns/op, 19975 cycles/op, 0 errors) 54963.22 tps ( 66.1 allocs/op, 0.0 logallocs/op, 14.3 tasks/op, 39474 insns/op, 18235 cycles/op, 0 errors) throughput: mean= 63672.38 standard-deviation=13195.12 median= 59755.04 median-absolute-deviation=8709.16 maximum=82149.45 minimum=49640.04 instructions_per_op: mean= 39448.38 standard-deviation=31.60 median= 39466.17 median-absolute-deviation=25.75 maximum=39474.12 minimum=39411.42 cpu_cycles_per_op: mean= 19267.01 standard-deviation=2217.03 median= 18234.80 median-absolute-deviation=1384.25 maximum=22834.26 minimum=17408.67 ``` `perf-simple-query --smp 1 --write` results obtained for fixed 400MHz frequency and PGO disabled: Before (write path): ``` enable-cache=1 Running test with config: {partitions=10000, concurrency=100, mode=write, query_single_key=no, counters=no} Disabling auto compaction 63736.96 tps ( 59.4 allocs/op, 16.4 logallocs/op, 14.3 tasks/op, 49667 insns/op, 19924 cycles/op, 0 errors) 64109.41 tps ( 59.3 allocs/op, 16.0 logallocs/op, 14.3 tasks/op, 49992 insns/op, 20084 cycles/op, 0 errors) 56950.47 tps ( 59.3 allocs/op, 16.0 logallocs/op, 14.3 tasks/op, 50005 insns/op, 20501 cycles/op, 0 errors) 44858.42 tps ( 59.3 allocs/op, 16.0 logallocs/op, 14.3 tasks/op, 50014 insns/op, 21947 cycles/op, 0 errors) 28592.87 tps ( 59.3 allocs/op, 16.0 logallocs/op, 14.3 tasks/op, 50027 insns/op, 27659 cycles/op, 0 errors) throughput: mean= 51649.63 standard-deviation=15059.74 median= 56950.47 median-absolute-deviation=12087.33 maximum=64109.41 minimum=28592.87 instructions_per_op: mean= 49941.18 standard-deviation=153.76 median= 50005.24 median-absolute-deviation=73.01 maximum=50027.07 minimum=49667.05 cpu_cycles_per_op: mean= 22023.01 standard-deviation=3249.92 median= 20500.74 median-absolute-deviation=1938.76 maximum=27658.75 minimum=19924.32 ``` After (write path) ``` enable-cache=1 Running test with config: {partitions=10000, concurrency=100, mode=write, query_single_key=no, counters=no} Disabling auto compaction 53395.93 tps ( 59.4 allocs/op, 16.5 logallocs/op, 14.3 tasks/op, 50326 insns/op, 21252 cycles/op, 0 errors) 46527.83 tps ( 59.3 allocs/op, 16.0 logallocs/op, 14.3 tasks/op, 50704 insns/op, 21555 cycles/op, 0 errors) 55846.30 tps ( 59.3 allocs/op, 16.0 logallocs/op, 14.3 tasks/op, 50731 insns/op, 21060 cycles/op, 0 errors) 55669.30 tps ( 59.3 allocs/op, 16.0 logallocs/op, 14.3 tasks/op, 50735 insns/op, 21521 cycles/op, 0 errors) 52130.17 tps ( 59.3 allocs/op, 16.0 logallocs/op, 14.3 tasks/op, 50757 insns/op, 21334 cycles/op, 0 errors) throughput: mean= 52713.91 standard-deviation=3795.38 median= 53395.93 median-absolute-deviation=2955.40 maximum=55846.30 minimum=46527.83 instructions_per_op: mean= 50650.57 standard-deviation=182.46 median= 50731.38 median-absolute-deviation=84.09 maximum=50756.62 minimum=50325.87 cpu_cycles_per_op: mean= 21344.42 standard-deviation=202.86 median= 21334.00 median-absolute-deviation=176.37 maximum=21554.61 minimum=21060.24 ``` Fixes #24815 Improvement for rare corner cases. No backport required Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes scylladb/scylladb#24919	2025-07-13 19:13:11 +03:00
Michael Litvak	11f045bb7c	storage service: add repair colocated tablets rpc add a new RPC repair_colocated_tablets which is similar to the RPC tablet_repair, but instead of repairing a single tablet it takes a set of co-located tablets, repairs them and returns a shared repair_time result. This is useful because the way co-located tablets are represented doesn't allow to repair tablets independently but only as a group operation, and the repair_time which is stored in the tablet map is shared with the entire co-location group. But when repairing a group of co-located tablets we may require a different behavior, especially considering that co-located tablets are derived tablets of a special type. For example, we may want to skip running repair on CDC tablets when repairing the base table. The new RPC and the storage service function repair_colocated_tablets allow the flexibility to implement different strategies when repairing co-located groups. Currently the implementation is simply to repair each tablet and return the minimum repair_time as the shared repair time.	2025-07-01 13:20:18 +03:00
Avi Kivity	b33dd2bd7d	Merge 'sstables/mx/writer: handle non-full prefix row keys' from Botond Dénes Although valid for compact tables, non-full (or empty) clustering key prefixes are not handled for row keys when writing sstables. Only the present components are written, consequently if the key is empty, it is omitted entirely. When parsing sstables, the parsing code unconditionally parses a full prefix. This mis-match results in parsing failures, as the parser parses part of the row content as a key resulting in a garbage key and subsequent mis-parsing of the row content and maybe even subsequent partitions. Introduce a new system table: `system.corrupt_data` and infrastructure similar to `large_data_handler`: `corrupt_data_handler` which abstracts how corrupt data is handled. The sstable writer now passes rows such corrupt keys to the corrupt data handler. This way, we avoid corrupting the sstables beyond parsing and the rows are also kept around in system.corrupt_data for later inspection and possible recovery. Add a full-stack test which checks that rows with bad keys are correctly handled. Fixes: https://github.com/scylladb/scylladb/issues/24489 The bug is present in all versions, has to be backported to all supported versions. Closes scylladb/scylladb#24492 * github.com:scylladb/scylladb: test/boost/sstable_datafile_test: add test for corrupt data sstables/mx/writer: handler rows with empty keys test/lib/cql_assertions: introduce columns_assertions sstables: add corrupt_data_handler to sstables::sstables tools/scylla-sstable: make large_data_handler a local db: introduce corrupt_data_handler mutation: introduce frozen_mutation_fragment_v2 mutation/mutation_partition_view: read_{clustering,static}_row(): return row type mutation/mutation_partition_view: extract de-ser of {clustering,static} row idl-compiler.py: generate skip() definition for enums serializers idl: extract full_position.idl from position_in_partition.idl db/system_keyspace: add apply_mutation() db/system_keyspace: introduce the corrupt_data table	2025-06-29 18:18:36 +03:00
Piotr Dulikowski	62efe6616a	Merge 'mapreduce: add tablet-aware dispatching algorithm' from Andrzej Jackowski The primary motivation for this change is to reduce the time during which the Effective Replication Map (ERM) is retained by the mapreduce service. This ensures that long aggregate queries do not block topology operations. As ScyllaDB is generally transitioning towards tablets, and using tablets simplifies work dispatching, the decision was made to design the new algorithm specifically for tablets. The goal of the algorithm is to divide the work in such a way that each `tablet_replica` (that is <host, shard> pair) processes two tablets at a time. The new algorithm can be summarized as follows: 1. Prepare a tablet_replica -> partition_range mapping where the values cover the entire space. 2. For each tablet_replica, in parallel, take two partition ranges and dispatch them to the node hosting the replica. The ERM is released and re-acquired in each iteration, allowing the destination (i.e., tablet_replica) to change for each artition range (in such cases, the partition range is assigned to the appropriate tablet_replica). In step 1, the main difference compared to the old algorithm (dispatch_to_vnodes) is that partition ranges are assigned to a tablet_replica rather than just the host. In step 2, the main difference is that the work is divided into smaller batches, and the ERM is released and re-acquired for each batch. In the current implementation, each node can correctly handle every partition range, even if the mapreduce supercoordinator does not retain the ERM and the range is absent locally. This is because mapreduce_service::execute_on_this_shard creates a new pager that coordinates the partition range read, including obtaining its own ERM. However, every partition range that is absent locally is handled by shard 0. Therefore, proper routing of partition ranges is necessary to avoid shard 0 overload. This is why, in step 2, the ERM is retained during each batch processing, and the tablet_replica is refreshed for each processed range. Additionally, shard_id is added to mapreduce request. When shard_id is set, the entire partition range is handled by the specified shard. As the new tablet-aware mapreduce algorithm balances the workload across shards, shard_id ensure that the balance is preserved, even during events such as tablet splits. This patch series: - Refactors a bit mapreduce service, to facilitate having two algorithm versions (one for vnodes and one for tablets). - Implements tablet-aware dispatching algorithm. - Adds shard_id to mapreduce request and uses the information to handle requests entirely by selected shard. - Adds test_long_query_timeout_erm to verify the new functionality. Fixes: scylladb#21831 No backport, as it is rather new feature than a bugfix. Closes scylladb/scylladb#24383 * github.com:scylladb/scylladb: mapreduce: add missing comma and space in mapreduce_request operator<< mapreduce: add shard_id_hint to mapreduce request test: add test_long_query_timeout_erm mapreduce: add tablet-aware dispatching algorithm storage_proxy: make storage_proxy::is_alive public mapreduce: remove _shared_token_metadata from mapreduce_service mapreduce: move dispatching logic to dispatch_to_vnodes mapreduce: remove underscores from variable names mapreduce: move req_with_modified_pr handling to a new function mapreduce: change next_vnode lambda to get_next_partition_range function	2025-06-26 12:25:39 +02:00
Andrzej Jackowski	26403df9ea	mapreduce: add shard_id_hint to mapreduce request If a partition range is not present locally, `partition_ranges_owned_by_this_shard` assigns it to shard 0, which can overload shard 0. To address this, this commit adds a `shard_id_hint` to the mapreduce request. When `shard_id_hint` is set, the entire partition range in the request is handled by the specified shard. The `shard_id_hint` is set by the new tablet-aware mapreduce algorithm, introduced in `dispatch_to_tablets`. This algorithm balances the workload across shards, so the changes in this commit ensure that load balancing is preserved, even during events such as tablet splits. Fixes: scylladb#21831	2025-06-25 19:23:07 +02:00
Botond Dénes	b931145a26	mutation: introduce frozen_mutation_fragment_v2 Mirrors frozen_mutation_fragment and shares most of the underlying serialization code, the only exception is replacing range_tombstone with range_tombstone_change in the mutation fragment variant.	2025-06-24 11:05:31 +03:00
Botond Dénes	b0d5462440	idl: extract full_position.idl from position_in_partition.idl A future user of position_in_partition.idl doesn't need full_position and so doesn't want to include full_position.hh to fix compile errors when including position_in_partition.idl.hh. Extract it to a separate idl file: it has a single user in a storage_proxy VERB.	2025-06-24 11:05:30 +03:00
Avi Kivity	f3dccc2215	interval: change start()/end() not to return references to data members We'd like to change the data layout of `interval` to save space. As a result, start() and end() which return references to data members must return objects (not references). Since we'd like to maintain zero-copy for these functions, we change them to return objects containing references (rather than references to objects), avoiding copying of potentially expensive objects. We repurpose the interval_bound class to hold references (by instantiating it with `const T&` instead of `T`) and provide converting constructors. To make transform_bounds() retain zero-copy, we add start() and end() that take *this by rvalue reference.	2025-06-14 21:26:17 +03:00
Avi Kivity	16fb68bb5e	interval: rename start_ref() back to start() (and end_ref() etc). To reduce noise, rename start_ref() back to its original name start(), after it was changed in the previous patch to force an audit of all calls.	2025-06-14 21:26:16 +03:00
Avi Kivity	3363bc41e2	interval: rename start() to start_ref() (and end() etc). We are about to change start() to return a proxy object rather than a `const interval_bound<T>&`. This is generally transparent, except in one case: `auto x = i.start()`. With the current implementation, we'll copy object referred to and assign it to x. With the planned implementation, the proxy object will be assigned to `x`, but it will keep referring to `i`. To prevent such problems, rename start() to start_ref() and end() to end_ref(). This forces us to audit all calls, and redirect calls that will break to new start_copy() and end_copy() methods.	2025-06-14 21:26:16 +03:00
Avi Kivity	844a49ed6e	dht: fragment token_range_vector token_range_vector is a linear vector containing intervals of tokens. It can grow quite large in certain places and so cause stalls. Convert it to utils::chunked_vector, which prevents allocation stalls. It is not used in any hot path, as it usually describes vnodes or similar things. Fixes #3335.	2025-05-27 14:47:24 +03:00
Benny Halevy	4bd0845fce	gossiper: make send_gossip_echo cancellable Currently send_gossip_echo has a 22 seconds timeout during which _abort_source is ignored. Mark the verb as cancellable so it can be canceled on shutdown / abort. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-30 11:46:10 +03:00
Avi Kivity	882f405eed	Merge "Convert gossiper's endpoint state map to be host id based" from Gleb " The series makes endpoint state map in the gossiper addressable by host id instead of ips. The transition has implication outside of the gossiper as well. Gossiper based topology operations are affected by this change since they assume that the mapping is ip based. On wire protocol is not affected by the change as maps that are sent by the gossiper protocol remain ip based. If old node sends two different entries for the same host id the one with newer generation is applied. If new node has two ids that are mapped to the same ip the newer one is added to the outgoing map. Interoperability was verified manually by running mixed cluster. The series concludes the conversion of the system to be host id based. " * 'gleb/gossipper-endpoint-map-to-host-id-v2' of github.com:scylladb/scylla-dev: gossiper: make examine_gossiper private gossiper: rename get_nodes_with_host_id to get_node_ip treewide: drop id parameter from gossiper::for_each_endpoint_state treewide: move gossiper to index nodes by host id gossiper: drop ip from replicate function parameters gossiper: drop ip from apply_new_states parameters gossiper: drop address from handle_major_state_change parameter list gossiper: pass rpc::client_info to gossiper_shutdown verb handler gossiper: add try_get_host_id function gossiper: add ip to endpoint_state serialization: fix std::map de-serializer to not invoke value's default constructor gossiper: drop template from wait_alive_helper function gossiper: move get_supported_features and its users to host id storage_service: make candidates_for_removal host id based gossiper: use peers table to detect address change storage_service: use std::views::keys instead of std::views::transform that returns a key gossiper: move _pending_mark_alive_endpoints to host id gossiper: do not allow to assassinate endpoint in raft topology mode gossiper: fix indentation after previous patch gossiper: do not allow to assassinate non existing endpoint	2025-04-02 12:30:00 +03:00
Michał Chojnowski	94c33b6760	messaging_service: add SAMPLE_SSTABLES and ESTIMATE_SSTABLE_VOLUME verbs Add two verbs needed to implement dictionary training for SSTable compression. SAMPLE_SSTABLES returns a list of randomly-selected chunks of Data files with a given cardinality and using a given chunk size, for the given table. ESTIMATE_SSTABLE_VOLUME returns the total uncompressed size of all Data files the given table.	2025-04-01 00:07:29 +02:00
Gleb Natapov	5e06bf76e0	gossiper: pass rpc::client_info to gossiper_shutdown verb handler It will be needed later to obtain host id of the peer.	2025-03-31 16:50:50 +03:00
Aleksandra Martyniuk	928f92c780	repair: pass session_id to repair_meta Pass session_id of tablet repair down the stack from the repair request to repair_meta. The session_id will be utiziled in the following patches.	2025-03-14 10:20:12 +01:00
Avi Kivity	696ce4c982	Merge "convert some parts of the gossiper to host ids" from Gleb " This is series starts conversion of the gossiper to use host ids to index nodes. It does not touch the main map yet, but converts a lot of internal code to host id. There are also some unrelated cleanups that were done while working on the series. On of which is dropping code related to old shadow round. We replaced shadow round with explicit GOSSIP_GET_ENDPOINT_STATES verb in `cd7d64f588` which is in scylla-4.3.0, so there should be no compatibility problem. We already dropped a lot of old shadow round code in previous patches anyway. I tested manually that old and new node can co-exist in the same cluster, " * 'gleb/gossiper-host-id-v2' of github.com:scylladb/scylla-dev: (33 commits) gossiper: drop unneeded code gossiper: move _expire_time_endpoint_map to host_id gossiper: move _just_removed_endpoints to host id gossiper: drop unused get_msg_addr function messaging_service: change connection dropping notification to pass host id only messaging_service: pass host id to remove_rpc_client in down notification treewide: pass host id to endpoint_lifecycle_subscriber treewide: drop endpoint life cycle subscribers that do nothing load_meter: move to host id treewide: use host id directly in endpoint state change subscribers treewide: pass host id to endpoint state change subscribers gossiper: drop deprecated unsafe_assassinate_endpoint operation storage_service: drop unused code in handle_state_removed treewide: drop endpoint state change subscribers that do nothing gossiper: drop ip address from handle_echo_msg and simplify code since host_id is now mandatory gossiper: start using host ids to send messages earlier messaging_service: add temporary address map entry on incoming connection topology_coordinator: notify about IP change from sync_raft_topology_nodes as well treewide: move everyone to use host id based gossiper::is_alive and drop ip based one storage_proxy: drop unused template ...	2025-03-13 13:36:31 +02:00
Gleb Natapov	8425c26462	gossiper: start using host ids to send messages earlier Send digest ack and ack2 by host ids as well now since the id->ip mapping is available after receiving digest syn. It allows to convert more code to host id here.	2025-03-11 12:09:21 +02:00
Gleb Natapov	0437f558cd	idl: generate ip based version of a verb only for verbs that need it The patch adds new marker for a verb - [[ip]] that means that for this verb ip version of the verbs needs to be generated. Most of the verbs do not need it.	2025-03-11 12:09:21 +02:00

1 2 3 4 5 ...

365 Commits