scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-05-22 07:42:16 +00:00

Author	SHA1	Message	Date
Patryk Jędrzejczak	c9592a495e	Merge 'cql: fix missing TABLETS_ROUTING_V1 payload after CAS shard bounce' from Petr Gusev After an internal CAS shard bounce, check_locality() was evaluating against this_shard_id() of the post-bounce shard — which is the correct tablet shard — so it returned nullopt, and LWT/SERIAL responses omitted the tablets-routing-v1 custom payload. The client never learned the correct tablet map. Fix by recording the original entry shard in client_state (initialized to this_shard_id() at construction, preserved across shard bounces via client_state_for_another_shard) and passing it to check_locality() so it compares against the client's actual routing decision. No host_id tracking or forwarded_client_state IDL changes are needed because CAS shard bounces are always intra-node. Fixes SCYLLADB-2041 backport: need to backport to all versions with LWT over tablets Closes scylladb/scylladb#29910 * https://github.com/scylladb/scylladb: cql: refactor add_tablet_info to take tablet_routing_info directly cql: fix UB dereference of nullopt tablet_info in execute_with_condition test/boost: add regression test for missing tablet routing after CAS bounce cql: fix missing TABLETS_ROUTING_V1 payload after CAS shard bounce	2026-05-18 11:19:04 +02:00
Yaniv Michael Kaul	34aac2030c	paxos: enable paging for internal paxos state queries The paxos state queries (load_paxos_state, save_paxos_promise, etc.) were using page_size=-1 (no paging). While each query returns at most one row and paging never actually kicks in, the lack of paging causes these internal queries to be counted as non-paged reads in the metrics, which can be confusing to users monitoring their cluster. Add LIMIT 1 to the SELECT query so that may_need_paging() short-circuits to false (row_limit <= 1), avoiding pager allocation overhead entirely. Set page_size=1000 so these queries are no longer reported as non-paged reads. Refs: https://scylladb.atlassian.net/browse/CUSTOMER-372 Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com> Backport: no, improvement Closes scylladb/scylladb#29852	2026-05-18 11:35:55 +03:00
Aleksandra Martyniuk	d874d355c2	service: skip load_sketch unload for excluded nodes on RF shrink When an RF change shrinks replicas on a DC and the node being shrunk is excluded, refresh_tablet_load_stats() only provides load_stats for that node if it has a cached snapshot from when the node was still up. If the snapshot is missing or predates the tables being shrunk (e.g. they were created after the node went down), stats stay incomplete. In that case load_sketch::unload() called from make_rf_change_plan() throws: Can't provide accurate load computation with incomplete load_stats for host: <uuid> Since an excluded node is not expected to come back, load_stats will never become complete, and the topology coordinator retries the plan infinitely, hanging ALTER KEYSPACE. Add a check for excluded nodes and skip unload() for them: we are removing the replica, so accurate load data for that node is not needed. For all other node states the throw-and-retry behavior is preserved. Modify test_excludenode_shrink_rf to always trigger the bug: a new error injection 'force_down_node_load_stats_invalid' forces the invalid-stats path in refresh_tablet_load_stats() for a down node, so the test does not depend on whether the load-stats refresher happened to cache the excluded node's stats while it was still up. Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1702. Closes scylladb/scylladb#29622	2026-05-15 17:46:28 +02:00
Petr Gusev	167a3c9c50	cql: fix missing TABLETS_ROUTING_V1 payload after CAS shard bounce After an internal CAS shard bounce, check_locality() was evaluating against this_shard_id() of the post-bounce shard — which is the correct tablet shard — so it returned nullopt, and LWT/SERIAL responses omitted the tablets-routing-v1 custom payload. The client never learned the correct tablet map. Fix by recording the original entry shard in client_state (initialized to this_shard_id() at construction, preserved across shard bounces via client_state_for_another_shard) and passing it to check_locality() so it compares against the client's actual routing decision. No host_id tracking or forwarded_client_state IDL changes are needed because CAS shard bounces are always intra-node. Fixes SCYLLADB-2041	2026-05-15 11:56:14 +02:00
Piotr Dulikowski	0c016cecc3	Merge 'QOS: self-heal stale V1-to-V2 migration state on upgrade' from Alex Dathskovsky service_levels: self-heal stale v1 marker after raft topology upgrade This PR handles an upgrade corner case where a node may already be using raft topology, while `system.scylla_local` still marks service levels as v1. The problem was introduced by commit `2917ec5d51` ("service:qos: service levels migration"), which added the service-levels migration from `system_distributed.service_levels` to `system.service_levels_v2` as part of the raft topology upgrade. However, if the cluster had no service levels configured, there was no data to migrate. In that case, the migration path could leave the local version marker unchanged, so the node would later observe an inconsistent state: * raft topology is already enabled; * service levels are still marked as v1 in `system.scylla_local`. Such clusters can be left in a stale state and fail startup during upgrade to 2026.2 This PR makes the upgrade path self-healing. The first commit restores `service_level_controller::migrate_to_v2()`, giving us a group0-based path for writing the service-levels v2 state even after raft topology is already in use. The second commit wires this path into startup. When the node detects the stale raft-topology + service-levels-v1 state, it retries the migration a bounded number of times and updates the version marker to v2 instead of failing startup. With this change, clusters that were left in this stale state can recover automatically during upgrade to 2026. Fixes: SCYLLADB-1807 backport: 2026.2 2026.1 we need this functionality when we are upgrading older servers Closes scylladb/scylladb#29749 * github.com:scylladb/scylladb: test/auth_cluster: simulate v1 state in self-heal test When skip_service_levels_v2_initialization is used, write an explicit v1 service level version marker while skipping v2 initialization. This lets the restart test exercise self-healing from v1 to v2. qos: self-heal stale service levels version on startup qos: reintroduce service levels v2 migration self-heal	2026-05-14 10:32:43 +02:00
Alex	6188bf3e01	test/auth_cluster: simulate v1 state in self-heal test When skip_service_levels_v2_initialization is used, write an explicit v1 service level version marker while skipping v2 initialization. This lets the restart test exercise self-healing from v1 to v2.	2026-05-13 17:55:20 +03:00
Alex	c2014f7e50	qos: self-heal stale service levels version on startup Add self_heal_service_levels_version() and use it during startup when the node is already on raft topology but service levels are still marked as v1. In that stale state, migrate service levels to v2 through group0 instead of failing startup.	2026-05-13 17:55:20 +03:00
Piotr Dulikowski	f3ac35f9d2	Merge 'strong_consistency: wait for raft servers to start in create table' from Michael Litvak When creating a strongly consistent table, wait for the table's raft servers to start and be ready to serve queries before completing the operation. We want the create table operation to absorb the delay of starting the raft groups instead of the first queries. The create table coordinator commits and applies the schema statement, then it waits for all hosts that have a tablet replica to create and start the raft groups for the table's tablets. It does this by sending an RPC to all the relevant hosts that executes a group0 barrier, in order to ensure the table and raft groups are created, then waits for all raft groups on the host to finish starting and be ready. Fixes SCYLLADB-807 no backport - strong consistency is still experimental Closes scylladb/scylladb#28843 * github.com:scylladb/scylladb: strong_consistency: wait for leader when starting a group strong_consistency: change wait for groups to start on startup strong_consistency: optimize wait_for_groups_to_start strong_consistency: wait for raft servers to start in create table	2026-05-13 16:42:05 +02:00
Piotr Dulikowski	3c2c814215	Merge 'db/view/view_building: replace system keyspace functions with mutation builder' from Michał Jadwiszczak `system.view_building_tasks` is a single partition table, so it makes more sense to use a mutation builder and generate 1 mutation per group0 command instead of generating multiple mutations. This PR removes all `make_..._mutation()` system keyspace functions related to view building tasks and replaces them with mutation builder. Refs https://github.com/scylladb/scylladb/issues/25929 This patch doesn't fix any bug, it only reduces number of generated mutations, no need to backport it. Closes scylladb/scylladb#26557 * github.com:scylladb/scylladb: db/system_keyspace: replace `make_remove_view_building_task_mutation()` with mutation builder db/view/view_building_task_mutation_builder: make uuid generator optional db/system_keyspace: replace `make_view_building_task_mutation()` with mutation builder db/view/view_building_task_mutation_builder: add helper method	2026-05-13 16:10:55 +02:00
Tomasz Grabiec	66439bb753	Merge 'load_balancer: apply balance threshold to intranode shard balancing' from Ferenc Szili - Fix intranode shard balancing to respect the size-based balance threshold, preventing unnecessary migrations when load difference between shards is negligible - Add a regression test that verifies the threshold is respected for intranode balancing The intranode shard balancing loop only stopped when the algorithm exhausted the migration candidates or when a migration would go against convergence (it would increase imbalance instead of decrease it). This caused unnecessary tablet migrations for negligible imbalances (e.g., 0.78% difference between shards). The inter-node balancer already uses `is_balanced()` to stop when the relative load difference is within the configured `size_based_balance_threshold`, but this check was missing from the intranode path. Apply the same `is_balanced()` threshold check that is already used for inter-node balancing to the intranode convergence loop. When the relative load difference between the most-loaded and least-loaded shards on a node is within the threshold, the balancer now stops without issuing further migrations. The test creates a single node with 2 shards and 512 tablets: 1. Balanced scenario (257 vs 255 tablets, same size): relative diff = 0.78% < 1% threshold → verifies no intranode migration is emitted 2. Unbalanced scenario (307 vs 205 tablets, same size): relative diff = 33% >> 1% threshold → verifies intranode migration IS emitted Fixes: SCYLLADB-1775 This is a performance improvement which reduces the number of intranode migrations issued, and needs to be backported to versions with size-based load balancing: 2026.1 and 2026.2 Closes scylladb/scylladb#29756 * github.com:scylladb/scylladb: test: add test for intranode balance threshold in size-based mode tablet_allocator: apply balance threshold to intranode shard balancing	2026-05-13 13:09:52 +02:00
Patryk Jędrzejczak	3f2ff5a13f	Merge 'Remove raft_group0::finish_setup_after_join' from Gleb Natapov The function does nothing useful now. No backport needed. Removes code. Closes scylladb/scylladb#29828 * https://github.com/scylladb/scylladb: raft_group0: remove finish_setup_after_join function raft_group0: fix indentation after the last change raft_group: drop unneeded checks	2026-05-13 10:53:37 +02:00
Michał Jadwiszczak	1a32ccd8f6	db/system_keyspace: replace `make_remove_view_building_task_mutation()` with mutation builder Again, get rid of system keyspace method in favor of mutation builder, because `system.view_building_tasks` is a single parition table.	2026-05-13 10:06:18 +02:00
Alex	ac0a19aab8	qos: reintroduce service levels v2 migration self-heal migrate_to_v2() was removed after gossip-based service level migration support was dropped, since upgraded nodes were expected to already use service levels v2. However, clusters affected by the old migration bug may reach raft topology while system.scylla_local still has a stale service level version. Restore the migration helper so startup can self-heal those nodes by writing the v2 state through group0.	2026-05-13 10:16:02 +03:00
Michael Litvak	80bfc445a8	strong_consistency: wait for leader when starting a group When starting the raft server for a group, wait for the leader before completing the start operation. We want the group to be ready to accept writes by the time the start is reported to be completed without the additional latency of waiting for leader.	2026-05-13 08:43:26 +02:00
Michael Litvak	5f8322a820	strong_consistency: change wait for groups to start on startup on startup, previously groups_manager::start() was called and waited for the groups to start. we change it instead to just start the raft servers in the background without waiting for them to be fully started. we wait for the servers to start explicitly at a later stage of startup, after starting the messaging service. the reason is that for the servers to be fully started they may require communication that requires the messaging service. currently it is not required, but it will be changed in the next commit.	2026-05-13 08:43:26 +02:00
Michael Litvak	e568ca2bd8	strong_consistency: optimize wait_for_groups_to_start instead of iterating over all raft groups in wait_for_groups_to_start and check if we need to wait for them, maintain a list of only the raft groups that are starting and need to be waited.	2026-05-13 08:43:26 +02:00
Michael Litvak	5a5c7c6241	strong_consistency: wait for raft servers to start in create table When creating a strongly consistent table, wait for the table's raft servers to start and be ready to serve queries before completing the operation. We want the create table operation to absorb the delay of starting the raft groups instead of the first queries. The create table coordinator commits and applies the schema statement, then it waits for all hosts that have a tablet replica to create and start the raft groups for the table's tablets. It does this by sending an RPC to all the relevant hosts that executes a group0 barrier, in order to ensure the table and raft groups are created, then waits for all raft groups on the host to finish starting and be ready. Fixes SCYLLADB-807	2026-05-13 08:43:24 +02:00
Michał Jadwiszczak	e002665aa7	db/system_keyspace: replace `make_view_building_task_mutation()` with mutation builder `system.view_building_tasks` is a single partition table, so it makes more sense to use a mutation builder and generate 1 mutation per group0 command instead of generating multiple mutations.	2026-05-12 21:49:18 +02:00
Piotr Dulikowski	129f193116	Merge 'strong_consistency: implement basic coordinator metrics' from Michał Jadwiszczak Add per-shard metrics for strong consistency coordinator operations (latency, timeouts, bounces, status unknown) under the `"strong_consistency_coordinator"` category. These are analogous to the eventual consistency metrics in `storage_proxy_stats`, enabling direct performance comparison between the two consistency modes. The metrics are simplified compared to `storage_proxy_stats` — no breakdown by table, tablet, scheduling group, or DC, only per-shard. Fixes SCYLLADB-1343 Strong consistency is still in experimental phase, no need to backport. Closes scylladb/scylladb#29318 * github.com:scylladb/scylladb: test/strong_consistency: verify metrics strong_consistency: wire up metrics to operations strong_consistency: add stats struct and metrics registration	2026-05-12 16:15:51 +02:00
Botond Dénes	e95eb21a16	Merge 'Tablet-aware restore' from Pavel Emelyanov The mechanics of the restore is like this - A /storage_service/tablets/restore API is called with (keyspace, table, endpoint, bucket, manifests) parameters - First, it populates the system_distributed.snapshot_sstables table with the data read from the manifests - Then it emplaces a bunch of tablet transitions (of a new "restore" kind), one for each tablet - The topology coordinator handles the "restore" transition by calling a new RESTORE_TABLET RPC against all the current tablet replicas - Each replica handles the RPC verb by - Reading the snapshot_sstables table - Filtering the read sstable infos against current node and tablet being handled - Downloading and attaching the filtered sstables This PR includes system_distributed.snapshot_sstables table from @robertbindar and preparation work from @kreuzerkrieg that extracts raw sstables downloading and attaching from existing generic sstables loading code. This is first step towards SCYLLADB-197 and lacks many things. In particular - the API only works for single-DC cluster - the caller needs to "lock" tablet boundaries with min/max tablet count - not abortable - no progress tracking - sub-optimal (re-kicking API on restore will re-download everything again) - not re-attacheable (if API node dies, restoration proceeds, but the caller cannot "wait" for it to complete via other node) - nodes download sstables in maintenance/streaming sched gorup (should be moved to maintenance/backup) Other follow-up items: - have an actual swagger object specification for `backup_location` Closes #28436 Closes #28657 Closes #28773 Closes scylladb/scylladb#28763 * github.com:scylladb/scylladb: docs: Update topology_over_raft.md with `restore` transition kind test: Add test for backup vs migration race test: Restore resilience test sstables_loader: Fail tablet-restore task if not all sstables were downloaded sstables_loader: mark sstables as downloaded after attaching sstables_loader: return shared_sstable from attach_sstable db: add update_sstable_download_status method db: add downloaded column to snapshot_sstables db: extract snapshot_sstables TTL into class constant test: Add a test for tablet-aware restore tablets: Implement tablet-aware cluster-wide restore messaging: Add RESTORE_TABLET RPC verb sstables_loader: Add method to download and attach sstables for a tablet tablets: Add restore_config to tablet_transition_info sstables_loader: Add restore_tablets task skeleton test: Add rest_client helper to kick newly introduced API endpoint api: Add /storage_service/tablets/restore endpoint skeleton sstables_loader: Add keyspace and table arguments to manfiest loading helper sstables_loader_helpers: just reformat the code sstables_loader_helpers: generalize argument and variable names sstables_loader_helpers: generalize get_sstables_for_tablet sstables_loader_helpers: add token getters for tablet filtering sstables_loader_helpers: remove underscores from struct members sstables_loader: move download_sstable and get_sstables_for_tablet sstables_loader: extract single-tablet SST filtering sstables_loader: make download_sstable static sstables_loader: fix formating of the new `download_sstable` function sstables_loader: extract single SST download into a function sstables_loader: add shard_id to minimal_sst_info sstables_loader: add function for parsing backup manifests split utility functions for creating test data from database_test export make_storage_options_config from lib/test_services rjson: Add helpers for conversions to dht::token and sstable_id Add system_distributed_keyspace.snapshot_sstables add get_system_distributed_keyspace to cql_test_env code: Add system_distributed_keyspace dependency to sstables_loader storage_service: Export export handle_raft_rpc() helper storage_service: Export do_tablet_operation() storage_service: Split transit_tablet() into two tablets: Add braces around tablet_transition_kind::repair switch	2026-05-12 16:24:13 +03:00
Avi Kivity	ddb1181103	Merge 'load_balance: fix drain with forced capacity-based balancing' from Ferenc Szili When `force_capacity_based_balancing` is enabled and a node is being drained/excluded, the tablet allocator incorrectly aborts balancing due to incomplete tablet stats - even though capacity-based balancing doesn't depend on tablet sizes. The tablet allocator normally waits for complete load stats before balancing. An exception exists for drained+excluded nodes (they're unreachable and won't return stats). However, when forced capacity-based balancing is active, this exception was not being applied, causing the balancer to reject the drain plan. Adjust the condition in `tablet_allocator.cc` so that the "ignore missing data for drained nodes" logic applies regardless of whether capacity-based balancing is forced. Added a Boost unit test that forces capacity-based balancing and verifies a drained/excluded node gets its tablets migrated even when tablet size stats are missing. This bug was introduced in 2026.1, so this needs to be backported to 2026.1 and 2026.2 Fixes: SCYLLADB-1803 Closes scylladb/scylladb#29791 * github.com:scylladb/scylladb: test: boost: add drain test for forced capacity-based balancing service: allow draining with forced capacity-based balancing	2026-05-12 12:38:25 +03:00
Piotr Dulikowski	7c2b1ea0b5	Merge 'view_building: fix tombstone_warn_threshold warnings' from Michał Jadwiszczak `system.view_building_tasks` is a single-partition Raft group0 table (pk = `"view_building"`, CK = timeuuid). When `clean_finished_tasks()` deletes hundreds of finished tasks, the physical rows remain in SSTables until compaction. Any subsequent read of the partition counts every column of every tombstoned row as a dead cell, triggering `tombstone_warn_threshold` warnings in large clusters. Two-part fix: 1. Range tombstones instead of row tombstones (commits 2–3) Instead of one row tombstone per finished task, find the minimum alive task UUID (`min_alive_uuid`) and emit a single range tombstone `[before_all, min_alive_uuid)` covering all tasks below that boundary. This reduces the tombstone count significantly and also benefits future compaction. 2. Bounded scan with `min_task_id` (commits 4–6) Even with range tombstones, physical rows remain until compaction and still count as dead cells during reads. The only way to avoid them is to not read them at all. - Add a `min_task_id timeuuid` static column to `system.view_building_tasks`. - On every GC, write `min_task_id = min_alive_uuid` atomically with the range tombstone (same Raft batch). - On reload, read `min_task_id` first using a static-only partition slice (empty `_row_ranges` + `always_return_static_content`): the SSTable reader stops immediately after the static row before processing any clustering tombstones — zero dead cells counted. - Use `AND id >= min_task_id` as a lower bound for the main task scan, skipping all tombstoned rows. The static-only read and the bounded scan are gated on the `VIEW_BUILDING_TASKS_MIN_TASK_ID` cluster feature so mixed-version clusters fall back to the full scan. The issue is not critical, so the fix shouldn't be backported. Fixes SCYLLADB-657 Closes scylladb/scylladb#28929 * github.com:scylladb/scylladb: test/cluster/test_view_building_coordinator: add reproducer for tombstone threshold warning docs: document tombstone avoidance in view_building_tasks view_building: add `task_uuid_generator` to `view_building_task_mutation_builder` view_building: introduce `task_uuid_generator` view_building: store `min_alive_uuid` in view building state view_building: set min_task_id when GC-ing finished tasks view_building: add min_task_id support to view_building_task_mutation_builder view_building: add min_task_id static column and bounded scan to system_keyspace view_building: use range tombstone when GC-ing finished tasks view_building: add range tombstone support to view_building_task_mutation_builder view_building: introduce VIEW_BUILDING_TASKS_MIN_TASK_ID cluster feature	2026-05-12 12:38:25 +03:00
Ferenc Szili	aaead10e5d	tablet_allocator: apply balance threshold to intranode shard balancing The intranode shard balancing loop only stopped when the most-loaded and least-loaded shard were the same (src == dst), meaning it would keep issuing migrations until the load difference reached exactly 0. This caused unnecessary migrations for negligible imbalances. Apply the same is_balanced() threshold check that is already used for inter-node balancing, so that intranode migrations stop when the relative load difference between shards is within the configured size_based_balance_threshold (default 1%).	2026-05-12 10:34:16 +02:00
Pavel Emelyanov	17384d42e3	tablets: Implement tablet-aware cluster-wide restore This patch adds - Changes in sstables_loader::restore_tablets() method It populates the system_distributed_keyspace.snapshot_sstables table with the information read from the manifest - Implementation of tablet_restore_task_impl::run() method It emplaces a bunch of tablet migrations with "restore" kind - Topology coordinator handling of tablet_transition_stage::restore When seen, the coordinator calls RESTORE_TABLET RPC against all tablet replicas Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2026-05-12 10:40:23 +03:00
Pavel Emelyanov	2c60d8f897	storage_service: Export export handle_raft_rpc() helper Just like do_tablet_operation, this one will be used by sstables_loader restore-tablet RPC Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2026-05-12 10:17:40 +03:00
Pavel Emelyanov	1c0e04316b	storage_service: Export do_tablet_operation() Next patches will introduce an RPC handler to restore a tablet on replica. The handler will be registered by sstables_loader, and it will have to call that helper from storage_service which thus needs to be moved to public scope. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2026-05-12 10:17:40 +03:00
Pavel Emelyanov	e5f04b0927	storage_service: Split transit_tablet() into two The goal of the split is to have try_transit_tablet() that - doesn't throw if tablet is in transition, but reports it back - doesn't wait for the submitted transition to finish The user will be in tablet-aware-restore, it will call this new trying helper in parallel, then wait for all transitions to finish. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2026-05-12 10:17:39 +03:00
Calle Wilund	2cc1a2c406	storage_service: Disable snapshots after raft decommission Fixes: SCYLLADB-1693 In case we abort a decommission operation, the snapshot/backup mechanism need to remain open. This change moves it to after raft_decommission. In the case of a cluster snapshot, our nodes ownership or not of tables will be serialized by raft anyway, so should remain consistent. In that case we at worst coordinate from a node in "leave" status In the case of a local snapshot, ownership matters less, only sstables on disk, which should not change. In the case of backup, this operates on a snapshot, state of which is not affected. Adds an injection point for testing. v2: - Added injection point to ensure test can abort decommission Closes scylladb/scylladb#29667	2026-05-11 17:04:09 +03:00
Yaniv Michael Kaul	3cba27d25f	topology: propagate error messages through raft_topology_cmd_result When a topology command (e.g., rebuild) fails on a target node, the exception message was being swallowed at multiple levels: 1. raft_topology_cmd_handler caught exceptions and returned a bare fail status with no error details. 2. exec_direct_command_helper saw the fail status and threw a generic "failed status returned from {id}" message. 3. The rebuilding handler caught that and stored a hardcoded "streaming failed" message. This meant users only saw "rebuild failed: streaming failed" instead of the actionable error from the safety check (e.g., "it is unsafe to use source_dc=dc2 to rebuild keyspace=..."). Fix by: - Adding an error_message field to raft_topology_cmd_result (with [[version 2026.2]] for wire compatibility). - Populating error_message with the exception text in the handler's catch blocks. - Including error_message in the exception thrown by exec_direct_command_helper. - Passing the actual error through to rtbuilder.done() instead of the hardcoded "streaming failed". A follow-up test is in https://github.com/scylladb/scylladb/pull/29363 Fixes: SCYLLADB-1404 Closes scylladb/scylladb#29362	2026-05-11 17:01:15 +03:00
Calle Wilund	db1b92c185	service::load_balancer: Add metrics for repair and rebuild count Fixes #21115 Adds cluster counter for repairs, and dc counter for rebuilds Closes scylladb/scylladb#28985	2026-05-11 16:57:46 +03:00
Gleb Natapov	c3d2f0bde9	raft_group0: remove finish_setup_after_join function The only thing it does not change a bootstrapping node to become a voter in case the cluster does not support limited voters feature. But the feature was introduced in 2025.2 and direct upgrade from 2025.1 to version newer than 2026.1 is not supported. But even if such upgrade is done the removed code has affect only during bootstrap, not during regular boot. Also remove the upgrade test since after the patch suppressing the feature on the first boot will no longer behave correctly.	2026-05-11 15:38:36 +03:00
Gleb Natapov	5213aee99f	raft_group0: fix indentation after the last change	2026-05-11 11:56:26 +03:00
Gleb Natapov	5f7f72fa50	raft_group: drop unneeded checks	2026-05-11 11:55:39 +03:00
Botond Dénes	eae15f4fdd	Merge 'Share timeout_config between services' from Pavel Emelyanov The timeout_config (more exactly -- updatable_timeout_config) is used by alternator/controller and transport/controller. Both create a local copy of that opbject by constructing one out of db::config. Also some options from this config are needed by storage_proxy, but since it doesn't have access to any timeout_config-s, it just uses db::config by getting it from the database. This PR introduces top-level sharded<updateable_timeout_config>, initializes it from db::config values and makes existing users plus storage_proxy us it where required. Motivation -- remove more replica::database::get_config() users. A side effect -- timeout_config is not duplicated by transport and alternator controllers. Components' dependencies cleanup, not backporting. Closes scylladb/scylladb#29636 * github.com:scylladb/scylladb: storage_proxy: Use shared updateable_timeout_config for CAS contention timeout alternator: Use shared updateable_timeout_config by reference cql_transport: Use shared updateable_timeout_config by reference storage_proxy: Use shared updateable_timeout_config by reference main: Introduce sharded<updateable_timeout_config> storage_proxy: Keep own updateable_timeout_config	2026-05-11 11:12:01 +03:00
Botond Dénes	9b2dfab2e5	Merge 'Don't use database.get_config() to fetch calculate_view_update_throttling_delay option' from Pavel Emelyanov This option is used in two places -- proxy and view-update-generator both need it to calculate the calculate_view_update_throttling_delay() value. This PR moves the option onto view_update_backlog top-level service, makes the calculating helper be method of that class and patches the callers to use it. This eliminates more places that abuse database as db::config accessor. Code dependencies refactoring, not backporting Closes scylladb/scylladb#29635 * github.com:scylladb/scylladb: view: Turn calculate_view_update_throttling_delay into node_update_backlog member view: Place view_flow_control_delay_limit_in_ms on node_update_backlog view: Add node_update_backlog reference to view_update_generator	2026-05-11 10:30:24 +03:00
Pavel Emelyanov	f39cbb1ec6	storage_proxy: Move maintenance_mode onto storage_proxy::config Stop reading maintenance_mode through replica::database's db::config. Add a properly typed maintenance_mode_enabled field to storage_proxy::config, populate it in main.cc from cfg->maintenance_mode() (same as messaging_service::config), and use a cached member in storage_proxy instead of db.local().get_config().maintenance_mode(). Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Closes scylladb/scylladb#29637	2026-05-11 10:11:20 +03:00
Botond Dénes	3f72852d8c	Merge 'Fix missing format string placeholders across the codebase (33 bugs across 14 modules )' from Yaniv Kaul Fix 28 format string bugs plus 5 related format argument bugs across 14 modules where `{}` placeholders were missing or arguments were wrong, causing arguments to be silently dropped or misleading output from the `{fmt}` library. Inspired by https://github.com/scylladb/scylladb/pull/29143 (which fixed a single instance in `replica/table.cc`), a comprehensive audit of the entire codebase was performed to find all similar issues. - Missing `{}` placeholder (21 instances): format string simply lacks `{}` for a passed argument, e.g. `format("msg for table {}", group_id, table_id)` -- `group_id` is silently dropped - Spurious comma breaking C++ string literal concatenation (2 instances): a comma after a string literal prevents adjacent-literal concatenation, turning the continuation into a format argument instead of part of the format string - Printf-style `%s` in fmtlib context (4 instances): `%s` has no meaning in fmtlib and appears as literal text while the argument is silently ignored - Extra spurious argument (1 instance): an extraneous `t.tomb()` argument inserted between correct arguments, causing wrong values in the wrong slots - Wrong variable in error message (4 instances in `types/map.hh`): error messages for oversized map keys/values reported `map_size` (total entry count) instead of the actual `elem.first.size()` or `elem.second.size()` that exceeded the limit - Swapped argument order (1 instance in `data_dictionary/data_dictionary.cc`): format string says `"Extraneous options for {type}: {values}"` but the values and type arguments were passed in reverse order \| Module \| Bugs Fixed \| Files \| \|--------\|:---------:\|-------\| \| `replica/` \| 1 \| `table.cc` \| \| `service/` \| 4 \| `raft_group0.cc`, `storage_service.cc` \| \| `db/` \| 6 \| `heat_load_balance.cc`, `commitlog_replayer.cc`, `view_update_generator.cc`, `view_building_worker.cc`, `row_locking.cc` \| \| `cql3/` \| 2 \| `prepare_expr.cc`, `statement_restrictions.cc` \| \| `transport/` \| 4 \| `event_notifier.cc` \| \| `sstables/` \| 3 \| `partition_reversing_data_source.cc`, `reader.cc` \| \| `alternator/` \| 1 \| `conditions.cc` \| \| `cdc/` \| 1 \| `split.cc` \| \| `raft/` \| 1 \| `server.cc` \| \| `utils/` \| 2 \| `gcp/object_storage.cc`, `s3/client.cc` \| \| `mutation/` \| 1 \| `mutation_partition.hh` \| \| `ent/` \| 2 \| `kmip_host.cc`, `kms_host.cc` \| \| `types/` \| 4 \| `map.hh` \| \| `data_dictionary/` \| 1 \| `data_dictionary.cc` \| The `{fmt}` library's compile-time checker validates that each `{}` placeholder references a valid argument, but does not verify the reverse -- that every argument has a corresponding placeholder. Extra arguments are silently ignored at both compile time and runtime. Build verified with `dbuild ninja build/dev/scylla` -- compiles cleanly. --- Note: Commits were amended to fix the author name from "Yaniv Michael Kaul" to "Yaniv Kaul". Closes scylladb/scylladb#29448 * github.com:scylladb/scylladb: data_dictionary: fix swapped arguments in extraneous options error types: fix wrong variable in map key/value size error messages ent: fix missing format placeholders in encryption error/log messages mutation: fix spurious argument in shadowable_tombstone formatter utils: fix missing format placeholders in object storage log messages raft: fix missing format placeholder in server ostream operator cdc: fix missing format placeholder in error message alternator: fix missing format placeholder in error message sstables: fix missing format placeholders in error messages transport: fix printf-style format specifiers in fmtlib log calls cql3: fix missing format placeholders in error messages db: fix missing format placeholders in log and error messages service: fix missing format placeholders in log messages replica: fix missing format placeholder in cleanup log message	2026-05-11 07:04:42 +03:00
Nadav Har'El	d4aa528834	Merge 'load_balancer: fix tablet allocator dropped table' from Ferenc Szili - Handle dropped tables gracefully in the tablet load balancer's `get_schema_and_rs()` instead of aborting with `on_internal_error` - The load balancer operates on a token metadata snapshot but accesses the live schema for table lookups. A DROP TABLE applied by another fiber between coroutine yield points can remove a table from the live schema while it still exists in the snapshot, causing an abort. `get_schema_and_rs()` now returns `std::optional` and logs a warning in debug log level instead of aborting when a table is missing. All callers skip dropped tables: - `make_sizing_plan`: skips to next table - `make_resize_plan`: skips to next table (merge suppression is moot) - `check_constraints`: returns `skip_info{}` with empty viable targets - `get_rs`: returns `nullptr`, checked by `check_constraints` The call chain is: `make_plan` → `make_internode_plan` → `check_constraints` → `get_rs` → `get_schema_and_rs`. The `make_internode_plan` coroutine has multiple `co_await` yield points (`maybe_yield`, `pick_candidate`) between building the candidate tablet list and checking replication constraints. A DROP TABLE schema mutation applied during any of these yields removes the table from `_db.get_tables_metadata()` while the candidate list still references it. Added `test_load_balancing_with_dropped_table` which simulates the race by capturing a token metadata snapshot, dropping the table, then calling `balance_tablets` with the stale snapshot. Fixes: SCYLLADB-1664 This fix needs to be backported to versions: 2025.4, 2026.1 Closes scylladb/scylladb#29585 * github.com:scylladb/scylladb: test: verify load balancer handles dropped tables gracefully tablet_allocator: handle dropped tables gracefully in get_schema_and_rs	2026-05-10 22:07:51 +03:00
Yaniv Kaul	4ee81f9b32	service: fix missing format placeholders in log messages Fix four format string bugs: - raft_group0.cc: the exception from sleep_and_abort was passed as an argument but had no {} placeholder, so it was silently dropped. - storage_service.cc: loading topology trace was missing a placeholder for the cleanup field (9 args but only 8 placeholders). - storage_service.cc: two join-rejection warnings had a spurious comma after the first string literal, breaking C++ string concatenation. This caused the continuation string to be treated as a separate format argument instead of being part of the format string, and params.host_id was silently dropped. Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>	2026-05-10 17:49:50 +03:00
Avi Kivity	5a887362e3	Merge 'Remove legacy tables creation code' from Gleb Natapov Drop creation of `service_levels` and `cdc_generation_descriptions_v2` table creation code since they are no longer needed. Old clusters will still have it because they were created earlier. Also the series contains a small improvement around group0 creation. No backport needed since this removes functionality. Closes scylladb/scylladb#29482 * github.com:scylladb/scylladb: db/system_distributed_keyspace: remove system_distributed_everywhere since it is unused db/system_distributed_keyspace: drop CDC_TOPOLOGY_DESCRIPTION and CDC_GENERATIONS_V2 db/system_distributed_keyspace: remove unused code db/system_distributed_keyspace: drop old cdc_generation_descriptions_v2 table db/system_distributed_keyspace: drop old service_levels table fix indent after the previous patch group0: call setup_group0 only when needed	2026-05-10 14:46:21 +03:00
Ferenc Szili	906d2b817e	service: allow draining with forced capacity-based balancing When force_capacity_based_balancing is enabled, the tablet allocator balances by node and shard capacity rather than by tablet sizes. When the data needed for load balancing is incomplete, the balancer fails and waits until load_stats is available and correct for all the nodes. An exception to this is when a node is being drained and excluded: it is unreachable, and will not return. In this case the balancer has to do its best and ignore the missing data. This patch fixes a bug where forcing capacity based balancing made the balancer not ignore missing data in these cases, and instead abort the balancing.	2026-05-07 13:44:53 +02:00
Patryk Jędrzejczak	25fd1001c2	Merge 'alternator: improve CreateTable/UpdateTable schema agreement timeout' from Nadav Har'El CreateTable and UpdateTable call wait_for_schema_agreement() after announcing the schema change, to ensure all live nodes have applied the new schema before returning to the user. This wait has a hard- coded 10 second timeout, and on some overloaded test machines we saw it not completing in time, and causing tests to become flaky. This patch increases this timeout from 10 seconds to 30 seconds. It's still hard-coded and not configurable via alternator_timeout_in_ms because it is unlikely any user will want to change it - it just needs to be long. The patch also improves the behavior of a schema-agreement timeout, when it happens: 1. Provide an InternalServerError with more descriptive text. 2. This InternalServerError tells the user that the result of the operation is unknown; So the user will repeat the CreateTable, and will get a ResourceInUseException because the table exists. In that case too, we need to wait for schema agreement. So we added this missing wait. Fixes SCYLLADB-1804 Refs #5052 (claiming CreateTable shouldn't wait at all) This patch is only important to improve test stability in extremely slow test machines where schema agreement sometimes (very rarely) takes over 10 seconds. It's not important to backport it to branches that don't run CI very often on slow machines. Closes scylladb/scylladb#29744 * https://github.com/scylladb/scylladb: alternator: improve CreateTable/UpdateTable schema agreement timeout migration_manager: unique timeout exception for wait_for_schema_agreement()	2026-05-06 16:56:46 +02:00
Patryk Jędrzejczak	b69d00b0a7	Merge 'Barrier and drain logging' from Gleb Natapov Add more logging to barrier and drain rpc to try and pinpoint https://github.com/scylladb/scylladb/issues/26281 Bakport since we want to have it if it happens in the field. Fixes: SCYLLADB-1821 Refs: #26281 Closes scylladb/scylladb#29735 * https://github.com/scylladb/scylladb: session, raft_topology: add periodic warnings for hung drain and stale version waits session: add info-level logging to drain_closing_sessions raft_topology: log sub-step progress in local_topology_barrier raft_topology: log read_barrier progress in topology cmd handler	2026-05-05 15:04:50 +02:00
Nadav Har'El	5895dff03b	migration_manager: unique timeout exception for wait_for_schema_agreement() Before this patch, if wait_for_schema_agreement() times out, it threw a generic std::runtime_error, making it inconvenient for callers to catch this error only. So in this patch we create and use a new exception type, schema_agreement_timeout, based on seastar::timed_out_error. Although wait_for_schema_agreement() was added in commit `a429018a8a` was a utility function used in a dozen places, it has become less interesting after we introduced schema changes over Raft, and over the years most of the callers to this function were removed, except one in view.cc which uses an infinite timeout, so doesn't care about the timeout exception type. In the next patch we want to add a new caller which does care about the time exception type - hence this patch. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2026-05-05 10:38:38 +03:00
Gleb Natapov	d2b695aa64	session, raft_topology: add periodic warnings for hung drain and stale version waits Add periodic warning timers (every 5 minutes) to help diagnose hangs in barrier_and_drain: - drain_closing_sessions(): warn if semaphore acquisition or session gate close is taking too long, reporting the gate count to show how many guards are still alive. - local_topology_barrier(): warn if stale_versions_in_use() is taking too long, reporting the current stale version trackers. - session::gate_count(): new public accessor for diagnostic purposes. These warnings help distinguish between the two possible hang points in barrier_and_drain (stale versions vs session drain) and provide ongoing visibility into what's blocking progress.	2026-05-04 15:58:45 +03:00
Gleb Natapov	385915c101	session: add info-level logging to drain_closing_sessions drain_closing_sessions() is called as part of the barrier_and_drain topology command and can block on two things: acquiring the drain semaphore (if another drain is in progress) and waiting for individual sessions to close (which blocks until all session guards are released). Previously, all logging in this function was at debug level, making it invisible in production logs. When barrier_and_drain hangs, there is no way to tell whether the function is waiting for the semaphore, waiting for a specific session to close, or was never called. Promote logging to info level and add messages at each blocking point: before/after semaphore acquisition (with count of sessions to drain), before/after each individual session close (with session id), and at function completion. This makes it possible to identify the exact session blocking a topology operation from the node log alone.	2026-05-04 15:58:45 +03:00
Gleb Natapov	e88ce09372	raft_topology: log sub-step progress in local_topology_barrier When a node processes a barrier_and_drain topology command, it performs two potentially long-running operations inside local_topology_barrier(): waiting for stale token metadata versions to be released (stale_versions_in_use) and draining closing sessions (drain_closing_sessions). Either of these can hang indefinitely -- for example, stale_versions_in_use blocks until all references to previous token metadata versions are released, which depends on in-flight requests completing. Previously, the only logging was a single 'done' message at the end, making it impossible to determine which sub-step was blocking when a barrier_and_drain RPC appeared stuck on a node. In a recent CI failure, a node never responded to barrier_and_drain during a removenode operation, and the logs showed the RPC was received but nothing about what it was waiting on internally. Add info-level logging before each blocking sub-step, including the topology version for correlation. This allows diagnosing hangs by showing whether the node is stuck waiting for stale metadata versions, stuck draining sessions, or never reached these steps at all.	2026-05-04 15:58:45 +03:00
Yaniv Michael Kaul	6179406467	raft/group0: fix destroy assertion on startup failure If start_server_for_group0() successfully registers a server in _raft_gr._servers but a subsequent step (e.g. enable_in_memory_state_machine()) throws, the server is never destroyed because abort_and_drain()/destroy() check std::get_if<raft::group_id>(&_group0) which was only set after the entire with_scheduling_group block completed. Move _group0.emplace<raft::group_id>() inside the lambda, immediately after start_server_for_group() succeeds, so that cleanup paths can always find and destroy the registered server. This fixes the assertion: "raft_group_registry - stop(): server for group ... is not destroyed" which manifests during shutdown after an upgrade where topology_state_load() fails due to netw::unknown_address. Backport: Yes, to 2026.1, 2026.2, as it causes a crash on upgrades Refs: SCYLLADB-1217 Refs: CUSTOMER-340 Refs: CUSTOMER-335 Fixes: SCYLLADB-1801 Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com> AI-assisted: Yes, Opencode/Opus 4.6 Closes scylladb/scylladb#29702	2026-05-04 11:25:46 +02:00
Gleb Natapov	11b838e71e	raft_topology: log read_barrier progress in topology cmd handler When a raft topology command (e.g. barrier_and_drain) is received by a node, the handler first performs a raft read_barrier to ensure it sees the latest topology state. This read_barrier can hang indefinitely if raft cannot achieve quorum, but there was no logging around it, making it impossible to tell whether the handler was stuck at this step or somewhere else. Add info-level logging before and after the read_barrier call in raft_topology_cmd_handler, including the command type, index, and term. This allows diagnosing hangs by showing whether the node entered the read_barrier and whether it completed, narrowing down the root cause when a topology command RPC appears stuck on the receiver side.	2026-05-03 13:56:25 +03:00
Patryk Jędrzejczak	15f35577ed	Merge 'paxos_state: keep prepared message alive across statement execution' from Petr Gusev In do_execute_cql_with_timeout(), when the prepared statement was not found in the cache, we called qp.prepare() and stored the returned result_message::prepared in a local variable scoped to the 'if' block. We then extracted ps_ptr (a checked_weak_ptr to the prepared statement) from the message, let the message go out of scope at the end of the 'if', and used ps_ptr after a co_await on st->execute(). Since `3ac4e258e8` ("transport/messages: hold pinned prepared entry in PREPARE result"), result_message::prepared owns a strong pinned reference to the prepared cache entry. While qp.prepare() runs it also holds its own pin on the entry, so on return the entry has at least the pin owned by the returned message. As long as that message is alive, the cache entry cannot be purged and the weak handle inside ps_ptr remains promotable. The lifetime gap manifested only in debug builds. qp.prepare() returns a ready future on the cache-miss path, so in release builds the co_await resumes synchronously: control flows from the assignment of ps_ptr straight into st->execute() with no opportunity for any other task (in particular, prepared cache invalidation triggered by a concurrent schema change) to run in between. Debug builds, however, force a reactor preemption point on every co_await even when the awaited future is ready. With prepared_msg already destroyed at the end of the 'if' block, the only remaining handle on the cache entry was the weak ps_ptr, and the preemption gave a concurrent cache purge - triggered, for example, by Raft schema changes received during a node restart - the chance to drop the entry. The subsequent execute() then failed when promoting the weak pointer with checked_ptr_is_null_exception. The exception propagated out of the Paxos prepare path as a generic std::exception with no type information in the log, surfacing on the coordinator as: WriteFailure: Failed to prepare ballot ... Replica errors: host_id ... -> seastar::rpc::remote_verb_error (std::exception) Hoist the result_message::prepared into the outer scope so the pinned cache entry stays alive across co_await st->execute(...), closing the window in which a concurrent cache purge could invalidate the weak handle. Fixes SCYLLADB-1173 backport: the patch is simple, we can backport it to all versions with "LWT over tablets" feature. Note that the problem is only in test runs in debug configuration, production is not affected. Closes scylladb/scylladb#29675 * https://github.com/scylladb/scylladb: table_helper: retry insert prepare on concurrent cache invalidation paxos_state: keep prepared message alive across statement execution	2026-04-29 17:57:27 +02:00

1 2 3 4 5 ...

6342 Commits