scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-05-22 15:52:13 +00:00

Author	SHA1	Message	Date
Petr Gusev	9e3209e4a3	cql: refactor add_tablet_info to take tablet_routing_info directly Change add_tablet_info() to accept locator::tablet_routing_info instead of destructured (tablet_replica_set, token_range) pair. This simplifies all three call sites. Remove the empty-replicas guard inside add_tablet_info(): the only producer of tablet_routing_info is tablet ERM's check_locality(), which returns either nullopt (correctly routed) or info with replicas copied from tablet_info — a tablet always has replicas. All callers already check for nullopt before calling add_tablet_info(), so by the time we enter the function replicas are guaranteed non-empty.	2026-05-15 12:28:33 +02:00
Petr Gusev	738b7b4a86	cql: fix UB dereference of nullopt tablet_info in execute_with_condition When check_locality() returns nullopt (correctly routed LWT), the optional tablet_info was unconditionally dereferenced in the lambda capture list: tablet_info->tablet_replicas, tablet_info->token_range. The code previously masked this by initializing tablet_info with an empty-but-present value, so the dereference happened to work but only because the empty tablet_replicas made add_tablet_info() a no-op. After check_locality() overwrites it with nullopt, the dereference is UB. Fix by initializing tablet_info as empty (nullopt) and guarding the dereference.	2026-05-15 11:56:14 +02:00
Petr Gusev	167a3c9c50	cql: fix missing TABLETS_ROUTING_V1 payload after CAS shard bounce After an internal CAS shard bounce, check_locality() was evaluating against this_shard_id() of the post-bounce shard — which is the correct tablet shard — so it returned nullopt, and LWT/SERIAL responses omitted the tablets-routing-v1 custom payload. The client never learned the correct tablet map. Fix by recording the original entry shard in client_state (initialized to this_shard_id() at construction, preserved across shard bounces via client_state_for_another_shard) and passing it to check_locality() so it compares against the client's actual routing decision. No host_id tracking or forwarded_client_state IDL changes are needed because CAS shard bounces are always intra-node. Fixes SCYLLADB-2041	2026-05-15 11:56:14 +02:00
Piotr Dulikowski	f3ac35f9d2	Merge 'strong_consistency: wait for raft servers to start in create table' from Michael Litvak When creating a strongly consistent table, wait for the table's raft servers to start and be ready to serve queries before completing the operation. We want the create table operation to absorb the delay of starting the raft groups instead of the first queries. The create table coordinator commits and applies the schema statement, then it waits for all hosts that have a tablet replica to create and start the raft groups for the table's tablets. It does this by sending an RPC to all the relevant hosts that executes a group0 barrier, in order to ensure the table and raft groups are created, then waits for all raft groups on the host to finish starting and be ready. Fixes SCYLLADB-807 no backport - strong consistency is still experimental Closes scylladb/scylladb#28843 * github.com:scylladb/scylladb: strong_consistency: wait for leader when starting a group strong_consistency: change wait for groups to start on startup strong_consistency: optimize wait_for_groups_to_start strong_consistency: wait for raft servers to start in create table	2026-05-13 16:42:05 +02:00
Piotr Dulikowski	dc05bd35bb	Merge 'strong_consistency: limit available consistency levels in strong consistent requests' from Michał Jadwiszczak Strong consistent requests take different patch then EC requests and consistency levels don’t map well. We should limit available consistency levels in SC request to avoid ignoring them silently, which may cause confusion to user. For writes, there is only one option: - QUORUM/LOCAL_QUORUM (multi DC is not supported yet, so both of those CLs have the same effect) - we need quorum of replicas to successfully commit new mutations to Raft log. For reads, there are 2 options: - QUORUM/LOCAL_QUORUM - if user wants to be sure he sees latest data and the query needs to execute `read_barrier()`, which requires quorum of replicas - ONE/LOCAL_ONE - if user just wants to read data from one replica without synchronization All tests were updated to use LOCAL_QUORUM for both read and writes. Fixes SCYLLADB-1766 SC is in experimental phase and this patch is an improvement, no backport needed. Closes scylladb/scylladb#29691 * github.com:scylladb/scylladb: strong_consistency: allow QUORUM/LOCAL_QUORUM and ONE/LOCAL_ONE for reads strong_consistency: allow only QUORUM/LOCAL_QUORUM CL for writes	2026-05-13 16:31:05 +02:00
Michael Litvak	5a5c7c6241	strong_consistency: wait for raft servers to start in create table When creating a strongly consistent table, wait for the table's raft servers to start and be ready to serve queries before completing the operation. We want the create table operation to absorb the delay of starting the raft groups instead of the first queries. The create table coordinator commits and applies the schema statement, then it waits for all hosts that have a tablet replica to create and start the raft groups for the table's tablets. It does this by sending an RPC to all the relevant hosts that executes a group0 barrier, in order to ensure the table and raft groups are created, then waits for all raft groups on the host to finish starting and be ready. Fixes SCYLLADB-807	2026-05-13 08:43:24 +02:00
Michał Jadwiszczak	d073097ebf	strong_consistency: allow QUORUM/LOCAL_QUORUM and ONE/LOCAL_ONE for reads We can execute strong consistent read queries in 2 ways: - with QUORUM/LOCAL_QUORUM CL - this path executes `read_barrier()` before reading the data, which synchronizes Raft log with the leader. But to execute it, we need quorum of replicas - with ONE/LOCAL_ONE CL - this path just reads data from one replica without any synchronization (not implemented yet)	2026-05-12 23:20:07 +02:00
Michał Jadwiszczak	68f0cf6fac	strong_consistency: allow only QUORUM/LOCAL_QUORUM CL for writes To successfully write data to strong consistent table, a quorum of replicas need to be used to save the data to Raft log. So the only reasonable consistency level is QUORUM/LOCAL_QUORUM (currently SC doesn't support multi DC).	2026-05-12 23:20:03 +02:00
Piotr Dulikowski	129f193116	Merge 'strong_consistency: implement basic coordinator metrics' from Michał Jadwiszczak Add per-shard metrics for strong consistency coordinator operations (latency, timeouts, bounces, status unknown) under the `"strong_consistency_coordinator"` category. These are analogous to the eventual consistency metrics in `storage_proxy_stats`, enabling direct performance comparison between the two consistency modes. The metrics are simplified compared to `storage_proxy_stats` — no breakdown by table, tablet, scheduling group, or DC, only per-shard. Fixes SCYLLADB-1343 Strong consistency is still in experimental phase, no need to backport. Closes scylladb/scylladb#29318 * github.com:scylladb/scylladb: test/strong_consistency: verify metrics strong_consistency: wire up metrics to operations strong_consistency: add stats struct and metrics registration	2026-05-12 16:15:51 +02:00
Marcin Maliszkiewicz	3df951bc9c	Merge 'audit: set audit_info for native-protocol BATCH messages' from Andrzej Jackowski Commit `16b56c2451` ("Audit: avoid dynamic_cast on a hot path") moved audit info into batch_statement via set_audit_info(), but only wired it for the CQL-text BATCH path (raw::batch_statement::prepare()). Native-protocol BATCH messages (opcode 0x0D), handled by process_batch_internal in transport/server.cc, construct a batch_statement without setting audit_info. This causes audit to silently skip the entire batch. Set audit_info on the batch_statement so these batches are audited. Fixes SCYLLADB-1652 No backport - bug introduced recently. Closes scylladb/scylladb#29570 * github.com:scylladb/scylladb: test/audit: add reproducer for native-protocol batch not being audited audit: set audit_info for native-protocol BATCH messages test/audit: rename internal test methods to avoid CI misdetection	2026-04-22 18:56:28 +02:00
Michał Jadwiszczak	f77c258c8e	strong_consistency: wire up metrics to operations Track write and read latency using latency_counter in coordinator::mutate() and coordinator::query(). Count commit_status_unknown errors in coordinator::mutate(). Count node and shard bounces in redirect_statement(), passing the coordinator's stats from both modification_statement and select_statement.	2026-04-22 08:59:59 +02:00
Tomasz Grabiec	cddde464ca	Merge 'service: Support adding/removing a datacenter with tablets by changing RF' from Aleksandra Martyniuk With this change, you can add or remove a DC(s) in a single ALTER KEYSPACE statement. It requires the keyspace to use rack list replication factor. In existing approach, during RF change all tablet replicas are rebuilt at once. This isn't the case now. In global_topology_request::keyspace_rf_change the request is added to a ongoing_rf_changes - a new column in system.topology table. In a new column in system_schema.keyspaces - next_replication - we keep the target RF. In make_rf_change_plan, load balancer schedules necessary migrations, considering the load of nodes and other pending tablet transitions. Requests from ongoing_rf_changes are processed concurrently, independently from one another. In each request racks are processed concurrently. No tablet replica will be removed until all required replicas are added. While adding replicas to each rack we always start with base tables and won't proceed with views until they are done (while removing - the other way around). The intermediary steps aren't reflected in schema. When the Rf change is finished: - in system_schema.keyspaces: - next_replication is cleared; - new keyspace properties are saved; - request is removed from ongoing_rf_changes; - the request is marked as done in system.topology_requests. Until the request is done, DESCRIBE KEYSPACE shows the replication_v2. If a request hasn't started to remove replicas, it can be aborted using task manager. system.topology_requests::error is set (but the request isn't marked as done) and next_replication = replication_v2. This will be interpreted by load balancer, that will start the rollback of the request. After the rollback is done, we set the relevant system.topology_requests entry as done (failed), clear the request id from system.topology::ongoing_rf_changes, and remove next_replication. Fixes: SCYLLADB-567. No backport needed; new feature. Closes scylladb/scylladb#24421 * github.com:scylladb/scylladb: service: fix indentation docs: update documentation test: test multi RF changes service: tasks: allow aborting ongoing RF changes cql3: allow changing RF by more than one when adding or removing a DC service: handle multi_rf_change service: implement make_rf_change_plan service: add keyspace_rf_change_plan to migration_plan service: extend tablet_migration_info to handle rebuilds service: split update_node_load_on_migration service: rearrange keyspace_rf_change handler db: add columns to system_schema.keyspaces db: service: add ongoing_rf_changes to system.topology gms: add keyspace_multi_rf_change feature	2026-04-22 01:46:11 +02:00
Andrzej Jackowski	f5bb9b6282	audit: set audit_info for native-protocol BATCH messages Commit `16b56c2451` ("Audit: avoid dynamic_cast on a hot path") moved audit info into batch_statement via set_audit_info(), but only wired it for the CQL-text BATCH path (raw::batch_statement::prepare()). Native-protocol BATCH messages (opcode 0x0D), handled by process_batch_internal in transport/server.cc, construct a batch_statement without setting audit_info. This causes audit to silently skip the entire batch. Set audit_info on the batch_statement so these batches are audited. Fixes SCYLLADB-1652	2026-04-21 21:52:26 +02:00
Nadav Har'El	6165124fcc	Merge 'cql3: statement_restrictions: analyze during prepare time' from Avi Kivity The statement_restrictions code is responsible for analyzing the WHERE clause, deciding on the query plan (which index to use), and extracting the partition and clustering keys to use for the index. Currently, it suffers from repetition in making its decisions: there are 15 calls to expr::visit in statement_restrictions.cc, and 14 find_binop calls. This reduces to 2 visits (one nested in the other) and 6 find_binop calls. The analysis of binary operators is done once, then reused. The key data structure introduced is the predicate. While an expression takes inputs from the row evaluated, constants, and bind variables, and produces a boolean result, predicates ask which values for a column (or a number of columns) are needed to satisfy (part of) the WHERE clause. The WHERE clause is then expressed as a conjunction of such predicates. The analyzer uses the predicates to select the index, then uses the predicates to compute the partition and clustering keys. The refactoring is composed of these parts (but patches from different parts are interspersed): 1. an exhaustive regression test is added as the first commit, to ensure behavior doesn't change 2. move computation from query time to prepare time 3. introduce, gradually enrich, and use predicates to implement the statement_restrictions API Major refactoring, and no bugs fixed, so definitely not backporting. Closes scylladb/scylladb#29114 * github.com:scylladb/scylladb: cql3: statement_restrictions: replace has_eq_restriction_on_column with precomputed set cql3: statement_restrictions: replace multi_column_range_accumulator_builder with direct predicate iteration cql3: statement_restrictions: use predicate fields in build_get_clustering_bounds_fn cql3: statement_restrictions: remove extract_single_column_restrictions_for_column cql3: statement_restrictions: use predicate vectors in prepare_indexed_local cql3: statement_restrictions: use predicate vector size for clustering prefix length cql3: statement_restrictions: replace do_find_idx and is_supported_by with predicate-based versions cql3: statement_restrictions: remove expression-based has_supporting_index and index_supports_some_column cql3: statement_restrictions: replace multi-column and PK index support checks with predicate-based versions cql3: statement_restrictions: add predicate-based index support checking cql3: statement_restrictions: use pre-built single-column maps for index support checks cql3: statement_restrictions: build clustering-prefix restrictions incrementally cql3: statement_restrictions: build partition-range restrictions incrementally cql3: statement_restrictions: build clustering-key single-column restrictions map incrementally cql3: statement_restrictions: build partition-key single-column restrictions map incrementally cql3: statement_restrictions: build non-primary-key single-column restrictions map incrementally cql3: statement_restrictions: use tracked has_mc_clustering for _has_multi_column cql3: statement_restrictions: track has-token state incrementally cql3: statement_restrictions: track partition-key-empty state incrementally cql3: statement_restrictions: track first multi-column predicate incrementally cql3: statement_restrictions: track last clustering column incrementally cql3: statement_restrictions: track clustering-has-slice incrementally cql3: statement_restrictions: track has-multi-column-clustering incrementally cql3: statement_restrictions: track clustering-empty state incrementally cql3: statement_restrictions: replace restr bridge variable with pred.filter cql3: statement_restrictions: convert single-column branch to use predicate properties cql3: statement_restrictions: convert multi-column branch to use predicate properties cql3: statement_restrictions: convert constructor loop to iterate over predicates cql3: statement_restrictions: annotate predicates with operator properties cql3: statement_restrictions: annotate predicates with is_not_null and is_multi_column cql3: statement_restrictions: complete preparation early cql3: statement_restrictions: convert expressions to predicates without being directed at a specific column cql3: statement_restrictions: refine possible_lhs_values() function_call processing cql3: statement_restrictions: return nullptr for function solver if not token cql3: statement_restrictions: refine possible_lhs_values() subscript solving cql3: statement_restrictions: return nullptr from possible_lhs_values instead of on_internal_error cql3: statement_restrictions: convert possible_lhs_values into a solver cql3: statement_restrictions: split _where to boolean factors in preparation for predicates conversion cql3: statement_restrictions: refactor IS NOT NULL processing cql3: statement_restrictions: fold add_single_column_nonprimary_key_restriction() into its caller cql3: statement_restrictions: fold add_single_column_clustering_key_restriction() into its caller cql3: statement_restrictions: fold add_single_column_partition_key_restriction() into its caller cql3: statement_restrictions: fold add_token_partition_key_restriction() into its caller cql3: statement_restrictions: fold add_multi_column_clustering_key_restriction() into its caller cql3: statement_restrictions: avoid early return in add_multi_column_clustering_key_restrictions cql3: statement_restrictions: fold add_is_not_restriction() into its caller cql3: statement_restrictions: fold add_restriction() into its caller cql3: statement_restrictions: remove possible_partition_token_values() cql3: statement_restrictions: remove possible_column_values cql3: statement_restrictions: pass schema to possible_column_values() cql3: statement_restrictions: remove fallback path in solve() cql3: statement_restrictions: reorder possible_lhs_column parameters cql3: statement_restrictions: prepare solver for multi-column restrictions cql3: statement_restrictions: add solver for token restriction on index cql3: statement_restrictions: pre-analyze column in value_for() cql3: statement_restrictions: don't handle boolean constants in multi_column_range_accumulator_builder cql3: statement_restrictions: split range_from_raw_bounds into prepare phase and query phase cql3: statement_restrictions: adjust signature of range_from_raw_bounds cql3: statement_restrictions: split multi_column_range_accumulator into prepare-time and query-time phases cql3: statement_restrictions: make get_multi_column_clustering_bounds a builder cql3: statement_restrictions: multi-key clustering restrictions one layer deeper cql3: statement_restrictions: push multi-column post-processing into get_multi_column_clustering_bounds() cql3: statement_restrictions: pre-analyze single-column clustering key restrictions cql3: statement_restrictions: wrap value_for_index_partition_key() cql3: statement_restrictions: hide value_for() cql3: statement_restrictions: push down clustering prefix wrapper one level cql3: statement_restrictions: wrap functions that return clustering ranges cql3: statement_restrictions: do not pass view schema back and forth cql3: statement_restrictions: pre-analyze token range restrictions cql3: statement_restrictions: pre-analyze partition key columns cql3: statement_restrictions: do not collect subscripted partition key columns cql3: statement_restrictions: split _partition_range_restrictions into three cases cql3: statement_restrictions: move value_list, value_set to header file cql3: statement_restrictions: wrap get_partition_key_ranges cql3: statement_restrictions: prepare statement_restrictions for capturing `this` test: statement_restrictions: add index_selection regression test	2026-04-21 15:44:06 +03:00
Łukasz Paszkowski	d18eb9479f	cql/statement: Create keyspace_metadata with correct initial_tablets count In `ks_prop_defs::as_ks_metadata(...)` a default initial tablets count is set to 0, when tablets are enabled and the replication strategy is NetworkReplicationStrategy. This effectively sets _uses_tablets = false in abstract_replication_strategy for the remaining strategies when no `tablets = {...}` options are specified. As a consequence, it is possible to create vnode-based keyspaces even when tablets are enforced with `tablets_mode_for_new_keyspaces`. The patch sets a default initial tablets count to zero regardless of the chosen replication strategy. Then each of the replication strategy validates the options and raises a configuration exception when tablets are not supported. All tests are altered in the following way: + whenever it was correct, SimpleStrategy was replaced with NetworkTopologyStrategy + otherwise, tablets were explicitly disabled with ` AND tablets = {'enabled': false}` Fixes https://github.com/scylladb/scylladb/issues/25340 Closes scylladb/scylladb#25342	2026-04-20 17:57:38 +03:00
Avi Kivity	325497d460	cql3: statement_restrictions: hide value_for() value_for() is a general function that solves for values that satisfy an expression set to TRUE. This goes against our goal to prepare solvers for all the expressions we use. Fortunately, it's only called with one expression, which comes from statement_restrictions, so we can add an accessor that provides the expression from our own state. Later, we'll be able to do prepare-time work on it.	2026-04-19 20:57:04 +03:00
Avi Kivity	620df7103f	cql3: statement_restrictions: do not pass view schema back and forth For indexed queries, statement_restrictions calculates _view_schema, which is passed via get_view_schema() to indexed_select_statement(), which passes it right back to statement_restrictions via one of three functions to calculate clustering ranges. Avoid the back-and-forth and use the stored value. Using a different value would be broken. This change allows unifying the signatures of the four functions that get clustering ranges.	2026-04-19 20:57:03 +03:00
Avi Kivity	eec0b20dbc	cql3: statement_restrictions: prepare statement_restrictions for capturing `this` Prevent copying/moving, that can change the address, and instead enforce using shared_ptr. Most of the code is already using shared_ptr, so the changes aren't very large. To forbid non-shared_ptr construction, the constructors are annotated with a private_tag tag class.	2026-04-19 20:57:03 +03:00
Pawel Pery	7883f161bb	vector-store: fix creating local vector search indexes with a part of the partition key Users ought to have possibility to create the local index for Vector Search based only on a part of the partition key. This commits provides this by removing requirements of 'full partition key only' for custom local index. The commit updates docs to explain that local vector index can use only a part of the partition key. The commit implements cqlpy test to check fixed functionality. Fixes: SCYLLADB-953 Needs to be backported to 2026.1 as it is a fix for local vector indexes. Closes scylladb/scylladb#28931	2026-04-17 11:44:15 +02:00
Aleksandra Martyniuk	38bad5f316	cql3: allow changing RF by more than one when adding or removing a DC rf_rack_valid_keyspaces relies on the fact that replicas of base table and mv are streamed concurrently. This is no longer true for newly introduced method of adding a DC. Disable rf_rack_valid_keyspaces in test_mv_first_replica_in_dc to force the old method.	2026-04-17 09:58:08 +02:00
Aleksandra Martyniuk	72bb3113ac	db: add columns to system_schema.keyspaces Add a new next_replication column to system_schema.keyspaces table. While there is an ongoing RF change: - next_replication keeps the target RF values; - existing replication_v2 column keeps initial RF values - the ones we started the RF change with. DESCRIBE KEYSPACE statement shows replication_v2. When there is no ongoing RF change for this keyspace, its next_replication is empty. In this commit no data is kept in the new column.	2026-04-17 09:58:07 +02:00
Pavel Emelyanov	335261f351	cql3: Move enable_create_table_with_compact_storage to cql_config Move enable_create_table_with_compact_storage option from db::config to cql_config. This improves separation of concerns by consolidating CQL-specific table creation policies in the cql_config structure. Update the CREATE TABLE statement prepare() function to use the new location for the configuration check. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-16 08:52:20 +03:00
Pavel Emelyanov	f20ede79f9	cql3: Move strict_is_not_null_in_views to cql_config Move strict_is_not_null_in_views option from db::config to cql_config via new view_restrictions sub-struct. This improves separation of concerns by keeping view-specific validation policies with other CQL configuration. Update prepare_view() to take view_restrictions reference instead of reaching into db::config, and update all callsites to pass the sub-struct. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-16 08:52:19 +03:00
Pavel Emelyanov	027c91f45e	cql3: Move restrict_future_timestamp to cql_config Move restrict_future_timestamp option from db::config to cql_config. This improves separation of concerns as timestamp validation is part of CQL query execution behavior. Update validate_timestamp() function signature to take cql_config reference instead of db::config, and update all callsites in modification_statement and batch_statement to pass cql_config. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-16 08:51:53 +03:00
Pavel Emelyanov	7264581881	cql3: Move TWCS restriction options to cql_config Move twcs_max_window_count and restrict_twcs_without_default_ttl options from db::config to cql_config via new twcs_restrictions sub-struct. This improves separation of concerns by keeping TWCS-specific validation policies with other CQL configuration. Update check_restricted_table_properties() to remove unused db parameter and take twcs_restrictions reference instead. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-16 08:51:52 +03:00
Pavel Emelyanov	8b853505cd	cql3: Move keyspace restriction options to cql_config Introduce replication_restrictions, a sub-struct of cql_config, to hold the seven keyspace-level policy options that govern how CREATE/ALTER KEYSPACE statements are validated: - restrict_replication_simplestrategy - replication_strategy_warn_list / replication_strategy_fail_list - minimum/maximum_replication_factor_warn/fail_threshold Pass replication_restrictions into check_against_restricted_replication_strategies() instead of having it reach into db::config directly (via both qp.db().get_config() and qp.proxy().data_dictionary().get_config()). Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-16 08:51:24 +03:00
Pavel Emelyanov	1af26a1dd6	cql3: Move batch_size_fail_threshold_in_kb to cql_config The batch_size_fail_threshold_in_kb option controls the batch size at which an oversized batch error is returned to the client. It belongs in cql_config rather than db::config as it directly governs CQL batch statement behavior. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-16 07:57:27 +03:00
Pavel Emelyanov	4d255cf533	cql3: Move batch_size_warn_threshold_in_kb to cql_config The batch_size_warn_threshold_in_kb option controls the batch size at which a client warning is emitted during batch execution. It belongs in cql_config rather than db::config as it directly governs CQL batch statement behavior. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-16 07:57:27 +03:00
Pavel Emelyanov	a3f097f100	cql3: Move enable_parallelized_aggregation to cql_config The enable_parallelized_aggregation option controls whether aggregation queries are fanned out across shards for parallel execution. It belongs in cql_config rather than db::config as it directly governs CQL query behavior at prepare time. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-16 07:57:27 +03:00
Pavel Emelyanov	4314fc0642	cql3: Move strict_allow_filtering to cql_config The strict_allow_filtering option controls whether queries that require ALLOW FILTERING are silently accepted, warned about, or rejected. It belongs in cql_config rather than db::config as it directly governs CQL query behavior at prepare time. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-16 07:57:26 +03:00
Pavel Emelyanov	3411ed8bcc	cql3: Move select_internal_page_size to cql_config The select_internal_page_size option controls CQL query execution behavior (internal paging for aggregate/filtered SELECTs) and belongs in cql_config rather than being read directly from db::config at execution time. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-16 07:57:26 +03:00
Pavel Emelyanov	60a834d9fa	cql3: Add cql_config parameter to parsed_statement::prepare() Pass cql_config to prepare() so that statement preparation can use CQL-specific configuration rather than reaching into db::config directly. Callers that use default_cql_config: - db/view/view.cc: builds a SELECT statement internally to compute view restrictions, not in response to a user query - cql3/statements/create_view_statement.cc: same -- parses the view's WHERE clause as a synthetic SELECT to extract restrictions - tools/schema_loader.cc: offline schema loading tool, no runtime config available - tools/scylla-sstable.cc: offline sstable inspection tool, no runtime config available Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-16 07:57:25 +03:00
Avi Kivity	59ec93b86b	Merge 'Allow arbitrary tablet boundaries and count' from Tomasz Grabiec There are several reasons we want to do that. One is that it will give us more flexibility in distributing the load. We can subdivide tablets at any token, and achieve more evenly-sized tablets. In particular, we can isolate large partitions into separate tablets. We can also split and merge incrementally individual tablets. Currently, we do it for the whole table or nothing, which makes splits and merges take longer and cause wide swings of the count. This is not implemented in this PR yet, we still split/merge the whole table. Another reason is vnode to tablets migration. We now could construct a tablet map which matches exactly the vnode boundaries, so migration can happen transparently from CQL-coordinator point of view. Tablet count is still a power-of-two by default for newly created tables. It may be different if tablet map is created by non-standard means, or if per-table tablet option "pow2_count" is set to "false". build/release/scylla perf-tablets: Memory footprint for 131k tablets increased from 56 MiB to 58.1 MiB (+3.5%) Before: ``` Generating tablet metadata Total tablet count: 131072 Size of tablet_metadata in memory: 57456 KiB Copied in 0.014346 [ms] Cleared in 0.002698 [ms] Saved in 1234.685303 [ms] Read in 445.577881 [ms] Read mutations in 299.596313 [ms] 128 mutations Read required hosts in 247.482742 [ms] Size of canonical mutations: 33.945053 [MiB] Disk space used by system.tablets: 1.456761 [MiB] Tablet metadata reload: full 407.69ms partial 2.65ms ``` After: ``` Generating tablet metadata Total tablet count: 131072 Size of tablet_metadata in memory: 59504 KiB Copied in 0.032475 [ms] Cleared in 0.002965 [ms] Saved in 1093.877441 [ms] Read in 387.027100 [ms] Read mutations in 255.752121 [ms] 128 mutations Read required hosts in 211.202805 [ms] Size of canonical mutations: 33.954453 [MiB] Disk space used by system.tablets: 1.450162 [MiB] Tablet metadata reload: full 354.50ms partial 2.19ms ``` Closes scylladb/scylladb#28459 * github.com:scylladb/scylladb: test: boost: tablets: Add test for merge with arbitrary tablet count tablets, database: Advertise 'arbitrary' layout in snapshot manifest tablets: Introduce pow2_count per-table tablet option tablets: Prepare for non-power-of-two tablet count tablets: Implement merged tablet_map constructor on top of for_each_sibling_tablets() tablets: Prepare resize_decision to hold data in decisions tablets: table: Make storage_group handle arbitrary merge boundaries tablets: Make stats update post-merge work with arbitrary merge boundaries locator: tablets: Support arbitrary tablet boundaries locator: tablets: Introduce tablet_map::get_split_token() dht: Introduce get_uniform_tokens()	2026-04-15 18:57:22 +03:00
Marcin Maliszkiewicz	53b6e9fda5	Merge 'Make DESCRIBE CLUSTER get cluster information from storage_service' from Pavel Emelyanov Currently the statement returns cluster, partitioner and snitch names by accessing global db::config via database. As the part of an effort to detach components from global db::config, this PR tweaks the statement handler to get the cluster information from some other source. Currently the needed cluster information is stored in different components, but they are all under storage_service umbrella which seems to be a good central source of this truth. Unit test included. Cleaning components inter-dependencies, not backporting Closes scylladb/scylladb#29429 * github.com:scylladb/scylladb: test: Add test_describe_cluster_sanity for DESCRIBE CLUSTER validation describe_statement: Get cluster info from storage_service storage_service: Add describe_cluster() method query_processor: Expose storage_service accessor	2026-04-15 14:40:15 +03:00
Nadav Har'El	1eb8d170dd	Merge 'vector_index: allow recreating vector indexes on the same column' from Dawid Pawlik This series allows creating multiple vector indexes on the same column so users can rebuild an index without losing query availability. The intended flow is: 1. Create a new vector index on a column that already has one. 2. Keep serving ANN queries from the old index while the new one is being built. 3. Verify the new index is ready. 4. Automatically switch to the remaining index. 5. Drop the old index. To make that deterministic, `index_version` is changed from the base table schema version to a real creation timeuuid. When multiple vector indexes exist on the same column, ANN query planning now picks the index according to the routing implemented in Vector Store (newest serving index). This keeps queries on the old index until it the new one is up and ready. This patch also removes the create-time restriction that rejected a second vector index on the same column. Name collisions are still rejected as before. Test coverage is updated accordingly: - Scylla now verifies that two vector indexes can coexist on the same column. - Cassandra/SAI behavior is still covered and is still expected to reject duplicate indexes on the same column. Fixes: VECTOR-610 Closes scylladb/scylladb#29407 * github.com:scylladb/scylladb: docs: document vector index metadata and duplicate handling test/cqlpy: cover vector index duplicate creation rules vector_index: allow multiple named indexes on one column vector_index: store `index_version` as creation timeuuid	2026-04-15 14:40:15 +03:00
Pavel Emelyanov	a428472e50	db: Remove redundant enable_logstor config option The enable_logstor configuration option is redundant with the 'logstor' experimental feature flag. Consolidate to a single gate: use the experimental feature to control both whether logstor is available for table creation and whether it is initialized at database startup. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#29427	2026-04-15 14:40:15 +03:00
Botond Dénes	87eb20ba33	Merge 'cql: Include parallelized queries in the scylla_cql_select_partition_range_scan_no_bypass_cache metric' from Tomasz Grabiec This metric is used to catch execution of scans which go via row cache, which can have bad effect on performance. Since `f344bd0aaa`, aggregate queries go via new statement class: parallelized_select_statement. This class inherits from select_statement directly rather than from primary_key_select_statement. The range scan detection logic (_range_scan, _range_scan_no_bypass_cache) was only in primary_key_select_statement's constructor, so parallelized queries were not counted in select_partition_range_scan and select_partition_range_scan_no_bypass_cache metrics. Fix by moving the range scan detection into select_statement's constructor, so that all subclasses get it. No backport: enhancement Closes scylladb/scylladb#29422 * github.com:scylladb/scylladb: cql: Include parallelized queries in the scylla_cql_select_partition_range_scan_no_bypass_cache metric test: cluster: dtest: Fix double-counting of metrics	2026-04-15 14:40:15 +03:00
Tomasz Grabiec	50fbac6ea6	tablets: Introduce pow2_count per-table tablet option By default it's true, in which case tablet count of the table is rounded up to a power of two. This option allows lifting this, in which case the count can be arbitrary. This will allow testing the logic of arbitrary tablet count.	2026-04-15 10:40:56 +02:00
Pavel Emelyanov	debfb147f5	describe_statement: Get cluster info from storage_service Update cluster_describe_statement::describe() to retrieve cluster metadata from storage_service::describe_cluster() instead of directly from db::config or gossiper. The storage_service provides a centralized API for accessing cluster metadata (cluster_name, partitioner, snitch_name) that works in both normal and maintenance modes, improving separation of concerns. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-14 19:33:06 +03:00
Alex	0f6d9ffd22	cql: expose stable result metadata for prepared LIST statements Prepared LIST statements were not calculating metadata in PREPARE path, and sent empty string hash to client causing problematic behaviour where metadat_id was not recalculated correctly. This patch moves metadata construction into get_result_metadata() for the affected LIST statements and reuse that metadata when building the result set. This gives PREPARE a stable metadata id for LIST ROLES, LIST USERS, LIST PERMISSIONS and the service-level variants. This patch also adds a new boost test that verifies that when an EXECUTE request carries an empty result metadata id while the server has a real metadata id for the result set, the response is marked METADATA_CHANGED and includes the full result metadata plus the server metadata id. This covers the recovery path for clients that send an empty or otherwise unusable metadata id instead of a matching cached one.	2026-04-13 17:49:27 +03:00
Dawid Pawlik	63b782451e	vector_index: allow multiple named indexes on one column Allow creating multiple named vector indexes on the same column while still rejecting duplicate unnamed ones. `index_metadata::equals_noname()` now ignores `index_version`, which is unique for every vector index creation, so duplicate detection keeps working for unnamed vector indexes. CREATE INDEX keeps using structural duplicate detection for regular indexes and unnamed vector indexes, but named vector indexes are checked by name only. The explicit name check is also needed for IF NOT EXISTS when the same index name already exists on a different table in the same keyspace, because vector indexes have no backing view table to catch that case.	2026-04-13 15:04:59 +02:00
Avi Kivity	0ae22a09d4	LICENSE: Update to version 1.1 Updated terms of non-commercial use (must be a never-customer).	2026-04-12 19:46:33 +03:00
Dawid Pawlik	2dd8eef38c	vector_index: store `index_version` as creation timeuuid Vector indexes currently store the base table schema version in `index_version`. That value is name-based, not time-based, so it does not represent when the index was created. Store a timeuuid instead and change the relevant interfaces from `table_schema_version` to `utils::UUID`. This is a prerequisite for supporting multiple vector indexes on the same column where the oldest index must be selected deterministically via routing implemented in Vector Store. Update the cqlpy tests to check the new semantics directly: recreating the index changes `index_version`, while ALTER TABLE does not.	2026-04-10 13:05:21 +02:00
Piotr Dulikowski	32e3a01718	Merge 'service: strong_consistency: Allow for aborting operations' from Dawid Mędrek Motivation ---------- Since strongly consistent tables are based on the concept of Raft groups, operations on them can get stuck for indefinite amounts of time. That may be problematic, and so we'd like to implement a way to cancel those operations at suitable times. Description of solution ----------------------- The situations we focus on are the following: * Timed-out queries * Leader changes * Tablet migrations * Table drops * Node shutdowns We handle each of them and provide validation tests. Implementation strategy ----------------------- 1. Auxiliary commits. 2. Abort operations on timeout. 3. Abort operations on tablet removal. 4. Extend `client_state`. 5. Abort operation on shutdown. 6. Help `state_machine` be aborted as soon as possible. Tests ----- We provide tests that validate the correctness of the solution. The total time spent on `test_strong_consistency.py` (measured on my local machine, dev mode): Before: ``` real 0m31.809s user 1m3.048s sys 0m21.812s ``` After: ``` real 0m34.523s user 1m10.307s sys 0m27.223s ``` The incremental differences in time can be found in the commit messages. Fixes SCYLLADB-429 Backport: not needed. This is an enhancement to an experimental feature. Closes scylladb/scylladb#28526 * github.com:scylladb/scylladb: service: strong_consistency: Abort state_machine::apply when aborting server service: strong_consistency: Abort ongoing operations when shutting down service: client_state: Extend with abort_source service: strong_consistency: Handle abort when removing Raft group service: strong_consistency: Abort Raft operations on timeout service: strong_consistency: Use timeout when mutating service: strong_consistency: Fix indentation service: strong_consistency: Enclose coordinator methods with try-catch service: strong_consistency: Crash at unexpected exception test: cluster: Extract default config & cmdline in test_strong_consistency.py	2026-04-10 11:11:21 +02:00
Tomasz Grabiec	88bea5aaf3	cql: Include parallelized queries in the scylla_cql_select_partition_range_scan_no_bypass_cache metric This metric is used to catch execution of scans which go via row cache, which can have bad effect on performance. Since `f344bd0aaa`, aggreagte queries go via new statement class: parallelized_select_statement. This class inherits from select_statement directly rather than from primary_key_select_statement. The range scan detection logic (_range_scan, _range_scan_no_bypass_cache) was only in primary_key_select_statement's constructor, so parallelized queries were not counted in select_partition_range_scan and select_partition_range_scan_no_bypass_cache metrics. Fix by moving the range scan detection into select_statement's constructor, so that all subclasses get it.	2026-04-10 02:12:48 +02:00
Szymon Wasik	573def7cd8	cql: accept source_model option and show options in DESCRIBE Accept the Cassandra SAI 'source_model' option for vector indexes. This option is used by Cassandra libraries (e.g., CassIO, LangChain) to tag vector indexes with the name of the embedding model that produced the vectors. ScyllaDB does not use the source_model value but stores it and includes it in the DESCRIBE INDEX output for Cassandra compatibility. Additionally, extend vector_index::describe() to emit a WITH OPTIONS = {...} clause containing all user-provided index options (filtering out system keys: target, class_name, index_version). This makes options like similarity_function, source_model, etc. visible in DESCRIBE output.	2026-04-09 17:20:03 +02:00
Szymon Wasik	80a2e4a0ab	cql: add Cassandra SAI (StorageAttachedIndex) compatibility Libraries such as CassIO, LangChain, and LlamaIndex create vector indexes using Cassandra's StorageAttachedIndex (SAI) class name. This commit lets ScyllaDB accept these statements without modification. When a CREATE CUSTOM INDEX statement specifies an SAI class name on a vector column, ScyllaDB automatically rewrites it to the native vector_index implementation. Accepted class names (case-insensitive): - org.apache.cassandra.index.sai.StorageAttachedIndex - StorageAttachedIndex - sai SAI on non-vector columns is rejected with a clear error directing users to a secondary index instead. The SAI detection and rewriting logic is extracted into a dedicated static function (maybe_rewrite_sai_to_vector_index) to keep the already-long validate_while_executing method manageable. Multi-column (local index) targets and nonexistent columns are skipped with continue — the former are treated as filtering columns by vector_index::check_target(), and the latter are caught later by vector_index::validate(). Tests that exercise features common to both backends (basic creation, similarity_function, IF NOT EXISTS, bad options, etc.) now use the SAI class name with the skip_on_scylla_vnodes fixture so they run against both ScyllaDB and Cassandra. ScyllaDB-specific tests continue to use USING 'vector_index' with scylla_only.	2026-04-09 17:20:03 +02:00
Dawid Mędrek	ad8a263683	service: strong_consistency: Abort ongoing operations when shutting down These changes are complementary to those from a recent commit where we handled aborting ongoing operations during tablet events, such as tablet migration. In this commit, we consider the case of shutting down a node. When a node is shutting down, we eventually close the connections. When the client can no longer get a response from the server, it makes no sense to continue with the queries. We'd like to cancel them at that point. We leverage the abort source passed down via `client_state` down to the strongly consistent coordinator. This way, the transport layer can communicate with it and signal that the queries should be canceled. The abort source is triggered by the CQL server (cf. `generic_server::server::{stop,shutdown}`). --- Note that this is not an optional change. In fact, if we don't abort those requests, we might hang for an indefinite amount of time when executing the following code in `main.cc`: ``` // Register at_exit last, so that storage_service::drain_on_shutdown will be called first auto do_drain = defer_verbose_shutdown("local storage", [&ss] { ss.local().drain_on_shutdown().get(); }); ``` The problem boils down to the fact that `generic_server::server::stop` will wait for all connections to be closed, but that won't happen until all ongoing operations (at least those to strongly consistent tables) are finished. It's important to highlight that even though we hang on this, the client can no longer get any response. Thus, it's crucial that at that point we simply abort ongoing operations to proceed with the rest of shutdown. --- Two tests are added to verify that the implementation is correct: one focusing on local operations, the other -- on a forwarded write. Difference in time spent on the whole test file `test_strong_consistency.py` on my local machine, in dev mode: Before: ``` real 0m31.775s user 1m4.475s sys 0m22.615s ``` After: ``` real 0m32.024s user 1m10.751s sys 0m23.871s ``` Individual runs of the added tests: test_queries_when_shutting_down: ``` real 0m12.818s user 0m36.726s sys 0m4.577s ``` test_abort_forwarded_write_upon_shutdown: ``` real 0m12.930s user 0m36.622s sys 0m4.752s ```	2026-04-09 11:36:17 +02:00
Dawid Mędrek	2243e0ffea	service: strong_consistency: Use timeout when mutating We remove the inconsistency between reads and writes to strongly consistent tables. Before the commit, only reads used a timeout. Now, writes do as well. Although the parameter isn't used yet, that will change in the following commit. This is a prerequisite for it.	2026-04-09 11:25:57 +02:00
Karol Nowacki	6bc88e817f	vector_search: fix SELECT on local vector index Queries against local vector indexes were failing with the error: "ANN ordering by vector requires the column to be indexed using 'vector_index'" This was a regression introduced by `15788c3734`, which incorrectly assumed the first column in the targets list is always the vector column. For local vector indexes, the first column is the partition key, causing the failure. Previously, serialization logic for the target index option was shared between vector and secondary indexes. This is no longer viable due to the introduction of local vector indexes and vector indexes with filtering columns, which have different target format. This commit introduces a dedicated JSON-based serialization format for vector index targets, identifying the target column (tc), filtering columns (fc), and partition key columns (pk). This ensures unambiguous serialization and deserialization for all vector index types. This change is backward compatible for regular vector indexes. However, it breaks compatibility for local vector indexes and vector indexes with filtering columns created in version 2026.1.0. To mitigate this, usage of these specific index types will be blocked in the 2026.1.0 release by failing ANN queries against them in vector-store service. Fixes: SCYLLADB-895	2026-03-30 16:46:48 +02:00

1 2 3 4 5 ...

2012 Commits