Commit Graph

2012 Commits

Author SHA1 Message Date
Petr Gusev
9e3209e4a3 cql: refactor add_tablet_info to take tablet_routing_info directly
Change add_tablet_info() to accept locator::tablet_routing_info instead
of destructured (tablet_replica_set, token_range) pair. This simplifies
all three call sites.

Remove the empty-replicas guard inside add_tablet_info(): the only
producer of tablet_routing_info is tablet ERM's check_locality(), which
returns either nullopt (correctly routed) or info with replicas copied
from tablet_info — a tablet always has replicas. All callers already
check for nullopt before calling add_tablet_info(), so by the time we
enter the function replicas are guaranteed non-empty.
2026-05-15 12:28:33 +02:00
Petr Gusev
738b7b4a86 cql: fix UB dereference of nullopt tablet_info in execute_with_condition
When check_locality() returns nullopt (correctly routed LWT), the
optional tablet_info was unconditionally dereferenced in the lambda
capture list: tablet_info->tablet_replicas, tablet_info->token_range.

The code previously masked this by initializing tablet_info with an
empty-but-present value, so the dereference happened to work but
only because the empty tablet_replicas made add_tablet_info() a no-op.
After check_locality() overwrites it with nullopt, the dereference
is UB.

Fix by initializing tablet_info as empty (nullopt) and guarding the
dereference.
2026-05-15 11:56:14 +02:00
Petr Gusev
167a3c9c50 cql: fix missing TABLETS_ROUTING_V1 payload after CAS shard bounce
After an internal CAS shard bounce, check_locality() was evaluating
against this_shard_id() of the post-bounce shard — which is the correct
tablet shard — so it returned nullopt, and LWT/SERIAL responses omitted
the tablets-routing-v1 custom payload. The client never learned the
correct tablet map.

Fix by recording the original entry shard in client_state (initialized
to this_shard_id() at construction, preserved across shard bounces via
client_state_for_another_shard) and passing it to check_locality() so
it compares against the client's actual routing decision.

No host_id tracking or forwarded_client_state IDL changes are needed
because CAS shard bounces are always intra-node.

Fixes SCYLLADB-2041
2026-05-15 11:56:14 +02:00
Piotr Dulikowski
f3ac35f9d2 Merge 'strong_consistency: wait for raft servers to start in create table' from Michael Litvak
When creating a strongly consistent table, wait for the table's raft
servers to start and be ready to serve queries before completing the
operation. We want the create table operation to absorb the delay of
starting the raft groups instead of the first queries.

The create table coordinator commits and applies the schema statement,
then it waits for all hosts that have a tablet replica to create and
start the raft groups for the table's tablets. It does this by sending
an RPC to all the relevant hosts that executes a group0 barrier, in
order to ensure the table and raft groups are created, then waits for
all raft groups on the host to finish starting and be ready.

Fixes SCYLLADB-807

no backport - strong consistency is still experimental

Closes scylladb/scylladb#28843

* github.com:scylladb/scylladb:
  strong_consistency: wait for leader when starting a group
  strong_consistency: change wait for groups to start on startup
  strong_consistency: optimize wait_for_groups_to_start
  strong_consistency: wait for raft servers to start in create table
2026-05-13 16:42:05 +02:00
Piotr Dulikowski
dc05bd35bb Merge 'strong_consistency: limit available consistency levels in strong consistent requests' from Michał Jadwiszczak
Strong consistent requests take different patch then EC requests and consistency levels don’t map well.
We should limit available consistency levels in SC request to avoid ignoring them silently, which may cause confusion to user.

For writes, there is only one option:
- QUORUM/LOCAL_QUORUM (multi DC is not supported yet, so both of those CLs have the same effect) - we need quorum of replicas to successfully commit new mutations to Raft log.

For reads, there are 2 options:
- QUORUM/LOCAL_QUORUM - if user wants to be sure he sees latest data and the query needs to execute `read_barrier()`, which requires quorum of replicas
- ONE/LOCAL_ONE - if user just wants to read data from one replica without synchronization

All tests were updated to use LOCAL_QUORUM for both read and writes.

Fixes SCYLLADB-1766

SC is in experimental phase and this patch is an improvement, no backport needed.

Closes scylladb/scylladb#29691

* github.com:scylladb/scylladb:
  strong_consistency: allow QUORUM/LOCAL_QUORUM and ONE/LOCAL_ONE for reads
  strong_consistency: allow only QUORUM/LOCAL_QUORUM CL for writes
2026-05-13 16:31:05 +02:00
Michael Litvak
5a5c7c6241 strong_consistency: wait for raft servers to start in create table
When creating a strongly consistent table, wait for the table's raft
servers to start and be ready to serve queries before completing the
operation. We want the create table operation to absorb the delay of
starting the raft groups instead of the first queries.

The create table coordinator commits and applies the schema statement,
then it waits for all hosts that have a tablet replica to create and
start the raft groups for the table's tablets. It does this by sending
an RPC to all the relevant hosts that executes a group0 barrier, in
order to ensure the table and raft groups are created, then waits for
all raft groups on the host to finish starting and be ready.

Fixes SCYLLADB-807
2026-05-13 08:43:24 +02:00
Michał Jadwiszczak
d073097ebf strong_consistency: allow QUORUM/LOCAL_QUORUM and ONE/LOCAL_ONE for reads
We can execute strong consistent read queries in 2 ways:
- with QUORUM/LOCAL_QUORUM CL - this path executes `read_barrier()`
  before reading the data, which synchronizes Raft log with the leader.
  But to execute it, we need quorum of replicas
- with ONE/LOCAL_ONE CL - this path just reads data from one replica
  without any synchronization (not implemented yet)
2026-05-12 23:20:07 +02:00
Michał Jadwiszczak
68f0cf6fac strong_consistency: allow only QUORUM/LOCAL_QUORUM CL for writes
To successfully write data to strong consistent table, a quorum of
replicas need to be used to save the data to Raft log.

So the only reasonable consistency level is QUORUM/LOCAL_QUORUM
(currently SC doesn't support multi DC).
2026-05-12 23:20:03 +02:00
Piotr Dulikowski
129f193116 Merge 'strong_consistency: implement basic coordinator metrics' from Michał Jadwiszczak
Add per-shard metrics for strong consistency coordinator operations (latency, timeouts, bounces, status unknown) under the `"strong_consistency_coordinator"` category. These are analogous to the eventual consistency metrics in `storage_proxy_stats`, enabling direct performance comparison between the two consistency modes.

The metrics are simplified compared to `storage_proxy_stats` — no breakdown by table, tablet, scheduling group, or DC, only per-shard.

Fixes SCYLLADB-1343

Strong consistency is still in experimental phase, no need to backport.

Closes scylladb/scylladb#29318

* github.com:scylladb/scylladb:
  test/strong_consistency: verify metrics
  strong_consistency: wire up metrics to operations
  strong_consistency: add stats struct and metrics registration
2026-05-12 16:15:51 +02:00
Marcin Maliszkiewicz
3df951bc9c Merge 'audit: set audit_info for native-protocol BATCH messages' from Andrzej Jackowski
Commit 16b56c2451 ("Audit: avoid dynamic_cast on a hot path") moved
audit info into batch_statement via set_audit_info(), but only wired it
for the CQL-text BATCH path (raw::batch_statement::prepare()).
Native-protocol BATCH messages (opcode 0x0D), handled by
process_batch_internal in transport/server.cc, construct a
batch_statement without setting audit_info. This causes audit to
silently skip the entire batch.

Set audit_info on the batch_statement so these batches are audited.

Fixes SCYLLADB-1652

No backport - bug introduced recently.

Closes scylladb/scylladb#29570

* github.com:scylladb/scylladb:
  test/audit: add reproducer for native-protocol batch not being audited
  audit: set audit_info for native-protocol BATCH messages
  test/audit: rename internal test methods to avoid CI misdetection
2026-04-22 18:56:28 +02:00
Michał Jadwiszczak
f77c258c8e strong_consistency: wire up metrics to operations
Track write and read latency using latency_counter in
coordinator::mutate() and coordinator::query().

Count commit_status_unknown errors in coordinator::mutate().

Count node and shard bounces in redirect_statement(), passing the
coordinator's stats from both modification_statement and
select_statement.
2026-04-22 08:59:59 +02:00
Tomasz Grabiec
cddde464ca Merge 'service: Support adding/removing a datacenter with tablets by changing RF' from Aleksandra Martyniuk
With this change, you can add or remove a DC(s) in a single ALTER KEYSPACE statement. It requires the keyspace to use rack list replication factor.

In existing approach, during RF change all tablet replicas are rebuilt at once. This isn't the case now. In global_topology_request::keyspace_rf_change the request is added to a ongoing_rf_changes - a new column in system.topology table. In a new column in system_schema.keyspaces - next_replication - we keep the target RF.

In make_rf_change_plan, load balancer schedules necessary migrations, considering the load of nodes and other pending tablet transitions. Requests from ongoing_rf_changes are processed concurrently, independently from one another. In each request racks are processed concurrently. No tablet replica will be removed until all required replicas are added. While adding replicas to each rack we always start with base tables and won't proceed with views until they are done (while removing - the other way around). The intermediary steps aren't reflected in schema. When the Rf change is finished:
- in system_schema.keyspaces:
  - next_replication is cleared;
  - new keyspace properties are saved;
- request is removed from ongoing_rf_changes;
- the request is marked as done in system.topology_requests.

Until the request is done, DESCRIBE KEYSPACE shows the replication_v2.

If a request hasn't started to remove replicas, it can be aborted using task manager. system.topology_requests::error is set (but the request isn't marked as done) and next_replication = replication_v2. This will be interpreted by load balancer, that will start the rollback of the request. After the rollback is done, we set the relevant system.topology_requests entry as done (failed), clear the request id from system.topology::ongoing_rf_changes, and remove next_replication.

Fixes: SCYLLADB-567.

No backport needed; new feature.

Closes scylladb/scylladb#24421

* github.com:scylladb/scylladb:
  service: fix indentation
  docs: update documentation
  test: test multi RF changes
  service: tasks: allow aborting ongoing RF changes
  cql3: allow changing RF by more than one when adding or removing a DC
  service: handle multi_rf_change
  service: implement make_rf_change_plan
  service: add keyspace_rf_change_plan to migration_plan
  service: extend tablet_migration_info to handle rebuilds
  service: split update_node_load_on_migration
  service: rearrange keyspace_rf_change handler
  db: add columns to system_schema.keyspaces
  db: service: add ongoing_rf_changes to system.topology
  gms: add keyspace_multi_rf_change feature
2026-04-22 01:46:11 +02:00
Andrzej Jackowski
f5bb9b6282 audit: set audit_info for native-protocol BATCH messages
Commit 16b56c2451 ("Audit: avoid dynamic_cast on a hot path") moved
audit info into batch_statement via set_audit_info(), but only wired it
for the CQL-text BATCH path (raw::batch_statement::prepare()).
Native-protocol BATCH messages (opcode 0x0D), handled by
process_batch_internal in transport/server.cc, construct a
batch_statement without setting audit_info. This causes audit to
silently skip the entire batch.

Set audit_info on the batch_statement so these batches are audited.

Fixes SCYLLADB-1652
2026-04-21 21:52:26 +02:00
Nadav Har'El
6165124fcc Merge 'cql3: statement_restrictions: analyze during prepare time' from Avi Kivity
The statement_restrictions code is responsible for analyzing the WHERE
clause, deciding on the query plan (which index to use), and extracting
the partition and clustering keys to use for the index.

Currently, it suffers from repetition in making its decisions: there are 15
calls to expr::visit in statement_restrictions.cc, and 14 find_binop calls. This
reduces to 2 visits (one nested in the other) and 6 find_binop calls. The analysis
of binary operators is done once, then reused.

The key data structure introduced is the predicate. While an expression
takes inputs from the row evaluated, constants, and bind variables, and
produces a boolean result, predicates ask which values for a column (or
a number of columns) are needed to satisfy (part of) the WHERE clause.
The WHERE clause is then expressed as a conjunction of such predicates.
The analyzer uses the predicates to select the index, then uses the predicates
to compute the partition and clustering keys.

The refactoring is composed of these parts (but patches from different parts
are interspersed):

1. an exhaustive regression test is added as the first commit, to ensure behavior doesn't change
2. move computation from query time to prepare time
3. introduce, gradually enrich, and use predicates to implement the statement_restrictions API

Major refactoring, and no bugs fixed, so definitely not backporting.

Closes scylladb/scylladb#29114

* github.com:scylladb/scylladb:
  cql3: statement_restrictions: replace has_eq_restriction_on_column with precomputed set
  cql3: statement_restrictions: replace multi_column_range_accumulator_builder with direct predicate iteration
  cql3: statement_restrictions: use predicate fields in build_get_clustering_bounds_fn
  cql3: statement_restrictions: remove extract_single_column_restrictions_for_column
  cql3: statement_restrictions: use predicate vectors in prepare_indexed_local
  cql3: statement_restrictions: use predicate vector size for clustering prefix length
  cql3: statement_restrictions: replace do_find_idx and is_supported_by with predicate-based versions
  cql3: statement_restrictions: remove expression-based has_supporting_index and index_supports_some_column
  cql3: statement_restrictions: replace multi-column and PK index support checks with predicate-based versions
  cql3: statement_restrictions: add predicate-based index support checking
  cql3: statement_restrictions: use pre-built single-column maps for index support checks
  cql3: statement_restrictions: build clustering-prefix restrictions incrementally
  cql3: statement_restrictions: build partition-range restrictions incrementally
  cql3: statement_restrictions: build clustering-key single-column restrictions map incrementally
  cql3: statement_restrictions: build partition-key single-column restrictions map incrementally
  cql3: statement_restrictions: build non-primary-key single-column restrictions map incrementally
  cql3: statement_restrictions: use tracked has_mc_clustering for _has_multi_column
  cql3: statement_restrictions: track has-token state incrementally
  cql3: statement_restrictions: track partition-key-empty state incrementally
  cql3: statement_restrictions: track first multi-column predicate incrementally
  cql3: statement_restrictions: track last clustering column incrementally
  cql3: statement_restrictions: track clustering-has-slice incrementally
  cql3: statement_restrictions: track has-multi-column-clustering incrementally
  cql3: statement_restrictions: track clustering-empty state incrementally
  cql3: statement_restrictions: replace restr bridge variable with pred.filter
  cql3: statement_restrictions: convert single-column branch to use predicate properties
  cql3: statement_restrictions: convert multi-column branch to use predicate properties
  cql3: statement_restrictions: convert constructor loop to iterate over predicates
  cql3: statement_restrictions: annotate predicates with operator properties
  cql3: statement_restrictions: annotate predicates with is_not_null and is_multi_column
  cql3: statement_restrictions: complete preparation early
  cql3: statement_restrictions: convert expressions to predicates without being directed at a specific column
  cql3: statement_restrictions: refine possible_lhs_values() function_call processing
  cql3: statement_restrictions: return nullptr for function solver if not token
  cql3: statement_restrictions: refine possible_lhs_values() subscript solving
  cql3: statement_restrictions: return nullptr from possible_lhs_values instead of on_internal_error
  cql3: statement_restrictions: convert possible_lhs_values into a solver
  cql3: statement_restrictions: split _where to boolean factors in preparation for predicates conversion
  cql3: statement_restrictions: refactor IS NOT NULL processing
  cql3: statement_restrictions: fold add_single_column_nonprimary_key_restriction() into its caller
  cql3: statement_restrictions: fold add_single_column_clustering_key_restriction() into its caller
  cql3: statement_restrictions: fold add_single_column_partition_key_restriction() into its caller
  cql3: statement_restrictions: fold add_token_partition_key_restriction() into its caller
  cql3: statement_restrictions: fold add_multi_column_clustering_key_restriction() into its caller
  cql3: statement_restrictions: avoid early return in add_multi_column_clustering_key_restrictions
  cql3: statement_restrictions: fold add_is_not_restriction() into its caller
  cql3: statement_restrictions: fold add_restriction() into its caller
  cql3: statement_restrictions: remove possible_partition_token_values()
  cql3: statement_restrictions: remove possible_column_values
  cql3: statement_restrictions: pass schema to possible_column_values()
  cql3: statement_restrictions: remove fallback path in solve()
  cql3: statement_restrictions: reorder possible_lhs_column parameters
  cql3: statement_restrictions: prepare solver for multi-column restrictions
  cql3: statement_restrictions: add solver for token restriction on index
  cql3: statement_restrictions: pre-analyze column in value_for()
  cql3: statement_restrictions: don't handle boolean constants in multi_column_range_accumulator_builder
  cql3: statement_restrictions: split range_from_raw_bounds into prepare phase and query phase
  cql3: statement_restrictions: adjust signature of range_from_raw_bounds
  cql3: statement_restrictions: split multi_column_range_accumulator into prepare-time and query-time phases
  cql3: statement_restrictions: make get_multi_column_clustering_bounds a builder
  cql3: statement_restrictions: multi-key clustering restrictions one layer deeper
  cql3: statement_restrictions: push multi-column post-processing into get_multi_column_clustering_bounds()
  cql3: statement_restrictions: pre-analyze single-column clustering key restrictions
  cql3: statement_restrictions: wrap value_for_index_partition_key()
  cql3: statement_restrictions: hide value_for()
  cql3: statement_restrictions: push down clustering prefix wrapper one level
  cql3: statement_restrictions: wrap functions that return clustering ranges
  cql3: statement_restrictions: do not pass view schema back and forth
  cql3: statement_restrictions: pre-analyze token range restrictions
  cql3: statement_restrictions: pre-analyze partition key columns
  cql3: statement_restrictions: do not collect subscripted partition key columns
  cql3: statement_restrictions: split _partition_range_restrictions into three cases
  cql3: statement_restrictions: move value_list, value_set to header file
  cql3: statement_restrictions: wrap get_partition_key_ranges
  cql3: statement_restrictions: prepare statement_restrictions for capturing `this`
  test: statement_restrictions: add index_selection regression test
2026-04-21 15:44:06 +03:00
Łukasz Paszkowski
d18eb9479f cql/statement: Create keyspace_metadata with correct initial_tablets count
In `ks_prop_defs::as_ks_metadata(...)` a default initial tablets count
is set to 0, when tablets are enabled and the replication strategy
is NetworkReplicationStrategy.

This effectively sets _uses_tablets = false in abstract_replication_strategy
for the remaining strategies when no `tablets = {...}` options are specified.
As a consequence, it is possible to create vnode-based keyspaces even
when tablets are enforced with `tablets_mode_for_new_keyspaces`.

The patch sets a default initial tablets count to zero regardless of
the chosen replication strategy. Then each of the replication strategy
validates the options and raises a configuration exception when tablets
are not supported.

All tests are altered in the following way:
+ whenever it was correct, SimpleStrategy was replaced with NetworkTopologyStrategy
+ otherwise, tablets were explicitly disabled with ` AND tablets = {'enabled': false}`

Fixes https://github.com/scylladb/scylladb/issues/25340

Closes scylladb/scylladb#25342
2026-04-20 17:57:38 +03:00
Avi Kivity
325497d460 cql3: statement_restrictions: hide value_for()
value_for() is a general function that solves for values that
satisfy an expression set to TRUE. This goes against our goal to
prepare solvers for all the expressions we use. Fortunately, it's only
called with one expression, which comes from statement_restrictions, so
we can add an accessor that provides the expression from our own state.
Later, we'll be able to do prepare-time work on it.
2026-04-19 20:57:04 +03:00
Avi Kivity
620df7103f cql3: statement_restrictions: do not pass view schema back and forth
For indexed queries, statement_restrictions calculates _view_schema,
which is passed via get_view_schema() to indexed_select_statement(),
which passes it right back to statement_restrictions via one of three
functions to calculate clustering ranges.

Avoid the back-and-forth and use the stored value. Using a different
value would be broken.

This change allows unifying the signatures of the four functions that
get clustering ranges.
2026-04-19 20:57:03 +03:00
Avi Kivity
eec0b20dbc cql3: statement_restrictions: prepare statement_restrictions for capturing this
Prevent copying/moving, that can change the address, and instead enforce
using shared_ptr. Most of the code is already using shared_ptr, so the
changes aren't very large.

To forbid non-shared_ptr construction, the constructors are annotated
with a private_tag tag class.
2026-04-19 20:57:03 +03:00
Pawel Pery
7883f161bb vector-store: fix creating local vector search indexes with a part of the partition key
Users ought to have possibility to create the local index for Vector Search
based only on a part of the partition key. This commits provides this by
removing requirements of 'full partition key only' for custom local index.

The commit updates docs to explain that local vector index can use only a part
of the partition key.

The commit implements cqlpy test to check fixed functionality.

Fixes: SCYLLADB-953

Needs to be backported to 2026.1 as it is a fix for local vector indexes.

Closes scylladb/scylladb#28931
2026-04-17 11:44:15 +02:00
Aleksandra Martyniuk
38bad5f316 cql3: allow changing RF by more than one when adding or removing a DC
rf_rack_valid_keyspaces relies on the fact that replicas of base
table and mv are streamed concurrently. This is no longer true
for newly introduced method of adding a DC. Disable rf_rack_valid_keyspaces
in test_mv_first_replica_in_dc to force the old method.
2026-04-17 09:58:08 +02:00
Aleksandra Martyniuk
72bb3113ac db: add columns to system_schema.keyspaces
Add a new next_replication column to system_schema.keyspaces table.

While there is an ongoing RF change:
- next_replication keeps the target RF values;
- existing replication_v2 column keeps initial RF values - the ones we
  started the RF change with.

DESCRIBE KEYSPACE statement shows replication_v2.

When there is no ongoing RF change for this keyspace, its
next_replication is empty.

In this commit no data is kept in the new column.
2026-04-17 09:58:07 +02:00
Pavel Emelyanov
335261f351 cql3: Move enable_create_table_with_compact_storage to cql_config
Move enable_create_table_with_compact_storage option from db::config to
cql_config. This improves separation of concerns by consolidating CQL-specific
table creation policies in the cql_config structure. Update the CREATE TABLE
statement prepare() function to use the new location for the configuration check.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-16 08:52:20 +03:00
Pavel Emelyanov
f20ede79f9 cql3: Move strict_is_not_null_in_views to cql_config
Move strict_is_not_null_in_views option from db::config to cql_config via
new view_restrictions sub-struct. This improves separation of concerns by
keeping view-specific validation policies with other CQL configuration.
Update prepare_view() to take view_restrictions reference instead of reaching
into db::config, and update all callsites to pass the sub-struct.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-16 08:52:19 +03:00
Pavel Emelyanov
027c91f45e cql3: Move restrict_future_timestamp to cql_config
Move restrict_future_timestamp option from db::config to cql_config. This
improves separation of concerns as timestamp validation is part of CQL query
execution behavior. Update validate_timestamp() function signature to take
cql_config reference instead of db::config, and update all callsites in
modification_statement and batch_statement to pass cql_config.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-16 08:51:53 +03:00
Pavel Emelyanov
7264581881 cql3: Move TWCS restriction options to cql_config
Move twcs_max_window_count and restrict_twcs_without_default_ttl options
from db::config to cql_config via new twcs_restrictions sub-struct. This
improves separation of concerns by keeping TWCS-specific validation policies
with other CQL configuration. Update check_restricted_table_properties()
to remove unused db parameter and take twcs_restrictions reference instead.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-16 08:51:52 +03:00
Pavel Emelyanov
8b853505cd cql3: Move keyspace restriction options to cql_config
Introduce replication_restrictions, a sub-struct of cql_config, to hold
the seven keyspace-level policy options that govern how CREATE/ALTER
KEYSPACE statements are validated:
  - restrict_replication_simplestrategy
  - replication_strategy_warn_list / replication_strategy_fail_list
  - minimum/maximum_replication_factor_warn/fail_threshold

Pass replication_restrictions into check_against_restricted_replication_strategies()
instead of having it reach into db::config directly (via both
qp.db().get_config() and qp.proxy().data_dictionary().get_config()).

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-16 08:51:24 +03:00
Pavel Emelyanov
1af26a1dd6 cql3: Move batch_size_fail_threshold_in_kb to cql_config
The batch_size_fail_threshold_in_kb option controls the batch size at
which an oversized batch error is returned to the client. It belongs in
cql_config rather than db::config as it directly governs CQL batch
statement behavior.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-16 07:57:27 +03:00
Pavel Emelyanov
4d255cf533 cql3: Move batch_size_warn_threshold_in_kb to cql_config
The batch_size_warn_threshold_in_kb option controls the batch size at
which a client warning is emitted during batch execution. It belongs in
cql_config rather than db::config as it directly governs CQL batch
statement behavior.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-16 07:57:27 +03:00
Pavel Emelyanov
a3f097f100 cql3: Move enable_parallelized_aggregation to cql_config
The enable_parallelized_aggregation option controls whether aggregation
queries are fanned out across shards for parallel execution. It belongs
in cql_config rather than db::config as it directly governs CQL query
behavior at prepare time.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-16 07:57:27 +03:00
Pavel Emelyanov
4314fc0642 cql3: Move strict_allow_filtering to cql_config
The strict_allow_filtering option controls whether queries that require
ALLOW FILTERING are silently accepted, warned about, or rejected. It
belongs in cql_config rather than db::config as it directly governs CQL
query behavior at prepare time.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-16 07:57:26 +03:00
Pavel Emelyanov
3411ed8bcc cql3: Move select_internal_page_size to cql_config
The select_internal_page_size option controls CQL query execution
behavior (internal paging for aggregate/filtered SELECTs) and belongs
in cql_config rather than being read directly from db::config at
execution time.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-16 07:57:26 +03:00
Pavel Emelyanov
60a834d9fa cql3: Add cql_config parameter to parsed_statement::prepare()
Pass cql_config to prepare() so that statement preparation can use
CQL-specific configuration rather than reaching into db::config
directly.

Callers that use default_cql_config:
- db/view/view.cc: builds a SELECT statement internally to compute view
  restrictions, not in response to a user query
- cql3/statements/create_view_statement.cc: same -- parses the view's
  WHERE clause as a synthetic SELECT to extract restrictions
- tools/schema_loader.cc: offline schema loading tool, no runtime
  config available
- tools/scylla-sstable.cc: offline sstable inspection tool, no runtime
  config available

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-16 07:57:25 +03:00
Avi Kivity
59ec93b86b Merge 'Allow arbitrary tablet boundaries and count' from Tomasz Grabiec
There are several reasons we want to do that.

One is that it will give us more flexibility in distributing the
load. We can subdivide tablets at any token, and achieve more
evenly-sized tablets. In particular, we can isolate large partitions
into separate tablets.

We can also split and merge incrementally individual tablets.
Currently, we do it for the whole table or nothing, which makes
splits and merges take longer and cause wide swings of the count.
This is not implemented in this PR yet, we still split/merge the whole table.

Another reason is vnode to tablets migration. We now could construct a
tablet map which matches exactly the vnode boundaries, so migration
can happen transparently from CQL-coordinator point of view.

Tablet count is still a power-of-two by default for newly created tables.
It may be different if tablet map is created by non-standard means,
or if per-table tablet option "pow2_count" is set to "false".

build/release/scylla perf-tablets:

Memory footprint for 131k tablets increased from 56 MiB to 58.1 MiB (+3.5%)

Before:
```
Generating tablet metadata
Total tablet count: 131072
Size of tablet_metadata in memory: 57456 KiB
Copied in 0.014346 [ms]
Cleared in 0.002698 [ms]
Saved in 1234.685303 [ms]
Read in 445.577881 [ms]
Read mutations in 299.596313 [ms] 128 mutations
Read required hosts in 247.482742 [ms]
Size of canonical mutations: 33.945053 [MiB]
Disk space used by system.tablets: 1.456761 [MiB]
Tablet metadata reload:
full      407.69ms
partial     2.65ms
```

After:
```
Generating tablet metadata
Total tablet count: 131072
Size of tablet_metadata in memory: 59504 KiB
Copied in 0.032475 [ms]
Cleared in 0.002965 [ms]
Saved in 1093.877441 [ms]
Read in 387.027100 [ms]
Read mutations in 255.752121 [ms] 128 mutations
Read required hosts in 211.202805 [ms]
Size of canonical mutations: 33.954453 [MiB]
Disk space used by system.tablets: 1.450162 [MiB]
Tablet metadata reload:
full      354.50ms
partial     2.19ms
```

Closes scylladb/scylladb#28459

* github.com:scylladb/scylladb:
  test: boost: tablets: Add test for merge with arbitrary tablet count
  tablets, database: Advertise 'arbitrary' layout in snapshot manifest
  tablets: Introduce pow2_count per-table tablet option
  tablets: Prepare for non-power-of-two tablet count
  tablets: Implement merged tablet_map constructor on top of for_each_sibling_tablets()
  tablets: Prepare resize_decision to hold data in decisions
  tablets: table: Make storage_group handle arbitrary merge boundaries
  tablets: Make stats update post-merge work with arbitrary merge boundaries
  locator: tablets: Support arbitrary tablet boundaries
  locator: tablets: Introduce tablet_map::get_split_token()
  dht: Introduce get_uniform_tokens()
2026-04-15 18:57:22 +03:00
Marcin Maliszkiewicz
53b6e9fda5 Merge 'Make DESCRIBE CLUSTER get cluster information from storage_service' from Pavel Emelyanov
Currently the statement returns cluster, partitioner and snitch names by accessing global db::config via database. As the part of an effort to detach components from global db::config, this PR tweaks the statement handler to get the cluster information from some other source. Currently the needed cluster information is stored in different components, but they are all under storage_service umbrella which seems to be a good central source of this truth. Unit test included.

Cleaning components inter-dependencies, not backporting

Closes scylladb/scylladb#29429

* github.com:scylladb/scylladb:
  test: Add test_describe_cluster_sanity for DESCRIBE CLUSTER validation
  describe_statement: Get cluster info from storage_service
  storage_service: Add describe_cluster() method
  query_processor: Expose storage_service accessor
2026-04-15 14:40:15 +03:00
Nadav Har'El
1eb8d170dd Merge 'vector_index: allow recreating vector indexes on the same column' from Dawid Pawlik
This series allows creating multiple vector indexes on the same column so users can rebuild an index without losing query availability.

The intended flow is:
1. Create a new vector index on a column that already has one.
2. Keep serving ANN queries from the old index while the new one is being built.
3. Verify the new index is ready.
4. Automatically switch to the remaining index.
5. Drop the old index.

To make that deterministic, `index_version` is changed from the base table schema version to a real creation timeuuid. When multiple vector indexes exist on the same column, ANN query planning now picks the index according to the routing implemented in Vector Store (newest serving index). This keeps queries on the old index until it the new one is up and ready.

This patch also removes the create-time restriction that rejected a second vector index on the same column. Name collisions are still rejected as before.

Test coverage is updated accordingly:
- Scylla now verifies that two vector indexes can coexist on the same column.
- Cassandra/SAI behavior is still covered and is still expected to reject duplicate indexes on the same column.

Fixes: VECTOR-610

Closes scylladb/scylladb#29407

* github.com:scylladb/scylladb:
  docs: document vector index metadata and duplicate handling
  test/cqlpy: cover vector index duplicate creation rules
  vector_index: allow multiple named indexes on one column
  vector_index: store `index_version` as creation timeuuid
2026-04-15 14:40:15 +03:00
Pavel Emelyanov
a428472e50 db: Remove redundant enable_logstor config option
The enable_logstor configuration option is redundant with the 'logstor'
experimental feature flag. Consolidate to a single gate: use the
experimental feature to control both whether logstor is available for
table creation and whether it is initialized at database startup.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#29427
2026-04-15 14:40:15 +03:00
Botond Dénes
87eb20ba33 Merge 'cql: Include parallelized queries in the scylla_cql_select_partition_range_scan_no_bypass_cache metric' from Tomasz Grabiec
This metric is used to catch execution of scans which go via row
cache, which can have bad effect on performance.

Since f344bd0aaa, aggregate queries go
via new statement class: parallelized_select_statement. This class
inherits from select_statement directly rather than from
primary_key_select_statement. The range scan detection logic
(_range_scan, _range_scan_no_bypass_cache) was only in
primary_key_select_statement's constructor, so parallelized queries
were not counted in select_partition_range_scan and
select_partition_range_scan_no_bypass_cache metrics.

Fix by moving the range scan detection into select_statement's
constructor, so that all subclasses get it.

No backport: enhancement

Closes scylladb/scylladb#29422

* github.com:scylladb/scylladb:
  cql: Include parallelized queries in the scylla_cql_select_partition_range_scan_no_bypass_cache metric
  test: cluster: dtest: Fix double-counting of metrics
2026-04-15 14:40:15 +03:00
Tomasz Grabiec
50fbac6ea6 tablets: Introduce pow2_count per-table tablet option
By default it's true, in which case tablet count of the table is
rounded up to a power of two. This option allows lifting this, in
which case the count can be arbitrary. This will allow testing the
logic of arbitrary tablet count.
2026-04-15 10:40:56 +02:00
Pavel Emelyanov
debfb147f5 describe_statement: Get cluster info from storage_service
Update cluster_describe_statement::describe() to retrieve cluster metadata
from storage_service::describe_cluster() instead of directly from db::config
or gossiper.

The storage_service provides a centralized API for accessing cluster metadata
(cluster_name, partitioner, snitch_name) that works in both normal and
maintenance modes, improving separation of concerns.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-14 19:33:06 +03:00
Alex
0f6d9ffd22 cql: expose stable result metadata for prepared LIST statements
Prepared LIST statements were not calculating metadata in PREPARE path, and sent empty string hash to client causing problematic behaviour where metadat_id was not recalculated correctly.
This patch moves metadata construction into get_result_metadata() for the affected LIST statements and reuse that metadata when building the result set.
This gives PREPARE a stable metadata id for LIST ROLES, LIST USERS, LIST PERMISSIONS and the service-level variants.
This patch also adds a new boost test that verifies that when an EXECUTE request carries an empty result metadata id while the server has a real metadata id for the result set, the response is marked METADATA_CHANGED and includes the full result metadata plus the server metadata id.
This covers the recovery path for clients that send an empty or otherwise unusable metadata id instead of a matching cached one.
2026-04-13 17:49:27 +03:00
Dawid Pawlik
63b782451e vector_index: allow multiple named indexes on one column
Allow creating multiple named vector indexes on the same column while
still rejecting duplicate unnamed ones.

`index_metadata::equals_noname()` now ignores `index_version`,
which is unique for every vector index creation, so duplicate detection
keeps working for unnamed vector indexes.

CREATE INDEX keeps using structural duplicate detection for regular
indexes and unnamed vector indexes, but named vector indexes are checked
by name only.

The explicit name check is also needed for IF NOT EXISTS when the same
index name already exists on a different table in the same keyspace,
because vector indexes have no backing view table to catch that case.
2026-04-13 15:04:59 +02:00
Avi Kivity
0ae22a09d4 LICENSE: Update to version 1.1
Updated terms of non-commercial use (must be a never-customer).
2026-04-12 19:46:33 +03:00
Dawid Pawlik
2dd8eef38c vector_index: store index_version as creation timeuuid
Vector indexes currently store the base table schema version in
`index_version`. That value is name-based, not time-based,
so it does not represent when the index was created.

Store a timeuuid instead and change the relevant interfaces from
`table_schema_version` to `utils::UUID`. This is a prerequisite
for supporting multiple vector indexes on the same column where
the oldest index must be selected deterministically via routing
implemented in Vector Store.

Update the cqlpy tests to check the new semantics directly:
recreating the index changes `index_version`, while ALTER TABLE does not.
2026-04-10 13:05:21 +02:00
Piotr Dulikowski
32e3a01718 Merge 'service: strong_consistency: Allow for aborting operations' from Dawid Mędrek
Motivation
----------

Since strongly consistent tables are based on the concept of Raft
groups, operations on them can get stuck for indefinite amounts of
time. That may be problematic, and so we'd like to implement a way
to cancel those operations at suitable times.

Description of solution
-----------------------

The situations we focus on are the following:

* Timed-out queries
* Leader changes
* Tablet migrations
* Table drops
* Node shutdowns

We handle each of them and provide validation tests.

Implementation strategy
-----------------------

1. Auxiliary commits.
2. Abort operations on timeout.
3. Abort operations on tablet removal.
4. Extend `client_state`.
5. Abort operation on shutdown.
6. Help `state_machine` be aborted as soon as possible.

Tests
-----

We provide tests that validate the correctness of the solution.

The total time spent on `test_strong_consistency.py`
(measured on my local machine, dev mode):

Before:
```
real    0m31.809s
user    1m3.048s
sys     0m21.812s
```

After:
```
real    0m34.523s
user    1m10.307s
sys     0m27.223s
```

The incremental differences in time can be found in the commit messages.

Fixes SCYLLADB-429

Backport: not needed. This is an enhancement to an experimental feature.

Closes scylladb/scylladb#28526

* github.com:scylladb/scylladb:
  service: strong_consistency: Abort state_machine::apply when aborting server
  service: strong_consistency: Abort ongoing operations when shutting down
  service: client_state: Extend with abort_source
  service: strong_consistency: Handle abort when removing Raft group
  service: strong_consistency: Abort Raft operations on timeout
  service: strong_consistency: Use timeout when mutating
  service: strong_consistency: Fix indentation
  service: strong_consistency: Enclose coordinator methods with try-catch
  service: strong_consistency: Crash at unexpected exception
  test: cluster: Extract default config & cmdline in test_strong_consistency.py
2026-04-10 11:11:21 +02:00
Tomasz Grabiec
88bea5aaf3 cql: Include parallelized queries in the scylla_cql_select_partition_range_scan_no_bypass_cache metric
This metric is used to catch execution of scans which go via row
cache, which can have bad effect on performance.

Since f344bd0aaa, aggreagte queries go
via new statement class: parallelized_select_statement. This class
inherits from select_statement directly rather than from
primary_key_select_statement. The range scan detection logic
(_range_scan, _range_scan_no_bypass_cache) was only in
primary_key_select_statement's constructor, so parallelized queries
were not counted in select_partition_range_scan and
select_partition_range_scan_no_bypass_cache metrics.

Fix by moving the range scan detection into select_statement's
constructor, so that all subclasses get it.
2026-04-10 02:12:48 +02:00
Szymon Wasik
573def7cd8 cql: accept source_model option and show options in DESCRIBE
Accept the Cassandra SAI 'source_model' option for vector indexes.
This option is used by Cassandra libraries (e.g., CassIO, LangChain)
to tag vector indexes with the name of the embedding model that
produced the vectors.

ScyllaDB does not use the source_model value but stores it and
includes it in the DESCRIBE INDEX output for Cassandra compatibility.

Additionally, extend vector_index::describe() to emit a
WITH OPTIONS = {...} clause containing all user-provided index options
(filtering out system keys: target, class_name, index_version).
This makes options like similarity_function, source_model, etc.
visible in DESCRIBE output.
2026-04-09 17:20:03 +02:00
Szymon Wasik
80a2e4a0ab cql: add Cassandra SAI (StorageAttachedIndex) compatibility
Libraries such as CassIO, LangChain, and LlamaIndex create vector
indexes using Cassandra's StorageAttachedIndex (SAI) class name.
This commit lets ScyllaDB accept these statements without modification.

When a CREATE CUSTOM INDEX statement specifies an SAI class name on a
vector column, ScyllaDB automatically rewrites it to the native
vector_index implementation. Accepted class names (case-insensitive):
  - org.apache.cassandra.index.sai.StorageAttachedIndex
  - StorageAttachedIndex
  - sai

SAI on non-vector columns is rejected with a clear error directing
users to a secondary index instead.

The SAI detection and rewriting logic is extracted into a dedicated
static function (maybe_rewrite_sai_to_vector_index) to keep the
already-long validate_while_executing method manageable.

Multi-column (local index) targets and nonexistent columns are
skipped with continue — the former are treated as filtering columns
by vector_index::check_target(), and the latter are caught later by
vector_index::validate().

Tests that exercise features common to both backends (basic creation,
similarity_function, IF NOT EXISTS, bad options, etc.) now use the
SAI class name with the skip_on_scylla_vnodes fixture so they run
against both ScyllaDB and Cassandra. ScyllaDB-specific tests continue
to use USING 'vector_index' with scylla_only.
2026-04-09 17:20:03 +02:00
Dawid Mędrek
ad8a263683 service: strong_consistency: Abort ongoing operations when shutting down
These changes are complementary to those from a recent commit where we
handled aborting ongoing operations during tablet events, such as
tablet migration. In this commit, we consider the case of shutting down
a node.

When a node is shutting down, we eventually close the connections. When
the client can no longer get a response from the server, it makes no
sense to continue with the queries. We'd like to cancel them at that
point.

We leverage the abort source passed down via `client_state` down to
the strongly consistent coordinator. This way, the transport layer can
communicate with it and signal that the queries should be canceled.
The abort source is triggered by the CQL server (cf.
`generic_server::server::{stop,shutdown}`).

---

Note that this is not an optional change. In fact, if we don't abort
those requests, we might hang for an indefinite amount of time when
executing the following code in `main.cc`:

```
// Register at_exit last, so that storage_service::drain_on_shutdown will be called first
auto do_drain = defer_verbose_shutdown("local storage", [&ss] {
    ss.local().drain_on_shutdown().get();
});
```

The problem boils down to the fact that `generic_server::server::stop`
will wait for all connections to be closed, but that won't happen until
all ongoing operations (at least those to strongly consistent tables)
are finished.

It's important to highlight that even though we hang on this, the
client can no longer get any response. Thus, it's crucial that at that
point we simply abort ongoing operations to proceed with the rest of
shutdown.

---

Two tests are added to verify that the implementation is correct:
one focusing on local operations, the other -- on a forwarded write.

Difference in time spent on the whole test file
`test_strong_consistency.py` on my local machine, in dev mode:

Before:
```
real    0m31.775s
user    1m4.475s
sys     0m22.615s
```

After:
```
real    0m32.024s
user    1m10.751s
sys     0m23.871s
```

Individual runs of the added tests:

test_queries_when_shutting_down:
```
real    0m12.818s
user    0m36.726s
sys     0m4.577s
```

test_abort_forwarded_write_upon_shutdown:
```
real    0m12.930s
user    0m36.622s
sys     0m4.752s
```
2026-04-09 11:36:17 +02:00
Dawid Mędrek
2243e0ffea service: strong_consistency: Use timeout when mutating
We remove the inconsistency between reads and writes to strongly
consistent tables. Before the commit, only reads used a timeout.
Now, writes do as well.

Although the parameter isn't used yet, that will change in the following
commit. This is a prerequisite for it.
2026-04-09 11:25:57 +02:00
Karol Nowacki
6bc88e817f vector_search: fix SELECT on local vector index
Queries against local vector indexes were failing with the error:
"ANN ordering by vector requires the column to be indexed using 'vector_index'"

This was a regression introduced by 15788c3734, which incorrectly
assumed the first column in the targets list is always the vector column.
For local vector indexes, the first column is the partition key, causing
the failure.

Previously, serialization logic for the target index option was shared
between vector and secondary indexes. This is no longer viable due to
the introduction of local vector indexes and vector indexes with filtering
columns, which have  different target format.

This commit introduces a dedicated JSON-based serialization format for
vector index targets, identifying the target column (tc), filtering
columns (fc), and partition key columns (pk). This ensures unambiguous
serialization and deserialization for all vector index types.

This change is backward compatible for regular vector indexes. However,
it breaks compatibility for local vector indexes and vector indexes with
filtering columns created in version 2026.1.0. To mitigate this, usage
of these specific index types will be blocked in the 2026.1.0 release
by failing ANN queries against them in vector-store service.

Fixes: SCYLLADB-895
2026-03-30 16:46:48 +02:00