Commit Graph

4215 Commits

Author SHA1 Message Date
Botond Dénes
555cfbcd38 Merge 'treewide: replace deprecated smp::count and smp::all_cpus() with new APIs' from Avi Kivity
Replace all uses of the deprecated seastar::smp::count with this_smp_shard_count() and smp::all_cpus() with this_smp_all_shards() across the ScyllaDB codebase (seastar submodule untouched).

Both replacement functions require a reactor thread context. All call sites were verified to run on reactor threads.

Notable cases:
- dht/token-sharding.hh: this_smp_shard_count() is used as a default parameter value. This is safe since all callers are on reactor threads, but the expression is now evaluated at each call site rather than being a reference to a global variable.
- service/storage_service.hh, locator/abstract_replication_strategy.hh, ent/encryption/encryption.cc: used in default member initializers and constructor member-init-lists. Objects are always constructed on reactor threads.
- schema_builder: sometimes called from BOOST_AUTO_TEST_CASE without a reactor. Added pre-patch that makes the implicit shard count parameter implicit and pass 1 in those cases.

Not changed:
- scylla-gdb.py: reads smp::count as a GDB symbol (no reactor context).
- Python test files: only reference smp::count in comments/strings.

No backport: the Seastar commit that deprecated these function hasn't (and won't) make its way into any release branches (and the warnings are cosmetic anyway)

Closes scylladb/scylladb#29990

* github.com:scylladb/scylladb:
  treewide: replace deprecated smp::count and smp::all_cpus() with new APIs
  scylla-gdb: read shard count from smp::_this_smp instead of smp::count
  schema_builder: make shard_count an explicit constructor parameter
2026-05-27 09:42:06 +03:00
Avi Kivity
8010e408a2 treewide: replace deprecated smp::count and smp::all_cpus() with new APIs
Replace all uses of the deprecated seastar::smp::count with
this_smp_shard_count() and smp::all_cpus() with this_smp_all_shards()
across the ScyllaDB codebase (seastar submodule untouched).

Both replacement functions require a reactor thread context. All call
sites were verified to run on reactor threads.

Notable cases:
- dht/token-sharding.hh: this_smp_shard_count() is used as a default
  parameter value. This is safe since all callers are on reactor threads,
  but the expression is now evaluated at each call site rather than being
  a reference to a global variable.
- service/storage_service.hh, locator/abstract_replication_strategy.hh,
  ent/encryption/encryption.cc: used in default member initializers and
  constructor member-init-lists. Objects are always constructed on reactor
  threads.

Not changed:
- scylla-gdb.py: reads smp::count as a GDB symbol (no reactor context).
- Python test files: only reference smp::count in comments/strings.
2026-05-26 17:35:20 +03:00
Avi Kivity
f165b396fd schema_builder: make shard_count an explicit constructor parameter
A recent Seastar update deprecated smp::count and introduced
this_smp_shard_count() as a replacement. One difference is that
this_smp_shard_count() wants to run on a reactor thread.

This poses a problem for non-reactor tests (BOOST_AUTO_TEST_CASE)
that nevertheless use a schema, as the schema_builder constructor
references smp::count. If we replace it with this_smp_shard_count()
then it will crash when running without a reactor.

To fix, remove the implicit this_smp_shard_count() call from raw_schema's
constructor and require callers to pass shard_count explicitly to
schema_builder. This allows tests that don't run on a reactor thread
to construct schemas without crashing.

Production code and reactor-based tests pass this_smp_shard_count().
Non-reactor test files (expr_test, keys_test, nonwrapping_interval_test,
wrapping_interval_test, bti_key_translation_test, range_tombstone_list_test)
pass a fixed shard count of 1.

Note: sstable_test.cc is a Seastar test file (SEASTAR_THREAD_TEST_CASE)
but also contains one plain BOOST_AUTO_TEST_CASE
(test_empty_key_view_comparison) that constructs a schema_builder without
a reactor context. This test also receives a fixed shard count of 1.
2026-05-26 11:55:56 +03:00
Botond Dénes
6c3f104b67 cql3: store raw query string in utils::chunked_string
Read query as fragmented string from the input stream in
transport/server.cc, propagate it a such to query_processor::prepare()
and also store it as such in cql3::cql_statement::raw_cql_statement.

Unfortunately, the query still has to be linearized for parsing, as
ANTLR -- although allows for custom InputStream implementation -- plays
pointer arithmetics games with the pointers obtained from them, so
fragmented input cannot be used.
To amortize the cost of this linearization, the query string is
linearized through utils::reusable_buffer. The parser can be
invoked recursively, nested invokations linearize directly.

Still, this patch limits the places where the query is linearized to the
following:
* Parsing
* Audit
* Logs and error messages

So the normal query paths for queries that actually can get arbitrarily
large (UPDATE and INSERT) should only linearize the query temporarily
for parsing.
2026-05-26 09:08:06 +03:00
Botond Dénes
4af3359744 cql3/expr: use utils::chunked_string for untyped_constant::raw_text
This value can be a string or bytes literal, which can get very large in
rare cases. Use chunked storage to avoid large allocations.
2026-05-26 09:08:06 +03:00
Botond Dénes
597d4252dc types: abstract_type::from_string() switch to fragmented buffers (interface)
Change input: str::string_view -> utils::chunked_string_view.
Change return value: bytes -> managed_bytes.

This patch only changes the interface, with some to_bytes() sprinkled in
the internals to deal with recursive calls.
Internals will be updated in the next patch, to keep the churn of
updating callers separate from the actually important changes.
2026-05-26 09:08:06 +03:00
Nadav Har'El
96dd3121e7 Merge 'cql: rewrite CassIO SAI metadata index to regular secondary index' from Szymon Wasik
CassIO (the library backing LangChain's `langchain_community.vectorstores.Cassandra` integration) issues the following DDL during schema setup to create a metadata index:

```sql
CREATE CUSTOM INDEX IF NOT EXISTS eidx_metadata_s_<table>
ON <keyspace>.<table> (ENTRIES(metadata_s))
USING 'org.apache.cassandra.index.sai.StorageAttachedIndex';
```

ScyllaDB does not support Cassandra's StorageAttachedIndex (SAI) for non-vector columns and previously rejected this statement with:

```
StorageAttachedIndex (SAI) is only supported on vector columns; use a secondary index for non-vector columns
```

This blocks seamless migration of existing LangChain/CassIO applications from Cassandra to ScyllaDB — applications fail during initialization before any application-level workaround can run, even when metadata filtering is not used (`metadata_indexing="none"`).

CassIO is no longer actively maintained but remains the only official LangChain integration path for Apache Cassandra over CQL, meaning existing applications will continue using this setup pattern.

Instead of rejecting the CassIO metadata-map SAI DDL, detect the pattern and rewrite it to a standard ScyllaDB secondary index on collection entries:

- **Detection**: SAI class name + single `ENTRIES` target on a non-frozen `map` column
- **Rewrite**: Clear the custom class so the index is created through the standard secondary index path (which already fully supports indexing map entries)
- **Warning**: Emit a CQL warning informing the user that SAI is not supported by ScyllaDB, a regular secondary index was created instead, and metadata filtering behavior may differ from Cassandra SAI

The rewrite is placed early in `validate_while_executing()`, before the rf-rack-validity check, so the standard secondary index code path handles all subsequent validation naturally — no code duplication.

After this change, the CassIO schema setup succeeds on ScyllaDB:
- `CREATE CUSTOM INDEX ... USING 'sai'` on `ENTRIES(metadata_s)` creates a real secondary index
- The index is functional and can accelerate metadata filtering queries
- A CQL warning makes the rewrite transparent to operators
- SAI on non-vector, non-map-entries columns is still rejected as before
- Vector SAI indexes continue to be rewritten to `vector_index` as before

- `test_sai_entries_on_map_creates_regular_index` — verifies the index is created and the warning is emitted (fully-qualified SAI class name)
- `test_sai_entries_on_map_short_name` — same with the `'sai'` short alias
- `test_sai_on_regular_column_rejected` — confirms SAI on regular scalar columns is still rejected

All 148 tests in `test_vector_index.py` and `test_secondary_index.py` pass with no regressions (125 passed, 22 xfailed, 1 skipped).

Fixes: SCYLLADB-2113
Backport: 2026.2 as this is the version where the support for SAI class needed by LangChain was added.

Closes scylladb/scylladb#29981

* github.com:scylladb/scylladb:
  cql: rewrite CassIO SAI metadata index to regular secondary index
  db/config: add enable_cassio_compatibility flag
2026-05-26 00:19:03 +03:00
Szymon Wasik
5ee339b11d cql: rewrite CassIO SAI metadata index to regular secondary index
When CassIO creates a SAI ENTRIES index on a map column,
ScyllaDB now rewrites it to a regular secondary index and emits
a CQL warning. This allows LangChain/CassIO applications to work
without DDL errors.

The rewrite is gated behind the enable_cassio_compatibility flag
(disabled by default).

Refs: SCYLLADB-2113
2026-05-25 15:11:43 +02:00
Avi Kivity
892f22f49c Merge 'cql: atomic add/subtract operations with LWT' from Nadav Har'El
ScyllaDB has special counter columns for which atomic add/subtract operations like `SET a = a + 1` are allowed. Such operations have not been allowed on ordinary non-counter columns, as they would not be properly atomic - the read an the write are separate, and concurrent operations can have incorrect results.

This patch makes it allowed to use such atomic add/subtract operations in **LWT** statements. For example

	UPDATE ... SET a = a - 7 IF a > 0

or

	UPDATE ... SET a = a + 1 IF a != NULL

The row updated in the operation, and the updated column (`a`) should be initialized before the update. The example `SET a = a + 1 IF a != NULL` will fail the condition if `a` is not set. A different request `SET a = a + 1 IF EXISTS` will just leave `a` unset if it's unset (NULL + 1 is NULL, this is SQL's null propagation rules).

This add/subtract operations is allowed on any numeric (integer or floating point) column.

The ability of LWT to fetch the old values of a column and use it to calculate the new value has long been available in our internal CAS implementation - and has been in use for years in Alternator - but until this patch it was not exposed in CQL's LWT.

This series does not add new syntax to CQL - the "SET a = a + b" and "SET a = a - b"  syntax already existed for counters, and we just allow the same syntax for non- counters. However, the series does add a bit of machinery that will allow us to easily support more general expressions in the future. In particular, this series implements the addition, subtraction, and unary-minus operators for expressions, and adds the machinery needed to run **any** expression in "SET a = expr()", using existing row values fetched by LWT.

This is a new Scylla-only feature that does not exist in Cassandra.

Fixes #10568
Refs #22918 ("Support arithmetic operators"), SCYLLADB-1576 ("Decimal arithmetic operations OOM")

This is is a new feature, so normally would not be backported.

Closes scylladb/scylladb#29939

* github.com:scylladb/scylladb:
  cql: atomic add/subtract operations with LWT
  cql3: let constants::setter evaluate expressions using prefetched row data
  cql3/expr: add NEG unary operator for numeric negation
  cql3/expr: add SUB binary operator for numeric subtraction
  cql3/expr: add ADD binary operator for numeric addition
  types: add is_arithmetic() method for types
2026-05-25 14:27:33 +03:00
Piotr Dulikowski
3a5dd2e5be Merge 'strong_consistency: forward reads to the raft leader' from Wojciech Mitros
Strongly consistent reads currently call read_barrier() on whichever
replica happens to process the request. When a follower runs
read_barrier(), it sends an RPC to the leader to get the current read
index, then waits for its local apply index to catch up. If the follower
is behind, this wait can be significant.

By forwarding linearizable reads to the leader, we don't need an RPC from replica to leader to get the index to wait for apply -- it's available locally.

Note that read_barrier() is still required on the leader to confirm it
is still the leader and guarantee linearizability. A future optimization
would be to implement leases in the raft library, which could eliminate
read_barrier() on the leader entirely.

The CL-to-behavior mapping is isolated in a single parse_consistency_level()
function:
- CL=(LOCAL_)QUORUM -> linearizable: forwarded to the raft leader
- CL=(LOCAL_)ONE -> non-linearizable: existing behavior (no read_barrier()/forwarding, may return stale results)
- All other CLs -> invalid request

Read forwarding reuses the same CQL-layer bounce_to_node() mechanism
that write forwarding already uses. The transport layer's existing
requests_forwarded_* metrics automatically count forwarded reads.
Coordinator-level metrics (linearizable_reads, non_linearizable_reads,
writes) are added for visibility into the strong consistency workload.

Fixes: SCYLLADB-1157

Closes scylladb/scylladb#29575

* github.com:scylladb/scylladb:
  strong_consistency: test read forwarding to leader
  strong_consistency: skip read_barrier() for non-linearizable reads
  strong_consistency: split coordinator-level read latency metrics
  strong_consistency: forward linearizable reads to raft leader
  strong_consistency: classify reads by consistency level
  strong_consistency: add begin_read() to raft_server
2026-05-25 10:55:00 +02:00
Nadav Har'El
f8aaeb5e87 cql: atomic add/subtract operations with LWT
ScyllaDB has special counter columns for which atomic add/subtract
operations like `SET a = a + 1` are allowed. Such operations have not
been allowed on ordinary non-counter columns, as they would not be
properly atomic - the read an the write are separate, and concurrent
operations can have incorrect results.

This patch makes it allowed to use such atomic add/subtract operations
in *LWT* statements. Some examples:

        UPDATE ... SET a = a - 1 IF a > 0

        UPDATE ... SET a = a + 1 IF EXISTS

        UPDATE ... SET a = a + 1 a != NULL

The row updated in the operation, and the updated column (a) should
be initialized before the update - arithmetic operations on missing
column values silently leave the column null (no error is generated).

This add/subtract operations is allowed on any numeric column -
integer or floating point of any size.

The ability of LWT to fetch the old values of a column and use it to
calculate the new value has long been available in our internal CAS
implementation - and has been in use for years in Alternator - but until
this patch it was not exposed in CQL's LWT.

This patch does not add new syntax to CQL - the "SET a = a + b"
and "SET a = a - b" syntax that already existed for counters is now
allowed for non-counters.

This is a new Scylla-only feature that does not exist in Cassandra.

Fixes #10568

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-05-25 10:09:11 +03:00
Nadav Har'El
3c6931c1ed cql3: let constants::setter evaluate expressions using prefetched row data
Previously, constants::setter evaluated its expression using only the query
options, which means expressions referencing row columns (column_value nodes)
would crash or return incorrect results.

Add evaluate_on_prefetched_row() to update_parameters: it evaluates an
expression in the context of the prefetched row for a given (pkey, ckey),
falling back to options-only evaluate() when no selection is available
(non-LWT context) or no column values are needed, and treating absent
columns needed by the expression as null.

Extend constants::setter to use this method:
- setter::execute() now calls evaluate_on_prefetched_row() or evaluate()
  as needed.
- setter::requires_read() returns true when the expression contains a
  column_value node, triggering a prefetch read.
- setter::requires_lwt() mirrors requires_read(), enforcing that column-
  referencing arithmetic is only allowed inside a conditional (IF) statement.

We'll use this new feature to implement "SET r = r + 1" and similar
expressions in the next patch.
2026-05-25 10:09:11 +03:00
Nadav Har'El
b026aea6f7 cql3/expr: add NEG unary operator for numeric negation
This patch adds a new expression type, unary_operator, analogous to
the existing binary_operator but takes just one operand instead of
two.

This patch also implements the first and only unary operator type,
unary_oper_t::NEG, implementing negation (unary minus) for all numeric
types.

For fixed-width integer types overflow or underflow results in an error.
If the operand is NULL, the result is a NULL as well.

The new operator is not yet used by the CQL syntax - our parser doesn't
parse arithmetic expressions yet. We also do not plan to use it in the
following patch which uses the separate SUB (subtraction) operation,
not the new NEG. But since I already implemented a unary minus operator,
and we'll surely need it in the future for general arithmentic operations,
I thought I might as well include this patch as well.

Refs #22918 ("Support arithmetic operators")
2026-05-25 10:08:11 +03:00
Nadav Har'El
f27d1f08fc cql3/expr: add SUB binary operator for numeric subtraction
In this patch we add to our expressions oper_t::SUB, for subtraction,
analogous to the ADD from the previous patch.

The only reason why we need a separate SUB operation and can't just
combine ADD with a unary minus (NEG) operator is the minimum integer
in fixed-sized integer. For example, 8-bit integers have the range
-128...127. A subtraction like -1 - (-128) is valid (its value is 127)
but the negation of (-128) would be invalid (128). One of the tests
we add in this patch validates this fact.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-05-25 10:06:28 +03:00
Nadav Har'El
083adf84ab cql3/expr: add ADD binary operator for numeric addition
Extend oper_t with a new ADD operator, to represent addition between two
numeric expressions. Supports all numeric types - tinyint, smallint,
int, bigint, float, double, varint, and decimal.

For fixed-width integer type overflow or underflow results in an error.
If one of the operand is NULL, the result is also a NULL.

The new operator is not yet used by the CQL syntax - our parser doesn't
parse arithmetic expressions yet. We plan to start using this new operator
in a following patch which implements counter syntax ("SET r = r + 1" )
for LWT, but in the future we can use it for more general cases.

At the moment, ADD requires that both operands have the same type.
This is all we need for the first use case, and this limitation can
be relaxed later.

Interestingly, ADD is our first binary operator implementation that
does not return a boolean. Until now all our binary operators have been
comparison operators, and all returned boolean. In contrast, ADD's
return type is the type of its operands.

This implementation is susceptible to the pre-existing bug SCYLLADB-1576,
where adding 1e1000000 and 1 in "decimal" or "varint" types will
happily allocate a million-digit number and run out of memory. A
reproducing test is included, and this issue will be solved in one
place for all operations that have additions (including aggregations
and arithmetic expressions) in a followup pull-request.

Refs #22918 ("Support arithmetic operators")
2026-05-25 10:05:09 +03:00
Wojciech Mitros
c0ea98f922 strong_consistency: classify reads by consistency level
Introduce a read_type enum (linearizable vs non_linearizable) and transform
the existing "validate" function into a "parse" method - instead of checking
if the consistency level is one of the accepted ones, we now also return the
correcponding read type for strong consistency.
The "parse" function maps CQL consistency levels to following read types:
- CL=(LOCAL_)QUORUM -> linearizable (this is the default CL)
- CL=(LOCAL_)ONE -> non_linearizable
- all others -> throw

The classification is performed in the CQL layer (select_statement) to
keep the coordinator free of CL concepts.
2026-05-23 11:35:37 +02:00
Avi Kivity
305346a3ec Merge 'Don't materialize collections into intermediate representations' from Botond Dénes
Collections have an age-old problem in ScyllaDB: they had to be unserialized into an intermediate representation for any access or manipulation. The intermediate representation needs effort to produce and also requires additional memory to store. Both can be significant for large collections. This intermediate representation is then either discarded immediately after use, or re-serialized again.
This problem was significant enough for us to consider the use of collections as somewhat of an anti-pattern. But our customers keep using it. Alternator is also a heavy user of collections.

This PR aims to solve this problem once and for all.  The plan is as follows:
* Promote direct use of the serialized collection format:
    - Add accessor methods to `collection_mutation_view` which read from the serialized format directly: `tomb()`, `size()` and `begin()`/`end()`.
    - Add a `collection_mutation_writer` which provides container semantics for generating a serialized `collection_mutation` directly on the go (`push_back()`).
* Replace all usage of `collection_mutation_description`, `collection_mutation_view_description` and friends with use of the new infrastructure.
* Drop the old infrastructure, to avoid accidental regressions.

Continues the work started by https://github.com/scylladb/scylladb/pull/29033 and takes it to its conclusion.

To help focus review, here is a summary of the patches:
* [1, 2] preparatory refactoring: drop some unused abstract_type params
* [3, 6] introduce new infrastructure to write and read serialized collections directly; this is the meat of the PR
* [6, -1) replace all usage of old materializing infrastructure with usage of the new one
* [-1] drop old infrastructure

**Command:**
```
dbuild -it -- build/release/scylla perf-simple-query --collection=16 -c1 -m2G --default-log-level=error
```

| Metric                   |  Before |   After | Change     |
|--------------------------|--------:|--------:|------------|
| Throughput (median tps)  | 315,760 | 332,021 | **+5.1%**  |
| Instructions/op (median) |  53,776 |  48,681 | **-9.5%**  |
| CPU cycles/op (median)   |  17,365 |  16,471 | **-5.1%**  |
| Allocations/op           |    85.1 |    82.1 | **-3.5%**  |

**Significant improvement.** Throughput is up ~5%, and both instruction count and cycle count are meaningfully reduced.

---

**Command:**
```
dbuild -it -- build/release/scylla perf-simple-query --collection=16 -c1 -m2G --default-log-level=error --write
```

| Metric                   |    Before |    After | Change    |
|--------------------------|----------:|---------:|-----------|
| Throughput (median tps)  |   150,823 |  149,678 | **-0.8%** |
| Instructions/op (median) |   108,388 |  103,858 | **-4.2%** |
| CPU cycles/op (median)   |    34,860 |   35,371 | **+1.5%** |
| Allocations/op           | ~105–108  | ~102–103 | **-3.0%** |

**Mixed, mostly neutral.** Throughput is essentially flat (within noise). Instructions/op improved by ~4%, allocations dropped slightly, but cycles/op edged up marginally.

---

**Command:**
```
dbuild -it -- build/release/scylla perf-alternator --workload write --developer-mode=1 --alternator-port=8000 --alternator-write-isolation=unsafe -c1 -m2G --default-log-level=error
```

| Metric                   |  Before |  After | Change    |
|--------------------------|--------:|-------:|-----------|
| Throughput (median tps)  |  55,777 | 56,051 | **+0.5%** |
| Instructions/op (median) | 246,215 |246,610 | **+0.2%** |
| CPU cycles/op (median)   |  77,641 | 77,020 | **-0.8%** |
| Allocations/op           |   340.4 |  335.4 | **-1.5%** |

**Essentially neutral.** All metrics are within noise margins. Slight reduction in allocations and cycles, negligible otherwise.

---

The change has a **clear, substantial positive effect on reads** (~5% throughput gain, ~9.5% fewer instructions per op).
The write and alternator paths are **unaffected in practice** — changes there are within measurement noise. No regressions are apparent.
This is expected: https://github.com/scylladb/scylladb/pull/29033 did the heavy lifting when it comes to the write path, this PR finishes the job, mostly improving reads.

Fixes: #3602

Improvement, no backport.

Closes scylladb/scylladb#29127

* github.com:scylladb/scylladb:
  mutation/collection_mutation: make collection_mutation::_data private
  mutation_collection: drop collection_mutation_description and friends
  test: move away from collection_mutation_description
  tree: move away from collection_mutation_description
  test: move away from collection_mutation_view::with_deserialized()
  tree: move away from collection_mutation_view::with_deserialized()
  types: fix indendation, left broken by previous commit
  types: move away from collection_mutation_view::with_deserialized()
  types: serialize_for_cql(): use throwing_assert() instead of SCYLLA_ASSERT()
  schema: column_computation: move away from collection_mutation_view::with_deserialized()
  mutation: move away from collection_mutation_view::with_deserialized()
  alternator: move away from collection_mutation_view::with_deserialized()
  cdc: move away from collection_mutation_view::with_deserialized()
  mutation/collection_mutation: printer: don't deserialize collections
  mutation/collection_mutation: difference(): don't deserialize collections
  mutation/collection_mutation: merge(): don't deserialize collections
  mutation/collection_mutation: extract compact_and_expire() to free function
  mutation/collection_mutation: refactor empty(), is_any_live() and last_update()
  compaction_garbage_collector: pass collection_mutation to collect()
  test/boost/mutation_test: add tests for collection_mutation_{view,writer}
  mutation/collaction_mutation: collection_mutation_view: add methods to inspect content
  mutation/collection_mutation: add collection_mutation_writer
  mutation/collection_mutation: collection_mutation(): generate valid collection
  mutation/collection_mutation: collection_mutation(): remove unused abstract_type param
  mutation/atomic_cell: drop unused type param from from_bytes()
2026-05-21 17:10:40 +03:00
Wojciech Mitros
13c043903d strong_consistency: cache leader location for non-replica nodes
When a non-replica node handles a strongly consistent write, it must
forward the request to a replica. If the closest replica is not the
leader, the request gets redirected again, causing an extra roundtrip.

Add a leader location cache in groups_manager, keyed by raft group_id.
After a write request is forwarded, the CQL transport layer records the
final node as the leader in the cache. Subsequent write requests from
the same node for the same group are forwarded directly to the cached
leader, eliminating the extra roundtrip.

The cache is only used for writes. Reads can be served by any replica,
so they skip the cache and use proximity-based routing instead.

Cache entries are validated at use time: if the cached leader is no
longer a replica (e.g. after tablet migration), the entry is evicted
and the normal closest-replica path is taken. This prevents a scenario
where two nodes keep redirecting to each other because both think that
the other is the leader but actually both are non-replicas - such loop
is broken as soon as the tablet maps are updated.

On token_metadata updates, entries for groups that no longer exist
(e.g. table dropped, tablet merged) are evicted. Entries for groups
that still exist are kept — use-time validation handles staleness.

An on_node_resolved callback is propagated through the redirect/bounce
path so the transport layer can update the cache generically without
coupling to the strong-consistency coordinator. The coordinator creates
the callback only for writes (capturing the groups_manager and
group_id) and attaches it to the bounce message; the transport layer
invokes it once the final node is known, keeping the forwarding
infrastructure subsystem-agnostic.

We also add a test which verifies that after the initial redirect,
following requests to the same node avoid the extra redirect and
forward directly to the leader.

Fixes: SCYLLADB-1064

Closes scylladb/scylladb#29392
2026-05-21 10:32:56 +02:00
Botond Dénes
636e2877e2 tree: move away from collection_mutation_description
Use collection_mutation_writer instead.

Add to_managed_bytes() to cql3::raw_value to help avoid some copies.

A special note for sstables/kl/reader.cc: this conversion is not
straighforward, so we accumulate a list of cells and feed to the writer
at the end. This is sub-optimal but this code is rarely used, best to be
conservative.
2026-05-21 10:23:29 +03:00
Botond Dénes
24fdfa34dd mutation/collection_mutation: collection_mutation(): remove unused abstract_type param 2026-05-21 08:34:21 +03:00
Dawid Pawlik
4c2ce1928c types/vector: avoid unnecessary copies during vector reserialization
When reserialize_value() is called on a vector type (which happens only
when the vector's element type contains sets or maps), the old code
materialized all elements via split_fragmented() into a
std::vector<managed_bytes>, then iterated them calling
reserialize_value() on each — discarding the intermediate copy.

Use split_fragmented_view() to obtain zero-copy views of elements, and
pass those directly to reserialize_value(). This avoids one managed_bytes
allocation per element.

Additionally, wrap the call with with_simplified() so that when the
input is a single contiguous fragment (the common case), the compiler
receives a single_fragmented_view and can eliminate fragment-boundary
checks at compile time.

Also generalize build_value_fragmented() to accept any forward range of
FragmentedView elements (not just managed_bytes), and write directly
into the output buffer via with_linearized instead of going through an
intermediate read_simple_bytes copy. This benefits all callers including
evaluate_vector() on the INSERT path for vector<float, N>.

The with_simplified() dispatch instantiates reserialize_value with
single_fragmented_view, which in turn instantiates
partially_deserialize_listlike and partially_deserialize_map with that
type. Add explicit template instantiations in types/types.cc since those
function templates are defined there and only previously instantiated for
managed_bytes_view and fragmented_temporary_buffer::view.

Note: the reserialization path is only exercised for vectors whose
element type contains sets or maps (e.g. vector<frozen<map<int,int>>, N>).
The common vector<float, N> case never enters reserialize_value() because
bound_value_needs_to_be_reserialized() returns false at the call site.
However, the build_value_fragmented() improvement applies to all vector
INSERTs.

References: SCYLLADB-471
Fixes: SCYLLADB-1799

Closes scylladb/scylladb#28559
2026-05-20 12:22:19 +03:00
Dawid Pawlik
232b1a3725 cql3: generalize viewless index handling in CREATE INDEX statement
Replace the `vector_index`-specific checks in `create_index_statement`
with a generic `is_viewless_custom_class()` helper that queries the
index factory to determine whether an index type creates a backing
materialized view.

This covers both existing (`vector_index`) and new (`fulltext_index`)
viewless index types:
- Reject view properties (WITH clause) for any viewless index
- Use name-based duplicate detection for named viewless indexes,
  since they have no backing view table for `has_schema()` to find
  (issue #26672)
2026-05-19 08:52:47 +02:00
Dawid Pawlik
9e02e11ea8 fulltext_index: enforce CDC requirements for fulltext indexes
Fulltext indexes rely on CDC to track changes for asynchronous index
building. Enforce the following CDC constraints during CREATE INDEX:
- CDC TTL must be at least 86400 seconds (24 hours)
- CDC delta mode must be 'full' or postimage must be enabled

Add `has_fulltext_index()` and `check_cdc_options()` so that other
modules can detect fulltext indexes and validate CDC settings:
- include fulltext indexes in `cdc_enabled()` so the CDC log
  is auto-created, and validate CDC options in
  `on_before_update_column_family()`
- block `ALTER TABLE ... WITH cdc = {'enabled': false}`
  when a fulltext index exists on the table
2026-05-19 08:52:47 +02:00
Dawid Pawlik
61d658106a index: introduce external_index base class for VS/FTS indexes
Add `external_index` as a common base for `vector_index` and `fulltext_index`,
both of which are backed by an external Vector Store engine and share CDC
requirements.
2026-05-19 08:52:47 +02:00
Petr Gusev
9e3209e4a3 cql: refactor add_tablet_info to take tablet_routing_info directly
Change add_tablet_info() to accept locator::tablet_routing_info instead
of destructured (tablet_replica_set, token_range) pair. This simplifies
all three call sites.

Remove the empty-replicas guard inside add_tablet_info(): the only
producer of tablet_routing_info is tablet ERM's check_locality(), which
returns either nullopt (correctly routed) or info with replicas copied
from tablet_info — a tablet always has replicas. All callers already
check for nullopt before calling add_tablet_info(), so by the time we
enter the function replicas are guaranteed non-empty.
2026-05-15 12:28:33 +02:00
Petr Gusev
738b7b4a86 cql: fix UB dereference of nullopt tablet_info in execute_with_condition
When check_locality() returns nullopt (correctly routed LWT), the
optional tablet_info was unconditionally dereferenced in the lambda
capture list: tablet_info->tablet_replicas, tablet_info->token_range.

The code previously masked this by initializing tablet_info with an
empty-but-present value, so the dereference happened to work but
only because the empty tablet_replicas made add_tablet_info() a no-op.
After check_locality() overwrites it with nullopt, the dereference
is UB.

Fix by initializing tablet_info as empty (nullopt) and guarding the
dereference.
2026-05-15 11:56:14 +02:00
Petr Gusev
167a3c9c50 cql: fix missing TABLETS_ROUTING_V1 payload after CAS shard bounce
After an internal CAS shard bounce, check_locality() was evaluating
against this_shard_id() of the post-bounce shard — which is the correct
tablet shard — so it returned nullopt, and LWT/SERIAL responses omitted
the tablets-routing-v1 custom payload. The client never learned the
correct tablet map.

Fix by recording the original entry shard in client_state (initialized
to this_shard_id() at construction, preserved across shard bounces via
client_state_for_another_shard) and passing it to check_locality() so
it compares against the client's actual routing decision.

No host_id tracking or forwarded_client_state IDL changes are needed
because CAS shard bounces are always intra-node.

Fixes SCYLLADB-2041
2026-05-15 11:56:14 +02:00
Piotr Dulikowski
f3ac35f9d2 Merge 'strong_consistency: wait for raft servers to start in create table' from Michael Litvak
When creating a strongly consistent table, wait for the table's raft
servers to start and be ready to serve queries before completing the
operation. We want the create table operation to absorb the delay of
starting the raft groups instead of the first queries.

The create table coordinator commits and applies the schema statement,
then it waits for all hosts that have a tablet replica to create and
start the raft groups for the table's tablets. It does this by sending
an RPC to all the relevant hosts that executes a group0 barrier, in
order to ensure the table and raft groups are created, then waits for
all raft groups on the host to finish starting and be ready.

Fixes SCYLLADB-807

no backport - strong consistency is still experimental

Closes scylladb/scylladb#28843

* github.com:scylladb/scylladb:
  strong_consistency: wait for leader when starting a group
  strong_consistency: change wait for groups to start on startup
  strong_consistency: optimize wait_for_groups_to_start
  strong_consistency: wait for raft servers to start in create table
2026-05-13 16:42:05 +02:00
Piotr Dulikowski
dc05bd35bb Merge 'strong_consistency: limit available consistency levels in strong consistent requests' from Michał Jadwiszczak
Strong consistent requests take different patch then EC requests and consistency levels don’t map well.
We should limit available consistency levels in SC request to avoid ignoring them silently, which may cause confusion to user.

For writes, there is only one option:
- QUORUM/LOCAL_QUORUM (multi DC is not supported yet, so both of those CLs have the same effect) - we need quorum of replicas to successfully commit new mutations to Raft log.

For reads, there are 2 options:
- QUORUM/LOCAL_QUORUM - if user wants to be sure he sees latest data and the query needs to execute `read_barrier()`, which requires quorum of replicas
- ONE/LOCAL_ONE - if user just wants to read data from one replica without synchronization

All tests were updated to use LOCAL_QUORUM for both read and writes.

Fixes SCYLLADB-1766

SC is in experimental phase and this patch is an improvement, no backport needed.

Closes scylladb/scylladb#29691

* github.com:scylladb/scylladb:
  strong_consistency: allow QUORUM/LOCAL_QUORUM and ONE/LOCAL_ONE for reads
  strong_consistency: allow only QUORUM/LOCAL_QUORUM CL for writes
2026-05-13 16:31:05 +02:00
Michael Litvak
5a5c7c6241 strong_consistency: wait for raft servers to start in create table
When creating a strongly consistent table, wait for the table's raft
servers to start and be ready to serve queries before completing the
operation. We want the create table operation to absorb the delay of
starting the raft groups instead of the first queries.

The create table coordinator commits and applies the schema statement,
then it waits for all hosts that have a tablet replica to create and
start the raft groups for the table's tablets. It does this by sending
an RPC to all the relevant hosts that executes a group0 barrier, in
order to ensure the table and raft groups are created, then waits for
all raft groups on the host to finish starting and be ready.

Fixes SCYLLADB-807
2026-05-13 08:43:24 +02:00
Michał Jadwiszczak
d073097ebf strong_consistency: allow QUORUM/LOCAL_QUORUM and ONE/LOCAL_ONE for reads
We can execute strong consistent read queries in 2 ways:
- with QUORUM/LOCAL_QUORUM CL - this path executes `read_barrier()`
  before reading the data, which synchronizes Raft log with the leader.
  But to execute it, we need quorum of replicas
- with ONE/LOCAL_ONE CL - this path just reads data from one replica
  without any synchronization (not implemented yet)
2026-05-12 23:20:07 +02:00
Michał Jadwiszczak
68f0cf6fac strong_consistency: allow only QUORUM/LOCAL_QUORUM CL for writes
To successfully write data to strong consistent table, a quorum of
replicas need to be used to save the data to Raft log.

So the only reasonable consistency level is QUORUM/LOCAL_QUORUM
(currently SC doesn't support multi DC).
2026-05-12 23:20:03 +02:00
Piotr Dulikowski
129f193116 Merge 'strong_consistency: implement basic coordinator metrics' from Michał Jadwiszczak
Add per-shard metrics for strong consistency coordinator operations (latency, timeouts, bounces, status unknown) under the `"strong_consistency_coordinator"` category. These are analogous to the eventual consistency metrics in `storage_proxy_stats`, enabling direct performance comparison between the two consistency modes.

The metrics are simplified compared to `storage_proxy_stats` — no breakdown by table, tablet, scheduling group, or DC, only per-shard.

Fixes SCYLLADB-1343

Strong consistency is still in experimental phase, no need to backport.

Closes scylladb/scylladb#29318

* github.com:scylladb/scylladb:
  test/strong_consistency: verify metrics
  strong_consistency: wire up metrics to operations
  strong_consistency: add stats struct and metrics registration
2026-05-12 16:15:51 +02:00
Avi Kivity
f5ffbd3c3e cql3: restrictions: reindent statement_restrictions.cc
6165124fcc has left statement_restrictions.cc scarred and
deformed. Restore it to standard 4-space indentation. This patch
contains only whitespace changes.

Closes scylladb/scylladb#29598
2026-05-11 17:02:14 +03:00
Piotr Smaron
959f67b345 cql: verify tuples length in multi-column IN restriction
When a multi-column IN restriction contains tuples with a different
number of elements than the number of restricted columns (e.g.
`(b, c, d) IN ((1, 2), (2, 1, 4))`), Scylla would either produce an
inconsistent error message or, for over-sized tuples, an internal
type-mismatch error referencing the list literal representation.

Validate each tuple's arity against the number of restricted columns
while building the IN restriction and raise a clear
"Expected N elements in value tuple, but got M" error in both the
under- and over-sized cases.

Fixes #13241

Co-authored-by: Alexander Turetskiy <someone.tur@gmail.com>

Closes scylladb/scylladb#18407
2026-05-11 16:55:09 +03:00
Nadav Har'El
fcfad51284 Merge 'cql3/selection: require EXECUTE on UDA REDUCEFUNC at SELECT time' from Marcin Maliszkiewicz
selection::used_functions() pushed the UDA, its SFUNC and its FINALFUNC,
but never the REDUCEFUNC. The reducefunc is invoked by the distributed
aggregation path in service::mapreduce_service, so a user could cause it
to run server-side without holding EXECUTE on it as long as the query
took the mapreduce path.

Also push agg.state_reduction_function so select_statement::check_access
requires EXECUTE on it too.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1756
Backport: no, it's a minor fix and UDFs are experimental feature in Scylla

Closes scylladb/scylladb#29717

* github.com:scylladb/scylladb:
  test/cqlpy: add test for EXECUTE permission on UDA sub-functions
  cql3/selection: require EXECUTE on UDA REDUCEFUNC at SELECT time
2026-05-11 16:14:38 +03:00
Yaniv Kaul
cfb568b5b5 cql3: fix missing format placeholders in error messages
Fix two format string bugs where arguments were silently dropped:

- prepare_expr.cc: the bad argument to count() was passed but had no
  {} placeholder, so users never saw what was actually passed.
- statement_restrictions.cc: the unsupported multi-column relation was
  passed but the trailing colon had no {} placeholder.

Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
2026-05-10 17:51:19 +03:00
Marcin Maliszkiewicz
fb55bef0ac cql3/selection: require EXECUTE on UDA REDUCEFUNC at SELECT time
selection::used_functions() pushed the UDA, its SFUNC and its FINALFUNC,
but never the REDUCEFUNC. The reducefunc is invoked by the distributed
aggregation path in service::mapreduce_service, so a user could cause it
to run server-side without holding EXECUTE on it as long as the query
took the mapreduce path.

Also push agg.state_reduction_function so select_statement::check_access
requires EXECUTE on it too.

Fixes SCYLLADB-1756
2026-05-08 16:37:52 +02:00
Dario Mirovic
918130befd utils: loading_cache: add insert() that is a no-op when caching is disabled
When the cache is constructed with expiry == 0 the underlying storage is
never instantiated and get_ptr() asserts via caching_enabled(). This is
fine for callers that need a handle into the cache, but it makes get_ptr()
unusable for write-only insertions on caches whose expiry is configurable
at runtime (e.g. caches driven by a LiveUpdate config option that the
operator may set to 0).

Add a new insert(k, load) method on loading_cache that returns a future<>
and is a no-op when caching is disabled, otherwise forwards to
get_ptr(k, load) and discards the resulting handle. This completes the
disabled-mode safety contract of the cache for the write side, mirroring
the fallback that get() already provides for the read side.

Switch authorized_prepared_statements_cache::insert() from
get_ptr().discard_result() to the new insert(), which fixes the crash
'Assertion caching_enabled() failed' in
authorized_prepared_statements_cache::insert() that occurs when
permissions_validity_in_ms is set to 0 and a prepared statement is
executed under authentication.

Fixes SCYLLADB-1699
2026-04-30 16:51:23 +02:00
Marcin Maliszkiewicz
3df951bc9c Merge 'audit: set audit_info for native-protocol BATCH messages' from Andrzej Jackowski
Commit 16b56c2451 ("Audit: avoid dynamic_cast on a hot path") moved
audit info into batch_statement via set_audit_info(), but only wired it
for the CQL-text BATCH path (raw::batch_statement::prepare()).
Native-protocol BATCH messages (opcode 0x0D), handled by
process_batch_internal in transport/server.cc, construct a
batch_statement without setting audit_info. This causes audit to
silently skip the entire batch.

Set audit_info on the batch_statement so these batches are audited.

Fixes SCYLLADB-1652

No backport - bug introduced recently.

Closes scylladb/scylladb#29570

* github.com:scylladb/scylladb:
  test/audit: add reproducer for native-protocol batch not being audited
  audit: set audit_info for native-protocol BATCH messages
  test/audit: rename internal test methods to avoid CI misdetection
2026-04-22 18:56:28 +02:00
Michał Jadwiszczak
f77c258c8e strong_consistency: wire up metrics to operations
Track write and read latency using latency_counter in
coordinator::mutate() and coordinator::query().

Count commit_status_unknown errors in coordinator::mutate().

Count node and shard bounces in redirect_statement(), passing the
coordinator's stats from both modification_statement and
select_statement.
2026-04-22 08:59:59 +02:00
Tomasz Grabiec
cddde464ca Merge 'service: Support adding/removing a datacenter with tablets by changing RF' from Aleksandra Martyniuk
With this change, you can add or remove a DC(s) in a single ALTER KEYSPACE statement. It requires the keyspace to use rack list replication factor.

In existing approach, during RF change all tablet replicas are rebuilt at once. This isn't the case now. In global_topology_request::keyspace_rf_change the request is added to a ongoing_rf_changes - a new column in system.topology table. In a new column in system_schema.keyspaces - next_replication - we keep the target RF.

In make_rf_change_plan, load balancer schedules necessary migrations, considering the load of nodes and other pending tablet transitions. Requests from ongoing_rf_changes are processed concurrently, independently from one another. In each request racks are processed concurrently. No tablet replica will be removed until all required replicas are added. While adding replicas to each rack we always start with base tables and won't proceed with views until they are done (while removing - the other way around). The intermediary steps aren't reflected in schema. When the Rf change is finished:
- in system_schema.keyspaces:
  - next_replication is cleared;
  - new keyspace properties are saved;
- request is removed from ongoing_rf_changes;
- the request is marked as done in system.topology_requests.

Until the request is done, DESCRIBE KEYSPACE shows the replication_v2.

If a request hasn't started to remove replicas, it can be aborted using task manager. system.topology_requests::error is set (but the request isn't marked as done) and next_replication = replication_v2. This will be interpreted by load balancer, that will start the rollback of the request. After the rollback is done, we set the relevant system.topology_requests entry as done (failed), clear the request id from system.topology::ongoing_rf_changes, and remove next_replication.

Fixes: SCYLLADB-567.

No backport needed; new feature.

Closes scylladb/scylladb#24421

* github.com:scylladb/scylladb:
  service: fix indentation
  docs: update documentation
  test: test multi RF changes
  service: tasks: allow aborting ongoing RF changes
  cql3: allow changing RF by more than one when adding or removing a DC
  service: handle multi_rf_change
  service: implement make_rf_change_plan
  service: add keyspace_rf_change_plan to migration_plan
  service: extend tablet_migration_info to handle rebuilds
  service: split update_node_load_on_migration
  service: rearrange keyspace_rf_change handler
  db: add columns to system_schema.keyspaces
  db: service: add ongoing_rf_changes to system.topology
  gms: add keyspace_multi_rf_change feature
2026-04-22 01:46:11 +02:00
Andrzej Jackowski
f5bb9b6282 audit: set audit_info for native-protocol BATCH messages
Commit 16b56c2451 ("Audit: avoid dynamic_cast on a hot path") moved
audit info into batch_statement via set_audit_info(), but only wired it
for the CQL-text BATCH path (raw::batch_statement::prepare()).
Native-protocol BATCH messages (opcode 0x0D), handled by
process_batch_internal in transport/server.cc, construct a
batch_statement without setting audit_info. This causes audit to
silently skip the entire batch.

Set audit_info on the batch_statement so these batches are audited.

Fixes SCYLLADB-1652
2026-04-21 21:52:26 +02:00
Nadav Har'El
6165124fcc Merge 'cql3: statement_restrictions: analyze during prepare time' from Avi Kivity
The statement_restrictions code is responsible for analyzing the WHERE
clause, deciding on the query plan (which index to use), and extracting
the partition and clustering keys to use for the index.

Currently, it suffers from repetition in making its decisions: there are 15
calls to expr::visit in statement_restrictions.cc, and 14 find_binop calls. This
reduces to 2 visits (one nested in the other) and 6 find_binop calls. The analysis
of binary operators is done once, then reused.

The key data structure introduced is the predicate. While an expression
takes inputs from the row evaluated, constants, and bind variables, and
produces a boolean result, predicates ask which values for a column (or
a number of columns) are needed to satisfy (part of) the WHERE clause.
The WHERE clause is then expressed as a conjunction of such predicates.
The analyzer uses the predicates to select the index, then uses the predicates
to compute the partition and clustering keys.

The refactoring is composed of these parts (but patches from different parts
are interspersed):

1. an exhaustive regression test is added as the first commit, to ensure behavior doesn't change
2. move computation from query time to prepare time
3. introduce, gradually enrich, and use predicates to implement the statement_restrictions API

Major refactoring, and no bugs fixed, so definitely not backporting.

Closes scylladb/scylladb#29114

* github.com:scylladb/scylladb:
  cql3: statement_restrictions: replace has_eq_restriction_on_column with precomputed set
  cql3: statement_restrictions: replace multi_column_range_accumulator_builder with direct predicate iteration
  cql3: statement_restrictions: use predicate fields in build_get_clustering_bounds_fn
  cql3: statement_restrictions: remove extract_single_column_restrictions_for_column
  cql3: statement_restrictions: use predicate vectors in prepare_indexed_local
  cql3: statement_restrictions: use predicate vector size for clustering prefix length
  cql3: statement_restrictions: replace do_find_idx and is_supported_by with predicate-based versions
  cql3: statement_restrictions: remove expression-based has_supporting_index and index_supports_some_column
  cql3: statement_restrictions: replace multi-column and PK index support checks with predicate-based versions
  cql3: statement_restrictions: add predicate-based index support checking
  cql3: statement_restrictions: use pre-built single-column maps for index support checks
  cql3: statement_restrictions: build clustering-prefix restrictions incrementally
  cql3: statement_restrictions: build partition-range restrictions incrementally
  cql3: statement_restrictions: build clustering-key single-column restrictions map incrementally
  cql3: statement_restrictions: build partition-key single-column restrictions map incrementally
  cql3: statement_restrictions: build non-primary-key single-column restrictions map incrementally
  cql3: statement_restrictions: use tracked has_mc_clustering for _has_multi_column
  cql3: statement_restrictions: track has-token state incrementally
  cql3: statement_restrictions: track partition-key-empty state incrementally
  cql3: statement_restrictions: track first multi-column predicate incrementally
  cql3: statement_restrictions: track last clustering column incrementally
  cql3: statement_restrictions: track clustering-has-slice incrementally
  cql3: statement_restrictions: track has-multi-column-clustering incrementally
  cql3: statement_restrictions: track clustering-empty state incrementally
  cql3: statement_restrictions: replace restr bridge variable with pred.filter
  cql3: statement_restrictions: convert single-column branch to use predicate properties
  cql3: statement_restrictions: convert multi-column branch to use predicate properties
  cql3: statement_restrictions: convert constructor loop to iterate over predicates
  cql3: statement_restrictions: annotate predicates with operator properties
  cql3: statement_restrictions: annotate predicates with is_not_null and is_multi_column
  cql3: statement_restrictions: complete preparation early
  cql3: statement_restrictions: convert expressions to predicates without being directed at a specific column
  cql3: statement_restrictions: refine possible_lhs_values() function_call processing
  cql3: statement_restrictions: return nullptr for function solver if not token
  cql3: statement_restrictions: refine possible_lhs_values() subscript solving
  cql3: statement_restrictions: return nullptr from possible_lhs_values instead of on_internal_error
  cql3: statement_restrictions: convert possible_lhs_values into a solver
  cql3: statement_restrictions: split _where to boolean factors in preparation for predicates conversion
  cql3: statement_restrictions: refactor IS NOT NULL processing
  cql3: statement_restrictions: fold add_single_column_nonprimary_key_restriction() into its caller
  cql3: statement_restrictions: fold add_single_column_clustering_key_restriction() into its caller
  cql3: statement_restrictions: fold add_single_column_partition_key_restriction() into its caller
  cql3: statement_restrictions: fold add_token_partition_key_restriction() into its caller
  cql3: statement_restrictions: fold add_multi_column_clustering_key_restriction() into its caller
  cql3: statement_restrictions: avoid early return in add_multi_column_clustering_key_restrictions
  cql3: statement_restrictions: fold add_is_not_restriction() into its caller
  cql3: statement_restrictions: fold add_restriction() into its caller
  cql3: statement_restrictions: remove possible_partition_token_values()
  cql3: statement_restrictions: remove possible_column_values
  cql3: statement_restrictions: pass schema to possible_column_values()
  cql3: statement_restrictions: remove fallback path in solve()
  cql3: statement_restrictions: reorder possible_lhs_column parameters
  cql3: statement_restrictions: prepare solver for multi-column restrictions
  cql3: statement_restrictions: add solver for token restriction on index
  cql3: statement_restrictions: pre-analyze column in value_for()
  cql3: statement_restrictions: don't handle boolean constants in multi_column_range_accumulator_builder
  cql3: statement_restrictions: split range_from_raw_bounds into prepare phase and query phase
  cql3: statement_restrictions: adjust signature of range_from_raw_bounds
  cql3: statement_restrictions: split multi_column_range_accumulator into prepare-time and query-time phases
  cql3: statement_restrictions: make get_multi_column_clustering_bounds a builder
  cql3: statement_restrictions: multi-key clustering restrictions one layer deeper
  cql3: statement_restrictions: push multi-column post-processing into get_multi_column_clustering_bounds()
  cql3: statement_restrictions: pre-analyze single-column clustering key restrictions
  cql3: statement_restrictions: wrap value_for_index_partition_key()
  cql3: statement_restrictions: hide value_for()
  cql3: statement_restrictions: push down clustering prefix wrapper one level
  cql3: statement_restrictions: wrap functions that return clustering ranges
  cql3: statement_restrictions: do not pass view schema back and forth
  cql3: statement_restrictions: pre-analyze token range restrictions
  cql3: statement_restrictions: pre-analyze partition key columns
  cql3: statement_restrictions: do not collect subscripted partition key columns
  cql3: statement_restrictions: split _partition_range_restrictions into three cases
  cql3: statement_restrictions: move value_list, value_set to header file
  cql3: statement_restrictions: wrap get_partition_key_ranges
  cql3: statement_restrictions: prepare statement_restrictions for capturing `this`
  test: statement_restrictions: add index_selection regression test
2026-04-21 15:44:06 +03:00
Łukasz Paszkowski
d18eb9479f cql/statement: Create keyspace_metadata with correct initial_tablets count
In `ks_prop_defs::as_ks_metadata(...)` a default initial tablets count
is set to 0, when tablets are enabled and the replication strategy
is NetworkReplicationStrategy.

This effectively sets _uses_tablets = false in abstract_replication_strategy
for the remaining strategies when no `tablets = {...}` options are specified.
As a consequence, it is possible to create vnode-based keyspaces even
when tablets are enforced with `tablets_mode_for_new_keyspaces`.

The patch sets a default initial tablets count to zero regardless of
the chosen replication strategy. Then each of the replication strategy
validates the options and raises a configuration exception when tablets
are not supported.

All tests are altered in the following way:
+ whenever it was correct, SimpleStrategy was replaced with NetworkTopologyStrategy
+ otherwise, tablets were explicitly disabled with ` AND tablets = {'enabled': false}`

Fixes https://github.com/scylladb/scylladb/issues/25340

Closes scylladb/scylladb#25342
2026-04-20 17:57:38 +03:00
Avi Kivity
d584bd7358 cql3: statement_restrictions: replace has_eq_restriction_on_column with precomputed set
has_eq_restriction_on_column() walked expression trees at prepare time to
find binary_operators with op==EQ that mention a given column on the LHS.
Its only caller is ORDER BY validation in select_statement, which checks
that clustering columns without an explicit ordering have an EQ restriction.

Replace the 50-line expression-walking free function with a precomputed
unordered_set<const column_definition*> (_columns_with_eq) populated during
the main predicate loop in analyze_statement_restrictions.  For single-column
EQ predicates the column is taken from on_column; for multi-column EQ like
(ck1, ck2) = (1, 2), all columns in on_clustering_key_prefix are included.

The member function becomes a single set::contains() call.
2026-04-19 20:57:09 +03:00
Avi Kivity
b7f86eaabc cql3: statement_restrictions: replace multi_column_range_accumulator_builder with direct predicate iteration
build_get_multi_column_clustering_bounds_fn() used expr::visit() to dispatch
each restriction through a 15-handler visitor struct.  Only the
binary_operator handler did real work; the conjunction handler just
recursed, and the remaining 13 handlers were dead-code on_internal_error
calls (the filter expression of each predicate is always a binary_operator).

Replace the visitor with a loop over predicates that does
as<binary_operator>(pred.filter) directly, building the same query-time
lambda inline.

Promote intersect_all() and process_in_values() from static methods of
the deleted struct to free functions in the anonymous namespace -- they
are still called from the query-time lambda.
2026-04-19 20:57:09 +03:00
Avi Kivity
ece9af229d cql3: statement_restrictions: use predicate fields in build_get_clustering_bounds_fn
Replace find_binop(..., is_multi_column) with pred.is_multi_column in
build_get_clustering_bounds_fn() and add_clustering_restrictions_to_idx_ck_prefix().

Replace is_clustering_order(binop) with pred.order == comparison_order::clustering
and iterate predicates directly instead of extracting filter expressions.

Remove the now-dead is_multi_column() free function.
2026-04-19 20:57:09 +03:00
Avi Kivity
72da1207d7 cql3: statement_restrictions: remove extract_single_column_restrictions_for_column
The previous commit made prepare_indexed_local() use the pre-built
predicate vectors instead of calling extract_single_column_restrictions_for_column().
That was the last production caller.

Remove the function definition (65 lines of expression-walking visitor)
and its declaration/doc-comment from the header.

Replace the unit test (expression_extract_column_restrictions) which
directly called the removed function with synthetic column_definitions,
with per_column_restriction_routing which exercises the same routing
logic through the public analyze_statement_restrictions() API.  The new
test verifies not just factor counts but the exact (column_name, oper_t)
pairs in each per-column entry, catching misrouted restrictions that a
count-only check would miss.
2026-04-19 20:57:09 +03:00
Avi Kivity
b093477cf7 cql3: statement_restrictions: use predicate vectors in prepare_indexed_local
Replace the extract_single_column_restrictions_for_column(_where, ...) call
in prepare_indexed_local() with a direct lookup in the pre-built predicate
vectors.

The old code walked the entire WHERE expression tree to extract binary
operators mentioning the indexed column, wrapped them in a conjunction,
translated column definitions to the index schema, then called
to_predicate_on_column() which walked the expression *again* to convert
back to predicates.

The new code selects the appropriate predicate vector map (PK, CK, or
non-PK) based on the indexed column's kind, looks up the column's
predicates directly, applies replace_column_def to each, and folds them
with make_conjunction -- producing the same result without any expression
tree walks.

This removes the last production caller of
extract_single_column_restrictions_for_column (unit tests in
statement_restrictions_test.cc still exercise it).
2026-04-19 20:57:09 +03:00