scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-06-05 14:33:08 +00:00

Author	SHA1	Message	Date
Nadav Har'El	75a05fc2b3	Merge 'cql3: fix stack overflow and quadratic behavior' from Avi Kivity This series fixes two vulnerabilities: unbounded recursion during expression evaluation with deeply nested expressions quadratic computation with large WHERE clauses The fixes simply bound the depth of recursion and the length of the WHERE clause. The WHERE clause limits are configurable. Nesting is less likely to be exceeded, so not configurable. Limits inspired by Common Expression Language: https://github.com/google/cel-spec/blob/master/doc/langdef.md#syntax Implementations are required to support at least: 24-32 repetitions of repeating rules 12 repetitions of recursive rules CVE-2026-31948 CVE-2026-31947 Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1003 Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1002 Fixes https://github.com/scylladb/scylladb/issues/14472 Closes scylladb/scylladb-ghsa-m4h7-g37h-mgxf#3 * github.com:scylladb/scylladb-ghsa-m4h7-g37h-mgxf: cql3: limit number of relations in WHERE clause cql3: add max_relations_in_where_clause to dialect test/cqlpy: add tests for WHERE clause relation count limit cql3: limit nesting depth of function calls and CASTs in CQL parser test/cqlpy: add tests for deeply nested function calls and CASTs	2026-06-01 22:31:56 +03:00
Avi Kivity	520b130b97	cql3: limit number of relations in WHERE clause A WHERE clause with many relations (e.g. hundreds of AND-ed conditions) can cause quadratic complexity. Check the relation count during parsing and reject queries exceeding the configurable max_relations_in_where_clause limit (default 100) with a SyntaxException. The changes to IDL don't cause problems during upgrade, because CQL forwarding is not in any released version, and because it is part of an experimental feature. CVE-2026-31947 Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1002	2026-06-01 14:01:27 +03:00
Avi Kivity	fdcc44c425	cql3: add max_relations_in_where_clause to dialect Add a configurable max_relations_in_where_clause parameter (default 100) to the CQL dialect, plumbed through db::config, transport server, and test environment. This will be used by the CQL parser to reject WHERE clauses with too many relations that cause quadratic complexity.	2026-06-01 14:01:27 +03:00
Botond Dénes	bb81dbf65e	Merge 'guardrails: Add replica-side large data guardrails' from Taras Veretilnyk Adds write-path guardrails that reject or warn on mutations targeting partitions, rows, or collections that already exceed configured size thresholds, based on SSTable `large_data_record` metadata. ScyllaDB already detects and records large partitions/rows/cells in `system.large_data_records` after compaction, but takes no preventive action on the write path. Once a partition grows past operational limits it causes latency spikes, OOM, and repair failures. These guardrails let operators set hard and soft thresholds so that writes to already-oversized data are rejected (hard) or logged as warnings (soft) before they make the problem worse. - Intrusive index over SSTable metadata: A per-table `large_data_record_index` maintains three `boost::intrusive::multiset`s (partitions, rows, cells) using `auto_unlink` hooks directly on `large_data_record`. SSTable destruction automatically removes records from the index — no explicit deregistration needed. - Virtual dispatch for zero-cost disabled path: `large_data_guardrail_base` → `noop_large_data_guardrail` / `large_data_guardrail`. Tables without guardrails enabled pay only a virtual call to a no-op. No index is built or maintained for disabled tables. - Schema storage: The per-table flag is stored as a scylla_tables column, following the tablets pattern: only write a live cell when enabled, omit entirely when disabled. The CQL feature gate prevents enabling until all nodes are upgraded. - Write-path integration: The guardrail check runs in `do_apply` after the frozen mutation is deserialized but before it is applied to the memtable. Hint replay and Paxos learn skip the check via `skip_large_data_guardrails`. Uses existing `large__warn_threshold` config options as soft limits and new `large__fail_threshold` options as hard limits. Checked dimensions: - Partition size (bytes) - Partition row count - Row size (bytes) - Collection element count Backport is not required Fixes https://scylladb.atlassian.net/browse/SCYLLADB-180 Closes scylladb/scylladb#29733 * github.com:scylladb/scylladb: test/cqlpy: add per-table toggle, LWT exemption, and multi-category tests test/cqlpy: add large collection guardrail tests test/cqlpy: add large row guardrail tests test/cqlpy: add large partition guardrail tests test/boost: add large_data_guardrail unit tests test/cluster: add large data guardrails rolling upgrade test replica: wire large_data_guardrail into the write path schema: add per-table large_data_guardrails_enabled flag db: implement large_data_guardrail db: implement large_data_record_index sstables: add intrusive index hook to large_data_record db: add large_collection_elements_fail_threshold config option db: add large_row_fail_threshold_mb config option db: add rows_count_fail_threshold config option db: add large_partition_fail_threshold_mb config option replica: introduce large_data_exception	2026-06-01 13:26:00 +03:00
Nadav Har'El	33dce2b7fc	Merge 'cql3: statement_restrictions: continue exploitation of predicate work' from Avi Kivity In `6165124fcc`, we changed analysis of expressions in the WHERE clause to use predicates, an annotated form of an expression that constrains a column when the expression is set to true. Here, we exploit this work to simplify the analysis further, reusing already computed attributes rather than re-analyzing the expression. Not backporting, this is a refactor with no functional change and no bugs fixed. Closes scylladb/scylladb#30049 * github.com:scylladb/scylladb: cql3: statement_restrictions: simplify find_idx to return only the index cql3: statement_restrictions: replace has_only_eq_binops with tracked booleans cql3: statement_restrictions: use index-selection predicates for value_for_index_partition_key cql3: statement_restrictions: replace find_clustering_order with predicate order field cql3: statement_restrictions: replace has_partition_token with variant check cql3: statement_restrictions: replace has_slice with predicate is_slice check cql3: statement_restrictions: replace contains_multi_column_restriction filter with _has_multi_column cql3: statement_restrictions: remove unused find_needs_filtering and has_slice_or_needs_filtering cql3: statement_restrictions: replace has_slice_or_needs_filtering with tracked bool cql3: statement_restrictions: replace contains_multi_column_restriction with _has_multi_column cql3: statement_restrictions: replace find_needs_filtering with predicate op check cql3: statement_restrictions: replace find_binop is_on_collection with tracked bool cql3: statement_restrictions: replace find_binop column extraction with predicate on field cql3: statement_restrictions: set op on all binary-operator-derived predicates	2026-05-31 23:22:43 +03:00
Benny Halevy	d4d43213f6	cql3/statements/describe_statement: use chunked_vector to prevent oversized allocations Running the 5000 tables scenario using tablets following Scylla warnings appeared: ``` 2026-02-23T23:18:31.903 schema-scale-tablets-5000t-2026-1-db-node-77930459-4 !WARNING \| scylla[5208] [shard 1:sl:d] seastar_memory - oversized allocation: 655360 bytes. This is non-fatal, but could lead to latency and/or fragmentation issues. Please report: at 0x320cf9f 0x320cba0 0x1826a28 0x2fb8f97 0x180340e 0x447855e 0x4461c5a 0x161c3c6 0x161c4b3 0x161e9b7 0x551f43c 0x54df6ca /opt/scylladb/libreloc/libc.so.6+0x72463 /opt/scylladb/libreloc/libc.so.6+0xf55ab seastar::current_backtrace_tasklocal() at ./build/release/seastar/./seastar/include/seastar/util/backtrace.hh:85 seastar::current_tasktrace() at ./build/release/seastar/./build/release/seastar/./seastar/src/util/backtrace.cc:136 seastar::current_backtrace() at ./build/release/seastar/./build/release/seastar/./seastar/src/util/backtrace.cc:169 seastar::memory::cpu_pages::warn_large_allocation(unsigned long) at ./build/release/seastar/./build/release/seastar/./seastar/src/core/memory.cc:865 seastar::memory::allocate_slowpath(unsigned long) at ./build/release/seastar/./build/release/seastar/./seastar/src/core/memory.cc:928 cql3::statements::(anonymous namespace)::tables(data_dictionary::database const&, seastar::lw_shared_ptr<data_dictionary::keyspace_metadata> const&, std::optional<bool>) [clone .resume] at ././seastar/src/core/memory.cc:1727 std::__n4861::coroutine_handle<seastar::internal::coroutine_traits_base<std::vector<cql3::description, std::allocator<cql3::description> > >::promise_type>::resume() const at /usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/coroutine:247 ``` This patch replaces the use of `std::vector<description>` with `utils::chunked_vector` to prevent the large allocation. Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-852 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes scylladb/scylladb#30146	2026-05-31 14:50:18 +03:00
Avi Kivity	503add224d	cql3: statement_restrictions: simplify find_idx to return only the index The expression returned as the second element of find_idx()'s pair was stored in view_indexed_table_select_statement::_used_index_restrictions but never read — dead code. Simplify find_idx() to return just the optional<index>, and remove the dead member and constructor parameter from view_indexed_table_select_statement. The now unused _idx_restrictions is also removed.	2026-05-29 17:18:21 +03:00
Avi Kivity	23d6f458ec	cql3: statement_restrictions: replace has_only_eq_binops with tracked booleans Replace has_only_eq_binops() (which uses find_in_expression to search the expression tree for non-EQ binary operators) with precomputed _pk_is_all_eq and _ck_is_all_eq booleans tracked incrementally during predicate construction. Each predicate's equality field is checked as it is processed, covering single-column PK/CK predicates, multi-column CK predicates, and token predicates. This removes the last find_in_expression call in statement_restrictions.cc, and eliminates has_only_eq_binops entirely.	2026-05-29 17:13:40 +03:00
Avi Kivity	9e70771600	cql3: statement_restrictions: use index-selection predicates for value_for_index_partition_key Instead of rebuilding predicates from the expression tree at build_value_for_index_partition_key_fn() time via to_predicate_on_column, capture the indexed column's predicates directly in do_find_idx() when the index is chosen. Store them as _idx_column_predicates and use them to build the value function without any redundant expression analysis. This eliminates build_value_for_fn and its call to to_predicate_on_column from this code path.	2026-05-29 17:13:37 +03:00
Avi Kivity	a402fe1a65	cql3: statement_restrictions: replace find_clustering_order with predicate order field In build_range_from_raw_bounds_fn(), replace find_clustering_order() (which uses find_binop to search for a binary_operator with clustering comparison order) with a direct check of the predicate's order field. Since each multi-column predicate's filter is a single binary_operator, extract it directly with as<binary_operator>() instead of searching.	2026-05-29 16:50:02 +03:00
Avi Kivity	b3c1ee230b	cql3: statement_restrictions: replace has_partition_token with variant check Replace has_token_restrictions()'s call to has_partition_token() (which uses find_binop to search the expression tree for token function calls) with a direct check on the _partition_range_restrictions variant, which already records whether token restrictions exist.	2026-05-29 16:50:02 +03:00
Avi Kivity	e10c124cd3	cql3: statement_restrictions: replace has_slice with predicate is_slice check In the clustering prefix construction loop, replace the has_slice() call (which uses find_binop to search the merged predicate's expression tree for slice operators) with a direct check on the individual predicate vector's is_slice field.	2026-05-29 16:50:02 +03:00
Avi Kivity	4c282f588a	cql3: statement_restrictions: replace contains_multi_column_restriction filter with _has_multi_column In calculate_column_defs_for_filtering_and_erase_restrictions_used_for_index(), the code extracted multi-column boolean factors from _clustering_columns_restrictions. Since multi-column and single-column CK restrictions cannot be mixed (the constructor enforces this), when _has_multi_column is true, ALL factors are multi-column. Simplify to just adding _clustering_columns_restrictions directly when _has_multi_column is set. This removes the last caller of contains_multi_column_restriction(), allowing the function (and its find_binop call) to be removed.	2026-05-29 16:50:02 +03:00
Avi Kivity	dca2cc512e	cql3: statement_restrictions: remove unused find_needs_filtering and has_slice_or_needs_filtering These helper functions were wrappers around find_binop that are no longer called, since their call sites have been replaced by predicate-based checks.	2026-05-29 16:50:02 +03:00
Avi Kivity	eb98aea466	cql3: statement_restrictions: replace has_slice_or_needs_filtering with tracked bool Replace the has_slice_or_needs_filtering() call on _partition_key_restrictions (which uses find_binop to walk the expression tree) with a precomputed _pk_has_slice_or_needs_filtering boolean tracked incrementally during predicate construction in the partition key branch.	2026-05-29 16:50:02 +03:00
Avi Kivity	6e27c3a185	cql3: statement_restrictions: replace contains_multi_column_restriction with _has_multi_column In clustering_key_restrictions_need_filtering(), replace the contains_multi_column_restriction() call (which uses find_binop to search for a tuple_constructor LHS in the expression tree) with the precomputed _has_multi_column boolean that is already tracked incrementally during predicate construction.	2026-05-29 16:50:01 +03:00
Avi Kivity	ae7eb860a5	cql3: statement_restrictions: replace find_needs_filtering with predicate op check In the clustering prefix construction loop, replace the find_needs_filtering() call (which walks the merged predicate's expression tree looking for needs-filtering binary operators) with a check on the individual predicate vector. This uses the per-predicate op field directly instead of searching the expression tree.	2026-05-29 16:50:01 +03:00
Avi Kivity	556262a165	cql3: statement_restrictions: replace find_binop is_on_collection with tracked bool Replace the two find_binop(_clustering_columns_restrictions, is_on_collection) calls with a precomputed _ck_is_on_collection boolean that is tracked incrementally during predicate construction. This avoids walking the expression tree at each call site. The is_on_collection check detects CONTAINS/CONTAINS_KEY operators, which indicate collection restrictions on a clustering key column.	2026-05-29 16:50:01 +03:00
Avi Kivity	569c85032e	cql3: statement_restrictions: replace find_binop column extraction with predicate on field In add_clustering_restrictions_to_idx_ck_prefix(), find_binop was used to locate any binary_operator in the predicate's filter just to extract the column from its LHS. Since the predicate already stores this information in its 'on' field (as on_column for single-column predicates), use it directly instead of searching the expression tree.	2026-05-29 16:50:01 +03:00
Avi Kivity	240d9be5e2	cql3: statement_restrictions: set op on all binary-operator-derived predicates The to_predicates() function had fallthrough paths for operators like LIKE and NOT_IN that created predicates without setting the op field. This meant predicate-based checks like 'p.op && needs_filtering(*p.op)' would miss these operators. Fix by inlining the predicate construction at the fallthrough points (instead of using cannot_solve_on_column) and setting .op = oper.op. This ensures all predicates derived from binary operators carry their operator type, enabling reliable predicate-based analysis. The cannot_solve_on_column helper is now unused and removed.	2026-05-29 16:50:01 +03:00
Taras Veretilnyk	5a0974e781	schema: add per-table large_data_guardrails_enabled flag Add a per-table large_data_guardrails_enabled flag controlled via the CQL table property WITH large_data_guardrails_enabled = true\|false. Store the flag as a boolean column in system_schema_ext.scylla_tables. Only write a live cell when enabled; when disabled (the default), omit the cell entirely so that old nodes that don't know this column can still read the SSTable during rolling upgrade or rollback. When the property transitions from true to false via ALTER TABLE, a tombstone is written in make_update_table_mutations to override the previous live cell — this is safe because the CQL feature gate ensures all nodes are upgraded before the property can be set to true. Gate the CQL property behind the LARGE_DATA_GUARDRAILS cluster feature: attempting to set large_data_guardrails_enabled = true before all nodes advertise the feature raises a ConfigurationException.	2026-05-29 12:18:33 +02:00
Yaniv Michael Kaul	f90b066405	cql3: lazily allocate _idx_opt behind unique_ptr Motivation: The secondary_index::index object stored in statement_restrictions is approximately 128 bytes (containing index_metadata with its sstring name, UUID id, and unordered_map options, plus a target_column sstring). This field is only populated for queries that use secondary indexing, yet every prepared statement's restrictions object pays the full inline cost. Replace std::optional<secondary_index::index> with std::unique_ptr<secondary_index::index>. This reduces the inline size from 136 bytes to 8 bytes, saving 128 bytes per non-index-using prepared statement cached in the prepared statement cache. The semantics are preserved: null unique_ptr is equivalent to std::nullopt, and the dereference patterns (-> and *) work identically. The find_idx() method that returns a copy constructs an optional from the dereferenced pointer when non-null. Tests: - statement_restrictions_test builds and passes - Full release build compiles cleanly Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com> AI-assisted: Yes Backport: no, improvement Closes scylladb/scylladb#30046	2026-05-28 21:35:25 +03:00
Nadav Har'El	7a387a499f	Merge 'cql3: extract vector search select statement into cql3/statements/external_search/' from Szymon Malewski Extract vector_indexed_table_select_statement and its filter logic out of the monolithic select_statement.cc and vector_search/ module into a dedicated directory cql3/statements/index_search/. This improves modularity and eliminates a circular dependency between cql3 and vector_search: the filter code depends heavily on cql3 types (expressions, query_options, statement_restrictions) and belongs in the cql3 layer. Follow-up to VECTOR-250 which originally addressed the same dependency but has since regressed. This is also a preparatory refactoring for full-text search select statements, which can share some implementation with the vector search. Pure refactoring, no semantic changes - no need for backporting. Closes scylladb/scylladb#30100 * github.com:scylladb/scylladb: vector_index: move filter into cql3/statements/external_search cql3: extract vector_indexed_table_select_statement into own compilation unit vector_index: split query_base_table to return raw coordinator_result	2026-05-28 11:26:49 +03:00
Szymon Malewski	ed1006928f	vector_index: move filter into cql3/statements/external_search Move prepared_filter, prepared_restriction, prepared_rhs types and prepare_filter() from vector_search/filter.{hh,cc} into new files cql3/statements/external_search/filter.{hh,cc} under namespace cql3::statements::external_search. This eliminates a circular dependency between the cql3 and vector_search modules: the filter code depends heavily on cql3 types (expressions, query_options, statement_restrictions) and belongs in the cql3 layer. This is a follow-up to VECTOR-250 which originally addressed the same circular dependency but has since regressed.	2026-05-27 21:43:56 +02:00
Szymon Malewski	5e94abe3bc	cql3: extract vector_indexed_table_select_statement into own compilation unit Move vector_indexed_table_select_statement and its associated helpers (ann_ordering_info, get_ann_ordering_info, add_similarity_function_to_selectors, get_similarity_ordering_comparator) from select_statement.hh/.cc into new files cql3/statements/external_search/vector_indexed_table_select_statement.hh/.cc.	2026-05-27 21:43:52 +02:00
Botond Dénes	555cfbcd38	Merge 'treewide: replace deprecated smp::count and smp::all_cpus() with new APIs' from Avi Kivity Replace all uses of the deprecated seastar::smp::count with this_smp_shard_count() and smp::all_cpus() with this_smp_all_shards() across the ScyllaDB codebase (seastar submodule untouched). Both replacement functions require a reactor thread context. All call sites were verified to run on reactor threads. Notable cases: - dht/token-sharding.hh: this_smp_shard_count() is used as a default parameter value. This is safe since all callers are on reactor threads, but the expression is now evaluated at each call site rather than being a reference to a global variable. - service/storage_service.hh, locator/abstract_replication_strategy.hh, ent/encryption/encryption.cc: used in default member initializers and constructor member-init-lists. Objects are always constructed on reactor threads. - schema_builder: sometimes called from BOOST_AUTO_TEST_CASE without a reactor. Added pre-patch that makes the implicit shard count parameter implicit and pass 1 in those cases. Not changed: - scylla-gdb.py: reads smp::count as a GDB symbol (no reactor context). - Python test files: only reference smp::count in comments/strings. No backport: the Seastar commit that deprecated these function hasn't (and won't) make its way into any release branches (and the warnings are cosmetic anyway) Closes scylladb/scylladb#29990 * github.com:scylladb/scylladb: treewide: replace deprecated smp::count and smp::all_cpus() with new APIs scylla-gdb: read shard count from smp::_this_smp instead of smp::count schema_builder: make shard_count an explicit constructor parameter	2026-05-27 09:42:06 +03:00
Szymon Malewski	aa17c7739e	vector_index: split query_base_table to return raw coordinator_result The inner query_base_table overloads previously called process_results() themselves, duplicating row_limit setup and making it impossible to thread per-execution context (e.g. a similarity provider) into result processing. Lift process_results() to the top-level overload and change the two inner overloads to return coordinator_result<foreign_ptr<query::result>> directly. This cleanly separates query dispatch from result processing, and opens the door to passing execution-time context at the single process_results() call site. No functional change.	2026-05-26 21:37:13 +02:00
Avi Kivity	8010e408a2	treewide: replace deprecated smp::count and smp::all_cpus() with new APIs Replace all uses of the deprecated seastar::smp::count with this_smp_shard_count() and smp::all_cpus() with this_smp_all_shards() across the ScyllaDB codebase (seastar submodule untouched). Both replacement functions require a reactor thread context. All call sites were verified to run on reactor threads. Notable cases: - dht/token-sharding.hh: this_smp_shard_count() is used as a default parameter value. This is safe since all callers are on reactor threads, but the expression is now evaluated at each call site rather than being a reference to a global variable. - service/storage_service.hh, locator/abstract_replication_strategy.hh, ent/encryption/encryption.cc: used in default member initializers and constructor member-init-lists. Objects are always constructed on reactor threads. Not changed: - scylla-gdb.py: reads smp::count as a GDB symbol (no reactor context). - Python test files: only reference smp::count in comments/strings.	2026-05-26 17:35:20 +03:00
Avi Kivity	f165b396fd	schema_builder: make shard_count an explicit constructor parameter A recent Seastar update deprecated smp::count and introduced this_smp_shard_count() as a replacement. One difference is that this_smp_shard_count() wants to run on a reactor thread. This poses a problem for non-reactor tests (BOOST_AUTO_TEST_CASE) that nevertheless use a schema, as the schema_builder constructor references smp::count. If we replace it with this_smp_shard_count() then it will crash when running without a reactor. To fix, remove the implicit this_smp_shard_count() call from raw_schema's constructor and require callers to pass shard_count explicitly to schema_builder. This allows tests that don't run on a reactor thread to construct schemas without crashing. Production code and reactor-based tests pass this_smp_shard_count(). Non-reactor test files (expr_test, keys_test, nonwrapping_interval_test, wrapping_interval_test, bti_key_translation_test, range_tombstone_list_test) pass a fixed shard count of 1. Note: sstable_test.cc is a Seastar test file (SEASTAR_THREAD_TEST_CASE) but also contains one plain BOOST_AUTO_TEST_CASE (test_empty_key_view_comparison) that constructs a schema_builder without a reactor context. This test also receives a fixed shard count of 1.	2026-05-26 11:55:56 +03:00
Botond Dénes	6c3f104b67	cql3: store raw query string in utils::chunked_string Read query as fragmented string from the input stream in transport/server.cc, propagate it a such to query_processor::prepare() and also store it as such in cql3::cql_statement::raw_cql_statement. Unfortunately, the query still has to be linearized for parsing, as ANTLR -- although allows for custom InputStream implementation -- plays pointer arithmetics games with the pointers obtained from them, so fragmented input cannot be used. To amortize the cost of this linearization, the query string is linearized through utils::reusable_buffer. The parser can be invoked recursively, nested invokations linearize directly. Still, this patch limits the places where the query is linearized to the following: * Parsing * Audit * Logs and error messages So the normal query paths for queries that actually can get arbitrarily large (UPDATE and INSERT) should only linearize the query temporarily for parsing.	2026-05-26 09:08:06 +03:00
Botond Dénes	4af3359744	cql3/expr: use utils::chunked_string for untyped_constant::raw_text This value can be a string or bytes literal, which can get very large in rare cases. Use chunked storage to avoid large allocations.	2026-05-26 09:08:06 +03:00
Botond Dénes	597d4252dc	types: abstract_type::from_string() switch to fragmented buffers (interface) Change input: str::string_view -> utils::chunked_string_view. Change return value: bytes -> managed_bytes. This patch only changes the interface, with some to_bytes() sprinkled in the internals to deal with recursive calls. Internals will be updated in the next patch, to keep the churn of updating callers separate from the actually important changes.	2026-05-26 09:08:06 +03:00
Nadav Har'El	96dd3121e7	Merge 'cql: rewrite CassIO SAI metadata index to regular secondary index' from Szymon Wasik CassIO (the library backing LangChain's `langchain_community.vectorstores.Cassandra` integration) issues the following DDL during schema setup to create a metadata index: ```sql CREATE CUSTOM INDEX IF NOT EXISTS eidx_metadata_s_<table> ON <keyspace>.<table> (ENTRIES(metadata_s)) USING 'org.apache.cassandra.index.sai.StorageAttachedIndex'; ``` ScyllaDB does not support Cassandra's StorageAttachedIndex (SAI) for non-vector columns and previously rejected this statement with: ``` StorageAttachedIndex (SAI) is only supported on vector columns; use a secondary index for non-vector columns ``` This blocks seamless migration of existing LangChain/CassIO applications from Cassandra to ScyllaDB — applications fail during initialization before any application-level workaround can run, even when metadata filtering is not used (`metadata_indexing="none"`). CassIO is no longer actively maintained but remains the only official LangChain integration path for Apache Cassandra over CQL, meaning existing applications will continue using this setup pattern. Instead of rejecting the CassIO metadata-map SAI DDL, detect the pattern and rewrite it to a standard ScyllaDB secondary index on collection entries: - Detection: SAI class name + single `ENTRIES` target on a non-frozen `map` column - Rewrite: Clear the custom class so the index is created through the standard secondary index path (which already fully supports indexing map entries) - Warning: Emit a CQL warning informing the user that SAI is not supported by ScyllaDB, a regular secondary index was created instead, and metadata filtering behavior may differ from Cassandra SAI The rewrite is placed early in `validate_while_executing()`, before the rf-rack-validity check, so the standard secondary index code path handles all subsequent validation naturally — no code duplication. After this change, the CassIO schema setup succeeds on ScyllaDB: - `CREATE CUSTOM INDEX ... USING 'sai'` on `ENTRIES(metadata_s)` creates a real secondary index - The index is functional and can accelerate metadata filtering queries - A CQL warning makes the rewrite transparent to operators - SAI on non-vector, non-map-entries columns is still rejected as before - Vector SAI indexes continue to be rewritten to `vector_index` as before - `test_sai_entries_on_map_creates_regular_index` — verifies the index is created and the warning is emitted (fully-qualified SAI class name) - `test_sai_entries_on_map_short_name` — same with the `'sai'` short alias - `test_sai_on_regular_column_rejected` — confirms SAI on regular scalar columns is still rejected All 148 tests in `test_vector_index.py` and `test_secondary_index.py` pass with no regressions (125 passed, 22 xfailed, 1 skipped). Fixes: SCYLLADB-2113 Backport: 2026.2 as this is the version where the support for SAI class needed by LangChain was added. Closes scylladb/scylladb#29981 * github.com:scylladb/scylladb: cql: rewrite CassIO SAI metadata index to regular secondary index db/config: add enable_cassio_compatibility flag	2026-05-26 00:19:03 +03:00
Szymon Wasik	5ee339b11d	cql: rewrite CassIO SAI metadata index to regular secondary index When CassIO creates a SAI ENTRIES index on a map column, ScyllaDB now rewrites it to a regular secondary index and emits a CQL warning. This allows LangChain/CassIO applications to work without DDL errors. The rewrite is gated behind the enable_cassio_compatibility flag (disabled by default). Refs: SCYLLADB-2113	2026-05-25 15:11:43 +02:00
Avi Kivity	892f22f49c	Merge 'cql: atomic add/subtract operations with LWT' from Nadav Har'El ScyllaDB has special counter columns for which atomic add/subtract operations like `SET a = a + 1` are allowed. Such operations have not been allowed on ordinary non-counter columns, as they would not be properly atomic - the read an the write are separate, and concurrent operations can have incorrect results. This patch makes it allowed to use such atomic add/subtract operations in LWT statements. For example UPDATE ... SET a = a - 7 IF a > 0 or UPDATE ... SET a = a + 1 IF a != NULL The row updated in the operation, and the updated column (`a`) should be initialized before the update. The example `SET a = a + 1 IF a != NULL` will fail the condition if `a` is not set. A different request `SET a = a + 1 IF EXISTS` will just leave `a` unset if it's unset (NULL + 1 is NULL, this is SQL's null propagation rules). This add/subtract operations is allowed on any numeric (integer or floating point) column. The ability of LWT to fetch the old values of a column and use it to calculate the new value has long been available in our internal CAS implementation - and has been in use for years in Alternator - but until this patch it was not exposed in CQL's LWT. This series does not add new syntax to CQL - the "SET a = a + b" and "SET a = a - b" syntax already existed for counters, and we just allow the same syntax for non- counters. However, the series does add a bit of machinery that will allow us to easily support more general expressions in the future. In particular, this series implements the addition, subtraction, and unary-minus operators for expressions, and adds the machinery needed to run any expression in "SET a = expr()", using existing row values fetched by LWT. This is a new Scylla-only feature that does not exist in Cassandra. Fixes #10568 Refs #22918 ("Support arithmetic operators"), SCYLLADB-1576 ("Decimal arithmetic operations OOM") This is is a new feature, so normally would not be backported. Closes scylladb/scylladb#29939 * github.com:scylladb/scylladb: cql: atomic add/subtract operations with LWT cql3: let constants::setter evaluate expressions using prefetched row data cql3/expr: add NEG unary operator for numeric negation cql3/expr: add SUB binary operator for numeric subtraction cql3/expr: add ADD binary operator for numeric addition types: add is_arithmetic() method for types	2026-05-25 14:27:33 +03:00
Piotr Dulikowski	3a5dd2e5be	Merge 'strong_consistency: forward reads to the raft leader' from Wojciech Mitros Strongly consistent reads currently call read_barrier() on whichever replica happens to process the request. When a follower runs read_barrier(), it sends an RPC to the leader to get the current read index, then waits for its local apply index to catch up. If the follower is behind, this wait can be significant. By forwarding linearizable reads to the leader, we don't need an RPC from replica to leader to get the index to wait for apply -- it's available locally. Note that read_barrier() is still required on the leader to confirm it is still the leader and guarantee linearizability. A future optimization would be to implement leases in the raft library, which could eliminate read_barrier() on the leader entirely. The CL-to-behavior mapping is isolated in a single parse_consistency_level() function: - CL=(LOCAL_)QUORUM -> linearizable: forwarded to the raft leader - CL=(LOCAL_)ONE -> non-linearizable: existing behavior (no read_barrier()/forwarding, may return stale results) - All other CLs -> invalid request Read forwarding reuses the same CQL-layer bounce_to_node() mechanism that write forwarding already uses. The transport layer's existing requests_forwarded_* metrics automatically count forwarded reads. Coordinator-level metrics (linearizable_reads, non_linearizable_reads, writes) are added for visibility into the strong consistency workload. Fixes: SCYLLADB-1157 Closes scylladb/scylladb#29575 * github.com:scylladb/scylladb: strong_consistency: test read forwarding to leader strong_consistency: skip read_barrier() for non-linearizable reads strong_consistency: split coordinator-level read latency metrics strong_consistency: forward linearizable reads to raft leader strong_consistency: classify reads by consistency level strong_consistency: add begin_read() to raft_server	2026-05-25 10:55:00 +02:00
Nadav Har'El	f8aaeb5e87	cql: atomic add/subtract operations with LWT ScyllaDB has special counter columns for which atomic add/subtract operations like `SET a = a + 1` are allowed. Such operations have not been allowed on ordinary non-counter columns, as they would not be properly atomic - the read an the write are separate, and concurrent operations can have incorrect results. This patch makes it allowed to use such atomic add/subtract operations in LWT statements. Some examples: UPDATE ... SET a = a - 1 IF a > 0 UPDATE ... SET a = a + 1 IF EXISTS UPDATE ... SET a = a + 1 a != NULL The row updated in the operation, and the updated column (a) should be initialized before the update - arithmetic operations on missing column values silently leave the column null (no error is generated). This add/subtract operations is allowed on any numeric column - integer or floating point of any size. The ability of LWT to fetch the old values of a column and use it to calculate the new value has long been available in our internal CAS implementation - and has been in use for years in Alternator - but until this patch it was not exposed in CQL's LWT. This patch does not add new syntax to CQL - the "SET a = a + b" and "SET a = a - b" syntax that already existed for counters is now allowed for non-counters. This is a new Scylla-only feature that does not exist in Cassandra. Fixes #10568 Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2026-05-25 10:09:11 +03:00
Nadav Har'El	3c6931c1ed	cql3: let constants::setter evaluate expressions using prefetched row data Previously, constants::setter evaluated its expression using only the query options, which means expressions referencing row columns (column_value nodes) would crash or return incorrect results. Add evaluate_on_prefetched_row() to update_parameters: it evaluates an expression in the context of the prefetched row for a given (pkey, ckey), falling back to options-only evaluate() when no selection is available (non-LWT context) or no column values are needed, and treating absent columns needed by the expression as null. Extend constants::setter to use this method: - setter::execute() now calls evaluate_on_prefetched_row() or evaluate() as needed. - setter::requires_read() returns true when the expression contains a column_value node, triggering a prefetch read. - setter::requires_lwt() mirrors requires_read(), enforcing that column- referencing arithmetic is only allowed inside a conditional (IF) statement. We'll use this new feature to implement "SET r = r + 1" and similar expressions in the next patch.	2026-05-25 10:09:11 +03:00
Nadav Har'El	b026aea6f7	cql3/expr: add NEG unary operator for numeric negation This patch adds a new expression type, unary_operator, analogous to the existing binary_operator but takes just one operand instead of two. This patch also implements the first and only unary operator type, unary_oper_t::NEG, implementing negation (unary minus) for all numeric types. For fixed-width integer types overflow or underflow results in an error. If the operand is NULL, the result is a NULL as well. The new operator is not yet used by the CQL syntax - our parser doesn't parse arithmetic expressions yet. We also do not plan to use it in the following patch which uses the separate SUB (subtraction) operation, not the new NEG. But since I already implemented a unary minus operator, and we'll surely need it in the future for general arithmentic operations, I thought I might as well include this patch as well. Refs #22918 ("Support arithmetic operators")	2026-05-25 10:08:11 +03:00
Nadav Har'El	f27d1f08fc	cql3/expr: add SUB binary operator for numeric subtraction In this patch we add to our expressions oper_t::SUB, for subtraction, analogous to the ADD from the previous patch. The only reason why we need a separate SUB operation and can't just combine ADD with a unary minus (NEG) operator is the minimum integer in fixed-sized integer. For example, 8-bit integers have the range -128...127. A subtraction like -1 - (-128) is valid (its value is 127) but the negation of (-128) would be invalid (128). One of the tests we add in this patch validates this fact. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2026-05-25 10:06:28 +03:00
Nadav Har'El	083adf84ab	cql3/expr: add ADD binary operator for numeric addition Extend oper_t with a new ADD operator, to represent addition between two numeric expressions. Supports all numeric types - tinyint, smallint, int, bigint, float, double, varint, and decimal. For fixed-width integer type overflow or underflow results in an error. If one of the operand is NULL, the result is also a NULL. The new operator is not yet used by the CQL syntax - our parser doesn't parse arithmetic expressions yet. We plan to start using this new operator in a following patch which implements counter syntax ("SET r = r + 1" ) for LWT, but in the future we can use it for more general cases. At the moment, ADD requires that both operands have the same type. This is all we need for the first use case, and this limitation can be relaxed later. Interestingly, ADD is our first binary operator implementation that does not return a boolean. Until now all our binary operators have been comparison operators, and all returned boolean. In contrast, ADD's return type is the type of its operands. This implementation is susceptible to the pre-existing bug SCYLLADB-1576, where adding 1e1000000 and 1 in "decimal" or "varint" types will happily allocate a million-digit number and run out of memory. A reproducing test is included, and this issue will be solved in one place for all operations that have additions (including aggregations and arithmetic expressions) in a followup pull-request. Refs #22918 ("Support arithmetic operators")	2026-05-25 10:05:09 +03:00
Wojciech Mitros	c0ea98f922	strong_consistency: classify reads by consistency level Introduce a read_type enum (linearizable vs non_linearizable) and transform the existing "validate" function into a "parse" method - instead of checking if the consistency level is one of the accepted ones, we now also return the correcponding read type for strong consistency. The "parse" function maps CQL consistency levels to following read types: - CL=(LOCAL_)QUORUM -> linearizable (this is the default CL) - CL=(LOCAL_)ONE -> non_linearizable - all others -> throw The classification is performed in the CQL layer (select_statement) to keep the coordinator free of CL concepts.	2026-05-23 11:35:37 +02:00
Avi Kivity	e35c388f65	cql3: limit nesting depth of function calls and CASTs in CQL parser Deeply nested expressions like f(f(f(...))) can overflow the evaluator stack. Add depth tracking in the recursive entry points of the CQL grammar (unaliasedSelector, term, relation), rejecting expressions that exceed the max_expression_nesting limit (12) with a SyntaxException. CVE-2026-31948 Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1003	2026-05-21 23:24:03 +03:00
Avi Kivity	305346a3ec	Merge 'Don't materialize collections into intermediate representations' from Botond Dénes Collections have an age-old problem in ScyllaDB: they had to be unserialized into an intermediate representation for any access or manipulation. The intermediate representation needs effort to produce and also requires additional memory to store. Both can be significant for large collections. This intermediate representation is then either discarded immediately after use, or re-serialized again. This problem was significant enough for us to consider the use of collections as somewhat of an anti-pattern. But our customers keep using it. Alternator is also a heavy user of collections. This PR aims to solve this problem once and for all. The plan is as follows: * Promote direct use of the serialized collection format: - Add accessor methods to `collection_mutation_view` which read from the serialized format directly: `tomb()`, `size()` and `begin()`/`end()`. - Add a `collection_mutation_writer` which provides container semantics for generating a serialized `collection_mutation` directly on the go (`push_back()`). * Replace all usage of `collection_mutation_description`, `collection_mutation_view_description` and friends with use of the new infrastructure. * Drop the old infrastructure, to avoid accidental regressions. Continues the work started by https://github.com/scylladb/scylladb/pull/29033 and takes it to its conclusion. To help focus review, here is a summary of the patches: * [1, 2] preparatory refactoring: drop some unused abstract_type params * [3, 6] introduce new infrastructure to write and read serialized collections directly; this is the meat of the PR * [6, -1) replace all usage of old materializing infrastructure with usage of the new one * [-1] drop old infrastructure Command: ``` dbuild -it -- build/release/scylla perf-simple-query --collection=16 -c1 -m2G --default-log-level=error ``` \| Metric \| Before \| After \| Change \| \|--------------------------\|--------:\|--------:\|------------\| \| Throughput (median tps) \| 315,760 \| 332,021 \| +5.1% \| \| Instructions/op (median) \| 53,776 \| 48,681 \| -9.5% \| \| CPU cycles/op (median) \| 17,365 \| 16,471 \| -5.1% \| \| Allocations/op \| 85.1 \| 82.1 \| -3.5% \| Significant improvement. Throughput is up ~5%, and both instruction count and cycle count are meaningfully reduced. --- Command: ``` dbuild -it -- build/release/scylla perf-simple-query --collection=16 -c1 -m2G --default-log-level=error --write ``` \| Metric \| Before \| After \| Change \| \|--------------------------\|----------:\|---------:\|-----------\| \| Throughput (median tps) \| 150,823 \| 149,678 \| -0.8% \| \| Instructions/op (median) \| 108,388 \| 103,858 \| -4.2% \| \| CPU cycles/op (median) \| 34,860 \| 35,371 \| +1.5% \| \| Allocations/op \| ~105–108 \| ~102–103 \| -3.0% \| Mixed, mostly neutral. Throughput is essentially flat (within noise). Instructions/op improved by ~4%, allocations dropped slightly, but cycles/op edged up marginally. --- Command: ``` dbuild -it -- build/release/scylla perf-alternator --workload write --developer-mode=1 --alternator-port=8000 --alternator-write-isolation=unsafe -c1 -m2G --default-log-level=error ``` \| Metric \| Before \| After \| Change \| \|--------------------------\|--------:\|-------:\|-----------\| \| Throughput (median tps) \| 55,777 \| 56,051 \| +0.5% \| \| Instructions/op (median) \| 246,215 \|246,610 \| +0.2% \| \| CPU cycles/op (median) \| 77,641 \| 77,020 \| -0.8% \| \| Allocations/op \| 340.4 \| 335.4 \| -1.5% \| Essentially neutral. All metrics are within noise margins. Slight reduction in allocations and cycles, negligible otherwise. --- The change has a clear, substantial positive effect on reads (~5% throughput gain, ~9.5% fewer instructions per op). The write and alternator paths are unaffected in practice — changes there are within measurement noise. No regressions are apparent. This is expected: https://github.com/scylladb/scylladb/pull/29033 did the heavy lifting when it comes to the write path, this PR finishes the job, mostly improving reads. Fixes: #3602 Improvement, no backport. Closes scylladb/scylladb#29127 * github.com:scylladb/scylladb: mutation/collection_mutation: make collection_mutation::_data private mutation_collection: drop collection_mutation_description and friends test: move away from collection_mutation_description tree: move away from collection_mutation_description test: move away from collection_mutation_view::with_deserialized() tree: move away from collection_mutation_view::with_deserialized() types: fix indendation, left broken by previous commit types: move away from collection_mutation_view::with_deserialized() types: serialize_for_cql(): use throwing_assert() instead of SCYLLA_ASSERT() schema: column_computation: move away from collection_mutation_view::with_deserialized() mutation: move away from collection_mutation_view::with_deserialized() alternator: move away from collection_mutation_view::with_deserialized() cdc: move away from collection_mutation_view::with_deserialized() mutation/collection_mutation: printer: don't deserialize collections mutation/collection_mutation: difference(): don't deserialize collections mutation/collection_mutation: merge(): don't deserialize collections mutation/collection_mutation: extract compact_and_expire() to free function mutation/collection_mutation: refactor empty(), is_any_live() and last_update() compaction_garbage_collector: pass collection_mutation to collect() test/boost/mutation_test: add tests for collection_mutation_{view,writer} mutation/collaction_mutation: collection_mutation_view: add methods to inspect content mutation/collection_mutation: add collection_mutation_writer mutation/collection_mutation: collection_mutation(): generate valid collection mutation/collection_mutation: collection_mutation(): remove unused abstract_type param mutation/atomic_cell: drop unused type param from from_bytes()	2026-05-21 17:10:40 +03:00
Wojciech Mitros	13c043903d	strong_consistency: cache leader location for non-replica nodes When a non-replica node handles a strongly consistent write, it must forward the request to a replica. If the closest replica is not the leader, the request gets redirected again, causing an extra roundtrip. Add a leader location cache in groups_manager, keyed by raft group_id. After a write request is forwarded, the CQL transport layer records the final node as the leader in the cache. Subsequent write requests from the same node for the same group are forwarded directly to the cached leader, eliminating the extra roundtrip. The cache is only used for writes. Reads can be served by any replica, so they skip the cache and use proximity-based routing instead. Cache entries are validated at use time: if the cached leader is no longer a replica (e.g. after tablet migration), the entry is evicted and the normal closest-replica path is taken. This prevents a scenario where two nodes keep redirecting to each other because both think that the other is the leader but actually both are non-replicas - such loop is broken as soon as the tablet maps are updated. On token_metadata updates, entries for groups that no longer exist (e.g. table dropped, tablet merged) are evicted. Entries for groups that still exist are kept — use-time validation handles staleness. An on_node_resolved callback is propagated through the redirect/bounce path so the transport layer can update the cache generically without coupling to the strong-consistency coordinator. The coordinator creates the callback only for writes (capturing the groups_manager and group_id) and attaches it to the bounce message; the transport layer invokes it once the final node is known, keeping the forwarding infrastructure subsystem-agnostic. We also add a test which verifies that after the initial redirect, following requests to the same node avoid the extra redirect and forward directly to the leader. Fixes: SCYLLADB-1064 Closes scylladb/scylladb#29392	2026-05-21 10:32:56 +02:00
Botond Dénes	636e2877e2	tree: move away from collection_mutation_description Use collection_mutation_writer instead. Add to_managed_bytes() to cql3::raw_value to help avoid some copies. A special note for sstables/kl/reader.cc: this conversion is not straighforward, so we accumulate a list of cells and feed to the writer at the end. This is sub-optimal but this code is rarely used, best to be conservative.	2026-05-21 10:23:29 +03:00
Botond Dénes	24fdfa34dd	mutation/collection_mutation: collection_mutation(): remove unused abstract_type param	2026-05-21 08:34:21 +03:00
Dawid Pawlik	4c2ce1928c	types/vector: avoid unnecessary copies during vector reserialization When reserialize_value() is called on a vector type (which happens only when the vector's element type contains sets or maps), the old code materialized all elements via split_fragmented() into a std::vector<managed_bytes>, then iterated them calling reserialize_value() on each — discarding the intermediate copy. Use split_fragmented_view() to obtain zero-copy views of elements, and pass those directly to reserialize_value(). This avoids one managed_bytes allocation per element. Additionally, wrap the call with with_simplified() so that when the input is a single contiguous fragment (the common case), the compiler receives a single_fragmented_view and can eliminate fragment-boundary checks at compile time. Also generalize build_value_fragmented() to accept any forward range of FragmentedView elements (not just managed_bytes), and write directly into the output buffer via with_linearized instead of going through an intermediate read_simple_bytes copy. This benefits all callers including evaluate_vector() on the INSERT path for vector<float, N>. The with_simplified() dispatch instantiates reserialize_value with single_fragmented_view, which in turn instantiates partially_deserialize_listlike and partially_deserialize_map with that type. Add explicit template instantiations in types/types.cc since those function templates are defined there and only previously instantiated for managed_bytes_view and fragmented_temporary_buffer::view. Note: the reserialization path is only exercised for vectors whose element type contains sets or maps (e.g. vector<frozen<map<int,int>>, N>). The common vector<float, N> case never enters reserialize_value() because bound_value_needs_to_be_reserialized() returns false at the call site. However, the build_value_fragmented() improvement applies to all vector INSERTs. References: SCYLLADB-471 Fixes: SCYLLADB-1799 Closes scylladb/scylladb#28559	2026-05-20 12:22:19 +03:00
Dawid Pawlik	232b1a3725	cql3: generalize viewless index handling in CREATE INDEX statement Replace the `vector_index`-specific checks in `create_index_statement` with a generic `is_viewless_custom_class()` helper that queries the index factory to determine whether an index type creates a backing materialized view. This covers both existing (`vector_index`) and new (`fulltext_index`) viewless index types: - Reject view properties (WITH clause) for any viewless index - Use name-based duplicate detection for named viewless indexes, since they have no backing view table for `has_schema()` to find (issue #26672)	2026-05-19 08:52:47 +02:00
Dawid Pawlik	9e02e11ea8	fulltext_index: enforce CDC requirements for fulltext indexes Fulltext indexes rely on CDC to track changes for asynchronous index building. Enforce the following CDC constraints during CREATE INDEX: - CDC TTL must be at least 86400 seconds (24 hours) - CDC delta mode must be 'full' or postimage must be enabled Add `has_fulltext_index()` and `check_cdc_options()` so that other modules can detect fulltext indexes and validate CDC settings: - include fulltext indexes in `cdc_enabled()` so the CDC log is auto-created, and validate CDC options in `on_before_update_column_family()` - block `ALTER TABLE ... WITH cdc = {'enabled': false}` when a fulltext index exists on the table	2026-05-19 08:52:47 +02:00

1 2 3 4 5 ...

4242 Commits