scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-05-30 19:46:48 +00:00

Author	SHA1	Message	Date
Nadav Har'El	21ecc12fc6	Merge 'index: fix local vector index locality detection after schema reload' from Michał Hudobski After schema reload, `target_parser::is_local()` did not recognize the vector-index local target format `{"pk": [...], "tc": "..."}`, causing local vector indexes to be treated as global. This broke duplicate detection when both a global and a local vector index existed on the same column. Fix by introducing `vector_index::is_local()` and dispatching to it from `create_index_from_index_row()` based on the index class. Also adds tests for local/global vector index coexistence. Fixes: SCYLLADB-987 backport reasoning: we added local vector index support in 2026.1 Closes scylladb/scylladb#29492 * github.com:scylladb/scylladb: test/cqlpy: add tests for global and local vector index coexistence index: fix local vector index locality detection after schema reload	2026-05-27 15:34:57 +03:00
Wojciech Mitros	ae0d77257f	mv: fix view_update_builder losing fragments across batch boundaries When a mutation generates more view updates than max_rows_for_view_updates (100), view_update_builder::build_some() splits the work into multiple batches. There was a bug in how fragments were read between batches: When should_stop_updates() returned true, the old code called stop() which returned stop_iteration::yes without reading the next fragments. On the next build_some() call, read_both_next_fragments() was called at the start, which advanced BOTH readers - skipping any fragment that was already read but not yet consumed. A row could be not consumed if either: - the 100th (last in the batch) update was a row insertion and we still had insertions/updates remaining - the 100th (last in the batch) update was a row deletion and we still had deletions/updates remaining For the most common case where work is split in batches, i.e. range deletions, we couldn't hit this because range delete generates only view row deletions. On tables with a single materialized view, we also couldn't get this for any batches with less than 50 statements (unless the batch also contained range deletions), because one non-range-delete update can generate up to 2 view updates. Howeveer, for a range of scenarios outside these 2, we could lose view updates, resulting in persistent inconsistencies. The fix: - read_*_next_fragment() now accept a stop_iteration parameter, so the next fragments are always read after consuming (even when stopping), but stop_iteration::yes is correctly propagated to break the loop. - build_some() no longer re-reads fragments at the start. Instead, an initialize() method performs the initial read once at construction. - because now we only advance readers after consuming, we won't advance readers after end_of_partition, so we extend the break condition to accept either readers evaluating to `false` or them being at the end_of_partition. We also handle the optimization with _skip_row_updates Fixes: scylladb/scylladb#29155 Closes scylladb/scylladb#29498	2026-05-26 14:15:12 +02:00
Nadav Har'El	f65a52f3ec	Merge 'vector_search: test: migrate rescoring tests from C++/Boost to pytest' from Szymon Malewski Migrate mock-based rescoring and oversampling tests from test/vector_search/rescoring_test.cc to pytest and delete the C++ file. Index option validation tests go to test_vector_index.py; rescoring tests go to a new test_vector_search_rescoring.py which introduces shared infrastructure (EmbeddingRow dataclass, TEST_DATA dict, reversed_ann_response() helper, rescoring_test_table() context manager). Two tests have updated assertions (semantic change): filters_invalid_similarity_scores now uses per-function expected result sets including a zero-vector row, and rescoring_with_zerovector_query asserts empty results after NaN filtering (cosine only). Both are marked xfail pending SCYLLADB-924. Follow-up to #29593. Does not require backport - simple refactoring of tests Closes scylladb/scylladb#29906 * github.com:scylladb/scylladb: test/vector_search: migrate zero-vector query rescoring test to pytest; delete rescoring_test.cc test/vector_search: migrate invalid similarity score filtering test to pytest test/vector_search: migrate non-ANN similarity argument rescoring test to pytest test/vector_search: migrate wildcard select rescoring test to pytest test/vector_search: migrate similarity_function rescoring test to pytest test/vector_search: migrate rescoring and f32 quantization tests to pytest test/vector_search: migrate oversampling tests to pytest test/vector_search: migrate vector_index option validation tests to pytest	2026-05-26 09:45:40 +03:00
Szymon Malewski	2151a4fac3	test/vector_search: migrate zero-vector query rescoring test to pytest; delete rescoring_test.cc Migrate rescoring_with_zerovector_query from rescoring_test.cc to pytest as test_rescoring_with_zerovector_query. Tested with cosine similarity only because zero vectors produce NaN only for cosine; other functions yield valid scores. The test is marked xfail: similarity_cosine now returns NaN for zero vectors (SCYLLADB-456 fix) and rescoring should filter out NaN scores, yielding an empty result set. Semantic change: the test now asserts the desired empty-result behavior instead of asserting that the query does not throw. Delete rescoring_test.cc now that all tests have been migrated and remove its entries from configure.py and test/vector_search/CMakeLists.txt.	2026-05-26 00:37:54 +02:00
Szymon Malewski	533a8e65fe	test/vector_search: migrate invalid similarity score filtering test to pytest Migrate no_nulls_in_rescored_results from rescoring_test.cc to pytest, renamed to test_filters_invalid_similarity_scores_in_rescored_results. The test now also inserts a zero-vector row (id=14) to cover the case introduced when similarity_cosine was changed to return NaN for zero vectors instead of throwing (SCYLLADB-456). The expected surviving set of rows is refined per similarity function based on which inputs produce valid (non-NaN, non-Infinity) similarity scores. Marked xfail because rescoring does not yet filter rows with invalid scores. Semantic change: the expected surviving row set is updated per the behavior described above.	2026-05-26 00:37:54 +02:00
Szymon Malewski	63d9b7445f	test/vector_search: migrate non-ANN similarity argument rescoring test to pytest Migrate select_similarity_function_other_than_ann_ordering from rescoring_test.cc to pytest. The test verifies that similarity scores in SELECT are computed against the explicitly supplied argument vector rather than the ANN ordering vector. No semantic change.	2026-05-26 00:37:54 +02:00
Szymon Malewski	0cb557695a	test/vector_search: migrate wildcard select rescoring test to pytest Migrate wildcard_select_is_correctly_rescored from rescoring_test.cc to pytest. The test verifies that SELECT * with rescoring returns rows in the correct similarity order with correct embedding values, covering a slightly different processing path from the explicit-column SELECT test. No semantic change.	2026-05-26 00:37:53 +02:00
Szymon Malewski	cae816a8c6	test/vector_search: migrate similarity_function rescoring test to pytest Migrate similarity_function_returns_correctly_rescored_results from rescoring_test.cc to pytest. The test verifies that similarity scores in the SELECT clause are computed correctly after rescoring, for both argument orderings of the similarity function. No semantic change.	2026-05-26 00:37:53 +02:00
Szymon Malewski	78d72309b8	test/vector_search: migrate rescoring and f32 quantization tests to pytest Introduce shared test infrastructure in test_vector_search_rescoring.py: EmbeddingRow dataclass, TEST_DATA dict keyed by similarity function name, ANN_QUERY_VECTOR, reversed_ann_response() helper, and rescoring_test_table() context manager. Migrate result_returned_by_vector_store_is_rescored and f32_quantization_disables_rescoring from rescoring_test.cc. No semantic change.	2026-05-26 00:37:53 +02:00
Szymon Malewski	400c0dbb22	test/vector_search: migrate oversampling tests to pytest Migrate oversampling_multiplies_limit_for_vector_store_query and oversampled_vector_store_results_are_limited_to_cql_limit from rescoring_test.cc to test_vector_search_rescoring_with_mock.py. No semantic change.	2026-05-26 00:37:53 +02:00
Szymon Malewski	9f632182fb	test/vector_search: migrate vector_index option validation tests to pytest CREATE INDEX option tests for quantization, oversampling, and rescoring are moved from rescoring_test.cc to test_vector_index.py alongside the existing index option tests. These tests exercise only option parsing and validation - no vector store mock needed. No semantic change.	2026-05-26 00:37:52 +02:00
Nadav Har'El	96dd3121e7	Merge 'cql: rewrite CassIO SAI metadata index to regular secondary index' from Szymon Wasik CassIO (the library backing LangChain's `langchain_community.vectorstores.Cassandra` integration) issues the following DDL during schema setup to create a metadata index: ```sql CREATE CUSTOM INDEX IF NOT EXISTS eidx_metadata_s_<table> ON <keyspace>.<table> (ENTRIES(metadata_s)) USING 'org.apache.cassandra.index.sai.StorageAttachedIndex'; ``` ScyllaDB does not support Cassandra's StorageAttachedIndex (SAI) for non-vector columns and previously rejected this statement with: ``` StorageAttachedIndex (SAI) is only supported on vector columns; use a secondary index for non-vector columns ``` This blocks seamless migration of existing LangChain/CassIO applications from Cassandra to ScyllaDB — applications fail during initialization before any application-level workaround can run, even when metadata filtering is not used (`metadata_indexing="none"`). CassIO is no longer actively maintained but remains the only official LangChain integration path for Apache Cassandra over CQL, meaning existing applications will continue using this setup pattern. Instead of rejecting the CassIO metadata-map SAI DDL, detect the pattern and rewrite it to a standard ScyllaDB secondary index on collection entries: - Detection: SAI class name + single `ENTRIES` target on a non-frozen `map` column - Rewrite: Clear the custom class so the index is created through the standard secondary index path (which already fully supports indexing map entries) - Warning: Emit a CQL warning informing the user that SAI is not supported by ScyllaDB, a regular secondary index was created instead, and metadata filtering behavior may differ from Cassandra SAI The rewrite is placed early in `validate_while_executing()`, before the rf-rack-validity check, so the standard secondary index code path handles all subsequent validation naturally — no code duplication. After this change, the CassIO schema setup succeeds on ScyllaDB: - `CREATE CUSTOM INDEX ... USING 'sai'` on `ENTRIES(metadata_s)` creates a real secondary index - The index is functional and can accelerate metadata filtering queries - A CQL warning makes the rewrite transparent to operators - SAI on non-vector, non-map-entries columns is still rejected as before - Vector SAI indexes continue to be rewritten to `vector_index` as before - `test_sai_entries_on_map_creates_regular_index` — verifies the index is created and the warning is emitted (fully-qualified SAI class name) - `test_sai_entries_on_map_short_name` — same with the `'sai'` short alias - `test_sai_on_regular_column_rejected` — confirms SAI on regular scalar columns is still rejected All 148 tests in `test_vector_index.py` and `test_secondary_index.py` pass with no regressions (125 passed, 22 xfailed, 1 skipped). Fixes: SCYLLADB-2113 Backport: 2026.2 as this is the version where the support for SAI class needed by LangChain was added. Closes scylladb/scylladb#29981 * github.com:scylladb/scylladb: cql: rewrite CassIO SAI metadata index to regular secondary index db/config: add enable_cassio_compatibility flag	2026-05-26 00:19:03 +03:00
Michał Hudobski	1d17d2144f	index, vector_index: limit primary key columns to 255 The vector-store's InvariantKey type supports at most 255 key components. Reject vector index creation when the base table's primary key (partition + clustering columns) exceeds this limit. Fixes: VECTOR-553 Closes scylladb/scylladb#29317	2026-05-25 19:24:17 +03:00
Szymon Wasik	5ee339b11d	cql: rewrite CassIO SAI metadata index to regular secondary index When CassIO creates a SAI ENTRIES index on a map column, ScyllaDB now rewrites it to a regular secondary index and emits a CQL warning. This allows LangChain/CassIO applications to work without DDL errors. The rewrite is gated behind the enable_cassio_compatibility flag (disabled by default). Refs: SCYLLADB-2113	2026-05-25 15:11:43 +02:00
Dmitry Kropachev	74fa423271	transport: report host id in SUPPORTED Currently driver creates network layout (node IP addresses and ports) from `system.local`, `system.peers`, `system.client_routes` and then runs on assumption that this network layout is correct. It does not check if it is. If, for example it happens so that node ip/port (say on proxy) will not match what driver calculated it will go unnoticed. The goal of this feature is to provide driver host-id on SUPPORTED frame, so that it would know which node it connected to and could make decision wether keep connection or drop it. - add `SCYLLA_HOST_ID` to the CQL `SUPPORTED` response - add a regression test that hooks the Python driver handshake and verifies the reported host id - `python3.12 -m py_compile test/cqlpy/test_protocol_exceptions.py` - syntax-only compile of `transport/server.cc` with the repo toolchain flags inside `dbuild` Refs #27452 Refs https://scylladb.atlassian.net/browse/DRIVER-610 Closes scylladb/scylladb#29809	2026-05-25 14:36:53 +03:00
Nadav Har'El	f8aaeb5e87	cql: atomic add/subtract operations with LWT ScyllaDB has special counter columns for which atomic add/subtract operations like `SET a = a + 1` are allowed. Such operations have not been allowed on ordinary non-counter columns, as they would not be properly atomic - the read an the write are separate, and concurrent operations can have incorrect results. This patch makes it allowed to use such atomic add/subtract operations in LWT statements. Some examples: UPDATE ... SET a = a - 1 IF a > 0 UPDATE ... SET a = a + 1 IF EXISTS UPDATE ... SET a = a + 1 a != NULL The row updated in the operation, and the updated column (a) should be initialized before the update - arithmetic operations on missing column values silently leave the column null (no error is generated). This add/subtract operations is allowed on any numeric column - integer or floating point of any size. The ability of LWT to fetch the old values of a column and use it to calculate the new value has long been available in our internal CAS implementation - and has been in use for years in Alternator - but until this patch it was not exposed in CQL's LWT. This patch does not add new syntax to CQL - the "SET a = a + b" and "SET a = a - b" syntax that already existed for counters is now allowed for non-counters. This is a new Scylla-only feature that does not exist in Cassandra. Fixes #10568 Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2026-05-25 10:09:11 +03:00
Yaniv Michael Kaul	bb69ae5a02	test: assert ALTER TYPE RENAME rejected on frozen PK UDTs Add assertion that ALTER TYPE RENAME is rejected when the UDT is used as a frozen partition key column. The existing test only covered ALTER TYPE ADD. This closes the coverage gap from dtest udtencoding_test.py::test_udt_change_in_partition_key, enabling its removal. Refs: SCYLLADB-1929 Closes scylladb/scylladb#29840	2026-05-22 12:29:43 +02:00
Dario Mirovic	f9e8518776	cql: fix request-side custom payload parsing When a CQL client sends a request with the CUSTOM_PAYLOAD flag (0x04) set, the frame body starts with a [bytes map] before the message. Scylla never implemented parsing of this map on the request side. This caused it to fail parsing with protocol errors such as "truncated frame: expected 65546 bytes". Fix this by skipping over the custom payload [bytes map] from the frame body before dispatching to opcode-specific handlers. The payload contents are discarded since Scylla has no pluggable QueryHandler. Cassandra's default QueryHandler also discards them. Fixes SCYLLADB-745	2026-05-21 18:36:37 +02:00
Dario Mirovic	8e6d2d0631	test/cqlpy: add tests for request-side custom payload handling Add tests that verify Scylla's handling of CQL native protocol requests with the CUSTOM_PAYLOAD flag (0x04) set. Each test asserts the specific parse error that the unfixed server produces. A separate CQL session is used for each test. The protocol error kills the driver connection, and we need to catch it properly. Refs SCYLLADB-745	2026-05-21 18:34:43 +02:00
Michał Hudobski	119ef942f8	test/cqlpy: add tests for global and local vector index coexistence Add integration tests verifying that both a global and a local vector index can be created on the same column without triggering a spurious "duplicate custom index" error. This was fixed by #29407. Tests cover: - Creating global+local and local+global index pairs on the same column. - Duplicate detection still rejects a second index of the same locality. - IF NOT EXISTS is a no-op for a duplicate same-locality index (and verifies no extra index is created). - IF NOT EXISTS with a different locality creates both indexes. - Two indexes with the same name on different tables are rejected (partially validates VECTOR-643). Fixes: SCYLLADB-987	2026-05-21 10:35:48 +02:00
Dawid Pawlik	6387c61506	test/cqlpy: add duplicate and view tests for fulltext index Verify that fulltext indexes, which have no backing materialized view, correctly reject duplicate index creation and respect IF NOT EXISTS semantics. Named indexes must not be created twice under the same name; unnamed indexes on the same column must be detected as duplicates. IF NOT EXISTS must silently succeed rather than create a second index, including the known edge cases where the same name is reused across different tables or columns in the same keyspace (VECTOR-641).	2026-05-19 08:52:47 +02:00
Dawid Pawlik	232b1a3725	cql3: generalize viewless index handling in CREATE INDEX statement Replace the `vector_index`-specific checks in `create_index_statement` with a generic `is_viewless_custom_class()` helper that queries the index factory to determine whether an index type creates a backing materialized view. This covers both existing (`vector_index`) and new (`fulltext_index`) viewless index types: - Reject view properties (WITH clause) for any viewless index - Use name-based duplicate detection for named viewless indexes, since they have no backing view table for `has_schema()` to find (issue #26672)	2026-05-19 08:52:47 +02:00
Dawid Pawlik	215a1e3f00	test/cqlpy: add CDC validation tests for fulltext index Verify that fulltext index creation and ALTER TABLE enforce the CDC requirements: creation is rejected when TTL is below the 24-hour minimum, or when the delta mode is neither 'full' nor compensated by postimage. Also verify that enabling postimage or full delta mode allows index creation to succeed, that DROP INDEX works, and that ALTER TABLE cannot disable CDC while a fulltext index is present.	2026-05-19 08:52:47 +02:00
Dawid Pawlik	558de64773	test/cqlpy: add tablet requirement test for fulltext index Add `test_create_fulltext_index_requires_tablets` to verify that creating a fulltext index on a keyspace with tablets disabled is rejected.	2026-05-19 08:52:47 +02:00
Dawid Pawlik	69dc62c373	fulltext_index: require tablet storage for fulltext indexes Fulltext indexes, like vector indexes, require the base table's keyspace to use tablets. Add `check_uses_tablets()` validation to `fulltext_index::validate()` that rejects index creation when the keyspace does not use tablet storage. Also add `skip_without_tablets` fixture to all existing fulltext index tests so they are skipped in environments where tablets are not available.	2026-05-19 08:52:47 +02:00
Dawid Pawlik	61d658106a	index: introduce `external_index` base class for VS/FTS indexes Add `external_index` as a common base for `vector_index` and `fulltext_index`, both of which are backed by an external Vector Store engine and share CDC requirements.	2026-05-19 08:52:47 +02:00
Dawid Pawlik	c2d27d1a50	index: remove Chinese, Japanese, and Korean language analyzers Remove "chinese", "japanese", and "korean" from the list of accepted full-text search analyzer options. Exposing these options commits ScyllaDB to supporting them long-term — if we ever switch from one backend search engine to another, CJK analyzers are the most likely to lose out-of-the-box support, unlike the popular European languages that are broadly available across text analysis libraries. Restrict the accepted set now, while FTS is still new, to avoid a future compatibility burden. Add a test to check if the CJK language analyzer options are rejected. Fixes: VECTOR-672 Closes scylladb/scylladb#29877	2026-05-18 18:20:47 +03:00
Szymon Malewski	15493872b2	vector_search: fix decimal/varint precision loss in filter value_to_json() value_to_json() converts CQL values to JSON for vector search filters. For decimal and varint types, it used rjson::parse() on the JSON string, which parses through a double and silently loses precision for values exceeding ~15 significant digits — producing wrong filter results. Additionally, for decimal type we need an exact string representation that preserves the original (unscaled, scale) pair, because partition keys use byte-level identity: different serialized representations of the same numeric value are distinct rows, so the filter must reproduce the exact representation stored in the key. Add big_decimal::to_string_canonical() which follows the Java BigDecimal toString() spec (JDK 8+), producing a bijective string representation that uses exponential notation for extreme scales instead of expanding trailing zeros (which could cause OOM). This could replace to_string(), but doing so has wider consequences (e.g. hash/equality contract for decimal_type) described in SCYLLADB-1574. Use it in value_to_json() for decimal_type, and use rjson::from_string() for varint_type, both bypassing the lossy double parse path. Tests cover the new to_string_canonical() and the filter fix, as well as existing decimal type behavior (key representation, clustering order, toJson) that we rely on and must not break. The CQL decimal type tests (test_type_decimal.py) also pass against Cassandra. Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1583 Refs: https://scylladb.atlassian.net/browse/SCYLLADB-1574 Closes scylladb/scylladb#29505	2026-05-18 17:07:26 +03:00
Evgeniy Naydanov	39a10d6d67	test: remove dead suite subclasses and legacy execution pipeline After all test suites migrated to test_config.yaml with type: Python, the specialized suite classes (Topology, CQLApproval, Run, Tool) and the legacy execution pipeline (find_tests, run_test, TestSuite.run, Test.run) became unreachable. Remove all this dead code. Deleted files: - suite/topology.py, suite/cql_approval.py, suite/run.py, suite/tool.py Simplified: - base.py: remove run_test(), read_log(), TestSuite.run(), add_test_list(), build_test_list(), all_tests(), test_count(), SUITE_CONFIG_FILENAME, disabled/flaky test tracking, and dead Test attributes (args, core_args, valid_exit_codes, allure_dir, is_flaky, is_cancelled, etc.) - python.py: remove PythonTestSuite.run(), PythonTest.run(), _prepare_pytest_params(), pattern, test_file_ext, xmlout, server_log, scylla_env setup, and shlex import. Simplify run_ctx() to take no parameters. - runner.py: remove --scylla-log-filename option, print_scylla_log_filename fixture, SUITE_CONFIG_FILENAME import, and suite.yaml probe in TestSuiteConfig.from_pytest_node(). - __init__.py: remove re-exports of deleted classes. - test_config.yaml: Topology -> Python, Approval -> Python. - conftest files: run_ctx(options=...) -> run_ctx(). - docs/dev/testing.md: update to reflect current pytest-based architecture, log paths, and removed features. Co-Authored-By: Claude Opus 4.6 (200K context) <noreply@anthropic.com> Closes scylladb/scylladb#29613	2026-05-17 22:16:31 +03:00
Marcin Maliszkiewicz	ec8f8e3a5b	Merge 'test: make test_vector_search_with_vector_store_mock 30 times faster!' from Nadav Har'El Before this patch, ``` test/cqlpy/run test_vector_search_with_vector_store_mock.py ``` Took 34 seconds. After this patch, it takes 1 second. Look at the individual patches for how the magic happened. The first patch lowers the test duration from 34 to 5 seconds, the second patch lowers it further to 1 second. Closes scylladb/scylladb#29891 * github.com:scylladb/scylladb: test/cqlpy: make test_vector_search_with_vector_store_mock faster vector-search: reset DNS timeout after changing host	2026-05-14 17:12:47 +02:00
Botond Dénes	1403f18240	Merge 'alternator: add more vector search features' from Nadav Har'El Recently (in commit `37fc1507f0`) we added vector search support for Alternator. That implementation was functional, but did not yet support all the features that we had envisioned. This patch series adds some of the missing features to Alternator's vector search. Each feature is described in more detail in its own patch. * Metrics related to vector search usage in Alternator. * `SimilarityFunction` option when creating a vector index to choose the similarity function. Defaults to `COSINE` (the existing default). Other options are `DOT_PRODUCT` and `EUCLIDEAN`. * An optimized vector type, `{"FLOAT32VECTOR": [1.0, 2.0, ..]}`, which is stored on disk efficiently as 32-bit floats, not a JSON. * A Query VectorSearch option `ReturnScores` asking to return the similarity score calculated for each returned result (the results are sorted in decreasing similarity score - the highest similarity is the best and returned first). Closes scylladb/scylladb#29554 * github.com:scylladb/scylladb: alternator: add ReturnScores option to VectorSearch vector_store_client: read and return similarity_scores alternator: add optimized vector type for vector search alternator: add SimilarityFunction option to vector index creation alternator: add vector search metrics	2026-05-14 10:41:41 +03:00
Nadav Har'El	5c065c7746	test/cqlpy: make test_vector_search_with_vector_store_mock faster The previous patch made test_vector_search_with_vector_store_mode significantly faster, but at 5 seconds for 7 tests, it was still not fast enough. It turns out that the reason why the tests was slow is that each test used a function-scoped fixture, which set up the vector store mock again and again, separately for each test. This - especially waiting for the client in Scylla to recognize the new server - took time (before the previous patch it was 5 seconds, after the patch it went down to 0.5 seconds - but still too slow). The solution is simple: 1. Create a module scoped fixture that creates the mock and connects it to Scylla just once for all the tests in that file. 2. The function scoped fixture just uses the module-scoped one but resets the saved responses, to avoid one test influencing the other. After this patch, the time to run this test file is down to 1 second (!). Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2026-05-13 14:57:56 +03:00
Nadav Har'El	c56361a6d7	vector_store_client: read and return similarity_scores The vector store returns for every ANN search, in addition to the keys of the matching items, two additional vectors - "distances" and "similarity_cores". The "distances" are raw distance metrics - lower scores are better matches, while "similarity_scores" are modified such that higher scores are better matches. Traditionally, search scores in systems like Cassandra and Open Search use the "similarity scores" approach (higher is better, results are returned in decreasing similarity order), so this is the more interesting vector of the two. But before this patch, our vector_store_client::ann() inspected only "distances". But... then, it didn't return even that to the caller :-) So in this patch, we: 1. Ignore "distances" and instead look at "similarity scores", which is what users really want based on their experience with other vector and non-vector search engines. 2. Return the similarity score of each match together with the match. We already have this score (the vector store returns it) and we can add it to the existing primary_key structure of each result. So each result is a "struct primary_key" which has fields partition, clustering, and after this patch - similarity. Existing callers in CQL and Alternator vector search will ignore this "similarity" field in each result, and not notice it was added. But in the next patch, we'll allow Alternator's vector search to return this similarity in each result. The existing unit tests for vector_store_client.cc mocked vector-store responses with "distances", without "similarity_scores", so no longer represent what we actually expect the vector store to do. So this patch also contains modifications for these tests, to mock and to test "similarity_scores" - not "distances". The more interesting tests, in the next patch, use the real vector store and check that we really do get a "similarity_scores" response from it. This patch also handles a small corner case for DOT_PRODUCT, which is the only unbounded similarity function. If the similarity overflows the 32-bit float, the vector store returns a JSON "null" instead of a JSON number (since JSON doesn't support infinite numbers). Our existing vector-store client code errored out when it saw this "null", which is wrong - the request should be allowed to proceed. So in this patch when we see a "null" JSON for similarity, we return +Inf. This is usually correct because the top results really have +Inf, not -Inf, but if we ask for all items we can reach those with similarity -Inf and incorrectly assign +Inf to them (we have a test for this case in the next patch). But this problenm won't happen when Limit is low, and in any case it's better than aborting the request after it had already succeeded. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2026-05-13 14:19:17 +03:00
Nadav Har'El	51c35c05e2	test/cqlpy: teach run-cassandra to use Docker The test/cqlpy/run-cassandra script makes it quite easy to run test/cqlpy tests against Cassandra, which is important for checking compatibility. Unfortunately, because modern Linux distributions like Fedora do not have either Cassandra or the old version of Java that it needs, the user needs to download those manually. This is fairly easy, and explained in detail in test/cqlpy/README.md, but nevertheless is a non-trivial manual step. So this patch adds an even simpler alternative, the "--docker" option which tells the script to run the official Cassandra docker image, complete with the version of Java that it prefers - the user does not need to download or install Cassandra or Java. The image is efficiently cached by Docker, so running run-cassandra again doesn't need to download it again; Moreover, trying several different versions of Cassandra only needs to download and store the shared parts (base image and Java) once. test/cqlpy/run-cassandra --docker test_file.py::test_function Runs by default the latest Cassandra 5 release. You can also use "--docker=4" to get the latest Cassandra 4 release, "--docker=3.11" to get the latest Cassandra 3.11 patch release, or "--docker=3.11.1" to get a specific patch release. In addition to the "--docker" option, this patch also introduces a second option, "--java-docker", which takes only Java from docker, but runs your locally installed Cassandra (to which you should point with the CASSANDRA environment variable, as before). This option can be useful if your host does not have a suitable version of Java, but you want to run a locally-installed or locally-modified version of Cassandra. The "--java-docker" option defaults to getting Java 11, to use other versions you can use for example "--java-docker=17". Fixes #25826. Closes scylladb/scylladb#29860	2026-05-13 11:57:18 +02:00
Yaniv Michael Kaul	c359a09189	test: add UDF/UDA keyspace isolation and UDT tests Port 3 tests from scylla-dtest user_functions_test.py: - test_udf_with_udt: UDF taking frozen UDT arg, verifies DROP TYPE blocked - test_udf_with_udt_keyspace_isolation: cross-keyspace UDT references rejected - test_aggregate_with_udt_keyspace_isolation: cross-keyspace UDT in UDA rejected All tests use Lua (Scylla's supported UDF language). Reproduces CASSANDRA-9409. Closes scylladb/scylladb#1928 Closes scylladb/scylladb#29843	2026-05-12 14:57:14 +03:00
Piotr Smaron	1018710e38	test/cqlpy: un-xfail oversized indexed value build test Issue #8627 is fixed, so test_too_large_indexed_value_build now passes and should run normally instead of XPASSing under strict xfail. Fixes: SCYLLADB-1938 Closes scylladb/scylladb#29853	2026-05-12 11:40:53 +02:00
Botond Dénes	8d6f031a4a	schema: fix DESCRIBE showing NullCompactionStrategy when compaction is disabled When a table's compaction is disabled via 'enabled': 'false', the DESCRIBE output incorrectly showed NullCompactionStrategy instead of the actual strategy. This happened because schema_properties() called compaction_strategy(), which returns compaction_strategy_type::null when compaction is disabled. Fix it by using configured_compaction_strategy(), which always returns the real strategy type - consistent with how schema_tables.cc serializes it to disk. Fixes SCYLLADB-1353 Closes scylladb/scylladb#29804	2026-05-12 12:38:25 +03:00
Pavel Emelyanov	150345cc52	Merge 'test: per-bucket isolation for S3/GCS object storage tests' from Ernest Zaslavsky This series adds per-test bucket isolation to all S3 and GCS object storage tests. Previously, every test shared a single pre-created bucket, which meant tests could interfere with each other through leftover objects and could not run concurrently across multiple `test.py` processes without risking collisions. New `create_bucket`, `delete_bucket`, and `delete_bucket_with_objects` methods on `s3::client`, following the existing `make_request` pattern. `create_bucket` handles the `BUCKET_ALREADY_OWNED_BY_YOU` error gracefully. A new `s3_test_fixture` RAII class for C++ Boost tests that creates a uniquely-named bucket on construction (derived from the Boost test name and pid) and tears down everything — objects, bucket, client — on destruction. All S3 tests in `s3_test.cc` are migrated to use it, removing manual `deferred_delete_object` and `deferred_close` boilerplate. The minio server policy is broadened to allow dynamic bucket creation/deletion. A `client::make` overload that accepts a custom `retry_strategy`, used in tests with a fast 1ms retry delay instead of exponential backoff, significantly reducing test runtime for transient errors during bucket lifecycle operations. Python-side (`test/cluster/object_store`): each pytest fixture (`object_storage`, `s3_storage`, `s3_server`) now creates a unique bucket per test function via `create_test_bucket()` and destroys it on teardown. Bucket names are sanitized from the pytest node name with a short UUID suffix for uniqueness. Object storage helpers (`S3Server`, `MinioWrapper`, `GSFront`, `GSServerImpl`, factory functions, CQL helpers, `s3_server` fixture) are extracted from `test/cluster/object_store/conftest.py` into a shared `test/pylib/object_storage.py` module, eliminating duplication across test suites. The conftest becomes a thin re-export wrapper. Old class names are preserved as aliases for backward compatibility. \| Test Name \| new test specific retry strategy execution time (ms) \| original execution time (ms) \| Δ (ms) \| Speedup \| \|--------------------------------------------------------------\|----------------:\|-------------:\|---------:\|--------:\| \| test_client_upload_file_multi_part_with_remainder_proxy \| 19,261 \| 61,395 \| −42,134 \| 3.2× \| \| test_client_upload_file_multi_part_without_remainder_proxy \| 16,901 \| 53,688 \| −36,787 \| 3.2× \| \| test_client_upload_file_single_part_proxy \| 3,478 \| 6,789 \| −3,311 \| 2.0× \| \| test_client_multipart_copy_upload_proxy \| 1,303 \| 1,619 \| −316 \| 1.2× \| \| test_client_put_get_object_proxy \| 150 \| 365 \| −215 \| 2.4× \| \| test_client_readable_file_stream_proxy \| 125 \| 327 \| −202 \| 2.6× \| \| test_small_object_copy_proxy \| 205 \| 389 \| −184 \| 1.9× \| \| test_client_put_get_tagging_proxy \| 181 \| 350 \| −169 \| 1.9× \| \| test_client_multipart_upload_proxy \| 1,252 \| 1,416 \| −164 \| 1.1× \| \| test_client_list_objects_proxy \| 729 \| 881 \| −152 \| 1.2× \| \| test_chunked_download_data_source_with_delays_proxy \| 830 \| 960 \| −130 \| 1.2× \| \| test_client_readable_file_proxy \| 148 \| 279 \| −131 \| 1.9× \| \| test_client_upload_file_multi_part_with_remainder_minio \| 3,358 \| 3,170 \| +188 \| 0.9× \| \| test_client_upload_file_multi_part_without_remainder_minio \| 3,131 \| 2,929 \| +202 \| 0.9× \| \| test_client_upload_file_single_part_minio \| 519 \| 421 \| +98 \| 0.8× \| \| test_download_data_source_proxy \| 180 \| 237 \| −57 \| 1.3× \| \| test_client_list_objects_incomplete_proxy \| 590 \| 641 \| −51 \| 1.1× \| \| test_large_object_copy_proxy \| 952 \| 991 \| −39 \| 1.0× \| \| test_client_multipart_upload_fallback_proxy \| 148 \| 185 \| −37 \| 1.3× \| \| test_client_multipart_copy_upload_minio \| 641 \| 674 \| −33 \| 1.1× \| No backport needed — this is a test infrastructure improvement with no production code impact beyond the new `s3::client` methods. Closes scylladb/scylladb#29508 * github.com:scylladb/scylladb: test: extract object storage helpers to test/pylib/object_storage.py test: add per-test bucket isolation to object_store fixtures s3: add client::make overload with custom retry strategy test: add s3_test_fixture and migrate tests to per-bucket isolation s3: add create_bucket and delete_bucket to client	2026-05-12 12:38:24 +03:00
Piotr Smaron	71542206bc	cql: return InvalidRequest for oversized partition/clustering keys When a partition key or clustering key value exceeds the 64 KiB limit (65535 bytes serialized), Scylla used to raise a generic std::runtime_error "Key size too large: N > M" from the low-level compound-key serializer. That error surfaced to clients as a CQL server error (code 0x0000, "NoHostAvailable"-looking), which is both ugly and incompatible with Cassandra - Cassandra returns a clean InvalidRequest with the message "Key length of N is longer than maximum of M". Fix this at the single chokepoint: compound_type::serialize_value in keys/compound.hh. The serializer is on every path that materializes a key - INSERT/UPDATE/DELETE/BATCH build mutations through it, and SELECT builds partition and clustering ranges through it - so a single throw replacement produces a clean InvalidRequest consistently across all paths and all key shapes (single, compound PK, composite CK). The previous approach on this PR branch patched three call sites in cql3/restrictions/statement_restrictions.cc, which only covered SELECT, duplicated the check, and placed it mid-restrictions code (flagged in review). Dropping those changes in favour of the root-cause fix here. Un-xfail the tests this fixes: - test/cqlpy/test_key_length.py: test_insert_65k_pk, test_insert_65k_ck, test_where_65k_pk, test_where_65k_ck, test_insert_65k_ck_composite, test_insert_total_compound_pk_err, test_insert_total_composite_ck_err. - test/cqlpy/cassandra_tests/.../insert_test.py: testPKInsertWithValueOver64K, testCKInsertWithValueOver64K. - test/cqlpy/cassandra_tests/.../select_test.py: testPKQueryWithValueOver64K. test_insert_65k_pk_compound stays xfail: its oversized value gets rejected by the Python driver's CQL wire-protocol encoder (see CASSANDRA-19270) before reaching the server, so the fix can't apply. Updated its reason. testCKQueryWithValueOver64K stays xfail with an updated reason: Cassandra silently returns empty for an oversized clustering key in WHERE, while Scylla now throws InvalidRequest - a deliberate choice mirroring the partition-key case, documented in the discussion on #10366. Add three tight-boundary tests (addressing review feedback on the previous revision) that pin MAX+1 behaviour for SELECT and INSERT of both partition and clustering keys. Update test/cluster/dtest/limits_test.py to match the new message ("Key length of \\d+ is longer than maximum of 65535"). fixes #10366 fixes #12247 Co-authored-by: Alexander Turetskiy <someone.tur@gmail.com> Closes scylladb/scylladb#23433	2026-05-11 16:56:35 +03:00
Piotr Smaron	959f67b345	cql: verify tuples length in multi-column IN restriction When a multi-column IN restriction contains tuples with a different number of elements than the number of restricted columns (e.g. `(b, c, d) IN ((1, 2), (2, 1, 4))`), Scylla would either produce an inconsistent error message or, for over-sized tuples, an internal type-mismatch error referencing the list literal representation. Validate each tuple's arity against the number of restricted columns while building the IN restriction and raise a clear "Expected N elements in value tuple, but got M" error in both the under- and over-sized cases. Fixes #13241 Co-authored-by: Alexander Turetskiy <someone.tur@gmail.com> Closes scylladb/scylladb#18407	2026-05-11 16:55:09 +03:00
Nadav Har'El	f1b2b9bd52	Merge 'Register `fulltext_index` custom index type' from Dawid Pawlik This PR adds the `fulltext_index` custom index class, laying the groundwork for full-text search in ScyllaDB. It focuses on the CQL-facing layer - schema validation, option parsing, and metadata - without implementing the search backend itself. Users can now write: ```cql CREATE CUSTOM INDEX ON t(content) USING 'fulltext_index' WITH OPTIONS = {'analyzer': 'english', 'positions': 'false'}; ``` The implementation follows the same custom index pattern established by vector search: a `custom_index` subclass registered in the factory map, with no backing materialized view. This keeps the door open for a CDC-based indexing pipeline similar to the one vector search uses. As part of this work, the option validation helpers (`validate_enumerated_option`, `validate_positive_option`, `validate_factor_option`) were extracted from `vector_index.cc` into a shared header so both index types can reuse them. The `custom_index` base class also gained a virtual `index_type_name()` method, giving each subclass a self-describing name for error messages without hardcoding strings in shared code. The PR is split into three commits: 1. Extract shared validation utilities and add `index_type_name()` to `custom_index` 2. Implement `fulltext_index` with column type and option validation 3. Integration tests covering creation, validation, describe, and metadata Fixes: SCYLLADB-1517 Fixes: SCYLLADB-1510 References: SCYLLADB-1516 Closes scylladb/scylladb#29658 * github.com:scylladb/scylladb: test/cqlpy: add integration tests for `fulltext_index` index: unify custom index description index: add `fulltext_index` custom index implementation index: extract option validation helpers	2026-05-11 16:16:58 +03:00
Nadav Har'El	fcfad51284	Merge 'cql3/selection: require EXECUTE on UDA REDUCEFUNC at SELECT time' from Marcin Maliszkiewicz selection::used_functions() pushed the UDA, its SFUNC and its FINALFUNC, but never the REDUCEFUNC. The reducefunc is invoked by the distributed aggregation path in service::mapreduce_service, so a user could cause it to run server-side without holding EXECUTE on it as long as the query took the mapreduce path. Also push agg.state_reduction_function so select_statement::check_access requires EXECUTE on it too. Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1756 Backport: no, it's a minor fix and UDFs are experimental feature in Scylla Closes scylladb/scylladb#29717 * github.com:scylladb/scylladb: test/cqlpy: add test for EXECUTE permission on UDA sub-functions cql3/selection: require EXECUTE on UDA REDUCEFUNC at SELECT time	2026-05-11 16:14:38 +03:00
Marcin Maliszkiewicz	fa9d15d31a	test/cqlpy: add test for EXECUTE permission on UDA sub-functions Verify that SELECT of a UDA requires EXECUTE on its SFUNC, FINALFUNC, and REDUCEFUNC individually. If any one permission is missing, the query must be rejected at planning time (even on an empty table). The test is parameterized over the three sub-functions and uses Lua on Scylla or Java on Cassandra, so it runs on both backends. The REDUCEFUNC case is skipped on Cassandra since REDUCEFUNC is a Scylla extension. Refs SCYLLADB-1756	2026-05-11 10:23:39 +02:00
Nadav Har'El	34136d3bc2	Merge 'vector_search: test: migrate CQL tests for vector search from C++/Boost to pytest' from Karol Nowacki Migrate vector search (ANN ordered select query) CQL tests from C++/Boost suite to pytest. This migration includes: - New pytest tests in `test/cqlpy/test_vector_search_with_vector_store_mock.py` - VectorStoreMock server as pytest fixture to simulate vector store responses The benefits of this migration are: - Extended test coverage to verify CQL protocol serialization and driver - Reduced overall test time (no compilation required for pytest) Fixes SCYLLADB-695 No backport needed as this is a refactoring. Closes scylladb/scylladb#29593 * github.com:scylladb/scylladb: vector_search: test: migrate paging warnings tests to Python vector_search: test: migrate local_vector_index to Python vector_search: test: migrate vector_index_with_additional_filtering_column to Python vector_search: test: migrate cql_error_contains_http_error_description to Python vector_search: test: migrate pk in restriction test to Python	2026-05-10 22:09:17 +03:00
Dawid Pawlik	b6d5ff344b	test/cqlpy: add integration tests for `fulltext_index` Add `test_fulltext_index.py` covering the `fulltext_index` custom index: - Creation on text, varchar, and ascii columns - Rejection of non-text types (int, blob, vector) - Validation of analyzer and positions options - Rejection of unsupported option keys - Case-insensitive class name lookup - DESCRIBE INDEX output with and without options - No backing materialized view in `system_schema.views` - IF NOT EXISTS idempotent behavior - Metadata correctness in `system_schema.indexes`	2026-05-08 11:30:08 +02:00
Yaniv Michael Kaul	7557c64f20	test/cqlpy: add tests for hyphenated column names Verify that double-quoted column names with hyphens (e.g. "my-col") work correctly for CREATE TABLE, INSERT, and SELECT. Also verify that unquoted hyphenated names are rejected with a syntax error.	2026-05-06 11:32:04 +03:00
Karol Nowacki	20b953ef8c	vector_search: test: migrate paging warnings tests to Python Move the paging warning related tests from C++ vector_store_client_test to Python test_vector_search_with_vector_store_mock.	2026-05-05 18:23:30 +02:00
Karol Nowacki	84787ce6a5	vector_search: test: migrate local_vector_index to Python Move the local vector index test from C++ vector_store_client_test to Python test_vector_search_with_vector_store_mock. The test creates a local vector index on ((pk1, pk2), embedding) and verifies that SELECT with partition key restriction and ANN ordering works correctly.	2026-05-05 18:23:30 +02:00
Karol Nowacki	0bb7e47090	vector_search: test: migrate vector_index_with_additional_filtering_column to Python Move the SCYLLADB-635 regression test from C++ vector_store_client_test to Python test_vector_search_with_vector_store_mock. The test creates a vector index on (embedding, ck1) and verifies that SELECT with ANN ordering works correctly when additional filtering columns are included in the index definition.	2026-05-05 18:23:30 +02:00
Karol Nowacki	5a8af3c727	vector_search: test: migrate cql_error_contains_http_error_description to Python Move the test that verifies HTTP error descriptions from the vector store are propagated through CQL InvalidRequest messages from the C++ vector_store_client_test to the Python test_vector_search_with_vector_store_mock. The test configures the mock to return HTTP 404 with 'index does not exist' and asserts the CQL SELECT raises InvalidRequest containing '404'.	2026-05-05 18:23:30 +02:00

1 2 3 4 5 ...

438 Commits