This series adds IDL file comparison to the build system comparison tool and fixes CMake PCH propagation.
1. `scripts/compare_build_systems.py` only compared compilation flags, link targets, and linker settings — it did not compare IDL-generated file sets. This allowed PR #28843 to pass CI despite adding `strong_consistency/groups_manager.idl.hh` to `configure.py` but not to `idl/CMakeLists.txt`.
2. CMake's `scylla-main` target was not using the precompiled header (`stdafx.hh`), even though configure.py applies it to every source file via `-include-pch`. This caused compilation failures for files relying on transitive includes from the PCH — e.g., `sstables_loader.cc` failed with `no member named 'read_entire_stream' in namespace 'seastar::util'`.
Add a 4th comparison check to the build system comparison script: extract IDL-generated file sets from both build systems' ninja files and compare them. The extractors parse ninja build statements — configure.py side filters by build mode, CMake side handles the `|` separator for implicit outputs — and normalize to a canonical relative path for comparison.
Add the missing `strong_consistency/groups_manager.idl.hh` to `idl/CMakeLists.txt`.
Add `target_precompile_headers(scylla-main REUSE_FROM scylla-precompiled-header)` so that all sources compiled under `scylla-main` benefit from the PCH, matching configure.py's behavior.
Update documentation to reflect the new IDL comparison check.
Refs: https://github.com/scylladb/scylladb/pull/29901
Refs: https://github.com/scylladb/scylladb/pull/28843
No backport needed — these are build system improvements only.
Closesscylladb/scylladb#29912
* github.com:scylladb/scylladb:
cmake: reuse precompiled header in scylla-main target
idl: add missing groups_manager.idl.hh to CMakeLists.txt
scripts: add IDL-generated file comparison to compare_build_systems
The paxos state queries (load_paxos_state, save_paxos_promise, etc.)
were using page_size=-1 (no paging). While each query returns at most
one row and paging never actually kicks in, the lack of paging causes
these internal queries to be counted as non-paged reads in the metrics,
which can be confusing to users monitoring their cluster.
Add LIMIT 1 to the SELECT query so that may_need_paging() short-circuits
to false (row_limit <= 1), avoiding pager allocation overhead entirely.
Set page_size=1000 so these queries are no longer reported as non-paged
reads.
Refs: https://scylladb.atlassian.net/browse/CUSTOMER-372
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
Backport: no, improvement
Closesscylladb/scylladb#29852
start_docker_service is a coroutine that took docker_args and
image_args by const reference. Its caller start_fake_gcs_server
is a regular function that passes temporaries (initializer lists)
and immediately returns a future. The temporaries are destroyed
when the caller returns, leaving the coroutine holding dangling
references.
On the first loop iteration this works by luck (memory not yet
reused), but on retry (after "address already in use") the
params.append_range(image_args) reads freed memory, causing
use-after-free that manifests as std::bad_alloc or broken_promise
in non-sanitizer builds.
Fix by taking docker_args and image_args by value so the coroutine
frame owns the vectors for its entire lifetime.
Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-2003Closesscylladb/scylladb#29932
DynamoDB normalizes Number values, so different string representations
of the same number (e.g., "1000" vs "1e3") should be treated as the
same value in all contexts.
In Alternator this is true in most cases, thanks to implicit normalization in
Decimal `to_string()` function.
However this is fragile - and in fact this function should be fixed
due to OOM vulnerability in CQL use (#8002).
This patch adds tests that should prevent regression in cases
that work currently.
Unfortunately not all contexts work currently - mainly the HASH keys
are not normalized and backend handles them by byte representation.
Added test replicate this incorrect behaviour
All added tests pass with DynamoDB, with one exception: weirdly
DynamoDB doesn't recognise unnormalized numbers in BatchGetItem
as duplicate keys.
Ref SCYLLADB-1575
Closesscylladb/scylladb#29501
After all test suites migrated to test_config.yaml with type: Python,
the specialized suite classes (Topology, CQLApproval, Run, Tool) and
the legacy execution pipeline (find_tests, run_test, TestSuite.run,
Test.run) became unreachable. Remove all this dead code.
Deleted files:
- suite/topology.py, suite/cql_approval.py, suite/run.py, suite/tool.py
Simplified:
- base.py: remove run_test(), read_log(), TestSuite.run(),
add_test_list(), build_test_list(), all_tests(), test_count(),
SUITE_CONFIG_FILENAME, disabled/flaky test tracking, and dead
Test attributes (args, core_args, valid_exit_codes, allure_dir,
is_flaky, is_cancelled, etc.)
- python.py: remove PythonTestSuite.run(), PythonTest.run(),
_prepare_pytest_params(), pattern, test_file_ext, xmlout,
server_log, scylla_env setup, and shlex import.
Simplify run_ctx() to take no parameters.
- runner.py: remove --scylla-log-filename option,
print_scylla_log_filename fixture, SUITE_CONFIG_FILENAME import,
and suite.yaml probe in TestSuiteConfig.from_pytest_node().
- __init__.py: remove re-exports of deleted classes.
- test_config.yaml: Topology -> Python, Approval -> Python.
- conftest files: run_ctx(options=...) -> run_ctx().
- docs/dev/testing.md: update to reflect current pytest-based
architecture, log paths, and removed features.
Co-Authored-By: Claude Opus 4.6 (200K context) <noreply@anthropic.com>
Closesscylladb/scylladb#29613
After stopping scylla server processes, the FUSE daemon
(fuse2fs) may still be processing file handle closures.
An immediate fusermount3 -u can fail with 'device busy',
causing spurious test failures on teardown.
Retry the unmount up to 10 times with 0.5s delay between
attempts, and capture stderr for diagnostics.
Fixes: SCYLLADB-2049
Closesscylladb/scylladb#29920
The `test_max_cells` test was flaky due to `std::bad_alloc` caused by Seastar buddy allocator fragmentation. The root causes are:
1. The doubling loop with 24 iterations of CREATE/INSERT/DROP fragmented the allocator
2. The test built the whole batch as a single string that takes contiguous memory
Also, some iterations inserted zero rows, but still did CREATE/DROP table which also contributed to the fragmentation.
This patch series:
- Skips iterations that insert zero rows
- Creates the table once, truncates it after each test iteration
- Switches to prepared statements
Investigation results are presented in detail in https://scylladb.atlassian.net/browse/SCYLLADB-1645
Fixes SCYLLADB-1645
CI stability improvement. Backport to versions that have this test.
Closesscylladb/scylladb#29759
* github.com:scylladb/scylladb:
test: prepare max cells inserts
test: reuse max cells schema
test: limits: skip empty max cells iterations
When an RF change shrinks replicas on a DC and the node being shrunk is
excluded, refresh_tablet_load_stats() only provides load_stats for that
node if it has a cached snapshot from when the node was still up. If the
snapshot is missing or predates the tables being shrunk (e.g. they were
created after the node went down), stats stay incomplete. In that case
load_sketch::unload() called from make_rf_change_plan() throws:
Can't provide accurate load computation with incomplete load_stats
for host: <uuid>
Since an excluded node is not expected to come back, load_stats will
never become complete, and the topology coordinator retries the plan
infinitely, hanging ALTER KEYSPACE.
Add a check for excluded nodes and skip unload() for them: we are
removing the replica, so accurate load data for that node is not
needed. For all other node states the throw-and-retry behavior is
preserved.
Modify test_excludenode_shrink_rf to always trigger the bug: a new
error injection 'force_down_node_load_stats_invalid' forces the
invalid-stats path in refresh_tablet_load_stats() for a down node, so
the test does not depend on whether the load-stats refresher happened
to cache the excluded node's stats while it was still up.
Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1702.
Closesscylladb/scylladb#29622
scylla-precompiled-header defines the PCH (stdafx.hh) with PRIVATE
visibility, so targets linking to it do not inherit the PCH.
scylla-main was missing the PCH entirely, causing files like
sstables_loader.cc to fail with 'no member read_entire_stream' since
that symbol comes from <seastar/util/short_streams.hh> which is
included in stdafx.hh.
PR #29901 worked around this by adding the missing #include directly,
but the real fix is to propagate the PCH to scylla-main — matching
the configure.py behavior where every source file is compiled with
-include-pch stdafx.hh.pch.
Add target_precompile_headers(scylla-main REUSE_FROM
scylla-precompiled-header) so that all sources in scylla-main benefit
from the precompiled header.
Refs: https://github.com/scylladb/scylladb/pull/29901
PR #28843 added strong_consistency/groups_manager.idl.hh to
configure.py but not to idl/CMakeLists.txt, causing the CMake build
to fail with a missing generated header.
Add a 4th check that compares IDL-generated file sets between
configure.py and CMake. Previously only compilation flags, link
targets, and linker settings were compared — a missing IDL entry
(like strong_consistency/groups_manager.idl.hh in PR #28843) would
go undetected.
The extractors parse ninja build statements from both systems and
normalize to a canonical relative path (e.g. cache_temperature.dist.hh)
for comparison. configure.py outputs are filtered by mode; CMake
outputs handle the | separator for implicit outputs in ninja build
lines.
Also update the documentation to mention the new check.
Switch from raw CQL batch string to using a prepared statement.
The old approach constructed the entire 50-row batch as a single
CQL text string (~19.8 MiB with 32768 column names spelled out
per row). This caused large contiguous allocations in the server.
Fixes SCYLLADB-1645
Extract table creation into _create_max_cell_count_table(). Call
it once before the loop instead of creating and dropping the table
on every iteration. Use TRUNCATE instead of DROP TABLE between
iterations to clear data while keeping the schema.
This avoids repeated schema operations that fragment the Seastar
buddy allocator's address space with scattered small allocations.
Refs SCYLLADB-1645
Before this patch,
```
test/cqlpy/run test_vector_search_with_vector_store_mock.py
```
Took 34 seconds.
After this patch, it takes **1 second**.
Look at the individual patches for how the magic happened. The first patch lowers the test duration from 34 to 5 seconds, the second patch lowers it further to 1 second.
Closesscylladb/scylladb#29891
* github.com:scylladb/scylladb:
test/cqlpy: make test_vector_search_with_vector_store_mock faster
vector-search: reset DNS timeout after changing host
The doubling loop in test_max_cells started from cells=1. Since
each row has MAX_CELLS_COLUMNS (32768) cells, iterations where
cells < MAX_CELLS_COLUMNS produced zero rows (cells // columns = 0).
Those iterations only did CREATE TABLE / DROP TABLE with no data
inserted.
Start the loop from MAX_CELLS_COLUMNS and use a while loop.
Co-authored-by: Dario Mirovic <dario.mirovic@scylladb.com>
Refs SCYLLADB-1645
The existing OCI section in admin.rst was a minimal stub that only showed
a config snippet without explaining how to actually set up connectivity.
Add documentation for:
- The OCI S3-compatible endpoint URL format (namespace + region)
- That credentials must be set explicitly via AWS_ACCESS_KEY_ID /
AWS_SECRET_ACCESS_KEY using OCI Customer Secret Keys (unlike AWS,
OCI has no instance metadata fallback compatible with STS/EC2)
- A note that iam_role_arn is AWS-specific and should be omitted for OCI
Fixes: SCYLLADB-501
Closesscylladb/scylladb#29689
Move `materialized_views_test.py::TestMaterializedViews::test_do_not_finish_view_building_with_hints`
test from dtest to test.py.
The dtest was throttling down IO throughput in the hope that the view
building won't be finished too soon. This introduces some unreliability,
which can be solved by using error injection and pausing view building
until we stop necessary nodes.
This patch adds 2 tests: one for tablet-based view and one for vnode-based. Both of the tests use error injection to pause view building.
Fixes [SCYLLADB-1261](https://scylladb.atlassian.net/browse/SCYLLADB-1261)
The issue was seen in 2026.2, so we should backport this patch to this version.
[SCYLLADB-1261]: https://scylladb.atlassian.net/browse/SCYLLADB-1261?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQClosesscylladb/scylladb#29788
* github.com:scylladb/scylladb:
test/cluster/mv/test_mv_building: add similar test for vnode-based view
test/cluster/test_view_building_coordinator: migrate test from dtest
db/view/view_building_worker: add more logs when flushing base table
Commit c97232b introduced use of `seastar::util::read_entire_stream()`,
however it didn't included relevant header which is causing compilation
error.
It probably went silently through CI because of precompiled headers.
Refs scylladb#28763
Closesscylladb/scylladb#29901
Alternator Streams graduated from experimental in #29604. Update the
compatibility and FAQ docs accordingly:
- Replace the "Experimental API features" section with a new
"Alternator Streams" section that lists known differences without
the experimental framing.
- Expand the alternator_streams_increased_compatibility paragraph to
explain both consequences of leaving it off (spurious no-op events
and inaccurate INSERT/MODIFY distinction) and the performance cost
of enabling it (LWT path for every write).
- Drop the stale ShardFilter limitation (now implemented).
- Replace the alternator-streams FAQ example with
strongly-consistent-tables so the multi-feature syntax example
remains useful.
Fixes SCYLLADB-462
Closesscylladb/scylladb#29695
In the dtest repo, the test run for both vnode and tablet based views.
Since in test.py infra we're using error injection to pause the view
building process, we need separate tests for those two cases.
service_levels: self-heal stale v1 marker after raft topology upgrade
This PR handles an upgrade corner case where a node may already be using
raft topology, while `system.scylla_local` still marks service levels as v1.
The problem was introduced by commit 2917ec5d51
("service:qos: service levels migration"), which added the service-levels
migration from `system_distributed.service_levels` to
`system.service_levels_v2` as part of the raft topology upgrade.
However, if the cluster had no service levels configured, there was no data
to migrate. In that case, the migration path could leave the local version
marker unchanged, so the node would later observe an inconsistent state:
* raft topology is already enabled;
* service levels are still marked as v1 in `system.scylla_local`.
Such clusters can be left in a stale state and fail startup during upgrade to
2026.2
This PR makes the upgrade path self-healing.
The first commit restores `service_level_controller::migrate_to_v2()`, giving
us a group0-based path for writing the service-levels v2 state even after raft
topology is already in use.
The second commit wires this path into startup. When the node detects the
stale raft-topology + service-levels-v1 state, it retries the migration a
bounded number of times and updates the version marker to v2 instead of
failing startup.
With this change, clusters that were left in this stale state can recover
automatically during upgrade to 2026.
Fixes: SCYLLADB-1807
backport: 2026.2 2026.1 we need this functionality when we are upgrading older servers
Closesscylladb/scylladb#29749
* github.com:scylladb/scylladb:
test/auth_cluster: simulate v1 state in self-heal test When skip_service_levels_v2_initialization is used, write an explicit v1 service level version marker while skipping v2 initialization. This lets the restart test exercise self-healing from v1 to v2.
qos: self-heal stale service levels version on startup
qos: reintroduce service levels v2 migration self-heal
Move `materialized_views_test.py::TestMaterializedViews::test_do_not_finish_view_building_with_hints`
test from dtest to test.py.
The dtest was throttling down IO throughput in the hope that the view
building won't be finished too soon. This introduces some unreliability,
which can be solved by using error injection and pausing view building
until we stop necessary nodes.
Fixes SCYLLADB-1261
Drop local formatter for seastar::http::reply, which should have
been added to Seastar in the first place, and now conflicts. Also
drop local formatters for types that are aliases for Seastar types
which have gained formatters.
Disable recently-gained TLS use of OpenSSL instead of gnutls. We
don't need it, and it causes link errors with LTO.
Fix incorrect skipping in encrypted_file_test, which computed
the remaining stream length but did not account for already
consumed size_to_compare.
Change utils::gcp::storage::client::object_data_source::skip()
to match new Seastar behavior (rejecting skip-past-eof with an
exception). This is needed since 30f1075544 switched the test's
data source to a Seastar implementation. It is also more correct -
if we're asked to skip n bytes but the stream doesn't have n bytes,
this is a protocol violation.
Contains test fix from Pavel, exposed by [1]:
test: Handle premature EOF in test_gcp_storage_skip_read
The test intentionally uses file_size larger than the actual object to
exercise EOF behavior. When input_stream::skip() is called after EOF,
it throws std::runtime_error("premature end of stream"). Catch this
specific exception from both streams, verify they agree, and exit the
loop gracefully.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
[1] cbd1e17d2f, included in this Seastar submodule update
* seastar 4d268e0e...485a62b2 (50):
> reactor: open_directory(): honor bypass_fsync
> http: Add formatters for http::request and http::reply
> Merge 'Assorted set of io-tester cleanups' from Pavel Emelyanov
io_tester: Remove unused and internal-only accessor
io_tester: Move think-time machinery into thinker_state
io_tester: Move _file to io_class_data
io_tester: Replace class_data::_start member with a local variable
io_tester: Move _alignment from class_data to io_class_data
io_tester: Remove buffer allocation from top-level request issuing
io_tester: Cleanup context::stop() invocation
io_tester: Allocate write buffer once to fill a file
io_tester: Declare quantiles arrays as static constexpr
io_tester: Drop class_data::type_str()
io_tester: Replace != "" comparisons with .empty()
io_tester: Replace gen_class_data() if/else chain with a switch
io_tester: Deduplicate vectorized I/O classes
> io_tester: fix crash from missing metric during startup
> net: tls: adjust openssl integration to new module support
> http/client: Count and export integrated queue length
> Merge 'Introduce pipe_data_source_impl and pipe_data_sink_impl' from Pavel Emelyanov
fstream: add pipe_data_source_impl and pipe_data_sink_impl
pollable_fd: add write_some/write_all backed by writev
pollable_fd: rename write_some/write_all(iovec) to send_some/send_all
> reactor: Make pollable_fd_state helper methods private
> module: extend seastar.cppm with comprehensive public API exports
> Merge 'Add exhaustive input_stream invariant test + fixes' from Pavel Emelyanov
tests: add exhaustive input_stream read/skip invariant test
iostream: make skip() reject premature end of stream with exception
> Merge 'Allow runtime selectability of GnuTLS or OpenSSL' from Noah Watkins
net/tls: avoid potential read-past-buffer
net/tls: move credential methods to generic tls layer
net/tls: rename credentials_impl::dh_params to set_dh_params
test/tls: enable openssl tls unit test
test/tls: fix CA cert generation to use v3_ca extensions
github: disable parallel test execution in alpine workflow
crypto: support compiling seastar without gnutls
net/tcp: use crypto provider for md5 calculation
tls: fix test_peer_certificate_chain_handling for OpenSSL
net/tls: fix test for self-signed server cert opoenssl compat
net/tls: disable priority strings test for openssl provider
core/crypto: expose crypto backend name for introspection
test/tls: remove gnutls version guard
net/tls: add openssl tls backend
http: use backend agnostic tls error code
net/tls: make error codes configurable by each tls backend
net/tls: move reloadable_credentials to generic tls layer
net/tls: move build_certificate to generic tls layer
net/tls: move apply_to() to generic tls layer
net/tls: move credential methods to generic tls layer
net/tls: add OpenSSL-specific methods to public API with no-op defaults
net/tls: introduce dh_params and credentials abstraction layer
net/tls: add credentials_impl abstract base class
net/tls: dispatch tls::error_category() through crypto_provider
net/tls: dispatch wrap_client/wrap_server through crypto_provider
net/tls: add tls_backend interface to crypto_provider
net/tls: move public tls API methods to generic tls layer
net/tls: move formatting utilities to generic tls layer
net/tls: move credentials_builder blob methods to generic tls layer
net/tls: move dh_params::from_file to generic tls layer
net/tls: move abstract_credentials file methods to generic tls layer
net/tls: move tls_socket_impl to generic tls layer
net/tls: move server_session to general tls layer
net/tls: move tls_connected_socket_impl to generic tls layer
net/tls: move net::get_impl to generic tls layer
net/tls: move session_ref to generic tls layer
net/tls: add session_impl abstract interface for tls pluggability
net/tls: rename tls.cc to be gnutls specific
crypto: introduce crypto provider abstraction
http: remove unused include
> tls: test_send_two_large
> rpc: include exception type for remote errors
> GHA: increase timeout to 60 minutes
> apps/httpd: replace deprecated reply::done() with write_body()
> missing header(s)
> net: Fix missing throw for runtime_error in create_native_net_device
> tests/io_queue: account for token bucket refill granularity in bandwidth checks
> Merge 'iovec: fix iovec_trim_front infinite loop on zero-length iovecs' from Travis Downs
tests: add regression tests for zero-length iovec handling
iovec: fix iovec_trim_front infinite loop on zero-length iovecs
> util/process: graduate process management API from experimental
> cooking: don't register ready.txt as a build output
> sstring: make make_sstring not static
> Add SparkyLinux to debian list in install-dependencies.sh
> http: allow control over default response headers
> Merge 'chunked_fifo: make cached chunk retention configurable' from Brandon Allard
tests/perf: add chunked_fifo microbenchmarks
chunked_fifo: set the default free chunk retention to 0
chunked_fifo: make free chunk retention configurable
> Merge 'reactor_backend: fix pollable_fd_state_completion reuse in io_uring' from Kefu Chai
tests: add regression test for pollable_fd_state_completion reuse
reactor_backend: use reset() in AIO and epoll poll paths
reactor_backend: fix pollable_fd_state_completion reuse after co_await in io_uring
> Merge 'coroutine: Generator cleanups' from Kefu Chai
coroutine/generator: extract schedule_or_resume helper
coroutine/generator: remove unused next_awaiter classes
coroutine/generator: remove write-only _started field
coroutine/generator: assert on unreachable path in buffered await_resume
coroutine/generator: add elements_of tag and #include <ranges>
coroutine/generator: add empty() to bounded_container concept
> cmake: bump minimum Boost version to 1.79.0
> seastar_test: remove unnecessary headers
> cmake: bump minimum GnuTLS version to 3.7.4
> Merge 'reactor: add get_all_io_queues() method' from Travis Downs
tests: add unit test for reactor::get_all_io_queues()
reactor: add get_all_io_queues() method
reactor: move get_io_queue and try_get_io_queue to .cc file
> http: deprecate reply::done(), remove _response_line dead field
> core: Deprecate scattered_message
> ci: add workflow dispatch to tests workflow
> perf_tests: exit non-zero when -t pattern matches no tests
> Replace duplicate SEGV_MAPERR check in sigsegv_action() with SEGV_ACCERR.
> perf_tests: add total runtime to json output
> Merge 'Relax large allocation error originating from json_list_template' from Robert Bindar
implement move assignment operator for json_list_template
json_list_template copy assignment operator reserves capacity upfront
> perf_tests: add --no-perf-counters option
> Merge 'Fix to_human_readable_value() ability to work with large values' from Pavel Emelyanov
memory: Add compile-time test for value-to-human-readable conversion
memory: Extend list of suffixes to have peta-s
memory: Fix off-by-one in suffix calculation
memory: Mark to_human_readable_value() and others constexpr
> http: Improve writing of response_line() into the output
> Merge 'websocket: add template parameter for text/binary frame mode and implement client-side WebSocket' from wangyuwei
websocket: add template parameter for text/binary frame mode
websocket: impl client side websocket function
> file: Fix checks for file being read-only
> reactor: Make do_dump_task_queue a task_queue method
> Merge 'Implement fully mixed mode for output_stream-s' from Pavel Emelyanov
tests/output_stream: sample type patterns in sanitizer builds
tests/output_stream: extend invariant test to cover mixed write modes
iostream: allow unrestricted mixing of buffered and zero-copy writes
tests/output_stream: remove obsolete ad-hoc splitting tests
tests/output_stream: add invariant-based splitting tests
iostream: rename output_stream::_size to ::_buffer_size
> reactor_backend: replace virtual bool methods with const bool_class members
> resource: Avoid copying CPU vector to break it into groups
> perf_tests: increase overhead column precision to 3 decimal places
> Merge 'Move reactor::fdatasync() into posix_file_impl' from Pavel Emelyanov
reactor: Deprecate fdatasync() method
file: Do fdatasync() right in the posix_file_impl::flush()
file: Propagate aio_fdatasync to posix_file_impl
reactor: Move reactor::fdatasync() code to file.cc
reactor,file: Make full use of file_open_options::durable bit
file: Add file_open_options::durable boolean
file: Account io_stats::fsyncs in posix_file_impl::flush()
reactor: Move _fsyncs counter onto io_stats
> http: Remove connection::write_body()
Closesscylladb/scylladb#29553
Recently (in commit 37fc1507f0) we added vector search support for Alternator.
That implementation was functional, but did not yet support all the features that we had envisioned.
This patch series adds some of the missing features to Alternator's vector search. Each feature is described in more detail in its own patch.
* Metrics related to vector search usage in Alternator.
* `SimilarityFunction` option when creating a vector index to choose the similarity function. Defaults to `COSINE` (the existing default). Other options are `DOT_PRODUCT` and `EUCLIDEAN`.
* An optimized vector type, `{"FLOAT32VECTOR": [1.0, 2.0, ..]}`, which is stored on disk efficiently as 32-bit floats, **not** a JSON.
* A Query VectorSearch option `ReturnScores` asking to return the similarity score calculated for each returned result (the results are sorted in decreasing similarity score - the highest similarity is the best and returned first).
Closesscylladb/scylladb#29554
* github.com:scylladb/scylladb:
alternator: add ReturnScores option to VectorSearch
vector_store_client: read and return similarity_scores
alternator: add optimized vector type for vector search
alternator: add SimilarityFunction option to vector index creation
alternator: add vector search metrics
This series fixes a recurring source of flaky tests in the cluster test suite.
When a test configures Scylla to listen on non-default ports (e.g. a custom Alternator port, proxy-protocol port or shard-aware port), server_add() and server_start() would declare the server ready by polling the hardcoded standard CQL and Alternator ports. Those ports can become available slightly before the custom ports finish binding, so the test could start using the custom port before it was open — causing intermittent failures.
The fix for each affected test was to pass `expected_server_up_state=ServerUpState.SERVING` explicitly, which waits for Scylla's sd_notify("STATUS=serving") signal instead. That signal is sent only after all configured listeners are fully open, so it is always the right readiness signal regardless of the port configuration. This workaround was applied again in PR #29737 and will keep being needed for every new test that uses a non-default port.
This series makes ServerUpState.SERVING the default at every level of the server start/add call stack so no test needs to remember it:
* Make server_add(), servers_add(), server_start() et al. all default to ServerUpState.SERVING.
* Document that server_add/server_start wait for all ports to be ready, so future test authors understand what the functions guarantee.
* Remove now-redundant expected_server_up_state=SERVING from exiting tests.
* A small optimization: Fix check_serving_notification() returning False on first completion. When the sd_notify future completed, the function correctly updated _received_serving but still returned False, wasting one 100ms polling interval. Return self._received_serving directly.
Closesscylladb/scylladb#29758
* github.com:scylladb/scylladb:
test/pylib: fix missing protocol_version=4 on control_cluster
scylla_cluster: guard poll_status() set_result() calls against cancelled future
test/cluster: avoid repeated CQL checks and leaks while waiting for SERVING
test/cluster: fix check_serving_notification() inefficiency
test/cluster: remove now-redundant expected_server_up_state=SERVING
test/cluster: document that add/start waits for all ports to be ready
test/cluster: update remaining CQL_ALTERNATOR_QUERIED defaults to SERVING
test/cluster: fix server_add/server_start hanging when starting in maintenance mode
main: notify "entering maintenance mode" after the maintenance CQL server is ready
test/cluster: make server_start() default to ServerUpState.SERVING
test/cluster: make server_add() default to ServerUpState.SERVING
When skip_service_levels_v2_initialization is used, write an explicit
v1 service level version marker while skipping v2 initialization. This
lets the restart test exercise self-healing from v1 to v2.
Add self_heal_service_levels_version() and use it during startup when
the node is already on raft topology but service levels are still marked
as v1.
In that stale state, migrate service levels to v2 through group0 instead
of failing startup.
When creating a strongly consistent table, wait for the table's raft
servers to start and be ready to serve queries before completing the
operation. We want the create table operation to absorb the delay of
starting the raft groups instead of the first queries.
The create table coordinator commits and applies the schema statement,
then it waits for all hosts that have a tablet replica to create and
start the raft groups for the table's tablets. It does this by sending
an RPC to all the relevant hosts that executes a group0 barrier, in
order to ensure the table and raft groups are created, then waits for
all raft groups on the host to finish starting and be ready.
Fixes SCYLLADB-807
no backport - strong consistency is still experimental
Closesscylladb/scylladb#28843
* github.com:scylladb/scylladb:
strong_consistency: wait for leader when starting a group
strong_consistency: change wait for groups to start on startup
strong_consistency: optimize wait_for_groups_to_start
strong_consistency: wait for raft servers to start in create table
Strong consistent requests take different patch then EC requests and consistency levels don’t map well.
We should limit available consistency levels in SC request to avoid ignoring them silently, which may cause confusion to user.
For writes, there is only one option:
- QUORUM/LOCAL_QUORUM (multi DC is not supported yet, so both of those CLs have the same effect) - we need quorum of replicas to successfully commit new mutations to Raft log.
For reads, there are 2 options:
- QUORUM/LOCAL_QUORUM - if user wants to be sure he sees latest data and the query needs to execute `read_barrier()`, which requires quorum of replicas
- ONE/LOCAL_ONE - if user just wants to read data from one replica without synchronization
All tests were updated to use LOCAL_QUORUM for both read and writes.
Fixes SCYLLADB-1766
SC is in experimental phase and this patch is an improvement, no backport needed.
Closesscylladb/scylladb#29691
* github.com:scylladb/scylladb:
strong_consistency: allow QUORUM/LOCAL_QUORUM and ONE/LOCAL_ONE for reads
strong_consistency: allow only QUORUM/LOCAL_QUORUM CL for writes
`system.view_building_tasks` is a single partition table, so it makes more sense to use a mutation builder and generate 1 mutation per group0 command instead of generating multiple mutations.
This PR removes all `make_..._mutation()` system keyspace functions related to view building tasks and replaces them with mutation builder.
Refs https://github.com/scylladb/scylladb/issues/25929
This patch doesn't fix any bug, it only reduces number of generated mutations, no need to backport it.
Closesscylladb/scylladb#26557
* github.com:scylladb/scylladb:
db/system_keyspace: replace `make_remove_view_building_task_mutation()` with mutation builder
db/view/view_building_task_mutation_builder: make uuid generator optional
db/system_keyspace: replace `make_view_building_task_mutation()` with mutation builder
db/view/view_building_task_mutation_builder: add helper method
The previous patch made test_vector_search_with_vector_store_mode
significantly faster, but at 5 seconds for 7 tests, it was still not
fast enough.
It turns out that the reason why the tests was slow is that each test
used a function-scoped fixture, which set up the vector store mock
again and again, separately for each test. This - especially waiting
for the client in Scylla to recognize the new server - took time
(before the previous patch it was 5 seconds, after the patch it
went down to 0.5 seconds - but still too slow).
The solution is simple:
1. Create a *module* scoped fixture that creates the mock and connects
it to Scylla just once for all the tests in that file.
2. The *function* scoped fixture just uses the module-scoped one but
resets the saved responses, to avoid one test influencing the other.
After this patch, the time to run this test file is down to 1 second (!).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The vector-search client in ScyllaDB limits itself to doing one DNS
lookup per 5 seconds. However, when the configuration changes to point
to a different host, the DNS lookup should happen immediately, and
this patch makes it do that.
Before this patch,
test/cqlpy/run test_vector_search_with_vector_store_mock.py
Takes a whopping 34 seconds, more than 4 seconds per test!
The problem is that each test creates a new mock vector-store server
and reconfigures Scylla, and when reconfiguring Scylla nothing happens
until the 5-second clock runs out.
After this patch, the same test run is down to 5 seconds.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
A vector search operation in Alternator (VectorSearch option to Query)
returns items sorted by decreasing similarity to the searched vector.
Although the items are sorted by decreasing similarity scores, before
this patch the user had no way to see the values of these scores.
This patch adds a new VectorSearch option, `ReturnScores`. This option
defaults to `NONE`. But if set to `SIMILARITY`, the query will return
an array `Scores` with the same length as `Items`, which gives the
similarity score for each item.
As usual, this patch includes the implementation, the documentation,
and tests for the new feature.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The vector store returns for every ANN search, in addition to the keys
of the matching items, two additional vectors - "distances" and
"similarity_cores". The "distances" are raw distance metrics - lower
scores are better matches, while "similarity_scores" are modified
such that higher scores are better matches.
Traditionally, search scores in systems like Cassandra and Open Search
use the "similarity scores" approach (higher is better, results are
returned in decreasing similarity order), so this is the more interesting
vector of the two.
But before this patch, our vector_store_client::ann() inspected
only "distances". But... then, it didn't return even that to the
caller :-)
So in this patch, we:
1. Ignore "distances" and instead look at "similarity scores",
which is what users really want based on their experience with
other vector and non-vector search engines.
2. Return the similarity score of each match together with the match.
We already have this score (the vector store returns it) and we
can add it to the existing primary_key structure of each result.
So each result is a "struct primary_key" which has fields partition,
clustering, and after this patch - similarity.
Existing callers in CQL and Alternator vector search will ignore this
"similarity" field in each result, and not notice it was added.
But in the next patch, we'll allow Alternator's vector search to
return this similarity in each result.
The existing unit tests for vector_store_client.cc mocked vector-store
responses with "distances", without "similarity_scores", so no longer
represent what we actually expect the vector store to do. So this patch
also contains modifications for these tests, to mock and to test
"similarity_scores" - not "distances". The more interesting tests, in
the next patch, use the real vector store and check that we really do
get a "similarity_scores" response from it.
This patch also handles a small corner case for DOT_PRODUCT, which is
the only unbounded similarity function. If the similarity overflows
the 32-bit float, the vector store returns a JSON "null" instead of
a JSON number (since JSON doesn't support infinite numbers). Our
existing vector-store client code errored out when it saw this "null",
which is wrong - the request should be allowed to proceed. So in this
patch when we see a "null" JSON for similarity, we return +Inf.
This is usually correct because the top results really have +Inf, not
-Inf, but if we ask for all items we can reach those with similarity
-Inf and incorrectly assign +Inf to them (we have a test for this case
in the next patch). But this problenm won't happen when Limit is low,
and in any case it's better than aborting the request after it had
already succeeded.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
- Fix intranode shard balancing to respect the size-based balance threshold, preventing unnecessary migrations when load difference between shards is negligible
- Add a regression test that verifies the threshold is respected for intranode balancing
The intranode shard balancing loop only stopped when the algorithm exhausted the migration candidates or when a migration would go against convergence (it would increase imbalance instead of decrease it). This caused unnecessary tablet migrations for negligible imbalances (e.g., 0.78% difference between shards).
The inter-node balancer already uses `is_balanced()` to stop when the relative load difference is within the configured `size_based_balance_threshold`, but this check was missing from the intranode path.
Apply the same `is_balanced()` threshold check that is already used for inter-node balancing to the intranode convergence loop. When the relative load difference between the most-loaded and least-loaded shards on a node is within the threshold, the balancer now stops without issuing further migrations.
The test creates a single node with 2 shards and 512 tablets:
1. **Balanced scenario** (257 vs 255 tablets, same size): relative diff = 0.78% < 1% threshold → verifies no intranode migration is emitted
2. **Unbalanced scenario** (307 vs 205 tablets, same size): relative diff = 33% >> 1% threshold → verifies intranode migration IS emitted
Fixes: SCYLLADB-1775
This is a performance improvement which reduces the number of intranode migrations issued, and needs to be backported to versions with size-based load balancing: 2026.1 and 2026.2
Closesscylladb/scylladb#29756
* github.com:scylladb/scylladb:
test: add test for intranode balance threshold in size-based mode
tablet_allocator: apply balance threshold to intranode shard balancing
Avoid concurrent topology changes in the tombstone GC repair setup, where debug-mode nodes running hinted handoff and materialized view startup work can time out while applying Raft entries before the test starts.
Keep the sequential path opt-in so unrelated repair tests still exercise concurrent bootstrap behavior.
Closesscylladb/scylladb#29829
The test/cqlpy/run-cassandra script makes it quite easy to run test/cqlpy
tests against Cassandra, which is important for checking compatibility.
Unfortunately, because modern Linux distributions like Fedora do not have
either Cassandra or the old version of Java that it needs, the user needs
to download those manually. This is fairly easy, and explained in detail
in test/cqlpy/README.md, but nevertheless is a non-trivial manual step.
So this patch adds an even simpler alternative, the "--docker" option
which tells the script to run the official Cassandra docker image,
complete with the version of Java that it prefers - the user does not
need to download or install Cassandra or Java. The image is efficiently
cached by Docker, so running run-cassandra again doesn't need to
download it again; Moreover, trying several different versions of
Cassandra only needs to download and store the shared parts (base image
and Java) once.
test/cqlpy/run-cassandra --docker test_file.py::test_function
Runs by default the latest Cassandra 5 release. You can also use
"--docker=4" to get the latest Cassandra 4 release, "--docker=3.11"
to get the latest Cassandra 3.11 patch release, or "--docker=3.11.1"
to get a specific patch release.
In addition to the "--docker" option, this patch also introduces a
second option, "--java-docker", which takes *only* Java from docker,
but runs your locally installed Cassandra (to which you should point
with the CASSANDRA environment variable, as before). This option can
be useful if your host does not have a suitable version of Java, but
you want to run a locally-installed or locally-modified version of
Cassandra. The "--java-docker" option defaults to getting Java 11,
to use other versions you can use for example "--java-docker=17".
Fixes#25826.
Closesscylladb/scylladb#29860
Today in Alternator vector search, vectors are presented to the API as
lists of numbers. I.e., in JSON a vector is sent in requests and responses
as:
{"L": [{"N": "3.14159"}, {"N":" "6.7"}}
This format is verbose and inefficient for long vectors. Even worse,
because the "N" number format has precision guarantees in DynamoDB,
we cannot optimize the storage of such vectors by, for example, storing
the numbers as 32-bit floats. We actually store these vectors as JSON,
exactly as shown above.
So in this patch we introduce a new DynamoDB type, "FLOAT32VECTOR", for
vectors. The above vector will look like this in JSON:
{"FLOAT32VECTOR": [3.14159, 6.7]}
Note that each number is an unquoted JSON number, not a JSON string.
Importantly, the definition of the "FLOAT32VECTOR" type specifies that
components of the vector only have 32-bit precision. This means that
Scylla may store internally these vectors as lists of 32-bit floats -
not as a JSON. And indeed, this patch includes this optimization:
Top-level vector attributes are now encoded in an optimized way,
as a byte 5 (alternator_type::FLOAT32VECTOR) followed by the elements
of the vector, just 4 bytes each (the 4-byte big-endian IEEE 754
representation of each floating-point component).
This patch also includes documentation, and extensive tests that the
new "FLOAT32VECTOR" type works (which also serves as an example how to
use it in the boto3 SDK), that it is indeed encoded internally as 32-bit
floats and not wasteful JSON strings, and that vector search on such items
work. The last thing requires cooperation from the vector store, of
course - it needs to be able to understand the new optimized encoding
of vector attributes in addition to the old unoptimized one.
Note that the old unoptimized ("list of numbers") vectors are still
supported. Although not recommended for general use, some users might
still want to use the unoptimized type if they have pre-existing data
created on DynamoDB or Alternator without vector search in mind, and
the vectors already exist as lists of numbers.
Although this is less important, the new vector type "FLOAT32VECTOR"
is also allowed in a Query's QueryVector.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Before this patch, vector search always used the COSINE similarity
function. In this patch we add the ability to choose a different
similarity function when creating a new vector index (with CreateTable
or UpdateTable) by using the SimilarityFunction option. We still default
to "COSINE" if SimilarityFunction isn't specified.
Allowed similarity functions are COSINE, DOT_PRODUCT, and EUCLIDEAN.
DescribeTable can also retrieve a vector index's SimilarityFunction.
As usual, this patch also includes documentation for the new feature,
and tests. Some of the tests can run without a vector store - verifying
the API syntax and which similarity function is supported - but we
also add tests that require the vector store and check that the different
similarity functions actually sort the nearest items in the expected
order.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Before this patch, we did not have any special metrics for vector search
in Alternator. We have had count of "Query" operations, but there was no
distinction between "standard" queries - of a base table or GSI/LSI -
and vector-search queries.
This patch adds four new metrics:
* vector_search_query - counting how many Query requests are actually
vector searches.
* vector_search_query_returned_items - counting how many items were
returned by vector searches.
* vector_search_query_items_from_vs - counting how many results were
retrieved from the vector-store backend.
* vector_search_query_items_from_base_table - counting how many items
were read from the base table during vector-search queries. Some
vector search queries using SELECT=ALL_PROJECTED_ATTRIBUTES or COUNT
are optimized to not need to read items from the base table.
This patch also includes documentation for the new four metrics, and
tests that they count what we want them to count.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The function does nothing useful now.
No backport needed. Removes code.
Closesscylladb/scylladb#29828
* https://github.com/scylladb/scylladb:
raft_group0: remove finish_setup_after_join function
raft_group0: fix indentation after the last change
raft_group: drop unneeded checks
After scylladb/scylladb#28929 `task_uuid_generator` became necassary
dependency of `view_building_task_mutation_builder`.
However to create the generator we need `view_building_state`, which in
some parts of the code (schema_tables.cc, migration_manager.cc) requires
remote proxy to be obtained.
But sometimes we need the mutation builder to just remove some view
building task. In those cases, we don't need the uuid generator and the
remote proxy requirement is not necassary.
migrate_to_v2() was removed after gossip-based service level migration
support was dropped, since upgraded nodes were expected to already use
service levels v2.
However, clusters affected by the old migration bug may reach raft topology
while system.scylla_local still has a stale service level version. Restore
the migration helper so startup can self-heal those nodes by writing the v2
state through group0.
When starting the raft server for a group, wait for the leader before
completing the start operation. We want the group to be ready to accept
writes by the time the start is reported to be completed without the
additional latency of waiting for leader.
on startup, previously groups_manager::start() was called and waited for
the groups to start. we change it instead to just start the raft servers
in the background without waiting for them to be fully started. we wait
for the servers to start explicitly at a later stage of startup, after
starting the messaging service.
the reason is that for the servers to be fully started they may require
communication that requires the messaging service. currently it is not
required, but it will be changed in the next commit.