11770 Commits

Author SHA1 Message Date
Piotr Dulikowski
0c016cecc3 Merge 'QOS: self-heal stale V1-to-V2 migration state on upgrade' from Alex Dathskovsky
service_levels: self-heal stale v1 marker after raft topology upgrade

This PR handles an upgrade corner case where a node may already be using
raft topology, while `system.scylla_local` still marks service levels as v1.

The problem was introduced by commit 2917ec5d51
("service:qos: service levels migration"), which added the service-levels
migration from `system_distributed.service_levels` to
`system.service_levels_v2` as part of the raft topology upgrade.

However, if the cluster had no service levels configured, there was no data
to migrate. In that case, the migration path could leave the local version
marker unchanged, so the node would later observe an inconsistent state:

  * raft topology is already enabled;
  * service levels are still marked as v1 in `system.scylla_local`.

Such clusters can be left in a stale state and fail startup during upgrade to
2026.2

This PR makes the upgrade path self-healing.

The first commit restores `service_level_controller::migrate_to_v2()`, giving
us a group0-based path for writing the service-levels v2 state even after raft
topology is already in use.

The second commit wires this path into startup. When the node detects the
stale raft-topology + service-levels-v1 state, it retries the migration a
bounded number of times and updates the version marker to v2 instead of
failing startup.

With this change, clusters that were left in this stale state can recover
automatically during upgrade to 2026.
Fixes: SCYLLADB-1807

backport: 2026.2 2026.1 we need this functionality when we are upgrading older servers

Closes scylladb/scylladb#29749

* github.com:scylladb/scylladb:
  test/auth_cluster: simulate v1 state in self-heal test When skip_service_levels_v2_initialization is used, write an explicit v1 service level version marker while skipping v2 initialization. This lets the restart test exercise self-healing from v1 to v2.
  qos: self-heal stale service levels version on startup
  qos: reintroduce service levels v2 migration self-heal
2026-05-14 10:32:43 +02:00
Avi Kivity
6db152afbb Update seastar submodule
Drop local formatter for seastar::http::reply, which should have
been added to Seastar in the first place, and now conflicts. Also
drop local formatters for types that are aliases for Seastar types
which have gained formatters.

Disable recently-gained TLS use of OpenSSL instead of gnutls. We
don't need it, and it causes link errors with LTO.

Fix incorrect skipping in encrypted_file_test, which computed
the remaining stream length but did not account for already
consumed size_to_compare.

Change utils::gcp::storage::client::object_data_source::skip()
to match new Seastar behavior (rejecting skip-past-eof with an
exception). This is needed since 30f1075544 switched the test's
data source to a Seastar implementation. It is also more correct -
if we're asked to skip n bytes but the stream doesn't have n bytes,
this is a protocol violation.

Contains test fix from Pavel, exposed by [1]:

test: Handle premature EOF in test_gcp_storage_skip_read

The test intentionally uses file_size larger than the actual object to
exercise EOF behavior. When input_stream::skip() is called after EOF,
it throws std::runtime_error("premature end of stream"). Catch this
specific exception from both streams, verify they agree, and exit the
loop gracefully.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

[1] cbd1e17d2f, included in this Seastar submodule update

* seastar 4d268e0e...485a62b2 (50):
  > reactor: open_directory(): honor bypass_fsync
  > http: Add formatters for http::request and http::reply
  > Merge 'Assorted set of io-tester cleanups' from Pavel Emelyanov
    io_tester: Remove unused and internal-only accessor
    io_tester: Move think-time machinery into thinker_state
    io_tester: Move _file to io_class_data
    io_tester: Replace class_data::_start member with a local variable
    io_tester: Move _alignment from class_data to io_class_data
    io_tester: Remove buffer allocation from top-level request issuing
    io_tester: Cleanup context::stop() invocation
    io_tester: Allocate write buffer once to fill a file
    io_tester: Declare quantiles arrays as static constexpr
    io_tester: Drop class_data::type_str()
    io_tester: Replace != "" comparisons with .empty()
    io_tester: Replace gen_class_data() if/else chain with a switch
    io_tester: Deduplicate vectorized I/O classes
  > io_tester: fix crash from missing metric during startup
  > net: tls: adjust openssl integration to new module support
  > http/client: Count and export integrated queue length
  > Merge 'Introduce pipe_data_source_impl and pipe_data_sink_impl' from Pavel Emelyanov
    fstream: add pipe_data_source_impl and pipe_data_sink_impl
    pollable_fd: add write_some/write_all backed by writev
    pollable_fd: rename write_some/write_all(iovec) to send_some/send_all
  > reactor: Make pollable_fd_state helper methods private
  > module: extend seastar.cppm with comprehensive public API exports
  > Merge 'Add exhaustive input_stream invariant test + fixes' from Pavel Emelyanov
    tests: add exhaustive input_stream read/skip invariant test
    iostream: make skip() reject premature end of stream with exception
  > Merge 'Allow runtime selectability of GnuTLS or OpenSSL' from Noah Watkins
    net/tls: avoid potential read-past-buffer
    net/tls: move credential methods to generic tls layer
    net/tls: rename credentials_impl::dh_params to set_dh_params
    test/tls: enable openssl tls unit test
     test/tls: fix CA cert generation to use v3_ca extensions
    github: disable parallel test execution in alpine workflow
    crypto: support compiling seastar without gnutls
    net/tcp: use crypto provider for md5 calculation
    tls: fix test_peer_certificate_chain_handling for OpenSSL
    net/tls: fix test for self-signed server cert opoenssl compat
    net/tls: disable priority strings test for openssl provider
    core/crypto: expose crypto backend name for introspection
    test/tls: remove gnutls version guard
    net/tls: add openssl tls backend
    http: use backend agnostic tls error code
    net/tls: make error codes configurable by each tls backend
    net/tls: move reloadable_credentials to generic tls layer
    net/tls: move build_certificate to generic tls layer
    net/tls: move apply_to() to generic tls layer
    net/tls: move credential methods to generic tls layer
    net/tls: add OpenSSL-specific methods to public API with no-op defaults
    net/tls: introduce dh_params and credentials abstraction layer
    net/tls: add credentials_impl abstract base class
    net/tls: dispatch tls::error_category() through crypto_provider
    net/tls: dispatch wrap_client/wrap_server through crypto_provider
    net/tls: add tls_backend interface to crypto_provider
    net/tls: move public tls API methods to generic tls layer
    net/tls: move formatting utilities to generic tls layer
    net/tls: move credentials_builder blob methods to generic tls layer
    net/tls: move dh_params::from_file to generic tls layer
    net/tls: move abstract_credentials file methods to generic tls layer
    net/tls: move tls_socket_impl to generic tls layer
    net/tls: move server_session to general tls layer
    net/tls: move tls_connected_socket_impl to generic tls layer
    net/tls: move net::get_impl to generic tls layer
    net/tls: move session_ref to generic tls layer
    net/tls: add session_impl abstract interface for tls pluggability
    net/tls: rename tls.cc to be gnutls specific
    crypto: introduce crypto provider abstraction
    http: remove unused include
  > tls: test_send_two_large
  > rpc: include exception type for remote errors
  > GHA: increase timeout to 60 minutes
  > apps/httpd: replace deprecated reply::done() with write_body()
  > missing header(s)
  > net: Fix missing throw for runtime_error in create_native_net_device
  > tests/io_queue: account for token bucket refill granularity in bandwidth checks
  > Merge 'iovec: fix iovec_trim_front infinite loop on zero-length iovecs' from Travis Downs
    tests: add regression tests for zero-length iovec handling
    iovec: fix iovec_trim_front infinite loop on zero-length iovecs
  > util/process: graduate process management API from experimental
  > cooking: don't register ready.txt as a build output
  > sstring: make make_sstring not static
  > Add SparkyLinux to debian list in install-dependencies.sh
  > http: allow control over default response headers
  > Merge 'chunked_fifo: make cached chunk retention configurable' from Brandon Allard
    tests/perf: add chunked_fifo microbenchmarks
    chunked_fifo: set the default free chunk retention to 0
    chunked_fifo: make free chunk retention configurable
  > Merge 'reactor_backend: fix pollable_fd_state_completion reuse in io_uring' from Kefu Chai
    tests: add regression test for pollable_fd_state_completion reuse
    reactor_backend: use reset() in AIO and epoll poll paths
    reactor_backend: fix pollable_fd_state_completion reuse after co_await in io_uring
  > Merge 'coroutine: Generator cleanups' from Kefu Chai
    coroutine/generator: extract schedule_or_resume helper
    coroutine/generator: remove unused next_awaiter classes
    coroutine/generator: remove write-only _started field
    coroutine/generator: assert on unreachable path in buffered await_resume
    coroutine/generator: add elements_of tag and #include <ranges>
    coroutine/generator: add empty() to bounded_container concept
  > cmake: bump minimum Boost version to 1.79.0
  > seastar_test: remove unnecessary headers
  > cmake: bump minimum GnuTLS version to 3.7.4
  > Merge 'reactor: add get_all_io_queues() method' from Travis Downs
    tests: add unit test for reactor::get_all_io_queues()
    reactor: add get_all_io_queues() method
    reactor: move get_io_queue and try_get_io_queue to .cc file
  > http: deprecate reply::done(), remove _response_line dead field
  > core: Deprecate scattered_message
  > ci: add workflow dispatch to tests workflow
  > perf_tests: exit non-zero when -t pattern matches no tests
  > Replace duplicate SEGV_MAPERR check in sigsegv_action() with SEGV_ACCERR.
  > perf_tests: add total runtime to json output
  > Merge 'Relax large allocation error originating from json_list_template' from Robert Bindar
    implement move assignment operator for json_list_template
    json_list_template copy assignment operator reserves capacity upfront
  > perf_tests: add --no-perf-counters option
  > Merge 'Fix to_human_readable_value() ability to work with large values' from Pavel Emelyanov
    memory: Add compile-time test for value-to-human-readable conversion
    memory: Extend list of suffixes to have peta-s
    memory: Fix off-by-one in suffix calculation
    memory: Mark to_human_readable_value() and others constexpr
  > http: Improve writing of response_line() into the output
  > Merge 'websocket: add template parameter for text/binary frame mode and implement client-side WebSocket' from wangyuwei
    websocket: add template parameter for text/binary frame mode
    websocket: impl client side websocket function
  > file: Fix checks for file being read-only
  > reactor: Make do_dump_task_queue a task_queue method
  > Merge 'Implement fully mixed mode for output_stream-s' from Pavel Emelyanov
    tests/output_stream: sample type patterns in sanitizer builds
    tests/output_stream: extend invariant test to cover mixed write modes
    iostream: allow unrestricted mixing of buffered and zero-copy writes
    tests/output_stream: remove obsolete ad-hoc splitting tests
    tests/output_stream: add invariant-based splitting tests
    iostream: rename output_stream::_size to ::_buffer_size
  > reactor_backend: replace virtual bool methods with const bool_class members
  > resource: Avoid copying CPU vector to break it into groups
  > perf_tests: increase overhead column precision to 3 decimal places
  > Merge 'Move reactor::fdatasync() into posix_file_impl' from Pavel Emelyanov
    reactor: Deprecate fdatasync() method
    file: Do fdatasync() right in the posix_file_impl::flush()
    file: Propagate aio_fdatasync to posix_file_impl
    reactor: Move reactor::fdatasync() code to file.cc
    reactor,file: Make full use of file_open_options::durable bit
    file: Add file_open_options::durable boolean
    file: Account io_stats::fsyncs in posix_file_impl::flush()
    reactor: Move _fsyncs counter onto io_stats
  > http: Remove connection::write_body()

Closes scylladb/scylladb#29553
2026-05-14 10:45:39 +03:00
Botond Dénes
1403f18240 Merge 'alternator: add more vector search features' from Nadav Har'El
Recently (in commit 37fc1507f0) we added vector search support for Alternator.
That implementation was functional, but did not yet support all the features that we had envisioned.

This patch series adds some of the missing features to Alternator's vector search. Each feature is described in more detail in its own patch.

* Metrics related to vector search usage in Alternator.
* `SimilarityFunction` option when creating a vector index to choose the similarity function. Defaults to `COSINE` (the existing default). Other options are `DOT_PRODUCT` and `EUCLIDEAN`.
* An optimized vector type, `{"FLOAT32VECTOR": [1.0, 2.0, ..]}`, which is stored on disk efficiently as 32-bit floats, **not** a JSON.
* A Query VectorSearch option `ReturnScores` asking to return the similarity score calculated for each returned result (the results are sorted in decreasing similarity score - the highest similarity is the best and returned first).

Closes scylladb/scylladb#29554

* github.com:scylladb/scylladb:
  alternator: add ReturnScores option to VectorSearch
  vector_store_client: read and return similarity_scores
  alternator: add optimized vector type for vector search
  alternator: add SimilarityFunction option to vector index creation
  alternator: add vector search metrics
2026-05-14 10:41:41 +03:00
Andrei Chekun
a09fdfc46a test.py: fix issue that C++ tests' logs are deleted
Add skiping deletion of the log file in case of the fail in C++ tests.

Closes scylladb/scylladb#29859
2026-05-13 21:31:03 +03:00
Avi Kivity
f2ab911a46 Merge 'test/cluster: fix server-starting functions to wait for all ports' from Nadav Har'El
This series fixes a recurring source of flaky tests in the cluster test suite.

When a test configures Scylla to listen on non-default ports (e.g. a custom Alternator port, proxy-protocol port or shard-aware port), server_add() and server_start() would declare the server ready by polling the hardcoded standard CQL and Alternator ports. Those ports can become available slightly before the custom ports finish binding, so the test could start using the custom port before it was open — causing intermittent failures.

The fix for each affected test was to pass `expected_server_up_state=ServerUpState.SERVING` explicitly, which waits for Scylla's sd_notify("STATUS=serving") signal instead. That signal is sent only after all configured listeners are fully open, so it is always the right readiness signal regardless of the port configuration. This workaround was applied again in PR #29737 and will keep being needed for every new test that uses a non-default port.

This series makes ServerUpState.SERVING the default at every level of the server start/add call stack so no test needs to remember it:

* Make server_add(), servers_add(), server_start() et al. all  default to ServerUpState.SERVING.
* Document that server_add/server_start wait for all ports to be  ready,  so future test authors understand what the functions guarantee.
* Remove now-redundant expected_server_up_state=SERVING from exiting tests.
* A small optimization: Fix check_serving_notification() returning False on first completion. When the sd_notify future completed, the function correctly updated _received_serving but still returned False, wasting one 100ms polling interval. Return self._received_serving directly.

Closes scylladb/scylladb#29758

* github.com:scylladb/scylladb:
  test/pylib: fix missing protocol_version=4 on control_cluster
  scylla_cluster: guard poll_status() set_result() calls against cancelled future
  test/cluster: avoid repeated CQL checks and leaks while waiting for SERVING
  test/cluster: fix check_serving_notification() inefficiency
  test/cluster: remove now-redundant expected_server_up_state=SERVING
  test/cluster: document that add/start waits for all ports to be ready
  test/cluster: update remaining CQL_ALTERNATOR_QUERIED defaults to SERVING
  test/cluster: fix server_add/server_start hanging when starting in maintenance mode
  main: notify "entering maintenance mode" after the maintenance CQL server is ready
  test/cluster: make server_start() default to ServerUpState.SERVING
  test/cluster: make server_add() default to ServerUpState.SERVING
2026-05-13 21:23:18 +03:00
Alex
6188bf3e01 test/auth_cluster: simulate v1 state in self-heal test
When skip_service_levels_v2_initialization is used, write an explicit
v1 service level version marker while skipping v2 initialization. This
lets the restart test exercise self-healing from v1 to v2.
2026-05-13 17:55:20 +03:00
Piotr Dulikowski
f3ac35f9d2 Merge 'strong_consistency: wait for raft servers to start in create table' from Michael Litvak
When creating a strongly consistent table, wait for the table's raft
servers to start and be ready to serve queries before completing the
operation. We want the create table operation to absorb the delay of
starting the raft groups instead of the first queries.

The create table coordinator commits and applies the schema statement,
then it waits for all hosts that have a tablet replica to create and
start the raft groups for the table's tablets. It does this by sending
an RPC to all the relevant hosts that executes a group0 barrier, in
order to ensure the table and raft groups are created, then waits for
all raft groups on the host to finish starting and be ready.

Fixes SCYLLADB-807

no backport - strong consistency is still experimental

Closes scylladb/scylladb#28843

* github.com:scylladb/scylladb:
  strong_consistency: wait for leader when starting a group
  strong_consistency: change wait for groups to start on startup
  strong_consistency: optimize wait_for_groups_to_start
  strong_consistency: wait for raft servers to start in create table
2026-05-13 16:42:05 +02:00
Piotr Dulikowski
dc05bd35bb Merge 'strong_consistency: limit available consistency levels in strong consistent requests' from Michał Jadwiszczak
Strong consistent requests take different patch then EC requests and consistency levels don’t map well.
We should limit available consistency levels in SC request to avoid ignoring them silently, which may cause confusion to user.

For writes, there is only one option:
- QUORUM/LOCAL_QUORUM (multi DC is not supported yet, so both of those CLs have the same effect) - we need quorum of replicas to successfully commit new mutations to Raft log.

For reads, there are 2 options:
- QUORUM/LOCAL_QUORUM - if user wants to be sure he sees latest data and the query needs to execute `read_barrier()`, which requires quorum of replicas
- ONE/LOCAL_ONE - if user just wants to read data from one replica without synchronization

All tests were updated to use LOCAL_QUORUM for both read and writes.

Fixes SCYLLADB-1766

SC is in experimental phase and this patch is an improvement, no backport needed.

Closes scylladb/scylladb#29691

* github.com:scylladb/scylladb:
  strong_consistency: allow QUORUM/LOCAL_QUORUM and ONE/LOCAL_ONE for reads
  strong_consistency: allow only QUORUM/LOCAL_QUORUM CL for writes
2026-05-13 16:31:05 +02:00
Nadav Har'El
4082fdf350 alternator: add ReturnScores option to VectorSearch
A vector search operation in Alternator (VectorSearch option to Query)
returns items sorted by decreasing similarity to the searched vector.

Although the items are sorted by decreasing similarity scores, before
this patch the user had no way to see the values of these scores.
This patch adds a new VectorSearch option, `ReturnScores`. This option
defaults to `NONE`. But if set to `SIMILARITY`, the query will return
an array `Scores` with the same length as `Items`, which gives the
similarity score for each item.

As usual, this patch includes the implementation, the documentation,
and tests for the new feature.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-05-13 14:19:17 +03:00
Nadav Har'El
c56361a6d7 vector_store_client: read and return similarity_scores
The vector store returns for every ANN search, in addition to the keys
of the matching items, two additional vectors - "distances" and
"similarity_cores". The "distances" are raw distance metrics - lower
scores are better matches, while "similarity_scores" are modified
such that higher scores are better matches.

Traditionally, search scores in systems like Cassandra and Open Search
use the "similarity scores" approach (higher is better, results are
returned in decreasing similarity order), so this is the more interesting
vector of the two.

But before this patch, our vector_store_client::ann() inspected
only "distances". But... then, it didn't return even that to the
caller :-)

So in this patch, we:

1. Ignore "distances" and instead look at "similarity scores",
   which is what users really want based on their experience with
   other vector and non-vector search engines.

2. Return the similarity score of each match together with the match.
   We already have this score (the vector store returns it) and we
   can add it to the existing primary_key structure of each result.
   So each result is a "struct primary_key" which has fields partition,
   clustering, and after this patch - similarity.

Existing callers in CQL and Alternator vector search will ignore this
"similarity" field in each result, and not notice it was added.
But in the next patch, we'll allow Alternator's vector search to
return this similarity in each result.

The existing unit tests for vector_store_client.cc mocked vector-store
responses with "distances", without "similarity_scores", so no longer
represent what we actually expect the vector store to do. So this patch
also contains modifications for these tests, to mock and to test
"similarity_scores" - not "distances". The more interesting tests, in
the next patch, use the real vector store and check that we really do
get a "similarity_scores" response from it.

This patch also handles a small corner case for DOT_PRODUCT, which is
the only unbounded similarity function. If the similarity overflows
the 32-bit float, the vector store returns a JSON "null" instead of
a JSON number (since JSON doesn't support infinite numbers). Our
existing vector-store client code errored out when it saw this "null",
which is wrong - the request should be allowed to proceed. So in this
patch when we see a "null" JSON for similarity, we return +Inf.
This is usually correct because the top results really have +Inf, not
-Inf, but if we ask for all items we can reach those with similarity
-Inf and incorrectly assign +Inf to them (we have a test for this case
in the next patch). But this problenm won't happen when Limit is low,
and in any case it's better than aborting the request after it had
already succeeded.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-05-13 14:19:17 +03:00
Tomasz Grabiec
66439bb753 Merge 'load_balancer: apply balance threshold to intranode shard balancing' from Ferenc Szili
- Fix intranode shard balancing to respect the size-based balance threshold, preventing unnecessary migrations when load difference between shards is negligible
- Add a regression test that verifies the threshold is respected for intranode balancing

The intranode shard balancing loop only stopped when the algorithm exhausted the migration candidates or when a migration would go against convergence (it would increase imbalance instead of decrease it). This caused unnecessary tablet migrations for negligible imbalances (e.g., 0.78% difference between shards).

The inter-node balancer already uses `is_balanced()` to stop when the relative load difference is within the configured `size_based_balance_threshold`, but this check was missing from the intranode path.

Apply the same `is_balanced()` threshold check that is already used for inter-node balancing to the intranode convergence loop. When the relative load difference between the most-loaded and least-loaded shards on a node is within the threshold, the balancer now stops without issuing further migrations.

The test creates a single node with 2 shards and 512 tablets:
1. **Balanced scenario** (257 vs 255 tablets, same size): relative diff = 0.78% < 1% threshold → verifies no intranode migration is emitted
2. **Unbalanced scenario** (307 vs 205 tablets, same size): relative diff = 33% >> 1% threshold → verifies intranode migration IS emitted

Fixes: SCYLLADB-1775

This is a performance improvement which reduces the number of intranode migrations issued, and needs to be backported to versions with size-based load balancing: 2026.1 and 2026.2

Closes scylladb/scylladb#29756

* github.com:scylladb/scylladb:
  test: add test for intranode balance threshold in size-based mode
  tablet_allocator: apply balance threshold to intranode shard balancing
2026-05-13 13:09:52 +02:00
Piotr Smaron
0fcae72530 test: bootstrap tombstone gc repair cluster sequentially
Avoid concurrent topology changes in the tombstone GC repair setup, where debug-mode nodes running hinted handoff and materialized view startup work can time out while applying Raft entries before the test starts.

Keep the sequential path opt-in so unrelated repair tests still exercise concurrent bootstrap behavior.

Closes scylladb/scylladb#29829
2026-05-13 13:58:44 +03:00
Nadav Har'El
51c35c05e2 test/cqlpy: teach run-cassandra to use Docker
The test/cqlpy/run-cassandra script makes it quite easy to run test/cqlpy
tests against Cassandra, which is important for checking compatibility.

Unfortunately, because modern Linux distributions like Fedora do not have
either Cassandra or the old version of Java that it needs, the user needs
to download those manually. This is fairly easy, and explained in detail
in test/cqlpy/README.md, but nevertheless is a non-trivial manual step.

So this patch adds an even simpler alternative, the "--docker" option
which tells the script to run the official Cassandra docker image,
complete with the version of Java that it prefers - the user does not
need to download or install Cassandra or Java. The image is efficiently
cached by Docker, so running run-cassandra again doesn't need to
download it again; Moreover, trying several different versions of
Cassandra only needs to download and store the shared parts (base image
and Java) once.

test/cqlpy/run-cassandra --docker test_file.py::test_function

Runs by default the latest Cassandra 5 release. You can also use
"--docker=4" to get the latest Cassandra 4 release, "--docker=3.11"
to get the latest Cassandra 3.11 patch release, or "--docker=3.11.1"
to get a specific patch release.

In addition to the "--docker" option, this patch also introduces a
second option, "--java-docker", which takes *only* Java from docker,
but runs your locally installed Cassandra (to which you should point
with the CASSANDRA environment variable, as before). This option can
be useful if your host does not have a suitable version of Java, but
you want to run a locally-installed or locally-modified version of
Cassandra. The "--java-docker" option defaults to getting Java 11,
to use other versions you can use for example "--java-docker=17".

Fixes #25826.

Closes scylladb/scylladb#29860
2026-05-13 11:57:18 +02:00
Nadav Har'El
85c6cafb1d alternator: add optimized vector type for vector search
Today in Alternator vector search, vectors are presented to the API as
lists of numbers. I.e., in JSON a vector is sent in requests and responses
as:

     {"L": [{"N": "3.14159"}, {"N":" "6.7"}}

This format is verbose and inefficient for long vectors. Even worse,
because the "N" number format has precision guarantees in DynamoDB,
we cannot optimize the storage of such vectors by, for example, storing
the numbers as 32-bit floats. We actually store these vectors as JSON,
exactly as shown above.

So in this patch we introduce a new DynamoDB type, "FLOAT32VECTOR", for
vectors. The above vector will look like this in JSON:

     {"FLOAT32VECTOR": [3.14159, 6.7]}

Note that each number is an unquoted JSON number, not a JSON string.
Importantly, the definition of the "FLOAT32VECTOR" type specifies that
components of the vector only have 32-bit precision. This means that
Scylla may store internally these vectors as lists of 32-bit floats -
not as a JSON. And indeed, this patch includes this optimization:
Top-level vector attributes are now encoded in an optimized way,
as a byte 5 (alternator_type::FLOAT32VECTOR) followed by the elements
of the vector, just 4 bytes each (the 4-byte big-endian IEEE 754
representation of each floating-point component).

This patch also includes documentation, and extensive tests that the
new "FLOAT32VECTOR" type works (which also serves as an example how to
use it in the boto3 SDK), that it is indeed encoded internally as 32-bit
floats and not wasteful JSON strings, and that vector search on such items
work. The last thing requires cooperation from the vector store, of
course - it needs to be able to understand the new optimized encoding
of vector attributes in addition to the old unoptimized one.

Note that the old unoptimized ("list of numbers") vectors are still
supported. Although not recommended for general use, some users might
still want to use the unoptimized type if they have pre-existing data
created on DynamoDB or Alternator without vector search in mind, and
the vectors already exist as lists of numbers.

Although this is less important, the new vector type "FLOAT32VECTOR"
is also allowed in a Query's QueryVector.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-05-13 11:57:45 +03:00
Nadav Har'El
ea910acdd4 alternator: add SimilarityFunction option to vector index creation
Before this patch, vector search always used the COSINE similarity
function. In this patch we add the ability to choose a different
similarity function when creating a new vector index (with CreateTable
or UpdateTable) by using the SimilarityFunction option. We still default
to "COSINE" if SimilarityFunction isn't specified.

Allowed similarity functions are COSINE, DOT_PRODUCT, and EUCLIDEAN.
DescribeTable can also retrieve a vector index's SimilarityFunction.

As usual, this patch also includes documentation for the new feature,
and tests. Some of the tests can run without a vector store - verifying
the API syntax and which similarity function is supported - but we
also add tests that require the vector store and check that the different
similarity functions actually sort the nearest items in the expected
order.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-05-13 11:57:45 +03:00
Nadav Har'El
70283967d3 alternator: add vector search metrics
Before this patch, we did not have any special metrics for vector search
in Alternator. We have had count of "Query" operations, but there was no
distinction between "standard" queries - of a base table or GSI/LSI -
and vector-search queries.

This patch adds four new metrics:

 * vector_search_query - counting how many Query requests are actually
   vector searches.

 * vector_search_query_returned_items - counting how many items were
   returned by vector searches.

 * vector_search_query_items_from_vs - counting how many results were
   retrieved from the vector-store backend.

 * vector_search_query_items_from_base_table - counting how many items
   were read from the base table during vector-search queries. Some
   vector search queries using SELECT=ALL_PROJECTED_ATTRIBUTES or COUNT
   are optimized to not need to read items from the base table.

This patch also includes documentation for the new four metrics, and
tests that they count what we want them to count.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-05-13 11:57:44 +03:00
Patryk Jędrzejczak
3f2ff5a13f Merge 'Remove raft_group0::finish_setup_after_join' from Gleb Natapov
The function does nothing useful now.

No backport needed. Removes code.

Closes scylladb/scylladb#29828

* https://github.com/scylladb/scylladb:
  raft_group0: remove finish_setup_after_join function
  raft_group0: fix indentation after the last change
  raft_group: drop unneeded checks
2026-05-13 10:53:37 +02:00
Michael Litvak
5a5c7c6241 strong_consistency: wait for raft servers to start in create table
When creating a strongly consistent table, wait for the table's raft
servers to start and be ready to serve queries before completing the
operation. We want the create table operation to absorb the delay of
starting the raft groups instead of the first queries.

The create table coordinator commits and applies the schema statement,
then it waits for all hosts that have a tablet replica to create and
start the raft groups for the table's tablets. It does this by sending
an RPC to all the relevant hosts that executes a group0 barrier, in
order to ensure the table and raft groups are created, then waits for
all raft groups on the host to finish starting and be ready.

Fixes SCYLLADB-807
2026-05-13 08:43:24 +02:00
Yaniv Michael Kaul
5d6f160129 test: update get_scylla_2025_1_executable() to use 2025.1.12
Update the hardcoded 2025.1.0 binary URL to the latest 2025.1.12
release for upgrade tests.

The 2025.1.12 binary now supports and enforces the
rf_rack_valid_keyspaces option which the test harness enables by
default. Since test_sstable_compression_dictionaries_upgrade creates
a 2-node cluster in a single rack with RF=2, it violates the
constraint. Disable the option explicitly for this test.

Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>

Closes scylladb/scylladb#29714
2026-05-12 23:20:55 +02:00
Michał Jadwiszczak
68f0cf6fac strong_consistency: allow only QUORUM/LOCAL_QUORUM CL for writes
To successfully write data to strong consistent table, a quorum of
replicas need to be used to save the data to Raft log.

So the only reasonable consistency level is QUORUM/LOCAL_QUORUM
(currently SC doesn't support multi DC).
2026-05-12 23:20:03 +02:00
Wojciech Mitros
f3cf20803b test: run test_mv_admission_control_exception on one shard
In the test we perform 2 consecutive writes where the first write
is supposed to increase the view update backlog above the mv
admission control threshold and the second one is expected to be
rejected because of that.

On each node/shard we have 2 types of view update backlogs:
 1. for deciding whether we should admit writes
 2. for propagating the backlog information to other nodes/shards.

For the second write to be rejected, it must be performed on a node
and shard which updated its backlog of type 1.

The view update backlog of type 2. is immediately increased on the
base table replica. For this backlog to be registered as a backlog
of type 1., it needs to be either carried by gossip (happening once
every second) or by attaching it to a replica write response. We
don't want to increase the runtime of tests unnecessarily, so we don't
wait and we rely on the second mechanism. The response to the first
base table write (the one causing increase in the backlog) carries
the increased backlog to the coordinator of this write. So for the
second write to observe the increased backlog, it needs to be coordinated
on the same node+shard as the first write.

We make sure that both writes are coordinated on the same node+shard by
using prepared statements combined with setting the host in `run_async`.
Both writes target the same partition and with prepared statements we
route them directly to the correct shard.

That was the idea, at least. In practice, for the driver to learn the
correct shard, it first needs to learn the token->shard mapping from
the server. For vnodes it can expect a shard by calculating the token
of the affected partition, but for tablets, it had no opportunity to
learn the tablet->shard mapping so the first write may route to any shard.
Additionally, we aren't guaranteed that the driver established connections
to all shards on all nodes at the point of any write. So if a connection
finishes establishing between the two writes, this may also cause us to
coordinate these 2 writes on different shards, leading to a missed view
backlog growth and not-rejected second write.

We fix this in this patch by running the test using one shard on each node.
This way, as long as we perform both writes on the same node, they'll also
be coordinated on the same shard. This also makes the prepared statement and
BoundStatement unnecessary — we can use SimpleStatement with
FallthroughRetryPolicy directly.

Fixes: SCYLLADB-1901

Closes scylladb/scylladb#29862
2026-05-12 17:34:19 +02:00
Piotr Dulikowski
129f193116 Merge 'strong_consistency: implement basic coordinator metrics' from Michał Jadwiszczak
Add per-shard metrics for strong consistency coordinator operations (latency, timeouts, bounces, status unknown) under the `"strong_consistency_coordinator"` category. These are analogous to the eventual consistency metrics in `storage_proxy_stats`, enabling direct performance comparison between the two consistency modes.

The metrics are simplified compared to `storage_proxy_stats` — no breakdown by table, tablet, scheduling group, or DC, only per-shard.

Fixes SCYLLADB-1343

Strong consistency is still in experimental phase, no need to backport.

Closes scylladb/scylladb#29318

* github.com:scylladb/scylladb:
  test/strong_consistency: verify metrics
  strong_consistency: wire up metrics to operations
  strong_consistency: add stats struct and metrics registration
2026-05-12 16:15:51 +02:00
Botond Dénes
e95eb21a16 Merge 'Tablet-aware restore' from Pavel Emelyanov
The mechanics of the restore is like this

- A /storage_service/tablets/restore API is called with (keyspace, table, endpoint, bucket, manifests) parameters
  - First, it populates the system_distributed.snapshot_sstables table with the data read from the manifests
  - Then it emplaces a bunch of tablet transitions (of a new "restore" kind), one for each tablet
- The topology coordinator handles the "restore" transition by calling a new RESTORE_TABLET RPC against all the current tablet replicas
- Each replica handles the RPC verb by
  - Reading the snapshot_sstables table
  - Filtering the read sstable infos against current node and tablet being handled
  - Downloading and attaching the filtered sstables

This PR includes system_distributed.snapshot_sstables table from @robertbindar and preparation work from @kreuzerkrieg that extracts raw sstables downloading and attaching from existing generic sstables loading code.

This is first step towards SCYLLADB-197 and lacks many things. In particular
- the API only works for single-DC cluster
- the caller needs to "lock" tablet boundaries with min/max tablet count
- not abortable
- no progress tracking
- sub-optimal (re-kicking API on restore will re-download everything again)
- not re-attacheable (if API node dies, restoration proceeds, but the caller cannot "wait" for it to complete via other node)
- nodes download sstables in maintenance/streaming sched gorup (should be moved to maintenance/backup)

Other follow-up items:
- have an actual swagger object specification for `backup_location`

Closes #28436
Closes #28657
Closes #28773

Closes scylladb/scylladb#28763

* github.com:scylladb/scylladb:
  docs: Update topology_over_raft.md with `restore` transition kind
  test: Add test for backup vs migration race
  test: Restore resilience test
  sstables_loader: Fail tablet-restore task if not all sstables were downloaded
  sstables_loader: mark sstables as downloaded after attaching
  sstables_loader: return shared_sstable from attach_sstable
  db: add update_sstable_download_status method
  db: add downloaded column to snapshot_sstables
  db: extract snapshot_sstables TTL into class constant
  test: Add a test for tablet-aware restore
  tablets: Implement tablet-aware cluster-wide restore
  messaging: Add RESTORE_TABLET RPC verb
  sstables_loader: Add method to download and attach sstables for a tablet
  tablets: Add restore_config to tablet_transition_info
  sstables_loader: Add restore_tablets task skeleton
  test: Add rest_client helper to kick newly introduced API endpoint
  api: Add /storage_service/tablets/restore endpoint skeleton
  sstables_loader: Add keyspace and table arguments to manfiest loading helper
  sstables_loader_helpers: just reformat the code
  sstables_loader_helpers: generalize argument and variable names
  sstables_loader_helpers: generalize get_sstables_for_tablet
  sstables_loader_helpers: add token getters for tablet filtering
  sstables_loader_helpers: remove underscores from struct members
  sstables_loader: move download_sstable and get_sstables_for_tablet
  sstables_loader: extract single-tablet SST filtering
  sstables_loader: make download_sstable static
  sstables_loader: fix formating of the new `download_sstable` function
  sstables_loader: extract single SST download into a function
  sstables_loader: add shard_id to minimal_sst_info
  sstables_loader: add function for parsing backup manifests
  split utility functions for creating test data from database_test
  export make_storage_options_config from lib/test_services
  rjson: Add helpers for conversions to dht::token and sstable_id
  Add system_distributed_keyspace.snapshot_sstables
  add get_system_distributed_keyspace to cql_test_env
  code: Add system_distributed_keyspace dependency to sstables_loader
  storage_service: Export export handle_raft_rpc() helper
  storage_service: Export do_tablet_operation()
  storage_service: Split transit_tablet() into two
  tablets: Add braces around tablet_transition_kind::repair switch
2026-05-12 16:24:13 +03:00
Yaniv Michael Kaul
c359a09189 test: add UDF/UDA keyspace isolation and UDT tests
Port 3 tests from scylla-dtest user_functions_test.py:
- test_udf_with_udt: UDF taking frozen UDT arg, verifies DROP TYPE blocked
- test_udf_with_udt_keyspace_isolation: cross-keyspace UDT references rejected
- test_aggregate_with_udt_keyspace_isolation: cross-keyspace UDT in UDA rejected

All tests use Lua (Scylla's supported UDF language).
Reproduces CASSANDRA-9409.

Closes scylladb/scylladb#1928

Closes scylladb/scylladb#29843
2026-05-12 14:57:14 +03:00
Yaniv Michael Kaul
f55a55fbf3 docker: fix coredump collection when host uses pipe-based core_pattern
The container image inherits kernel.core_pattern from the host.  When
the host pipes core dumps to a handler (e.g. Ubuntu's apport), that
handler does not exist or work correctly inside the container, so core
dumps are silently lost.

Override any pipe-based core_pattern with a file-based pattern that
writes directly to /var/lib/scylla/coredump/.  The override is attempted
both from the entrypoint (scyllasetup.coredumpSetup) and from
scylla-server.sh when running as root; it succeeds only when the
container has write access to /proc/sys/kernel/core_pattern and is
silently skipped otherwise.

Fixes: SCYLLADB-1366

Closes scylladb/scylladb#29337
2026-05-12 14:16:22 +03:00
Piotr Smaron
1018710e38 test/cqlpy: un-xfail oversized indexed value build test
Issue #8627 is fixed, so test_too_large_indexed_value_build now passes and should run normally instead of XPASSing under strict xfail.

Fixes: SCYLLADB-1938

Closes scylladb/scylladb#29853
2026-05-12 11:40:53 +02:00
Avi Kivity
ddb1181103 Merge 'load_balance: fix drain with forced capacity-based balancing' from Ferenc Szili
When `force_capacity_based_balancing` is enabled and a node is being drained/excluded, the tablet allocator incorrectly aborts balancing due to incomplete tablet stats - even though capacity-based balancing doesn't depend on tablet sizes.

The tablet allocator normally waits for complete load stats before balancing. An exception exists for drained+excluded nodes (they're unreachable and won't return stats). However, when forced capacity-based balancing is active, this exception was not being applied, causing the balancer to reject the drain plan.

Adjust the condition in `tablet_allocator.cc` so that the "ignore missing data for drained nodes" logic applies regardless of whether capacity-based balancing is forced.

Added a Boost unit test that forces capacity-based balancing and verifies a drained/excluded node gets its tablets migrated even when tablet size stats are missing.

This bug was introduced in 2026.1, so this needs to be backported to 2026.1 and 2026.2

Fixes: SCYLLADB-1803

Closes scylladb/scylladb#29791

* github.com:scylladb/scylladb:
  test: boost: add drain test for forced capacity-based balancing
  service: allow draining with forced capacity-based balancing
2026-05-12 12:38:25 +03:00
Andrzej Jackowski
89261bf759 test: wait for TTL scheduling sanity metric
The test samples sl:default runtime before and after setup writes to
prove that it measures the scheduling group used by regular CQL writes.
The metric is exported in milliseconds, so a single 200-row batch may
not be visible immediately, or may be too small in some environments.

Keep the original 200-row table size, but wait up to 30 seconds for the
metric to advance. If it does not, retry the same writes before TTL is
enabled. The retries update the same keys, so the expiration part of the
test still waits for exactly the original number of rows.

In a local 100-run with N=200 rows, the observed delta of
`ms_statement_before - ms_statement_before_write` was: min=4.0,
max=16.0, mean=8.13, and median=8.0. Therefore, it looks possible that
in a rare corner case the delta drops even to 0.

Fixes SCYLLADB-1869

Closes scylladb/scylladb#29797
2026-05-12 12:38:25 +03:00
Botond Dénes
8d6f031a4a schema: fix DESCRIBE showing NullCompactionStrategy when compaction is disabled
When a table's compaction is disabled via 'enabled': 'false', the DESCRIBE
output incorrectly showed NullCompactionStrategy instead of the actual strategy.
This happened because schema_properties() called compaction_strategy(), which
returns compaction_strategy_type::null when compaction is disabled. Fix it by
using configured_compaction_strategy(), which always returns the real strategy
type - consistent with how schema_tables.cc serializes it to disk.

Fixes SCYLLADB-1353

Closes scylladb/scylladb#29804
2026-05-12 12:38:25 +03:00
Piotr Dulikowski
7c2b1ea0b5 Merge 'view_building: fix tombstone_warn_threshold warnings' from Michał Jadwiszczak
`system.view_building_tasks` is a single-partition Raft group0 table (pk = `"view_building"`, CK = timeuuid). When `clean_finished_tasks()` deletes hundreds of finished tasks, the physical rows remain in SSTables until compaction. Any subsequent read of the partition counts every column of every tombstoned row
  as a dead cell, triggering `tombstone_warn_threshold` warnings in large clusters.

Two-part fix:

**1. Range tombstones instead of row tombstones (commits 2–3)**

Instead of one row tombstone per finished task, find the minimum alive task UUID (`min_alive_uuid`) and emit a single range tombstone `[before_all, min_alive_uuid)` covering all tasks below that boundary. This reduces the tombstone count significantly and also benefits future compaction.

**2. Bounded scan with `min_task_id` (commits 4–6)**

Even with range tombstones, physical rows remain until compaction and still count as dead cells during reads. The only way to avoid them is to not read them at all.

   - Add a `min_task_id timeuuid` static column to `system.view_building_tasks`.
   - On every GC, write `min_task_id = min_alive_uuid` atomically with the range tombstone (same Raft batch).
   - On reload, read `min_task_id` first using a **static-only partition slice** (empty `_row_ranges` + `always_return_static_content`): the SSTable reader stops immediately after the static row before processing any clustering tombstones — zero dead cells counted.
   - Use `AND id >= min_task_id` as a lower bound for the main task scan, skipping all tombstoned rows.

The static-only read and the bounded scan are gated on the `VIEW_BUILDING_TASKS_MIN_TASK_ID` cluster feature so mixed-version clusters fall back to the full scan.

The issue is not critical, so the fix shouldn't be backported.

Fixes SCYLLADB-657

Closes scylladb/scylladb#28929

* github.com:scylladb/scylladb:
  test/cluster/test_view_building_coordinator: add reproducer for tombstone threshold warning
  docs: document tombstone avoidance in view_building_tasks
  view_building: add `task_uuid_generator` to `view_building_task_mutation_builder`
  view_building: introduce `task_uuid_generator`
  view_building: store `min_alive_uuid` in view building state
  view_building: set min_task_id when GC-ing finished tasks
  view_building: add min_task_id support to view_building_task_mutation_builder
  view_building: add min_task_id static column and bounded scan to system_keyspace
  view_building: use range tombstone when GC-ing finished tasks
  view_building: add range tombstone support to view_building_task_mutation_builder
  view_building: introduce VIEW_BUILDING_TASKS_MIN_TASK_ID cluster feature
2026-05-12 12:38:25 +03:00
Pavel Emelyanov
1c0f8ab66e Merge 'sstables: introduce --abort-on-malformed-sstable-error' from Botond Dénes
When a malformed sstable error occurs, it is usually caused by actual sstable corruption — a cosmic ray, a bad disk write, etc. However, it can also be caused by memory corruption, where a data structure in memory happens to be read as sstable data. In the latter case, having a coredump of the process at the moment of the error is invaluable for post-mortem debugging, since the exception throwing/catching machinery destroys the stack frames that would point to the corruption site.

This patch series introduces `--abort-on-malformed-sstable-error`, a new command-line option (with `LiveUpdate` support) that, when set, causes the server to call `std::abort()` instead of throwing an exception whenever any sstable parse error is detected. This covers all code paths:

- Direct `throw malformed_sstable_exception(...)` sites (migrated to `throw_malformed_sstable_exception()`)
- Direct `throw bufsize_mismatch_exception(...)` sites (migrated to `throw_bufsize_mismatch_exception()`)
- `parse_assert()` failures (via `on_parse_error()`)
- BTI parse errors (via `on_bti_parse_error()`)

The implementation places the flag and helper functions in `sstables/sstables.cc`, next to the existing `on_parse_error()` / `on_bti_parse_error()` infrastructure.

The flag defaults to `false`, preserving current behaviour. It is intended to be enabled temporarily when investigating suspected memory corruption.

**Commit breakdown:**
1. Infrastructure: flag, getter/setter, and throw helpers in `sstables/sstables.cc`; config option wired up in `main.cc`
2. `on_parse_error()` and `on_bti_parse_error()` check the new flag
3. All ~50 `throw malformed_sstable_exception(...)` sites migrated
4. Both `throw bufsize_mismatch_exception(...)` sites migrated

Refs: SCYLLADB-1087
Backport: new feature, no backport

Closes scylladb/scylladb#29324

* github.com:scylladb/scylladb:
  sstables: migrate all bufsize_mismatch_exception throw sites to throw_bufsize_mismatch_exception()
  sstables: migrate all malformed_sstable_exception throw sites to throw_malformed_sstable_exception()
  sstables: make on_parse_error() and on_bti_parse_error() respect --abort-on-malformed-sstable-error
  sstables: disable abort-on-malformed-sstable-error in tests that corrupt sstables on purpose
  sstables: introduce --abort-on-malformed-sstable-error infrastructure
  sstables: refactor parse_path() to return std::expected<> instead of throwing
2026-05-12 12:38:25 +03:00
Pavel Emelyanov
150345cc52 Merge 'test: per-bucket isolation for S3/GCS object storage tests' from Ernest Zaslavsky
This series adds per-test bucket isolation to all S3 and GCS object storage tests. Previously, every test shared a single pre-created bucket, which meant tests could interfere with each other through leftover objects and could not run concurrently across multiple `test.py` processes without risking collisions.

New `create_bucket`, `delete_bucket`, and `delete_bucket_with_objects` methods on `s3::client`, following the existing `make_request` pattern. `create_bucket` handles the `BUCKET_ALREADY_OWNED_BY_YOU` error gracefully.

A new `s3_test_fixture` RAII class for C++ Boost tests that creates a uniquely-named bucket on construction (derived from the Boost test name and pid) and tears down everything — objects, bucket, client — on destruction. All S3 tests in `s3_test.cc` are migrated to use it, removing manual `deferred_delete_object` and `deferred_close` boilerplate. The minio server policy is broadened to allow dynamic bucket creation/deletion.

A `client::make` overload that accepts a custom `retry_strategy`, used in tests with a fast 1ms retry delay instead of exponential backoff, significantly reducing test runtime for transient errors during bucket lifecycle operations.

Python-side (`test/cluster/object_store`): each pytest fixture (`object_storage`, `s3_storage`, `s3_server`) now creates a unique bucket per test function via `create_test_bucket()` and destroys it on teardown. Bucket names are sanitized from the pytest node name with a short UUID suffix for uniqueness.

Object storage helpers (`S3Server`, `MinioWrapper`, `GSFront`, `GSServerImpl`, factory functions, CQL helpers, `s3_server` fixture) are extracted from `test/cluster/object_store/conftest.py` into a shared `test/pylib/object_storage.py` module, eliminating duplication across test suites. The conftest becomes a thin re-export wrapper. Old class names are preserved as aliases for backward compatibility.

| Test Name                                                    | new test specific retry strategy execution time (ms) | original execution time (ms) |   Δ (ms) | Speedup |
|--------------------------------------------------------------|----------------:|-------------:|---------:|--------:|
| test_client_upload_file_multi_part_with_remainder_proxy      |          19,261 |       61,395 | −42,134  | **3.2×** |
| test_client_upload_file_multi_part_without_remainder_proxy   |          16,901 |       53,688 | −36,787  | **3.2×** |
| test_client_upload_file_single_part_proxy                    |           3,478 |        6,789 |  −3,311  | **2.0×** |
| test_client_multipart_copy_upload_proxy                      |           1,303 |        1,619 |    −316  | 1.2×    |
| test_client_put_get_object_proxy                             |             150 |          365 |    −215  | **2.4×** |
| test_client_readable_file_stream_proxy                       |             125 |          327 |    −202  | **2.6×** |
| test_small_object_copy_proxy                                 |             205 |          389 |    −184  | 1.9×    |
| test_client_put_get_tagging_proxy                            |             181 |          350 |    −169  | 1.9×    |
| test_client_multipart_upload_proxy                           |           1,252 |        1,416 |    −164  | 1.1×    |
| test_client_list_objects_proxy                               |             729 |          881 |    −152  | 1.2×    |
| test_chunked_download_data_source_with_delays_proxy          |             830 |          960 |    −130  | 1.2×    |
| test_client_readable_file_proxy                              |             148 |          279 |    −131  | 1.9×    |
| test_client_upload_file_multi_part_with_remainder_minio      |           3,358 |        3,170 |    +188  | 0.9×    |
| test_client_upload_file_multi_part_without_remainder_minio   |           3,131 |        2,929 |    +202  | 0.9×    |
| test_client_upload_file_single_part_minio                    |             519 |          421 |     +98  | 0.8×    |
| test_download_data_source_proxy                              |             180 |          237 |     −57  | 1.3×    |
| test_client_list_objects_incomplete_proxy                     |             590 |          641 |     −51  | 1.1×    |
| test_large_object_copy_proxy                                 |             952 |          991 |     −39  | 1.0×    |
| test_client_multipart_upload_fallback_proxy                  |             148 |          185 |     −37  | 1.3×    |
| test_client_multipart_copy_upload_minio                      |             641 |          674 |     −33  | 1.1×    |

No backport needed — this is a test infrastructure improvement with no production code impact beyond the new `s3::client` methods.

Closes scylladb/scylladb#29508

* github.com:scylladb/scylladb:
  test: extract object storage helpers to test/pylib/object_storage.py
  test: add per-test bucket isolation to object_store fixtures
  s3: add client::make overload with custom retry strategy
  test: add s3_test_fixture and migrate tests to per-bucket isolation
  s3: add create_bucket and delete_bucket to client
2026-05-12 12:38:24 +03:00
Ferenc Szili
6856f51097 test: add test for intranode balance threshold in size-based mode
Verify that the load balancer does not issue intranode migrations when
the load difference between shards is within the size_based_balance_threshold,
and that it does issue migrations when the difference exceeds the threshold.
2026-05-12 10:34:25 +02:00
Pavel Emelyanov
19820910f8 test: Add test for backup vs migration race
The test starts regular backup+restore on a smaller cluster, but prior
to it spawns tablet migration from one node to another and locks it in
the middle with the help of block_tablet_streaming injection (even
though tablets have no data and there's nothing to stream, the injection
is located early enough to work).

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-05-12 10:40:24 +03:00
Pavel Emelyanov
3bcefa42c5 test: Restore resilience test
The test checks that losing one of nodes from the cluster while restore
is handled. In particular:

- losing an API node makes the task waiting API to throw (apparently)
- losing coordinator or replica node makes the API call to fail, because
  some tablets should fail to get restored. If the coordinator is lost,
  it triggers coordinator re-election and new coordinator still notices
  that a tablet that was replicated to "old" coordinator failed to get
  restored and fails the restore anyway

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-05-12 10:40:24 +03:00
Pavel Emelyanov
69b8f76a32 sstables_loader: Fail tablet-restore task if not all sstables were downloaded
When the storage_service::restore_tablets() resolves, it only means that
tablet transitions are done, including restore transitions, but not
necessarily that they succeeded. So before resolving the restoration
task with success need to check if all sstables were downloaded and, if
not, resolve the task with exception.

Test included. It uses fault-injection to abort downloading of a single
sstable early, then checks that the error was properly propagated back
to the task waiting API

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-05-12 10:40:24 +03:00
Pavel Emelyanov
4137211cf4 test: Add a test for tablet-aware restore
The test is derived from test_restore_with_streaming_scopes() one, with
the excaption that it doesn't check for streaming directions, doesn't
check mutations right after creation and doesn't loop over scoped
sub-tests, because there's no scope concept here.

Also it verifies just two topologies, it seems to be enough. The scopes
test has many topologies because of the nature of the scoped restore,
with cluster-wide restore such flexibility is not required.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-05-12 10:40:23 +03:00
Pavel Emelyanov
dcd490666b test: Add rest_client helper to kick newly introduced API endpoint
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-05-12 10:40:23 +03:00
Pavel Emelyanov
d280987f2c sstables_loader: Add keyspace and table arguments to manfiest loading helper
When restoring a backup into a keyspace under a different name, than the
one at which it existed during backup, the snapshot_sstables table must
be populated with the _new_ keyspace name, not the one taken from
manifest. Same is true for table name.

This patch makes it possible to override keyspace/table loaded from
manifest file with the provided values. in the future it will also be
good to check that if those values are not provided by user, then values
read from different manifest files are the same.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-05-12 10:40:23 +03:00
Robert Bindar
c97232bb7b sstables_loader: add function for parsing backup manifests
This change adds functionality for parsing backup manifests
and populating system_distributed.snapshot_sstables with
the content of the manifests.
This change is useful for tablet-aware restore. The function
introduced here will be called by the coordinator node
when restore starts to populate the snapshot_sstables table
with the data that workers need to execute the restore process.

Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
Co-authored-by: Pavel Emelyanov <xemul@scylladb.com>
2026-05-12 10:40:22 +03:00
Robert Bindar
f0e8d6c9dd split utility functions for creating test data from database_test
Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
2026-05-12 10:40:21 +03:00
Robert Bindar
b52e40e512 export make_storage_options_config from lib/test_services
Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
2026-05-12 10:40:21 +03:00
Robert Bindar
2f19d84ad7 Add system_distributed_keyspace.snapshot_sstables
This patch adds the snapshot_sstables table with the following
schema:
```cql
CREATE TABLE system_distributed.snapshot_sstables (
    snapshot_name text,
    keyspace text, table text,
    datacenter text, rack text,
    id uuid,
    first_token bigint, last_token bigint,
    toc_name text, prefix text)
  PRIMARY KEY ((snapshot_name, keyspace, table, datacenter, rack), first_token, id);
```
The table will be populated by the coordinator node during the restore
phase (and later on during the backup phase to accomodate live-restore).
The content of this table is meant to be consumed by the restore worker nodes
which will use this data to filter and file-based download sstables.

Fixes SCYLLADB-263

Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
2026-05-12 10:40:21 +03:00
Robert Bindar
31e9f04714 add get_system_distributed_keyspace to cql_test_env
Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
2026-05-12 10:17:40 +03:00
Botond Dénes
866afe4c1e Merge ' db: add large data metrics for rows, cells, and collections' from Taras Veretilnyk
- Add `large_rows_exceeding_threshold`, `large_cell_exceeding_threshold`, and `large_collection_exceeding_threshold` metrics to complement the existing `large_partition_exceeding_threshold`
- Add unit tests verifying stats counters increment correctly during SSTable writes

Backport is not needed

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1095

Closes scylladb/scylladb#29722

* github.com:scylladb/scylladb:
  test/boost: add tests for large data stats counters
  db: add large data metrics for rows, cells, and collections
2026-05-12 10:04:53 +03:00
Taras Veretilnyk
47b4fa920d test/boost: add tests for large data stats counters
Add test_large_data_stats_large_rows, test_large_data_stats_large_cells,
and test_large_data_stats_large_collections to verify that the
large_data_handler stats counters are correctly incremented during
SSTable writes and that unrelated counters remain at zero.
2026-05-11 23:42:14 +02:00
Calle Wilund
2cc1a2c406 storage_service: Disable snapshots after raft decommission
Fixes: SCYLLADB-1693

In case we abort a decommission operation, the snapshot/backup
mechanism need to remain open.

This change moves it to after raft_decommission.

In the case of a cluster snapshot, our nodes ownership
or not of tables will be serialized by raft anyway, so
should remain consistent. In that case we at worst coordinate
from a node in "leave" status

In the case of a local snapshot, ownership matters less,
only sstables on disk, which should not change.

In the case of backup, this operates on a snapshot, state of which
is not affected.

Adds an injection point for testing.

v2:
- Added injection point to ensure test can abort decommission

Closes scylladb/scylladb#29667
2026-05-11 17:04:09 +03:00
Yaniv Michael Kaul
3674deea54 scylla-gdb: display ms-format sstable summary from partitions db footer
For ms-format (trie-based) sstables, the traditional summary structure
is not populated. Instead, read equivalent metadata from the
_partitions_db_footer field: first_key, last_key, partition_count,
and trie_root_position.

This is a follow-up to the crash fix for SCYLLADB-1180, replacing the
informational-only message with actual useful output.

Refs: SCYLLADB-1180

Closes scylladb/scylladb#29164
2026-05-11 16:58:22 +03:00
Piotr Smaron
71542206bc cql: return InvalidRequest for oversized partition/clustering keys
When a partition key or clustering key value exceeds the 64 KiB limit
(65535 bytes serialized), Scylla used to raise a generic
std::runtime_error "Key size too large: N > M" from the low-level
compound-key serializer. That error surfaced to clients as a CQL
server error (code 0x0000, "NoHostAvailable"-looking), which is both
ugly and incompatible with Cassandra - Cassandra returns a clean
InvalidRequest with the message "Key length of N is longer than
maximum of M".

Fix this at the single chokepoint: compound_type::serialize_value in
keys/compound.hh. The serializer is on every path that materializes a
key - INSERT/UPDATE/DELETE/BATCH build mutations through it, and
SELECT builds partition and clustering ranges through it - so a single
throw replacement produces a clean InvalidRequest consistently across
all paths and all key shapes (single, compound PK, composite CK).

The previous approach on this PR branch patched three call sites in
cql3/restrictions/statement_restrictions.cc, which only covered
SELECT, duplicated the check, and placed it mid-restrictions code
(flagged in review). Dropping those changes in favour of the
root-cause fix here.

Un-xfail the tests this fixes:
- test/cqlpy/test_key_length.py: test_insert_65k_pk, test_insert_65k_ck,
  test_where_65k_pk, test_where_65k_ck, test_insert_65k_ck_composite,
  test_insert_total_compound_pk_err, test_insert_total_composite_ck_err.
- test/cqlpy/cassandra_tests/.../insert_test.py: testPKInsertWithValueOver64K,
  testCKInsertWithValueOver64K.
- test/cqlpy/cassandra_tests/.../select_test.py: testPKQueryWithValueOver64K.

test_insert_65k_pk_compound stays xfail: its oversized value gets
rejected by the Python driver's CQL wire-protocol encoder (see
CASSANDRA-19270) before reaching the server, so the fix can't apply.
Updated its reason. testCKQueryWithValueOver64K stays xfail with an
updated reason: Cassandra silently returns empty for an oversized
clustering key in WHERE, while Scylla now throws InvalidRequest - a
deliberate choice mirroring the partition-key case, documented in
the discussion on #10366.

Add three tight-boundary tests (addressing review feedback on the
previous revision) that pin MAX+1 behaviour for SELECT and INSERT of
both partition and clustering keys.

Update test/cluster/dtest/limits_test.py to match the new message
("Key length of \\d+ is longer than maximum of 65535").

fixes #10366
fixes #12247

Co-authored-by: Alexander Turetskiy <someone.tur@gmail.com>

Closes scylladb/scylladb#23433
2026-05-11 16:56:35 +03:00
Piotr Smaron
959f67b345 cql: verify tuples length in multi-column IN restriction
When a multi-column IN restriction contains tuples with a different
number of elements than the number of restricted columns (e.g.
`(b, c, d) IN ((1, 2), (2, 1, 4))`), Scylla would either produce an
inconsistent error message or, for over-sized tuples, an internal
type-mismatch error referencing the list literal representation.

Validate each tuple's arity against the number of restricted columns
while building the IN restriction and raise a clear
"Expected N elements in value tuple, but got M" error in both the
under- and over-sized cases.

Fixes #13241

Co-authored-by: Alexander Turetskiy <someone.tur@gmail.com>

Closes scylladb/scylladb#18407
2026-05-11 16:55:09 +03:00