Compare commits

...

379 Commits

Author SHA1 Message Date
Pavel Emelyanov
233da83dd9 Merge 'Fix directory lister leak in table::get_snapshot_details: ' from Benny Halevy
As reported in SCYLLADB-1013, the directory lister must be closed also when an exception is thrown.

For example, see backtrace below:
```
seastar::on_internal_error(seastar::logger&, std::basic_string_view<char, std::char_traits<char>>) at ./build/release/seastar/./seastar/src/core/on_internal_error.cc:57
directory_lister::~directory_lister() at ./utils/lister.cc:77
replica::table::get_snapshot_details(std::filesystem::__cxx11::path, std::filesystem::__cxx11::path) (.resume) at ./replica/table.cc:4081
std::__n4861::coroutine_handle<seastar::internal::coroutine_traits_base<db::snapshot_ctl::table_snapshot_details>::promise_type>::resume() const at /usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/coroutine:247
 (inlined by) seastar::internal::coroutine_traits_base<db::snapshot_ctl::table_snapshot_details>::promise_type::run_and_dispose() at ././seastar/include/seastar/core/coroutine.hh:129
seastar::reactor::task_queue::run_tasks() at ./build/release/seastar/./seastar/src/core/reactor.cc:2695
 (inlined by) seastar::reactor::task_queue_group::run_tasks() at ./build/release/seastar/./seastar/src/core/reactor.cc:3201
seastar::reactor::task_queue_group::run_some_tasks() at ./build/release/seastar/./seastar/src/core/reactor.cc:3185
 (inlined by) seastar::reactor::do_run() at ./build/release/seastar/./seastar/src/core/reactor.cc:3353
seastar::reactor::run() at ./build/release/seastar/./seastar/src/core/reactor.cc:3245
seastar::app_template::run_deprecated(int, char**, std::function<void ()>&&) at ./build/release/seastar/./seastar/src/core/app-template.cc:266
seastar::app_template::run(int, char**, std::function<seastar::future<int> ()>&&) at ./build/release/seastar/./seastar/src/core/app-template.cc:160
scylla_main(int, char**) at ./main.cc:756
```

Fixes: [SCYLLADB-1013](https://scylladb.atlassian.net/browse/SCYLLADB-1013)

* Requires backport to 2026.1 since the leak exists since 004c08f525

[SCYLLADB-1013]: https://scylladb.atlassian.net/browse/SCYLLADB-1013?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ

Closes scylladb/scylladb#29084

* github.com:scylladb/scylladb:
  test/boost/database_test: add test_snapshot_ctl_details_exception_handling
  table: get_snapshot_details: fix indentation inside try block
  table: per-snapshot get_snapshot_details: fix typo in comment
  table: per-snapshot get_snapshot_details: always close lister using try/catch
  table: get_snapshot_details: always close lister using deferred_close

(cherry picked from commit f27dc12b7c)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#29125
2026-03-25 10:04:39 +01:00
Robert Bindar
9b81939a93 replica/table: calculate manifest tablet_count from tablet map
During tests I noticed that if the number of tablets is very small,
say 2, and the number of nodes is 3 (2 shards per node), using the
number of storage groups on each shard, a shard may end up holding 0 groups,
whilst the other holds 1 group. And in some nodes even both shards have
0 groups.
Taking the minimum among shards here was showing in manifests a tablet
count of 0 for all 3 nodes, which is incorrect.

Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>

Closes scylladb/scylladb#28978

(cherry picked from commit 29619e48d7)

Closes scylladb/scylladb#29101
2026-03-24 16:37:28 +01:00
Michał Chojnowski
804842e95c test/boost/cache_algorithm_test: disable sstable compression to avoid giant index pages
The test intentionally creates huge index pages.
But since 5e7fb08bf3,
the index reader allocates a block of memory for a whole index page,
instead of incrementally allocating small pieces during index parsing.
This giant allocation causes the test to fail spuriously in CI sometimes.

Fix this by disabling sstable compression on the test table,
which puts a hard cap of 2000 keys per index page.

Fixes: SCYLLADB-1152

Closes scylladb/scylladb#29152

(cherry picked from commit f29525f3a6)

Closes scylladb/scylladb#29172
2026-03-24 16:02:48 +02:00
Botond Dénes
4f77cb621f Merge 'tablets: Fix deadlock in background storage group merge fiber' from Tomasz Grabiec
When it deadlocks, groups stop merging and compaction group merge
backlog will run-away.

Also, graceful shutdown will be blocked on it.

Found by flaky unit test
test_merge_chooses_best_replica_with_odd_count, which timed-out in 1
in 100 runs.

Reason for deadlock:

When storage groups are merged, the main compaction group of the new
storage group takes a compaction lock, which is appended to
_compaction_reenablers_for_merging, and released when the merge
completion fiber is done with the whole batch.

If we accumulate more than 1 merge cycle for the fiber, deadlock
occurs. Lock order will be this

Initial state:

 cg0: main
 cg1: main
 cg2: main
 cg3: main

After 1st merge:

 cg0': main [locked], merging_groups=[cg0.main, cg1.main]
 cg1': main [locked], merging_groups=[cg2.main, cg3.main]

After 2nd merge:

 cg0'': main [locked], merging_groups=[cg0'.main [locked], cg0.main, cg1.main, cg1'.main [locked], cg2.main, cg3.main]

merge completion fiber will try to stop cg0'.main, which will be
blocked on compaction lock. which is held by the reenabler in
_compaction_reenablers_for_merging, hence deadlock.

The fix is to wait for background merge to finish before we start the
next merge. It's achieved by holding old erm in the background merge,
and doing a topology barrier from the merge finalizing transition.

Background merge is supposed to be a relatively quick operation, it's
stopping compaction groups. So may wait for active requests. It
shouldn't prolong the barrier indefinitely.

Tablet tests which trigger merge need to be adjusted to call the
barrier, otherwise they will be vulnerable to the deadlock.

Fixes SCYLLADB-928

Backport to >= 2025.4 because it's the earliest vulnerable due to f9021777d8.

Closes scylladb/scylladb#29007

* github.com:scylladb/scylladb:
  tablets: Fix deadlock in background storage group merge fiber
  replica: table: Propagate old erm to storage group merge
  test: boost: tablets_test: Save tablet metadata when ACKing split resize decision
  storage_service: Extract local_topology_barrier()

(cherry picked from commit 5573c3b18e)

Closes scylladb/scylladb#29144
2026-03-21 01:37:30 +01:00
Raphael S. Carvalho
eb6c333e1b streaming: Release space incrementally during file streaming
File streaming only releases the file descriptors of a tablet being
streamed in the very streaming end. Which means that if the streaming
tablet has compaction on largest tier finished after streaming
started, there will be always ~2x space amplification for that
single tablet. Since there can be up to 4 tablets being migrated
away, it can add up to a significant amount, since nodes are pushed
to a substantial usage of available space (~90%).

We want to optimize this by dropping reference to a sstable after
it was fully streamed. This way, we reduce the chances of hitting
2x space amplification for a given tablet.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-790.
Closes scylladb/scylladb#28505

(cherry picked from commit 5b550e94a6)
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes scylladb/scylladb#28769
2026-03-20 15:50:46 +02:00
Calle Wilund
8d21636a81 test_internode_compression: Add await for "run" coro:s
Fixes: SCYLLADB-907

Closes scylladb/scylladb#28885

(cherry picked from commit 35aab75256)

Closes scylladb/scylladb#28905
2026-03-20 10:32:31 +02:00
Yaron Kaikov
7f236baf61 .github/workflows/trigger-scylla-ci: fix heredoc injection in trigger-scylla-ci workflow
Move all ${{ }} expression interpolations into env: blocks so they are
passed as environment variables instead of being expanded directly into
shell scripts. This prevents an attacker from escaping the heredoc in
the Validate Comment Trigger step and executing arbitrary commands on
the runner.

The Verify Org Membership step is hardened in the same way for
defense-in-depth.

Refs: GHSA-9pmq-v59g-8fxp
Fixes: SCYLLADB-954

Closes scylladb/scylladb#28935

(cherry picked from commit 977bdd6260)

Closes scylladb/scylladb#28946
2026-03-20 10:32:06 +02:00
Calle Wilund
4da8641d83 test_encryption: Fix test_system_auth_encryption
Fixes: SCYLLADB-915

Test was quite broken; Not waiting for coro:s, as well as a bunch
of checks no longer even close to valid (this is a ported dtest, and
not a very good one).

Closes scylladb/scylladb#28887

(cherry picked from commit ef795eda5b)

Closes scylladb/scylladb#28966
2026-03-20 10:30:48 +02:00
Łukasz Paszkowski
3ab789e1ca test/storage: harden out-of-space prevention tests around restart and disk-utilization transitions
The tests in test_out_of_space_prevention.py are flaky. Three issues contribute:

1. After creating/removing the blob file that simulates disk pressure,
   the tests immediately checked derived state (e.g., "compaction_manager
   - Drained") without first confirming the disk space monitor had detected
   the utilization change. Fix: explicitly wait for "Reached/Dropped below
   critical disk utilization level" right after creating/removing the blob
   file, before checking downstream effects.

2. Several tests called `manager.driver_connect()` or omitted reconnection
   entirely after `server_restart()` / `server_start()`. The pre-existing
   driver session can silently reconnect multiple times, causing subsequent
   CQL queries to fail. Fix: call `reconnect_driver()` after every node restart.
   Additionally, call `wait_for_cql_and_get_hosts()` where CQL is used afterward,
   to ensure all connection pools are established.

3. Some log assertions used marks captured before a restart, so they could
   match pre-restart messages or miss messages emitted in the correct post-restart
   window. Fix: refresh marks at the right points.

Apart from that, the patch fixes a typo: autotoogle -> autotoggle.

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-655

Closes scylladb/scylladb#28626

(cherry picked from commit 826fd5d6c3)

Closes scylladb/scylladb#28967
2026-03-20 10:30:23 +02:00
Asias He
25a17282bd test: Fix coordinator assumption in do_test_tablet_incremental_repair_merge_error
The first node in the cluster is not guaranteed to be the coordinator
node. Hardcoding node 0 as the coordinator causes test flakiness.  This
patch dynamically finds the actual coordinator node and targets it for
error injection, log checking, and restarts.

Additionally, inject `tablet_force_tablet_count_decrease_once` across
all servers to force the tablet merge process to trigger once.

Fixes SCYLLADB-865

Closes scylladb/scylladb#28945

(cherry picked from commit e0483f6001)

Closes scylladb/scylladb#28969
2026-03-20 10:30:00 +02:00
Botond Dénes
7afcc56128 db,compaction: use utils::chunked_vector for cache invalidation ranges
Instead of dht::partition_ranges_vector, which is an std::vector<> and
have been seen to cause large allocations when calculating ranges to be
invalidated after compaction:

    seastar_memory - oversized allocation: 147456 bytes. This is non-fatal, but could lead to latency and/or fragmentation issues. Please report: at
    [Backtrace #0]
    void seastar::backtrace<seastar::current_backtrace_tasklocal()::$_0>(seastar::current_backtrace_tasklocal()::$_0&&, bool) at ./build/release/seastar/./seastar/include/seastar/util/backtrace.hh:89
     (inlined by) seastar::current_backtrace_tasklocal() at ./build/release/seastar/./seastar/src/util/backtrace.cc:99
    seastar::current_tasktrace() at ./build/release/seastar/./seastar/src/util/backtrace.cc:136
    seastar::current_backtrace() at ./build/release/seastar/./seastar/src/util/backtrace.cc:169
    seastar::memory::cpu_pages::warn_large_allocation(unsigned long) at ./build/release/seastar/./seastar/src/core/memory.cc:840
    seastar::memory::cpu_pages::check_large_allocation(unsigned long) at ./build/release/seastar/./seastar/src/core/memory.cc:903
     (inlined by) seastar::memory::cpu_pages::allocate_large(unsigned int, bool) at ./build/release/seastar/./seastar/src/core/memory.cc:910
     (inlined by) seastar::memory::allocate_large(unsigned long, bool) at ./build/release/seastar/./seastar/src/core/memory.cc:1533
     (inlined by) seastar::memory::allocate_slowpath(unsigned long) at ./build/release/seastar/./seastar/src/core/memory.cc:1679
    seastar::memory::allocate(unsigned long) at ././seastar/src/core/memory.cc:1698
     (inlined by) operator new(unsigned long) at ././seastar/src/core/memory.cc:2440
     (inlined by) std::__new_allocator<interval<dht::ring_position>>::allocate(unsigned long, void const*) at /usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/bits/new_allocator.h:151
     (inlined by) std::allocator<interval<dht::ring_position>>::allocate(unsigned long) at /usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/bits/allocator.h:203
     (inlined by) std::allocator_traits<std::allocator<interval<dht::ring_position>>>::allocate(std::allocator<interval<dht::ring_position>>&, unsigned long) at /usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/bits/alloc_traits.h:614
     (inlined by) std::_Vector_base<interval<dht::ring_position>, std::allocator<interval<dht::ring_position>>>::_M_allocate(unsigned long) at /usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/bits/stl_vector.h:387
     (inlined by) std::vector<interval<dht::ring_position>, std::allocator<interval<dht::ring_position>>>::reserve(unsigned long) at /usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/bits/vector.tcc:79
    dht::to_partition_ranges(utils::chunked_vector<interval<dht::token>, 131072ul> const&, seastar::bool_class<utils::can_yield_tag>) at ./dht/i_partitioner.cc:347
    compaction::compaction::get_ranges_for_invalidation(std::vector<seastar::lw_shared_ptr<sstables::sstable>, std::allocator<seastar::lw_shared_ptr<sstables::sstable>>> const&) at ./compaction/compaction.cc:619
     (inlined by) compaction::compaction::get_compaction_completion_desc(std::vector<seastar::lw_shared_ptr<sstables::sstable>, std::allocator<seastar::lw_shared_ptr<sstables::sstable>>>, std::vector<seastar::lw_shared_ptr<sstables::sstable>, std::allocator<seastar::lw_shared_ptr<sstables::sstable>>>) at ./compaction/compaction.cc:719
     (inlined by) compaction::regular_compaction::replace_remaining_exhausted_sstables() at ./compaction/compaction.cc:1362
    compaction::compaction::finish(std::chrono::time_point<db_clock, std::chrono::duration<long, std::ratio<1l, 1000l>>>, std::chrono::time_point<db_clock, std::chrono::duration<long, std::ratio<1l, 1000l>>>) at ./compaction/compaction.cc:1021
    compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0::operator()() at ./compaction/compaction.cc:1960
     (inlined by) compaction::compaction_result std::__invoke_impl<compaction::compaction_result, compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0>(std::__invoke_other, compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0&&) at /usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/bits/invoke.h:63
     (inlined by) std::__invoke_result<compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0>::type std::__invoke<compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0>(compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0&&) at /usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/bits/invoke.h:98
     (inlined by) decltype(auto) std::__apply_impl<compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0, std::tuple<>>(compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0&&, std::tuple<>&&, std::integer_sequence<unsigned long, ...>) at /usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/tuple:2920
     (inlined by) decltype(auto) std::apply<compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0, std::tuple<>>(compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0&&, std::tuple<>&&) at /usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/tuple:2935
     (inlined by) seastar::future<compaction::compaction_result> seastar::futurize<compaction::compaction_result>::apply<compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0>(compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0&&, std::tuple<>&&) at ././seastar/include/seastar/core/future.hh:1930
     (inlined by) seastar::futurize<std::invoke_result<compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0>::type>::type seastar::async<compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0>(seastar::thread_attributes, compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0&&)::'lambda'()::operator()() const at ././seastar/include/seastar/core/thread.hh:267
     (inlined by) seastar::noncopyable_function<void ()>::direct_vtable_for<seastar::futurize<std::invoke_result<compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0>::type>::type seastar::async<compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0>(seastar::thread_attributes, compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0&&)::'lambda'()>::call(seastar::noncopyable_function<void ()> const*) at ././seastar/include/seastar/util/noncopyable_function.hh:138
    seastar::noncopyable_function<void ()>::operator()() const at ./build/release/seastar/./seastar/include/seastar/util/noncopyable_function.hh:224
     (inlined by) seastar::thread_context::main() at ./build/release/seastar/./seastar/src/core/thread.cc:318

dht::partition_ranges_vector is used on the hot path, so just convert
the problematic user -- cache invalidation -- to use
utils::chunked_vector<dht::partition_range> instead.

Fixes: SCYLLADB-121

Closes scylladb/scylladb#28855

(cherry picked from commit 13ff9c4394)

Closes scylladb/scylladb#28975
2026-03-20 10:29:23 +02:00
Andrzej Jackowski
32443ed6f7 reader_concurrency_semaphore: fix leak workaround
`e4da0afb8d5491bf995cbd1d7a7efb966c79ac34` introduces a protection
against resources that are "made up" of thin air to
`reader_concurrency_semaphore`. If there are more `_resources` than
the `_initial_resources`, it means there is a negative leak, and
`on_internal_error_noexcept` is called. In addition to it,
`_resources` is set to `std::max(_resources, _initial_resources)`.

However, the commit message of `e4da0afb8d5491bf995cbd1d7a7efb966c79ac34`
states the opposite: "The detection also clamps the
_resources to _initial_resources, to prevent any damage".

Before this commit, the protection mechanism doesn't clamp
`_resources` to `_initial_resources` but instead keeps `_resources` high,
possibly even indefinitely growing. This commit changes `std::max` to
`std::min` to make the code behave as intended.

Fixes: SCYLLADB-1014
Refs: SCYLLADB-163

Closes scylladb/scylladb#28982

(cherry picked from commit 9247dff8c2)

Closes scylladb/scylladb#28988
2026-03-20 10:28:45 +02:00
Botond Dénes
3e9b984020 Merge 'service: tasks: scan all tablets in tablet_virtual_task::wait' from Aleksandra Martyniuk
Currently, for repair tasks tablet_virtual_task::wait gathers the
ids of tablets that are to be repaired. The gathered set is later
used to check if the repair is still ongoing.

However, if the tablets are resized (split or merged), the gathered
set becomes irrelevant. Those, we may end up with invalid tablet id
error being thrown.

Wait until repair is done for all tablets in the table.

Fixes: https://github.com/scylladb/scylladb/issues/28202

Backport to 2026.1 needed as it contains the change introducing the issue d51b1fea94

Closes scylladb/scylladb#28323

* github.com:scylladb/scylladb:
  service: fix indentation
  test: add test_tablet_repair_wait
  service: remove status_helper::tablets
  service: tasks: scan all tablets in tablet_virtual_task::wait

(cherry picked from commit 3fed6f9eff)

Closes scylladb/scylladb#28991
2026-03-20 10:28:03 +02:00
Dmitriy Kruglov
2d199fb609 docs: add cluster platform migration procedure
Document how to migrate a ScyllaDB cluster to different instance
types using the add-and-replace node cycling approach.

Closes: QAINFRA-42

Closes scylladb/scylladb#28458

(cherry picked from commit cee44716db)

Closes scylladb/scylladb#28995
2026-03-20 10:27:29 +02:00
Piotr Dulikowski
35cd7f9239 Merge 'cql3: pin prepared cache entry in prepare() to avoid invalid weak handle race' from Alex Dathskovsky
query_processor::prepare() could race with prepared statement invalidation: after loading from the prepared cache, we converted the cached object to a checked weak pointer and then continued asynchronous work (including error-injection waitpoints). If invalidation happened in that window, the weak handle could no longer be promoted and the prepare path could fail nondeterministically.

This change keeps a strong cache entry reference alive across the whole critical section in prepare() by using a pinned cache accessor (get_pinned()), and only deriving the weak handle while the entry is pinned. This removes the lifetime gap without adding retry loops.

  Test coverage was extended in test/cluster/test_prepare_race.py:

  - reproduces the invalidation-during-prepare window with injection,
  - verifies prepare completes successfully,
  - then invalidates again and executes the same stale client prepared object,
  - confirms the driver transparently re-requests/re-prepares and execution succeeds.

  This change introduces:

  - no behavior change for normal prepare flow besides stronger lifetime guarantees,
  - no new protocol semantics,
  - preserves existing cache invalidation logic,
  - adds explicit cluster-level regression coverage for both the race and driver reprepare path.
  - pushes the re prepare operation twards the driver, the server will return unprepared error for the first time and the driver will have to re prepare during execution stage

Fixes: https://github.com/scylladb/scylladb/issues/27657

Backport to active branches recommended: No node crash, but user-visible PREPARE failures under rare schema-invalidation race; low-risk timeout-bounded retry improves robustness.

Closes scylladb/scylladb#28952

* github.com:scylladb/scylladb:
  transport/messages: hold pinned prepared entry in PREPARE result
  cql3: pin prepared cache entry in prepare() to avoid invalid weak handle race

(cherry picked from commit d9a277453e)

Closes scylladb/scylladb#29001
2026-03-20 10:27:04 +02:00
Pavel Emelyanov
32ce43d4b1 database: Rate limit all tokens from a range
The limiter scans ranges to decide whether or not to rate-limit the
query. However, when considering each range only the front one's token
is accounted. This looks like a misprint.

The limiter was introduced in cc9a2ad41f

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#29050

(cherry picked from commit 8b1ca6dcd6)

Closes scylladb/scylladb#29107
2026-03-20 10:25:32 +02:00
Botond Dénes
fef7750eb6 Merge 'tasks: do not fail the wait request if rpc fails' from Aleksandra Martyniuk
During decommission, we first mark a topology request as done, then shut
down a node and in the following steps we remove node from the topology.
Thus,  finished request does not imply that a node is removed from
the topology.

Due to that, in node_ops_virtual_task::wait, while gathering children
from the whole cluster, we may hit the connection exception - because
a node is still in topology, even though it is down.

Modify the get_children method to ignore the exception and warn
about the failure instead.

Keep token_metadata_ptr in get_children to prevent topology from changing.

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-867

Needs backports to all versions

Closes scylladb/scylladb#29035

* github.com:scylladb/scylladb:
  tasks: fix indentation
  tasks: do not fail the wait request if rpc fails
  tasks: pass token_metadata_ptr to task_manager::virtual_task::impl::get_children

(cherry picked from commit 2e47fd9f56)

Closes scylladb/scylladb#29126
2026-03-20 10:24:30 +02:00
Botond Dénes
213442227d Merge 'doc: fix the installation section' from Anna Stuchlik
This PR fixes the Installation page:

- Replaces `http `with `https `in the download command.
- Replaces the Open Source example from the Installation section for CentOS (we overlooked this example before).

Fixes https://github.com/scylladb/scylladb/issues/29087
Fixes https://github.com/scylladb/scylladb/issues/29087

This update affects all supported versions and should be backported as a bug fix.

Closes scylladb/scylladb#29088

* github.com:scylladb/scylladb:
  doc: remove the Open Source Example from Installation
  doc: replace http with https in the installation instructions

(cherry picked from commit e8b37d1a89)

Closes scylladb/scylladb#29135
2026-03-20 10:23:55 +02:00
Patryk Jędrzejczak
1398a55d16 test: test_remove_garbage_group0_members: wait for token ring and group0 consistency before removenode
The removenove initiator could have an outdated token ring (still considering
the node removed by the previous removenode a token owner) and unexpectedly
reject the operation.

Fix that by waiting for token ring and group0 consistency before removenode.
Note that the test already checks that consistency, but only for one node,
which is different from the removenode initiator.

This test has been removed in master together with the code being tested
(the gossip-based topology). Hence, the fix is submitted directly to 2026.1.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1103

Backport to all supported branches (other than 2026.1), as the test can fail
there.

Closes scylladb/scylladb#29108
2026-03-20 10:22:40 +02:00
Anna Stuchlik
a0a2a67634 doc: remove the instructions to install old versions from Web Installer
The Web Installer page includes instructions to install the old pre-2025.1 Enterprise versions,
which are no longer supported (since we released 2026.1).

This commit removes those redundant and misleading instructions.

Fixes https://github.com/scylladb/scylladb/issues/29099

Closes scylladb/scylladb#29103

(cherry picked from commit 6b1df5202c)

Closes scylladb/scylladb#29137
2026-03-20 10:22:08 +02:00
Avi Kivity
d4e454b5bc Merge 'Fix bad performance for densely populated partition index pages' from Tomasz Grabiec
This applies to small partition workload where index pages have high partition count, and the index doesn't fit in cache. It was observed that the count can be in the order of hundreds. In such a workload pages undergo constant population, LSA compaction, and LSA eviction, which has severe impact on CPU utilization.

Refs https://scylladb.atlassian.net/browse/SCYLLADB-620

This PR reduces the impact by several changes:

  - reducing memory footprint in the partition index. Assuming partition key size is 16 bytes, the cost dropped from 96 bytes to 36 bytes per partition.

  - flattening the object graph and amortizing storage. Storing entries directly in the vector. Storing all key values in a single managed_bytes. Making index_entry a trivial struct.

  - index entries and key storage are now trivially moveable, and batched inside vector storage
    so LSA migration can use memcpy(), which amortizes the cost per key. This reduces the cost of LSA segment compaction.

 - LSA eviction is now pretty much constant time for the whole page
   regardless of the number of entries, because elements are trivial and batched inside vectors.
   Page eviction cost dropped from 50 us to 1 us.

Performance evaluated with:

   scylla perf-simple-query -c1 -m200M --partitions=1000000

Before:

```
7774.96 tps (166.0 allocs/op, 521.7 logallocs/op,  54.0 tasks/op,  802428 insns/op,  430457 cycles/op,        0 errors)
7511.08 tps (166.1 allocs/op, 527.2 logallocs/op,  54.0 tasks/op,  804185 insns/op,  430752 cycles/op,        0 errors)
7740.44 tps (166.3 allocs/op, 526.2 logallocs/op,  54.2 tasks/op,  805347 insns/op,  432117 cycles/op,        0 errors)
7818.72 tps (165.2 allocs/op, 517.6 logallocs/op,  53.7 tasks/op,  794965 insns/op,  427751 cycles/op,        0 errors)
7865.49 tps (165.1 allocs/op, 513.3 logallocs/op,  53.6 tasks/op,  788898 insns/op,  425171 cycles/op,        0 errors)
```

After (+318%):

```
32492.40 tps (130.7 allocs/op,  12.8 logallocs/op,  36.1 tasks/op,  109236 insns/op,  103203 cycles/op,        0 errors)
32591.99 tps (130.4 allocs/op,  12.8 logallocs/op,  36.0 tasks/op,  108947 insns/op,  102889 cycles/op,        0 errors)
32514.52 tps (130.6 allocs/op,  12.8 logallocs/op,  36.0 tasks/op,  109118 insns/op,  103219 cycles/op,        0 errors)
32491.14 tps (130.6 allocs/op,  12.8 logallocs/op,  36.0 tasks/op,  109349 insns/op,  103272 cycles/op,        0 errors)
32582.90 tps (130.5 allocs/op,  12.8 logallocs/op,  36.0 tasks/op,  109269 insns/op,  102872 cycles/op,        0 errors)
32479.43 tps (130.6 allocs/op,  12.8 logallocs/op,  36.0 tasks/op,  109313 insns/op,  103242 cycles/op,        0 errors)
32418.48 tps (130.7 allocs/op,  12.8 logallocs/op,  36.1 tasks/op,  109201 insns/op,  103301 cycles/op,        0 errors)
31394.14 tps (130.7 allocs/op,  12.8 logallocs/op,  36.1 tasks/op,  109267 insns/op,  103301 cycles/op,        0 errors)
32298.55 tps (130.7 allocs/op,  12.8 logallocs/op,  36.1 tasks/op,  109323 insns/op,  103551 cycles/op,        0 errors)
```

When the workload is miss-only, with both row cache and index cache disabled (no cache maintenance cost):

  perf-simple-query -c1 -m200M --duration 6000 --partitions=100000 --enable-index-cache=0 --enable-cache=0

Before:

```
9124.57 tps (146.2 allocs/op, 789.0 logallocs/op,  45.3 tasks/op,  889320 insns/op,  357937 cycles/op,        0 errors)
9437.23 tps (146.1 allocs/op, 789.3 logallocs/op,  45.3 tasks/op,  889613 insns/op,  357782 cycles/op,        0 errors)
9455.65 tps (146.0 allocs/op, 787.4 logallocs/op,  45.2 tasks/op,  887606 insns/op,  357167 cycles/op,        0 errors)
9451.22 tps (146.0 allocs/op, 787.4 logallocs/op,  45.3 tasks/op,  887627 insns/op,  357357 cycles/op,        0 errors)
9429.50 tps (146.0 allocs/op, 787.4 logallocs/op,  45.3 tasks/op,  887761 insns/op,  358148 cycles/op,        0 errors)
9430.29 tps (146.1 allocs/op, 788.2 logallocs/op,  45.3 tasks/op,  888501 insns/op,  357679 cycles/op,        0 errors)
9454.08 tps (146.0 allocs/op, 787.3 logallocs/op,  45.3 tasks/op,  887545 insns/op,  357132 cycles/op,        0 errors)
```

After (+55%):

```
14484.84 tps (150.7 allocs/op,   6.5 logallocs/op,  44.7 tasks/op,  396164 insns/op,  229490 cycles/op,        0 errors)
14526.21 tps (150.8 allocs/op,   6.5 logallocs/op,  44.8 tasks/op,  396401 insns/op,  228824 cycles/op,        0 errors)
14567.53 tps (150.7 allocs/op,   6.5 logallocs/op,  44.7 tasks/op,  396319 insns/op,  228701 cycles/op,        0 errors)
14545.63 tps (150.6 allocs/op,   6.5 logallocs/op,  44.7 tasks/op,  395889 insns/op,  228493 cycles/op,        0 errors)
14626.06 tps (150.5 allocs/op,   6.5 logallocs/op,  44.7 tasks/op,  395254 insns/op,  227891 cycles/op,        0 errors)
14593.74 tps (150.5 allocs/op,   6.5 logallocs/op,  44.7 tasks/op,  395480 insns/op,  227993 cycles/op,        0 errors)
14538.10 tps (150.8 allocs/op,   6.5 logallocs/op,  44.8 tasks/op,  397035 insns/op,  228831 cycles/op,        0 errors)
14527.18 tps (150.8 allocs/op,   6.5 logallocs/op,  44.8 tasks/op,  396992 insns/op,  228839 cycles/op,        0 errors)
```

Same as above, but with summary ratio increased from 0.0005 to 0.005 (smaller pages):

Before:

```
33906.70 tps (146.1 allocs/op,  83.6 logallocs/op,  45.1 tasks/op,  170553 insns/op,   98104 cycles/op,        0 errors)
32696.16 tps (146.0 allocs/op,  83.5 logallocs/op,  45.1 tasks/op,  170369 insns/op,   98405 cycles/op,        0 errors)
33889.05 tps (146.1 allocs/op,  83.6 logallocs/op,  45.1 tasks/op,  170551 insns/op,   98135 cycles/op,        0 errors)
33893.24 tps (146.1 allocs/op,  83.5 logallocs/op,  45.1 tasks/op,  170488 insns/op,   98168 cycles/op,        0 errors)
33836.73 tps (146.1 allocs/op,  83.6 logallocs/op,  45.1 tasks/op,  170528 insns/op,   98226 cycles/op,        0 errors)
33897.61 tps (146.0 allocs/op,  83.5 logallocs/op,  45.1 tasks/op,  170428 insns/op,   98081 cycles/op,        0 errors)
33834.73 tps (146.1 allocs/op,  83.5 logallocs/op,  45.1 tasks/op,  170438 insns/op,   98178 cycles/op,        0 errors)
33776.31 tps (146.3 allocs/op,  83.9 logallocs/op,  45.2 tasks/op,  170958 insns/op,   98418 cycles/op,        0 errors)
33808.08 tps (146.3 allocs/op,  83.9 logallocs/op,  45.2 tasks/op,  170940 insns/op,   98388 cycles/op,        0 errors)
```

After (+18%):

```
40081.51 tps (148.2 allocs/op,   4.4 logallocs/op,  45.0 tasks/op,  121047 insns/op,   82231 cycles/op,        0 errors)
40005.85 tps (148.6 allocs/op,   4.4 logallocs/op,  45.2 tasks/op,  121327 insns/op,   82545 cycles/op,        0 errors)
39816.75 tps (148.3 allocs/op,   4.4 logallocs/op,  45.1 tasks/op,  121067 insns/op,   82419 cycles/op,        0 errors)
39953.11 tps (148.1 allocs/op,   4.4 logallocs/op,  45.0 tasks/op,  121027 insns/op,   82258 cycles/op,        0 errors)
40073.96 tps (148.2 allocs/op,   4.4 logallocs/op,  45.0 tasks/op,  121006 insns/op,   82313 cycles/op,        0 errors)
39882.25 tps (148.2 allocs/op,   4.4 logallocs/op,  45.0 tasks/op,  120925 insns/op,   82320 cycles/op,        0 errors)
39916.08 tps (148.3 allocs/op,   4.4 logallocs/op,  45.1 tasks/op,  121054 insns/op,   82393 cycles/op,        0 errors)
39786.30 tps (148.2 allocs/op,   4.4 logallocs/op,  45.0 tasks/op,  121027 insns/op,   82465 cycles/op,        0 errors)
38662.45 tps (148.3 allocs/op,   4.4 logallocs/op,  45.0 tasks/op,  121108 insns/op,   82312 cycles/op,        0 errors)
39849.42 tps (148.3 allocs/op,   4.4 logallocs/op,  45.1 tasks/op,  121098 insns/op,   82447 cycles/op,        0 errors)
```

Closes scylladb/scylladb#28603

* github.com:scylladb/scylladb:
  sstables: mx: index_reader: Optimize parsing for no promoted index case
  vint: Use std::countl_zero()
  test: sstable_partition_index_cache_test: Validate scenario of pages with sparse promoted index placement
  sstables: mx: index_reader: Amoritze partition key storage
  managed_bytes: Hoist write_fragmented() to common header
  utils: managed_vector: Use std::uninitialized_move() to move objects
  sstables: mx: index_reader: Keep promoted_index info next to index_entry
  sstables: mx: index_reader: Extract partition_index_page::clear_gently()
  sstables: mx: index_reader: Shave-off 16 bytes from index_entry by using raw_token
  sstables: mx: index_reader: Reduce allocation_section overhead during index page parsing by batching allocation
  sstables: mx: index_reader: Keep index_entry directly in the vector
  dht: Introduce raw_token
  test: perf_simple_query: Add 'sstable-format' command-line option
  test: perf_simple_query: Add 'sstable-summary-ratio' command-line option
  test: perf-simple-query: Add option to disable index cache
  test: cql_test_env: Respect enable-index-cache config

(cherry picked from commit 5e7fb08bf3)

Closes scylladb/scylladb#29136
2026-03-20 01:21:49 +01:00
Anna Stuchlik
825a36c97a doc: update the warning about shared dictionary training
This commit updates the inadequate warning on the Advanced Internode (RPC) Compression page.

The warning is replaced with a note about how training data is encrypted.

Fixes https://github.com/scylladb/scylladb/issues/29109

Closes scylladb/scylladb#29111

(cherry picked from commit 88b98fac3a)

Closes scylladb/scylladb#29119
2026-03-19 16:39:44 +02:00
Tomasz Grabiec
45413e99a5 Merge 'load_stats: improve tablet filtering for load stats' from Ferenc Szili
When computing table sizes via load_stats to determine if a split/merge is needed, we are filtering tablets which are being migrated, in order to avoid counting them twice (both on leaving and pending replica) in the total table size. The tablets are filtered so that they are counted on the leaving replica until the streaming stage, and on the pending replica after the streaming stage.

Currently, the procedure for collecting tablet sizes for load balancing also uses this same filter. This should be changed, because the load balancer needs to have as much information about tablet sizes as possible, and could ignore a node due to missing tablet sizes for tablets in the `write_both_read_new` and `use_new` stages.

For tablet size collection, we should include all the tablets which are currently taking up disk space. This means:
- on leaving replica, include all tablets until the `cleanup` stage
- on pending replica, include all tablets starting with the `write_both_read_new` and later stages

While this is an improvement, it causes problems with some of the tests, and therefore needs to be backported to 2026.1

Fixes: SCYLLADB-829

Closes scylladb/scylladb#28587

* github.com:scylladb/scylladb:
  load_stats: add filtering for tablet sizes
  load_stats: move tablet filtering for table size computation
  load_stats: bring the comment and code in sync

(cherry picked from commit 518470e89e)

Closes scylladb/scylladb#29034
2026-03-19 14:03:40 +01:00
Michał Hudobski
c93a935564 docs: update vector search filtering to reflect primary key support only
Remove outdated references to filtering on columns provided in the
index definition, and remove the note about equal relations (= and IN)
being the only supported operations. Vector search filtering currently
supports WHERE clauses on primary key columns only.

Closes scylladb/scylladb#28949

(cherry picked from commit 40d180a7ef)

Closes scylladb/scylladb#29069
2026-03-19 11:19:53 +01:00
Botond Dénes
69f78ce74a Merge 'perf-alternator: wait for alternator port before running workload' from Marcin Maliszkiewicz
This patch is mostly for the purpose of running pgo CI job.

We may receive connection error if asyncio.sleep(5) in
pgo.py is not sufficient waiting time.

In pgo.py we do wait for port but only for cql,
anyway it's better to have high level check than
trying to wait for alternator port there.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1071
Backport: 2026.1 - it failed on CI for that build

Closes scylladb/scylladb#29063

* github.com:scylladb/scylladb:
  perf: add abort_source support to wait-for-port loops
  perf-alternator: wait for alternator port before running workload

(cherry picked from commit 172c786079)

Closes scylladb/scylladb#29098
2026-03-18 13:13:01 +01:00
Patryk Jędrzejczak
3513ce6069 test: test_raft_no_quorum: decrease group0_raft_op_timeout_in_ms after quorum loss
`test_raft_no_quorum.py::test_cannot_add_new_node` is currently flaky in dev
mode. The bootstrap of the first node can fail due to `add_entry()` timing
out (with the 1s timeout set by the test case).

Other test cases in this test file could fail in the same way as well, so we
need a general fix. We don't want to increase the timeout in dev mode, as it
would slow down the test. The solution is to keep the timeout unchanged, but
set it only after quorum is lost. This prevents unexpected timeouts of group0
operations with almost no impact on the test running time.

A note about the new `update_group0_raft_op_timeout` function: waiting for
the log seems to be necessary only for
`test_quorum_lost_during_node_join_response_handler`, but let's do it
for all test cases just in case (including `test_can_restart` that shouldn't
be flaky currently).

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-913

Closes scylladb/scylladb#28998

(cherry picked from commit 526e5986fe)

Closes scylladb/scylladb#29068
2026-03-17 18:04:27 +01:00
Piotr Dulikowski
0ca7253315 Merge 'vector_search: fix TLS server name with IP' from Karol Nowacki
SNI works only with DNS hostnames. Adding an IP address causes warnings
on the server side.
This change adds SNI only if it is not an IP address.

This change has no unit tests, as this behavior is not critical,
since it causes a warning on the server side.
The critical part, that the server name is verified, is already covered.

This PR also adds warning logs to improve future troubleshooting of connections to the vector-store nodes.

Fixes: VECTOR-528

Backports to 2025.04 and 2026.01 are required, as these branches are also affected.

Closes scylladb/scylladb#28637

* github.com:scylladb/scylladb:
  vector_search: fix TLS server name with IP
  vector_search: add warn log for failed ann requests

(cherry picked from commit 23ed0d4df8)

Closes scylladb/scylladb#28964
2026-03-17 15:30:07 +01:00
Piotr Dulikowski
c7ac3b5394 db: view: mutate_MV: don't hold keyspace ref across preemption
Currently, the view_update_generator::mutate_MV function acquires a
reference to the keyspace relevant to the operation, then it calls
max_concurrent_for_each and uses that reference inside the lambda passed
to that function. max_concurrent_for_each can preempt and there is no
mechanism that makes sure that the keyspace is alive until the view
updates are generated, so it is possible that the keyspace is freed by
the time the reference is used.

Fix the issue by precomputing the necessary information based on the
keyspace reference right away, and then passing that information by
value to the other parts of the code. It turns out that we only need to
know whether the keyspace uses tablets and whether it uses a network
topology strategy.

Fixes: scylladb/scylladb#28925

Closes scylladb/scylladb#28928

(cherry picked from commit 42d70baad3)

Closes scylladb/scylladb#28968
2026-03-17 13:35:22 +01:00
Piotr Dulikowski
d6ed05efc1 Merge '[Backport 2026.1] mv: allow skipping view updates when a collection is unmodified' from Scylladb[bot]
mv: allow skipping view updates when a collection is unmodified
When we generate view updates, we check whether we can skip the
entire view update if all columns selected by the view are unmodified.
However, for collection columns, we only check if they were unset
before and after the update.
In this patch we add a check for the actual collection contents.
We perform this check for both virtual and non-virtual selections.
When the column is only a virtual column in the view, it would be
enough to check the liveness of each collection cell, however for
that we'd need to deserialize the entire collection anyway, which
should be effectively as expensive as comparing all of its bytes.

Fixes: SCYLLADB-996

- (cherry picked from commit 01ddc17ab9)

Parent PR: #28839

Closes scylladb/scylladb#28977

* github.com:scylladb/scylladb:
  Merge 'mv: allow skipping view updates when a collection is unmodified' from Wojciech Mitros
  mv: remove dead code in view_updates::can_skip_view_updates
2026-03-17 13:34:28 +01:00
Jenkins Promoter
39fcc83e75 Update ScyllaDB version to: 2026.1.1 2026-03-17 13:21:54 +02:00
Jenkins Promoter
6250f1e967 Update pgo profiles - aarch64 2026-03-15 05:07:46 +02:00
Aleksandra Martyniuk
b307c9301d nodetool: cluster repair: do not fail if a table was dropped
nodetool cluster repair without additional params repairs all tablet
keyspaces in a cluster. Currently, if a table is dropped while
the command is running, all tables are repaired but the command finishes
with a failure.

Modify nodetool cluster repair. If a table wasn't specified
(i.e. all tables are repaired), the command finishes successfully
even if a table was dropped.

If a table was specified and it does not exist (e.g. because it was
dropped before the repair was requested), then the behavior remains
unchanged.

Fixes: SCYLLADB-568.

Closes scylladb/scylladb#28739

(cherry picked from commit 2e68f48068)

Closes scylladb/scylladb#29006
2026-03-14 22:28:00 +02:00
Botond Dénes
f26af8cd30 mutation/collection_mutation: don't copy the serialized collection
serialize_collection_mutation() copies the serialized collection into
the returned collection_mutation object. Change to move to avoid the
copy.

Fixes: SCYLLADB-1041

Closes scylladb/scylladb#29010

(cherry picked from commit 15cfa5beeb)

Closes scylladb/scylladb#29024
2026-03-12 23:34:58 +02:00
Marcin Maliszkiewicz
2bd10bff5e Merge 'test_proxy_protocol: fix flaky system.clients visibility checks' from Piotr Smaron
`test_proxy_protocol_port_preserved_in_system_clients` failed because it
didn't see the just created connection in system.clients immediately. The
last lines of the stacktrace are:
```
            # Complete CQL handshake
            await do_cql_handshake(reader, writer)

            # Now query system.clients using the driver to see our connection
            cql = manager.get_cql()
            rows = list(cql.execute(
                f"SELECT address, port FROM system.clients WHERE address = '{fake_src_addr}' ALLOW FILTERING"
            ))

            # We should find our connection with the fake source address and port
>           assert len(rows) > 0, f"Expected to find connection from {fake_src_addr} in system.clients"
E           AssertionError: Expected to find connection from 203.0.113.200 in system.clients
E           assert 0 > 0
E            +  where 0 = len([])
```
Explanation: we first await for the hand-made connection to be completed,
then, via another connection, we're querying system.clients, and we don't
get this hand-made connection in the resultset.
The solution is to replace the bare cql.execute() calls with await wait_for_results(), a helper
 that polls via cql.run_async() until the expected row count is reached
 (30 s timeout, 100 ms period).

Fixes: SCYLLADB-819

The flaky test is present on master and in previous release, so backporting only there.

Closes scylladb/scylladb#28849

* github.com:scylladb/scylladb:
  test_proxy_protocol: introduce extra logging to aid debugging
  test_proxy_protocol: fix flaky system.clients visibility checks

(cherry picked from commit 4150c62f29)

Closes scylladb/scylladb#28951
2026-03-10 22:48:14 +02:00
Avi Kivity
1105d83893 Merge 'mv: allow skipping view updates when a collection is unmodified' from Wojciech Mitros
When we generate view updates, we check whether we can skip the
entire view update if all columns selected by the view are unmodified.
However, for collection columns, we only check if they were unset
before and after the update.
In this patch we add a check for the actual collection contents.
We perform this check for both virtual and non-virtual selections.
When the column is only a virtual column in the view, it would be
enough to check the liveness of each collection cell, however for
that we'd need to deserialize the entire collection anyway, which
should be effectively as expensive as comparing all of its bytes.

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-808

Closes scylladb/scylladb#28839

* github.com:scylladb/scylladb:
  mv: allow skipping view updates when a collection is unmodified
  mv: allow skipping view updates if an empty collection remains unset

(cherry picked from commit 01ddc17ab9)
2026-03-10 21:27:23 +01:00
Wojciech Mitros
9b9d5cee8a mv: remove dead code in view_updates::can_skip_view_updates
When we create a materialized view, we consider 2 cases:
1. the view's primary key contains a column that is not
in the primary key of the base table
2. the view's primary key doesn't contain such a column

In the 2nd case, we add all columns from the base table
to the schema of the view (as virtual columns). As a result,
all of these columns are effectively "selected" in
view_updates::can_skip_view_updates. Same thing happens when
we add new columns to the base table using ALTER.
Because of this, we can never have !column_is_selected and
!has_base_non_pk_columns_in_view_pk at the same time. And
thus, the check (!column_is_selected
&& _base_info.has_base_non_pk_columns_in_view_pk) is always
the same as (!column_is_selected).
Because we immediately return after this check, the tail of
this function is also never reached - all checks after the
(column_is_selected) are affected by this. Also, the condition
(!column_is_selected && base_has_nonexpiring_marker) is always
false at the point it is called. And this in turn makes the
`base_has_nonexpiring_marker` unused, so we delete it as well.

It's worth considering, why did we even have
`base_has_nonexpiring_marker` if it's effectively unused. We
initially introduced it in bd52e05ae2 and we (incorrectly)
used it to allow skipping view updates even if the liveness of
virtual columns changed. Soon after, in 5f85a7a821, we
started categorizing virtual columns as column_is_selected == true
and we moved the liveness checks for virtual columns to the
`if (column_is_selected)` clause, before the `base_has_nonexpiring_marker`
check. We changed this because even if we have a nonexpiring marker
right now, it may be changed in the future, in which case the liveness
of the view row will depend on liveness of the virtual columns and
we'll need to have the view updates from the time the row marker was
nonexpiring.

(cherry picked from commit ca1c8ff209)
2026-03-10 21:24:53 +01:00
Tomasz Grabiec
a8fd9936a3 Merge 'service: assert that tables updated via group0 use schema commitlog' from Aleksandra Martyniuk
Set enable_schema_commitlog for each group0 tables.

Assert that group0 tables use schema commitlog in ensure_group0_schema
(per each command).

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-914.

Needs backport to all live releases as all are vulnerable

Closes scylladb/scylladb#28876

* github.com:scylladb/scylladb:
  test: add test_group0_tables_use_schema_commitlog
  db: service: remove group0 tables from schema commitlog schema initializer
  service: ensure that tables updated via group0 use schema commitlog
  db: schema: remove set_is_group0_table param

(cherry picked from commit b90fe19a42)

Closes scylladb/scylladb#28916
2026-03-10 11:58:03 +01:00
Botond Dénes
9190d42863 Merge 'repair: Fix rwlock in compaction_state and lock holder lifecycle' from Raphael Raph Carvalho
Consider this:

- repair takes the lock holder
- tablet merge filber destories the compaction group and the compaction state
- repair fails
- repair destroy the lock holder

This is observed in the test:

```
repair - repair[5d73d094-72ee-4570-a3cc-1cd479b2a036] Repair 1 out of 1 tablets: table=sec_index.users range=(432345564227567615,504403158265495551] replicas=[0e9d51a5-9c99-4d6e-b9db-ad36a148b0ea:15, 498e354c-1254-4d8d-a565-2f5c6523845a:9, 5208598c-84f0-4526-bb7f-573728592172:28]

...

repair - repair[5d73d094-72ee-4570-a3cc-1cd479b2a036]: Started to repair 1 out of 1 tables in keyspace=sec_index, table=users, table_id=ea2072d0-ccd9-11f0-8dba-c5ab01bffb77, repair_reason=repair
repair - Enable incremental repair for table=sec_index.users range=(432345564227567615,504403158265495551]
table - Disabled compaction for range=(432345564227567615,504403158265495551] session_id=a13a72cc-cd2d-11f0-8e9b-76d54580ab09 for incremental repair
table - Got unrepaired compaction and repair lock for range=(432345564227567615,504403158265495551] session_id=a13a72cc-cd2d-11f0-8e9b-76d54580ab09 for incremental repair
table - Disabled compaction for range=(432345564227567615,504403158265495551] session_id=a13a72cc-cd2d-11f0-8e9b-76d54580ab09 for incremental repair
table - Got unrepaired compaction and repair lock for range=(432345564227567615,504403158265495551] session_id=a13a72cc-cd2d-11f0-8e9b-76d54580ab09 for incremental repair
repair - repair[5d73d094-72ee-4570-a3cc-1cd479b2a036]: get_sync_boundary: got error from node=0e9d51a5-9c99-4d6e-b9db-ad36a148b0ea, keyspace=sec_index, table=users, range=(432345564227567615,504403158265495551], error=seastar::rpc::remote_verb_error (Compaction state for table [0x60f008fa34c0] not found)
compaction_manager - Stopping 1 tasks for 1 ongoing compactions for table sec_index.users compaction_group=238 due to tablet merge
compaction_manager - Stopping 1 tasks for 1 ongoing compactions for table sec_index.users compaction_group=238 due to tablet merge

....

scylla[10793] Segmentation fault on shard 28, in scheduling group streaming
```

The rwlock in compaction_state could be destroyed before the lock holder
of the rwlock is destroyed. This causes user after free when the lock
the holder is destroyed.

To fix it, users of repair lock will now be waited when a compaction
group is being stopped.
That way, compaction group - which controls the lifetime of rwlock -
cannot be destroyed while the lock is held.
Additionally, the merge completion fiber - that might remove groups -
is properly serialized with incremental repair.

The issue can be reproduced using sanitize build consistently and can not
be reproduced after the fix.

Fixes #27365

Closes scylladb/scylladb#28823

* github.com:scylladb/scylladb:
  repair: Fix rwlock in compaction_state and lock holder lifecycle
  repair: Prevent repair lock holder leakage after table drop

(cherry picked from commit 509f2af8db)

Closes scylladb/scylladb#28934
2026-03-09 10:25:47 +02:00
Jenkins Promoter
09b0e1ba8b Update ScyllaDB version to: 2026.1.0 2026-03-08 11:47:02 +02:00
Avi Kivity
8a7f5f1428 Merge '[Backport 2026.1] raft ropology: prevent crashes of multiple nodes' from Scylladb[bot]
Some assertions in the Raft-based topology are likely to cause crashes of
multiple nodes due to the consistent nature of the Raft-based code. If the
failing assertion is executed in the code run by each follower (e.g., the code
reloading the in-memory topology state machine), then all nodes can crash. If
the failing assertion is executed only by the leader (e.g., the topology
coordinator fiber), then multiple consecutive group0 leaders will chain-crash
until there is no group0 majority.

Crashing multiple nodes is much more severe than necessary. It's enough to
prevent the topology state machine from making more progress. This will
naturally happen after throwing a runtime error. The problematic fiber will be
killed or will keep failing in a loop. Note that it should be safe to block
the topology state machine, but not the whole group0, as the topology state
machine is mostly isolated from the rest of group0.

We replace some occurrences of `on_fatal_internal_error` and `SCYLLA_ASSERT`
with `on_internal_error`. These are not all occurrences, as some fatal
assertions make sense, for example, in the bootstrap procedure.

We also raise an internal error to prevent a segmentation fault in a few places.

Fixes #27987

Backporting this PR is not required, but we can consider it at least for 2026.1
because:
- it is LTS,
- the changes are low-risk,
- there shouldn't be many conflicts.

- (cherry picked from commit e21ecf69de)

- (cherry picked from commit 8e9c7397c5)

Parent PR: #28558

Manually cherry-picked 2a3476094e.

Closes scylladb/scylladb#28735

* github.com:scylladb/scylladb:
  storage_service: raft_topology_cmd_handler: fix use-after-free
  raft topology: prevent accessing nullptr returned by topology::find
  raft topology: make some assertions non-crashing
2026-03-05 20:49:03 +02:00
Szymon Malewski
185288f16e test: vector_similarity: Fix similarity value checks
`isclose` function checks if returned similarity floats are close enough to expected value, but it doesn't `assert` by itself.
Several tests missed that `assert`, effectively always passing.
With this patch similarity values checks are wrapped in helper function `assert_similarity` with predefined tolerance.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-877

Closes scylladb/scylladb#28748

(cherry picked from commit 4c4673e8f9)

Closes scylladb/scylladb#28907
2026-03-05 20:48:36 +02:00
Anna Stuchlik
7dfcb53197 doc: fix the unified installer instructions
This commit updates the documentation for the unified installer.

- The Open Source example is replaced with version 2025.1 (Source Available, currently supported, LTS).
- The info about CentOS 7 is removed (no longer supported).
- Java 8 is removed.
- The example for cassandra-stress is removed (as it was already removed on other installation pages).

Fixes https://github.com/scylladb/scylladb/issues/28150

Closes scylladb/scylladb#28152

(cherry picked from commit 855c503c63)

Closes scylladb/scylladb#28910
2026-03-05 18:07:00 +02:00
Marcin Maliszkiewicz
d552244812 Merge 'test: add missing awaits in test_client_routes_upgrade' from Andrzej Jackowski
Two calls in test_client_routes_upgrade were missing `await`,
so they were never actually executed. This caused Python
to emit RuntimeWarning about unawaited coroutines, and more
importantly, the test skipped important verification steps, which
could mask real bugs or cause flakiness.

Additionally, increase 10s timeouts to 60s to avoid flakiness in slow
environments. Although these tests haven't failed so far, similar
issues have already been observed in other tests with too-short
timeouts.

Fixes: [SCYLLADB-909](https://scylladb.atlassian.net/browse/SCYLLADB-909)

Backport to 2026.1, as the test is also there.

[SCYLLADB-909]: https://scylladb.atlassian.net/browse/SCYLLADB-909?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ

Closes scylladb/scylladb#28877

* github.com:scylladb/scylladb:
  test: increase timeouts in test_client_routes.py
  test: add missing awaits in test_client_routes_upgrade

(cherry picked from commit 9697b6013f)

Closes scylladb/scylladb#28896
2026-03-05 17:46:39 +02:00
Patryk Jędrzejczak
62b344cb55 storage_service: raft_topology_cmd_handler: fix use-after-free
8e9c7397c5 made `rs` a reference, which can
lead to use-after-free. The `normal_nodes` map containing the referenced
value can be destroyed before the last use of `rs` when the topology state
is reloaded after a context switch on some `co_await`. The following move
assignment in `storage_service::topology_state_load` causes this:
```
_topology_state_machine._topology = co_await _sys_ks.local().load_topology_state(tablet_hosts);
```

This issue has been discovered in next-2026.1 CI after queueing the
backport of #28558. `test_truncate_during_topology_change` failed after
ASan reported a heap-use-after-free in
```
co_await _repair.local().bootstrap_with_repair(get_token_metadata_ptr(), rs.ring.value().tokens, session);
```
This test enables `delay_bootstrap_120s`, which makes the bug much more
likely to reproduce, but it could happen elsewhere.

No backport needed, as the only backport of #28558 hasn't been merged yet.
The backport PR will cherry-pick this commit.

Closes scylladb/scylladb#28772

(cherry picked from commit 2a3476094e)
2026-03-05 13:41:44 +01:00
Botond Dénes
173bfd627c Merge 'test: decrease strain in test_startup_response' from Marcin Maliszkiewicz
For 2025.3 and 2025.4 this test runs order of magnitude
slower in debug mode. Potentially due to passwords::check
running in alien thread and overwhelming the CPU (this is
fixed in newer versions).

Decreasing the number of connections in test makes it fast
again, without breaking reproducibility.

As additional measure we double the timeout.

The fix is now cherry-picked to master as sometimes
test fails there too.

(cherry picked from commit 1f1fc2c2ac)

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-795

backport: 2026.1, already on other stable branches

Closes scylladb/scylladb#28848

* github.com:scylladb/scylladb:
  test: add more logs to test_startup_no_auth_response
  test: decrease strain in test_startup_response

(cherry picked from commit f156bcddab)

Closes scylladb/scylladb#28878
2026-03-05 09:54:51 +01:00
Piotr Dulikowski
bf369326d6 Merge 'vector_search: test: fix HTTPS client test flakiness' from Karol Nowacki
The default 100ms timeout for client readiness in tests is too
aggressive. In some test environments, this is not enough time for
client creation, which involves address resolution and TLS certificate
reading, leading to flaky tests.

This commit increases the default client creation timeout to 10 seconds.
This makes the tests more robust, especially in slower execution
environments, and prevents similar flakiness in other test cases.

Fixes: VECTOR-547, SCYLLADB-802, SCYLLADB-825, SCYLLADB-826

Backport to 2025.4 and 2026.1, as the same problem occurs on these branches and can potentially make the CI flaky there as well.

Closes scylladb/scylladb#28846

* github.com:scylladb/scylladb:
  vector_search: test: include ANN error in assertion
  vector_search: test: fix HTTPS client test flakiness

(cherry picked from commit 2fb981413a)

Closes scylladb/scylladb#28879
2026-03-04 18:09:04 +01:00
Anna Stuchlik
864774fb00 doc: fix the upgrage guide to 2026.1
The upgrade guide on branch-2026.1 has bugs caused by incorrectly
resolved conflicts in the backport PR: https://github.com/scylladb/scylladb/pull/28835#issuecomment-3992474167

This commit fixes the issue.
The fix only applies to branch-2026.1.

Fixes https://github.com/scylladb/scylladb/issues/28871

Closes scylladb/scylladb#28872
2026-03-04 16:53:49 +02:00
Botond Dénes
49ed97cec8 Merge '[Backport 2026.1] Fix regression in Alternator TTL with tablets and node going down' from Scylladb[bot]
Recently we suffered a regression on how Alternator TTL behaves when a node goes down when tablets are used.

Usually, expiration of data in a particular tablet are handled by this tablet's "primary replica". However, if that node is down, we want another node to perform these expiration until the primary replica goes back online. We created a function `tablet_map::get_secondary_replica()` to select that "other node". We don't care too much what the "secondary replica" means, but we do care that it's different from the primary replica - if it's the same the expiration of that tablet will never be done.

It turns out that recently, in commits 817fdad and d88036d, the implementation of get_primary_replica() changed without a corresponding change to get_secondary_replica(). After those changes, the two functions are mismatched, and sometimes return the same node for both primary and secondary replica.

Unfortunately, although we had a dtest for the handling of a dead node in Alternator TTL, it failed to reproduce this bug, so this regression was missed - nothing else besides Alternator TTL ever used the get_secondary_replica() function.

So this series, in addition to fixing the bug, we add two tests that reproduce this bug (fail before the fix, pass with the fix):

1. A unit test that checks that get_secondary_replica() always returns a different node from get_primary_replica()
2. A cluster test based on the original dtest, which does reproduce this bug in Alternator TTL where some of the data was never expired (but only failed in release build, for an unknown reason).

Fixes SCYLLADB-777.

- (cherry picked from commit 9ab3d5b946)

- (cherry picked from commit 0c7f499750)

- (cherry picked from commit e463d528fe)

Parent PR: #28771

Closes scylladb/scylladb#28803

* github.com:scylladb/scylladb:
  test: add unit test for tablet_map::get_secondary_replica()
  test, alternator: add test for TTL expiration with a node down
  locator: fix get_secondary_replica() to match get_primary_replica()
2026-03-04 14:21:44 +02:00
Marcin Maliszkiewicz
81685b0d06 Merge 'db/batchlog_manager: re-add v1 support for mixed clusters' from Botond Dénes
3f7ee3ce5d introduced system.batchlog_v2, with a schema designed to speed up batchlog replays and make post-replay cleanups much more effective.
It did not introduce a cluster feature for the new table, because it is node local table, so the cluster can switch to the new table gradually, one node at a time.
However, https://github.com/scylladb/scylladb/issues/27886 showed that the switching causes timeouts during upgrades, in mixed clusters. Furthermore, switching to the new table unconditionally  on upgrades nodes, means that on rollback, the batches saved into the v2 table are lost.
This PR introduces re-introduces v1 (`system.batchlog`) support and guards the use of the v2 table with a cluster feature, so mixed clusters keep using v1 and thus be rollback-compatible.
The re-introduced v1 support doesn't support post-replay cleanups for simplicity. The cleanup in v1 was never particularly effective anyway and we ended up disabling it for heavy batchlog users, so I don't think the lack of support for cleanup is a problem.

Fixes: https://github.com/scylladb/scylladb/issues/27886

Needs backport to 2026.1, to fix upgrades for clusters using batches

Closes scylladb/scylladb#28736

* github.com:scylladb/scylladb:
  test/boost/batchlog_manager_test: add tests for v1 batchlog
  test/boost/batchlog_manager_test: make prepare_batches() work with both v1 and v2
  test/boost/batchlog_manager_test: fix indentation
  test/boost/batchlog_manager_test: extract prepare_batches() method
  test/lib/cql_assertions: is_rows(): add dump parameter
  tools/scylla-sstable: extract query result printers
  tools/scylla-sstable: add std::ostream& arg to query result printers
  repair/row_level: repair_flush_hints_batchlog_handler(): add all_replayed to finish log
  db/batchlog_manager: re-add v1 support
  db/batchlog_manager: return all_replayed from process_batch()
  db/batchlog_manager: process_bath() fix indentation
  db/batchlog_manager: make batch() a standalone function
  db/batchlog_manager: make structs stats public
  db/batchlog_manager: allocate limiter on the stack
  db/batchlog_manager: add feature_service dependency
  gms/feature_service: add batchlog_v2 feature

(cherry picked from commit a83ee6cf66)

Closes scylladb/scylladb#28853
2026-03-04 08:28:39 +02:00
Anna Stuchlik
06013b2377 doc: add the upgrade guide from 2025.x to 2026.1
This commit adds the upgrade guide for version 2026.1.

According to the new upgrade policy, the user can now upgrade to the major version (2026.1)
from any previous minor version.
So instead of adding a separate guide form 2025.4 to 2026.1, we need a guide from 2025.x to 2026.1.

In addition, this commit:
- Updates the upgrade policy for reflect the above change.
- Removes the upgrade guides for the previous version.

Fixes https://github.com/scylladb/scylladb/issues/28533

Fixes  https://github.com/scylladb/scylladb/issues/28532

Closes scylladb/scylladb#28789

(cherry picked from commit dfd46ad3fb)

Closes scylladb/scylladb#28835
2026-03-03 13:04:16 +02:00
Grzegorz Burzyński
4cc5c2605f packaging: add systemctl command to dependencies
scylladb/scylla container image doesn't include systemctl binary, while it
is used by perftune.py script shipped within the same image.

Scylla Operator runs this script to tune Scylla nodes/containers,
expecting its all dependencies to be available in the container's PATH.
Without systemctl, the script fails on systems that run irqbalance
(e.g., on EKS nodes) as the script tries to reconfigure irqbalance and
restart it via systemctl afterwards.

Fixes: scylladb/scylla-operator#3080

Closes scylladb/scylladb#28567

(cherry picked from commit b4f0eb666f)

Closes scylladb/scylladb#28845
2026-03-03 13:03:46 +02:00
Anna Stuchlik
021851c5c5 doc: remove reduntant Java-related information
This commit removes:
- Instructions to install scylla-jmx (and all references)
- The Java 11 requirement for Ubuntu.

Fixes https://github.com/scylladb/scylladb/issues/28249
Fixes https://github.com/scylladb/scylladb/issues/28252

Closes scylladb/scylladb#28254

(cherry picked from commit 64b1798513)

Closes scylladb/scylladb#28818
2026-03-03 10:39:46 +01:00
Patryk Jędrzejczak
c4aa14c1a7 test: test_full_shutdown_during_replace: retry replace after the replacing node is removed from gossip
The test is currently flaky with `reuse_ip = True`. The issue is that the
test retries replace before the first replace is rolled back and the
first replacing node is removed from gossip. The second replacing node
can see the entry of the first replacing node in gossip. This entry has
a newer generation than the entry of the node being replaced, and both
replacing nodes have the same IP as the node being replaced. Therefore,
the second replacing node incorrectly considers this entry as the entry
of the node being replaced. This entry is missing rack and DC, so the
second replace fails with
```
ERROR 2026-02-24 21:19:03,420 [shard 0:main] init - Startup failed:
std::runtime_error (Cannot replace node
8762a9d2-3b30-4e66-83a1-98d16c5dd007/127.61.127.1 with a node on
a different data center or rack.
Current location=UNKNOWN_DC/UNKNOWN_RACK, new location=dc1/rack2)
```

Fixes SCYLLADB-805

Closes scylladb/scylladb#28829

(cherry picked from commit ba7f314cdc)

Closes scylladb/scylladb#28850
2026-03-03 10:21:11 +01:00
Jenkins Promoter
0fdb0961a2 Update ScyllaDB version to: 2026.1.0-rc4 2026-03-02 20:36:38 +02:00
Roy Dahan
2100ae2d0a install.sh: fix REST API paths for nonroot installations
In nonroot installations, the install.sh script was hardcoding the
api_ui_dir and api_doc_dir paths to /opt/scylladb/ in scylla.yaml,
even though the actual files were installed to a different location
(typically ~/scylladb). This caused REST API endpoints like
/api-doc/failure_detector/ to fail with "transfer closed with
outstanding read data remaining" error because Scylla couldn't find
the API documentation files at the configured paths.

Fix this by using the $prefix variable instead of hardcoded
/opt/scylladb/ paths. This ensures that:
- In regular installations: $prefix = /opt/scylladb (no change)
- In nonroot installations: $prefix = ~/scylladb (paths now correct)

Fixes: SCYLLADB-721

Backport: The hardcoded paths in install.sh have been present since
the nonroot installation feature was introduced, making REST API
endpoints non-functional in all nonroot installations across all
live versions of Scylla.

Closes scylladb/scylladb#28805

(cherry picked from commit 822c1597c9)

Closes scylladb/scylladb#28836
2026-03-01 23:20:12 +02:00
Jenkins Promoter
51fc498314 Update pgo profiles - aarch64 2026-03-01 05:01:46 +02:00
Jenkins Promoter
f4b938df09 Update pgo profiles - x86_64 2026-02-28 21:23:33 -05:00
Botond Dénes
0dfefc3f12 db/config: don't use RBNO for scaling
Remove bootstrap and decomission from allowed_repair_based_node_ops.
Using RBNO over streaming for these operations has no benefits, as they
are not exposed to the out-of-date replica problem that replace,
removenode and rebuild are.
On top of that, RBNO is known to have problems with empty user tables.
Using streaming for boostrap and decomission is safe and faster
than RBNO in all condition, especially when the table is small.

One test needs adjustment as it relies on RBNO being used for all node
ops.

Fixes: SCYLLADB-105

Closes scylladb/scylladb#28080

(cherry picked from commit b637e17b19)

Closes scylladb/scylladb#28725
2026-02-27 06:32:15 +02:00
Łukasz Paszkowski
883e3e014a compaction_manager: fix maybe_wait_for_sstable_count_reduction() hanging forever
The futurization refactoring in 9d3755f276 ("replica: Futurize
retrieval of sstable sets in compaction_group_view") changed
maybe_wait_for_sstable_count_reduction() from a single predicated
wait:
```
    co_await cstate.compaction_done.wait([..] {
        return num_runs_for_compaction() <= threshold
            || !can_perform_regular_compaction(t);
    });
```
to a while loop with a predicated wait:
```
    while (can_perform_regular_compaction(t)
           && co_await num_runs_for_compaction() > threshold) {
        co_await cstate.compaction_done.wait([this, &t] {
            return !can_perform_regular_compaction(t);
        });
    }
```

This was necessary because num_runs_for_compaction() became a
coroutine (returns future<size_t>) and can no longer be called
inside a condition_variable predicate (which must be synchronous).

However, the inner wait's predicate — !can_perform_regular_compaction(t)
— only returns true when compaction is disabled or the table is being
removed. During normal operation, every signal() from compaction_done
wakes the waiter, the predicate returns false, and the waiter
immediately goes back to sleep without ever re-checking the outer
while loop's num_runs_for_compaction() condition.

This causes memtable flushes to hang forever in
maybe_wait_for_sstable_count_reduction() whenever the sstable run
count exceeds the threshold, because completed compactions signal
compaction_done but the signal is swallowed by the predicate.

Fix by replacing the predicated wait with a bare wait(), so that
any signal (including from completed compactions) causes the outer
while loop to re-evaluate num_runs_for_compaction().

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-610

Closes scylladb/scylladb#28801

(cherry picked from commit bb57b0f3b7)
2026-02-27 01:38:13 +02:00
Yaron Kaikov
4ccb795beb .github/workflows: enable automatic backport PR creation with Jira sub-issue integration
This workflow calls the reusable backport-with-jira workflow from
scylladb/github-automation to enable automatic backport PR creation with
Jira sub-issue integration.

The workflow triggers on:
- Push to master/next-*/branch-* branches (for promotion events)
- PR labeled with backport/X.X pattern (for manual backport requests)
- PR closed/merged on version branches (for chain backport processing)

Features enabled by calling the shared workflow:
- Creates Jira sub-issues under the main issue for each backport version
- Sorts versions descending (highest first: 2025.4 -> 2025.3 -> 2025.2)
- Cherry-picks from previous version branch to avoid repeated conflicts
- On Jira API failure: adds comment to main issue, applies 'jira-sub-issue-creation-failed' label, continues with PR

Closes scylladb/scylladb#28804

(cherry picked from commit b211590bc0)

Closes scylladb/scylladb#28812
2026-02-26 09:29:19 +02:00
Yaron Kaikov
9e02b0f45f ci: harden trigger-scylla-ci workflow against credential leaks and untrusted PRs
refs: https://github.com/scylladb/scylladb/security/advisories/GHSA-wrqg-xx2q-r3fv

- Remove -v and -i flags from curl to prevent credentials from being
  logged in workflow output
- Move PR_NUMBER and PR_REPO_NAME into the env block with proper quoting
  to prevent shell injection via crafted PR metadata
- Add org membership verification step for pull_request_target events so
  that only PRs from scylladb org members can trigger Jenkins CI

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-796

Closes scylladb/scylladb#28785

(cherry picked from commit 98494e08eb)

Closes scylladb/scylladb#28811
2026-02-26 09:28:00 +02:00
Anna Stuchlik
eb9b8dbf62 doc: remove the tablets limitation for Alternator
This commit removes the information that Alternator doesn't support tablets.
The limitation is no longer valid.

Fixes SCYLLADB-778

Closes scylladb/scylladb#28781

(cherry picked from commit e2333a57ad)

Closes scylladb/scylladb#28795
2026-02-26 09:26:57 +02:00
Andrzej Jackowski
995df5dec6 test: fix configuration of test_autoretrain_dict
`test_autoretrain_dict` sporadically fails because the default
compression algorithm was changed after the test was written.

`9ffa62a986815709d0a09c705d2d0caf64776249` was an attempt to fix it by
changing the compression configuration during node startup. However,
the configuration change had an incorrect YAML format and was
ignored by ScyllaDB. This commit fixes it.

Fixes: scylladb/scylladb#28204

Closes scylladb/scylladb#28746

(cherry picked from commit cd4caed3d3)

Closes scylladb/scylladb#28794
2026-02-26 09:26:23 +02:00
Calle Wilund
beb781b829 gcp: Add handling of 429 (too many requests) to exponential backoff
Fixes: SCYLLADB-611

Adds http error code 429 to codes handled by exponential backoff.

Closes scylladb/scylladb#28588

(cherry picked from commit 8e71a6f52a)

Closes scylladb/scylladb#28724
2026-02-26 09:24:43 +02:00
Marcin Maliszkiewicz
502b7f296d Merge '[Backport 2026.1] vector_search: return NaN for similarity_cosine with all-zero vectors' from Scylladb[bot]
The ANN vector queries with all-zero vectors are allowed even on vector indexes with similarity function set to cosine.
When enabling the rescoring option, those queries would fail as the rescoring calls `similarity_cosine` function underneath, causing an `InvalidRequest` exception as all-zero vectors were not allowed matching Cassandra's behaviour.

To eliminate the discrepancy we want the all-zero vector `similarity_cosine` calls to pass, but return the NaN as the cosine similarity for zero vectors is mathematically incorrect. We decided not to use arbitrary values contrary to USearch, for which the distance (not to be confused with similarity) is defined as cos(0, 0) = 0, cos(0, x) = 1 while supporting the range of values [0, 2].
If we wanted to convert that to similarity, that would mean sim_cos(0, x) = 0.5, which does not support mathematical reasoning why that would be more similar than for example vectors marking obtuse angles.
It's safe to assume that all-zero vectors for cosine similarity shouldn't make any impact, therefore we return NaN and eliminate them from best results.

Adjusted the tests accordingly to check both proper Cassandra and Scylla's behaviour.

Fixes: SCYLLADB-456

Backport to 2026.1 needed, as it fixes the bug for ANN vector queries using rescoring introduced there.

- (cherry picked from commit af0889d194)

- (cherry picked from commit 4e32502bb3)

Parent PR: #28609

Closes scylladb/scylladb#28775

* github.com:scylladb/scylladb:
  test/vector_search: add reproducer for rescoring with zero vectors
  vector_search: return NaN for similarity_cosine with all-zero vectors
2026-02-25 14:34:58 +01:00
Nadav Har'El
b251ee02a4 test: add unit test for tablet_map::get_secondary_replica()
This patch adds a unit test for tablet_map::get_secondary_replica().
It was never officially defined how the "primary" and "secondary"
replicas were chosen, and their implementation changed over time,
but the one invariant that this test verifies is that the secondary
replica and the primary replica must be a different node.

This test reproduces issue SCYLLADB-777, where we discovered that
the get_primary_replica() changed without a corresponding change to
get_primary_replica(). So before the previous patch, this test failed,
and after the previous patch - it passes.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
(cherry picked from commit e463d528fe)
2026-02-25 12:59:26 +00:00
Nadav Har'El
f26d08dde2 test, alternator: add test for TTL expiration with a node down
We have many single-node functional tests for Alternator TTL in
test/alternator/test_ttl.py. This patch adds a multi-node test in
test/cluster/test_alternator.py. The new test verifies that:

 1. Even though Alternator TTL splits the work of scanning and expiring
    items between nodes, all the items get correctly expired.
 2. When one node is down, all the items still expire because the
    "secondary" owner of each token range takes over expiring the
   items in this range while the "primary" owner is down.

This new test is actually a port of a test we already had in dtest
(alternator_ttl_tests.py::test_multinode_expiration). This port is
faster and smaller then the original (fewer nodes, fewer rows), but it
still found a regression (SCYLLADB-777) that dtest missed - the new test
failed when running with tablets and in release build mode.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
(cherry picked from commit 0c7f499750)
2026-02-25 12:59:26 +00:00
Nadav Har'El
9cd1038c7a locator: fix get_secondary_replica() to match get_primary_replica()
The function tablet_map::get_secondary_replica() is used by Alternator
TTL to choose a node different from get_primary_replica(). Unfortunately,
recently (commits 817fdad and d88037d) the implementation of the latter
function changed, without changing the former. So this patch changes
the former to match.

The next two patches will have two tests that fail before this patch,
and pass with it:

1. A unit test that checks that get_secondary_replica() returns a
   different node than get_primary_replica().

2. An Alternator TTL test that checks that when a node is down,
   expirations still happen because the secondary replica takes over
   the primary replica's work.

Fixes SCYLLADB-777

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
(cherry picked from commit 9ab3d5b946)
2026-02-25 12:59:25 +00:00
Avi Kivity
fdae3e4f3a Merge '[Backport 2026.1] s3_client: Fix s3 part size and number of parts calculation' from Scylladb[bot]
- Correct `calc_part_size` function since it could return more than 10k parts
- Add tests
- Add more checks in `calc_part_size` to comply with S3 limits

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-640
Must be ported back to 2025.3/4 and 2026.1 since we may encounter this bug in production clusters

- (cherry picked from commit 289e910cec)

- (cherry picked from commit 6280cb91ca)

- (cherry picked from commit 960adbb439)

Parent PR: #28592

Closes scylladb/scylladb#28697

* github.com:scylladb/scylladb:
  s3_client: add more constrains to the calc_part_size
  s3_client: add tests for calc_part_size
  s3_client: correct multipart part-size logic to respect 10k limit
2026-02-24 14:23:33 +02:00
Avi Kivity
d47e4898ea Merge '[Backport 2026.1] docs: update a documentation of adding/removing DC and rebuilding a node' from Scylladb[bot]
Describe a procedure to convert tablet keyspace replication factor
to rack list. Update the procedures of adding and removing a node
to consider tablet keyspaces.

Fixes: [SCYLLADB-398](https://scylladb.atlassian.net/browse/SCYLLADB-398)
Fixes: https://github.com/scylladb/scylladb/issues/28306.
Fixes: https://github.com/scylladb/scylladb/issues/28307.
Fixes: https://github.com/scylladb/scylladb/issues/28270.

Needs backport to all live branches as they all include tablets.

[SCYLLADB-398]: https://scylladb.atlassian.net/browse/SCYLLADB-398?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ

- (cherry picked from commit eefe66b2b2)

- (cherry picked from commit e08ac60161)

- (cherry picked from commit 1c764cf6ea)

- (cherry picked from commit e4c42acd8f)

- (cherry picked from commit 9ccc95808f)

Parent PR: #28521

Closes scylladb/scylladb#28780

* github.com:scylladb/scylladb:
  docs: update nodetool rebuild docs
  docs: update a procedure of decommissioning a DC
  docs: update a procedure of adding a DC
  docs: describe upgrade to enforce_rack_list option
  docs: describe conversion to rack-list RF
2026-02-24 14:21:50 +02:00
Aleksandra Martyniuk
7bc87de838 docs: update nodetool rebuild docs
Update nodetool rebuild docs to mention that the command does not
work for tablet keyspaces.

Fixes: https://github.com/scylladb/scylladb/issues/28270.
(cherry picked from commit 9ccc95808f)
2026-02-24 09:26:27 +01:00
Aleksandra Martyniuk
2141b9b824 docs: update a procedure of decommissioning a DC
Update a procedure of decommissioning a DC for tablet keyspaces.

Fixes: https://github.com/scylladb/scylladb/issues/28307.
(cherry picked from commit e4c42acd8f)
2026-02-24 09:26:13 +01:00
Aleksandra Martyniuk
aa50edbf17 docs: update a procedure of adding a DC
Update a procedure of adding a DC for tablet keyspaces.

Fixes: https://github.com/scylladb/scylladb/issues/28306.
(cherry picked from commit 1c764cf6ea)
2026-02-23 20:58:23 +00:00
Aleksandra Martyniuk
7f836aa3ec docs: describe upgrade to enforce_rack_list option
(cherry picked from commit e08ac60161)
2026-02-23 20:58:23 +00:00
Aleksandra Martyniuk
bd26803c1a docs: describe conversion to rack-list RF
Fixes: SCYLLADB-398
(cherry picked from commit eefe66b2b2)
2026-02-23 20:58:23 +00:00
Dawid Pawlik
2feed49285 test/vector_search: add reproducer for rescoring with zero vectors
Add reproducer for the SCYLLADB-456 issue following exception
on ANN vector queries with rescoring with similarity cosine.

(cherry picked from commit 4e32502bb3)
2026-02-23 17:09:51 +00:00
Dawid Pawlik
3007cb6f37 vector_search: return NaN for similarity_cosine with all-zero vectors
The ANN vector queries with all-zero vectors are allowed even on vector
indexes with similarity function set to cosine.
When enabling the rescoring option, those queries would fail as the rescoring
calls `similarity_cosine` function underneath, causing an `InvalidRequest` exception
as all-zero vectors were not allowed matching Cassandra's behaviour.

To eliminate the discrepancy we want the all-zero vector `similarity_cosine` calls to pass,
but return the NaN as the cosine similarity for zero vectors is mathematically incorrect.
We decided not to use arbitrary values contrary to USearch, for which the distance
(not to be confused with similarity) is defined as cos(0, 0) = 0, cos(0, x) = 1 while
supporting the range of values [0, 2].
If we wanted to convert that to similarity, that would mean sim_cos(0, x) = 0.5,
which does not support mathematical reasoning why that would be more similar than
for example vectors marking obtuse angles.
It's safe to assume that all-zero vectors for cosine similarity shouldn't make any impact,
therefore we return NaN and eliminate them from best results.

Adjusted the tests accordingly to check both proper Cassandra and Scylla's behaviour.

Fixes: SCYLLADB-456
(cherry picked from commit af0889d194)
2026-02-23 17:09:50 +00:00
Ferenc Szili
1e2d1c7e85 load_stats: fix race condition when computing sum_tablet_sizes
In storage_service::load_stats_for_tablet_based_tables(), we are passing
a reference to sum_tablet_sizes to the lambda which increments this value
on each shard via map_reduce0(). This means we could have a race
condition because this is executed on separate threads/CPUs.

This patch fixed the problem by collecting the sums by shard into a
vector, then summing those up.

Refs: SCYLLADB-678

Closes scylladb/scylladb#28703

(cherry picked from commit f1bc17bd4c)

Closes scylladb/scylladb#28729
2026-02-23 15:02:48 +01:00
Jenkins Promoter
55ad575c8f Update ScyllaDB version to: 2026.1.0-rc3 2026-02-22 14:48:35 +02:00
Tomasz Grabiec
8982140cd9 Merge '[Backport 2026.1] test_tablets_parallel_decommission: Fix flakiness due to delayed task appearance' from Scylladb[bot]
Currently, the test assumes that when
'topology_coordinator_pause_before_processing_backlog: waiting' is
logged, the task for decommission must be there. This was based on the
assumption that topology coordinator is idle and decommission request
wakes it up. But if the server is slow enough, it may still be running
the load balancer in reaction to table creation, and block on that
injection point before decommission request was added.

Fix by waiting for the task to appear rather than the injection.

Fixes SCYLLADB-715

Only 2026.1 vulnerable.

- (cherry picked from commit e14eca46af)

- (cherry picked from commit 2454de4f8f)

- (cherry picked from commit d33d38139f)

Parent PR: #28688

Closes scylladb/scylladb#28750

* github.com:scylladb/scylladb:
  test_tablets_parallel_decommission: Fix flakiness due to delayed task appearance
  test: cluster: task_manager_client: Introduce wait_task_appears()
  tests: pylib: util: Add exponential backoff to wait_for
2026-02-21 01:45:52 +01:00
Tomasz Grabiec
e90449f770 test_tablets_parallel_decommission: Fix flakiness due to delayed task appearance
Currently, the test assumes that when
'topology_coordinator_pause_before_processing_backlog: waiting' is
logged, the task for decommission must be there. This was based on the
assumption that topology coordinator is idle and decommission request
wakes it up. But if the server is slow enough, it may still be running
the load balancer in reaction to table creation, and block on that
injection point before decommission request was added.

Fix by waiting for the task to appear rather than the injection.

Fixes SCYLLADB-715

(cherry picked from commit d33d38139f)
2026-02-20 16:35:39 +00:00
Tomasz Grabiec
98fd5c5e45 test: cluster: task_manager_client: Introduce wait_task_appears()
(cherry picked from commit 2454de4f8f)
2026-02-20 16:35:39 +00:00
Tomasz Grabiec
cca6a1c3dd tests: pylib: util: Add exponential backoff to wait_for
Allows balancing the trade-off between fast execution in case the
condition is satisfied quickly and not adding load when it's not.

(cherry picked from commit e14eca46af)
2026-02-20 16:35:39 +00:00
Patryk Jędrzejczak
9edd0ae3fb raft topology: prevent accessing nullptr returned by topology::find
It's better to raise an internal error than cause a segmentation fault on
possibly multiple nodes.

(cherry picked from commit 8e9c7397c5)
2026-02-20 13:00:21 +01:00
Patryk Jędrzejczak
dc3133b031 raft topology: make some assertions non-crashing
Some assertions in the Raft-based topology are likely to cause crashes of
multiple nodes due to the consistent nature of the Raft-based code. If the
failing assertion is executed in the code run by each follower (e.g., the code
reloading the in-memory topology state machine), then all nodes can crash. If
the failing assertion is executed only by the leader (e.g., the topology
coordinator fiber), then multiple consecutive group0 leaders will chain-crash
until there is no group0 majority.

Crashing multiple nodes is much more severe than necessary. It's enough to
prevent the topology state machine from making more progress. This will
naturally happen after throwing a runtime error. The problematic fiber will be
killed or will keep failing in a loop. Note that it should be safe to block
the topology state machine, but not the whole group0, as the topology state
machine is mostly isolated from the rest of group0.

We replace some occurrences of `on_fatal_internal_error` and `SCYLLA_ASSERT`
with `on_internal_error`. These are not all occurrences, as some fatal
assertions make sense, for example, in the bootstrap procedure.

(cherry picked from commit e21ecf69de)
2026-02-20 13:00:19 +01:00
Szymon Malewski
86554e6192 vector: Improve similarity functions performance
Improves performance of deserialization of vector data for calculating similarity functions.
Instead of deserializing vector data into a std::vector<data_value>, we deserialize directly into a std::vector<float>
and then pass it to similarity functions as a std::span<const float>.
This avoids overhead of data_value allocations and conversions.
Example QPS of `SELECT id, similarity_cosine({vector<float, 1536>}, {vector<float, 1536>}) ...`:
client concurrency 1: before: ~135 QPS, after: ~1005 QPS
client concurrency 20: before: ~280 QPS, after: ~2097 QPS
Measured using https://github.com/zilliztech/VectorDBBench (modified to call above query without ANN search)

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-471

Closes scylladb/scylladb#28615

(cherry picked from commit 668d6fe019)

Closes scylladb/scylladb#28690
2026-02-19 14:14:39 +02:00
Asias He
637618560b repair: Skip auto repair for tables using RF one
There is no point running repair for tables using RF one. Row level
repair will skip it but the auto repair scheduler will keep scheduling
such repairs since repair_time could not be updated.

Skip such repairs at the scheduler level for auto repair.

If the request is issued by user, we will have to schedule such
repair otherwise the user request will never be finished.

Fixes SCYLLADB-561

Closes scylladb/scylladb#28640

(cherry picked from commit 1be80c9e86)

Closes scylladb/scylladb#28714
2026-02-19 13:07:37 +02:00
Avi Kivity
8c3c5777da Merge '[Backport 2026.1] transport: fix connection code to consume only initially taken semaphore units' from Scylladb[bot]
The connection's `cpu_concurrency_t` struct tracks the state of a connection
to manage the admission of new requests and prevent CPU overload during
connection storms. When a connection holds units (allowed only 0 or 1), it is
considered to be in the "CPU state" and contributes to the concurrency limits
used when accepting new connections.

The bug stems from the fact that `counted_data_source_impl::get` and
`counted_data_sink_impl::put` calls can interleave during execution. This
occurs because of `should_parallelize` and `_ready_to_respond`, the latter being
a future chain that can run in the background while requests are being read.
Consequently, while reading request (N), the system may concurrently be
writing the response for request (N-1) on the same connection.

This interleaving allows `return_all()` to be called twice before the
subsequent `consume_units()` is invoked. While the second `return_all()` call
correctly returns 0 units, the matching `consume_units()` call would
mistakenly take an extra unit from the semaphore. Over time, a connection
blocked on a read operation could end up holding an unreturned semaphore
unit. If this pattern repeats across multiple connections, the semaphore
units are eventually depleted, preventing the server from accepting any
new connections.

The fix ensures that we always consume the exact number of units that were
previously returned. With this change, interleaved operations behave as
follows:

get() return_all     — returns 1 unit
put() return_all     — returns 0 units
get() consume_units  — takes back 1 unit
put() consume_units  — takes back 0 units

Logically, the networking phase ends when the first network operation
concludes. But more importantly, when a network operation
starts, we no longer hold any units.

Other solutions are possible but the chosen one seems to be the
simplest and safest to backport.

Fixes SCYLLADB-485
Backport: all supported affected versions, bug introduced with initial feature implementation in: ed3e4f33fd

- (cherry picked from commit 0376d16ad3)

- (cherry picked from commit 3b98451776)

Parent PR: #28530

Closes scylladb/scylladb#28716

* github.com:scylladb/scylladb:
  test: auth_cluster: add test for hanged AUTHENTICATING connections
  transport: fix connection code to consume only initially taken semaphore units
2026-02-19 12:47:38 +02:00
Tomasz Grabiec
bb9a5261ec Merge '[Backport 2026.1] test: fix flaky test_balance_empty_tablets' from Scylladb[bot]
The test creates a single node cluster, then creates 3 tables which remain empty. Then it adds another node with half the disk capacity of the first one, and then it waits for the balancer to migrate tablets to the newly added node by calling the quiesce topology API. The number of tablets on the smaller node should be exactly half the number of tablets on the larger node.

After waiting for quiesce topology, we could have a situation where we query the number of tablets from the node which still hasn't processed the last tablet migrations and updated system.tablets.

This patch adds a read barrier so that both nodes see the same tablets metadata before we query the number of tablets.

Fixes: SCYLLADB-603

The test is present in master and 2026.1, so we need to backport this.

- (cherry picked from commit 4ca40929ef)

Parent PR: #28598

Closes scylladb/scylladb#28638

* github.com:scylladb/scylladb:
  test/cluster: Remove short_tablet_stats_refresh_interval injection
  test: add read barrier to test_balance_empty_tablets
2026-02-18 23:39:03 +01:00
Marcin Maliszkiewicz
d5d81cc066 test: auth_cluster: add test for hanged AUTHENTICATING connections
Test runtime:
Release - 2s
Debug - 5s

(cherry picked from commit 3b98451776)
2026-02-18 19:43:02 +00:00
Marcin Maliszkiewicz
6a438543c2 transport: fix connection code to consume only initially taken semaphore units
The connection's cpu_concurrency_t struct tracks the state of a connection
to manage the admission of new requests and prevent CPU overload during
connection storms. When a connection holds units (allowed only 0 or 1), it is
considered to be in the "CPU state" and contributes to the concurrency limits
used when accepting new connections.

The bug stems from the fact that `counted_data_source_impl::get` and
`counted_data_sink_impl::put` calls can interleave during execution. This
occurs because of `should_parallelize` and `_ready_to_respond`, the latter being
a future chain that can run in the background while requests are being read.
Consequently, while reading request (N), the system may concurrently be
writing the response for request (N-1) on the same connection.

This interleaving allows `return_all()` to be called twice before the
subsequent `consume_units()` is invoked. While the second `return_all()` call
correctly returns 0 units, the matching `consume_units()` call would
mistakenly take an extra unit from the semaphore. Over time, a connection
blocked on a read operation could end up holding an unreturned semaphore
unit. If this pattern repeats across multiple connections, the semaphore
units are eventually depleted, preventing the server from accepting any
new connections.

The fix ensures that we always consume the exact number of units that were
previously returned. With this change, interleaved operations behave as
follows:

get() return_all     — returns 1 unit
put() return_all     — returns 0 units
get() consume_units  — takes back 1 unit
put() consume_units  — takes back 0 units

Logically, the networking phase ends when the first network operation
concludes. But more importantly, when a network operation
starts, we no longer hold any units.

Other solutions are possible but the chosen one seems to be the
simplest and safest to backport.

Fixes SCYLLADB-485

(cherry picked from commit 0376d16ad3)
2026-02-18 19:43:02 +00:00
Botond Dénes
99a67484bf Merge '[Backport 2026.1] cql3/statements/describe_statement: hide paxos state tables ' from Scylladb[bot]
Paxos state tables are internal tables fully managed by Scylla
and they shouldn't be exposed to the user nor they shouldn't be backed up.

This commit hides those kind of tables from all listings and if such table
is directly described with `DESC ks."tbl$paxos"`, the description is generated
withing a comment and a note for the user is added.

Fixes https://github.com/scylladb/scylladb/issues/28183

LWT on tablets and paxos state tables are present in 2025.4, so the patch should be backported to this version.

- (cherry picked from commit f89a8c4ec4)

- (cherry picked from commit 9baaddb613)

Parent PR: #28230

Closes scylladb/scylladb#28508

* github.com:scylladb/scylladb:
  test/cqlpy: add reproducer for hidden Paxos table being shown by DESC
  cql3/statements/describe_statement: hide paxos state tables
2026-02-18 12:41:08 +02:00
Anna Stuchlik
cabf2845d9 doc: fix the links on the repair-related pages
This is a follow-up to https://github.com/scylladb/scylladb/pull/28199.

This commit fixes the syntax of the internal links.

Fixes https://github.com/scylladb/scylladb/issues/28486

Closes scylladb/scylladb#28487

(cherry picked from commit 77480c9d8f)

Closes scylladb/scylladb#28512
2026-02-18 12:39:07 +02:00
Yaron Kaikov
ff4a0fc87e ci: fix PR number extraction for unlabeled events
When the workflow is triggered by removing the 'conflicts' label
(pull_request_target unlabeled event), github.event.issue.number is
not available. Use github.event.pull_request.number as fallback.

Fixes: https://scylladb.atlassian.net/browse/RELENG-245

Closes scylladb/scylladb#28543

(cherry picked from commit b30ecb72d5)

Closes scylladb/scylladb#28553
2026-02-18 12:38:10 +02:00
Botond Dénes
0a89dbb4d4 Merge '[Backport 2026.1] raft topology: generate notification about released nodes only once' from Scylladb[bot]
Hints destined for some other node can only be drained after the other node is no longer a replica of any vnode or tablet. In case when tablets are present, a node might still technically be a replica of some tablets after it moved to left state. When it no longer is a replica of any tablet, it becomes "released" and storage service generates a notification about it. Hinted handoff listens to this notification and kicks off draining hints after getting it.

The current implementation of the "released" notification would trigger every time raft topology state is reloaded and a left node without any tokens is present in the raft topology. Although draining hints is idempotent, generating duplicate notifications is wasteful and recently became very noisy after in 44de563 verbosity of the draining-related log messages have been increased. The verbosity increase itself makes sense as draining is supposed to be a rare operation, but the duplicate notification bug now needs to be addressed.

Fix the duplicate notification problem by passing the list of previously released nodes to the `storage_service::raft_topology_update_ip` function and filtering based on it. If this function processes the topology state for the first time, it will not produce any notifications. This is fine as hinted handoff is prepared to detect "released" nodes during the startup sequence in main.cc and start draining the hints there, if needed.

Fixes: scylladb/scylladb#28301
Refs: scylladb/scylladb#25031

The log messages added in 44de563 cause a lot of noise during topology operations and tablet migrations, so the fix should be backported to all affected versions (2025.4 and 2026.1).

- (cherry picked from commit 10e9672852)

- (cherry picked from commit d28c841fa9)

- (cherry picked from commit 29da20744a)

Parent PR: #28367

Closes scylladb/scylladb#28612

* github.com:scylladb/scylladb:
  storage_service: fix indentation after previous patch
  raft topology: generate notification about released nodes only once
  raft topology: extract "released" nodes calculation to external function
2026-02-18 12:37:33 +02:00
Aleksandra Martyniuk
19cbaa1be2 test: fix test_remove_node_violating_rf_rack_with_rack_list
test_remove_node_violating_rf_rack_with_rack_list creates a cluster
with four nodes. One of the nodes is excluded, then another one is
stopped, excluded, and removed. If the two stopped nodes were both
voters, the majority is lost and the cluster loses its raft leader.
As a result, the node cannot be removed and the operation times out.

Add the 5th node to the cluster. This way the majority is always up.

Fixes: https://github.com/scylladb/scylladb/issues/28596.

Closes scylladb/scylladb#28610

(cherry picked from commit f955a90309)

Closes scylladb/scylladb#28639
2026-02-18 12:36:52 +02:00
Anna Mikhlin
9cf0f0998d .github/workflows: ignore quoted comments for trigger CI
prevent CI from being triggered when trigger-ci command appears inside
quoted (>) comment text

Fixes: https://scylladb.atlassian.net/browse/RELENG-271

Closes scylladb/scylladb#28604

(cherry picked from commit 33cf97d688)

Closes scylladb/scylladb#28652
2026-02-18 12:36:28 +02:00
Ernest Zaslavsky
f56e1760d7 s3_client: limit multipart upload concurrency
Prevent launching hundreds or thousands of fibers during multipart uploads
by capping concurrent part submissions to 16.

Closes scylladb/scylladb#28554

(cherry picked from commit 034c6fbd87)

Closes scylladb/scylladb#28668
2026-02-18 12:35:23 +02:00
Ernest Zaslavsky
db4e3a664d api: report restore params
report restore params once the API's call for restore is invoked

Closes scylladb/scylladb#28431

(cherry picked from commit 30699ed84b)

Closes scylladb/scylladb#28683
2026-02-18 12:34:33 +02:00
Botond Dénes
c292892d5f Merge '[Backport 2026.1] GCS object storage. Fix incompatibilty issues with "real" GCS' from Scylladb[bot]
Fixes #28398
Fixes #28399

When used as path elements in google storage paths, the object names need to be URL encoded. Due to

a.) tests not really using prefixes including non-url valid chars (i.e. / etc)
and
b.) the mock server used for most testing not enforcing this particular aspect,

this was missed.

Modified unit tests to use prefixing for all names, so when running real GS, any errors like this will show.

"Real" GCS also behaves a bit different when listing with pager, compared to mock;
The former will not give a pager token for last page, only penultimate.
 Adds handling for this.

Needs backport to the releases that have (though might not really use) the feature, as it is technically possible to use google storage for backup and whatnot there, and it should work as expected.

- (cherry picked from commit a896d8d5e3)

- (cherry picked from commit 87aa6c8387)

Parent PR: #28400

Closes scylladb/scylladb#28685

* github.com:scylladb/scylladb:
  utils/gcp/object_storage: URL-encode object names in URL:s
  utils::gcp::object_storage: Fix list object pager end condition detection
2026-02-18 12:33:28 +02:00
Calle Wilund
d87467f77b commitlog: Always abort replenish queue on loop exit
Fixes #28678

If replenish loop exits the sleep condition, with an empty queue,
when "_shutdown" is already set, a waiter might get stuck, unsignalled
waiting for segments, even though we are exiting.

Simply move queue abort to always be done on loop exit.

Closes scylladb/scylladb#28679

(cherry picked from commit ab4e4a8ac7)

Closes scylladb/scylladb#28693
2026-02-18 12:32:43 +02:00
Ernest Zaslavsky
6e92ee1bb2 s3_client: add more constrains to the calc_part_size
Enforce more checks on part size and object size as defined in
"Amazon S3 multipart upload limits", see
https://docs.aws.amazon.com/AmazonS3/latest/userguide/qfacts.html and
https://docs.aws.amazon.com/AmazonS3/latest/userguide/UsingObjects.html

(cherry picked from commit 960adbb439)
2026-02-18 09:41:28 +00:00
Ernest Zaslavsky
4ecc402b79 s3_client: add tests for calc_part_size
Introduce tests that validate the corrected multipart part-size
calculation, including boundary conditions and error cases.

(cherry picked from commit 6280cb91ca)
2026-02-18 09:41:27 +00:00
Ernest Zaslavsky
b27adefc16 s3_client: correct multipart part-size logic to respect 10k limit
The previous calculation could produce more than 10,000 parts for large
uploads because we mixed values in bytes and MiB when determining the
part size. This could result in selecting a part size that still
exceeded the AWS multipart upload limit. The updated logic now ensures
the number of parts never exceeds the allowed maximum.

This change also aligns the implementation with the code comment: we
prefer a 50 MiB part size because it provides the best performance, and
we use it whenever it fits within the 10,000-part limit. If it does not,
we increase the part size (in bytes, aligned to MiB) to stay within the
limit.

(cherry picked from commit 289e910cec)
2026-02-18 09:41:27 +00:00
Calle Wilund
cad92d5100 utils/gcp/object_storage: URL-encode object names in URL:s
Fixes #28398

When used as path elements in google storage paths, the object names
need to be URL encoded. Due to a.) tests not really using prefixes including
non-url valid chars (i.e. / etc) and the mock server used for most
testing not enforcing this particular aspect, this was missed.

Modified unit tests to use prefixing for all names, so when run
in real GS, any errors like this will show.

(cherry picked from commit 87aa6c8387)
2026-02-17 19:32:19 +00:00
Calle Wilund
44cc5ae30b utils::gcp::object_storage: Fix list object pager end condition detection
Fixes #28399

When iterating with pager, the mock server and real GCS behaves differently.
The latter will not give a pager token for last page, only penultimate.

Need to handle.

(cherry picked from commit a896d8d5e3)
2026-02-17 19:32:18 +00:00
Piotr Dulikowski
05a5bd542a Merge '[Backport 2026.1] vector_search: Fix flaky vector_store_client_https_rewrite_ca_cert' from Scylladb[bot]
Most likely, the root cause of the flaky test was that the TLS handshake hung for an extended period (60s). This caused
the test case to fail because the ANN request duration exceeded the test case timeout.

The PR introduces two changes:

* Mitigation of the hanging TLS handshake: This issue likely occurred because the test performed certificate rewrites
simultaneously with ANN requests that utilize those certificates.
* Production code fix: This addresses a bug where the TLS handshake itself was not covered by the connection timeout.
Since tls::connect does not perform the handshake immediately, the handshake only occurs during the first write
operation, potentially bypassing connect timeout.

Fixes: #28012

Backport to 2026.01 and 2025.04 is needed, as these branches are also affected and may experience CI flakiness due to this test.

- (cherry picked from commit aef5ff7491)

- (cherry picked from commit 079fe17e8b)

Parent PR: #28617

Closes scylladb/scylladb#28643

* github.com:scylladb/scylladb:
  vector_search: Fix missing timeout on TLS handshake
  vector_search: test: Fix flaky cert rewrite test
2026-02-17 10:44:13 +01:00
Patryk Jędrzejczak
8a626bb458 Merge '[Backport 2026.1] test: explicitly set compression algorithm in test_autoretrain_dict' from Scylladb[bot]
When `test_autoretrain_dict` was originally written, the default
`sstable_compression_user_table_options` was `LZ4Compressor`. The
test assumed (correctly) that initially the compression doesn't use
a trained dictionary, and later in the test scenario, it changed
the algorithm to one with a dictionary.

However, the default `sstable_compression_user_table_options` is now
`LZ4WithDictsCompressor`, so the old assumption is no longer correct.
As a result, the assertion that data is initially not compressed well
may or may not fail depending on dictionary training timing.

To fix this, this commit explicitly sets `ZstdCompressor`
as the initial `sstable_compression_user_table_options`, ensuring that
the assumption that initial compression is without a dictionary
is always met.

Note: `ZstdCompressor` differs from the former default `LZ4Compressor`.
However, it's a better choice — the test aims to show the benefit of
using a dictionary, not the benefit of Zstd over LZ4 (and the test uses
ZstdWithDictsCompressor as the algorithm with the dictionary).

Fixes: https://github.com/scylladb/scylladb/issues/28204
Backport: 2025.4, as test already failed there (and also backport to 2026.1 to make everything consistent).

- (cherry picked from commit e63cfc38b3)

- (cherry picked from commit 9ffa62a986)

Parent PR: #28625

Closes scylladb/scylladb#28667

* https://github.com/scylladb/scylladb:
  test: explicitly set compression algorithm in test_autoretrain_dict
  test: remove unneeded semicolons from python test
2026-02-17 10:19:26 +01:00
Patryk Jędrzejczak
3a56a0cf99 test: test_restart_leaving_replica_during_cleanup: reconnect driver after restart
The test can currently fail like this:
```
>           await cql.run_async(f"ALTER TABLE {ks}.test WITH tablets = {{'min_tablet_count': 1}}")
E           cassandra.cluster.NoHostAvailable: ('Unable to complete the operation against any hosts', {<Host: 127.158.27.9:9042 datacenter1>: <Error from server: code=0000 [Server error] message="Failed to apply group 0 change due to concurrent modification">})
```
The following happens:
- node A is restarted and becomes the group0 leader,
- the driver sends the ALTER TABLE request to node B,
- the request hits group 0 concurrent modification error 10 times and fails
  because node A performs tablet migrations at the the same time.

What is unexpected is that even though the driver session uses the default
retry policy, the driver doesn't retry the request on node A. The request
is guaranteed to succeed on node A because it's the only node adding group0
entries.

The driver doesn't retry the request on node A because of a missing
`wait_for_cql_and_get_hosts` call. We add it in this commit. We also reconnect
the driver just in case to prevent hitting scylladb/python-driver#295.

Moreover, we can revert the workaround from
4c9efc08d8, as the fix from this commit also
prevents DROP KEYSPACE failures.

The commit has been tested in byo with `_concurrent_ddl_retries{0}` to
verify that node A really can't hit group 0 concurrent modification error
and always receives the ALTER TABLE request from the driver. All 300 runs in
each build mode passed.

Fixes #25938

Closes scylladb/scylladb#28632

(cherry picked from commit 0693091aff)

Closes scylladb/scylladb#28673
2026-02-17 10:03:50 +01:00
Nikos Dragazis
0cdac69aab test/cluster: Remove short_tablet_stats_refresh_interval injection
The test `test_size_based_load_balancing.py::test_balance_empty_tablets`
waits for tablet load stats to be refreshed and uses the
`short_tablet_stats_refresh_interval` injection to speed up the refresh
interval.

This injection has no effect; it was replaced by the
`tablet_load_stats_refresh_interval_in_seconds` config option (patch: 1d6808aec4),
so the test currently waits for 60 seconds (default refresh interval).

Use the config option. This reduces the execution time to ~8 seconds.

Fixes SCYLLADB-556.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>

Closes scylladb/scylladb#28536

(cherry picked from commit 5d1e6243af)
2026-02-17 08:35:16 +01:00
Ferenc Szili
f04a3acf33 test: add read barrier to test_balance_empty_tablets
The test creates a single node cluster, then creates 3 tables which
remain empty. Then it adds another node with half the disk capacity of
the first one, and then it waits for the balancer to migrate tablets to
the newly added node by calling the quiesce topology API. The number of
tablets on the smaller node should be exactly half the number of tablets
on the larger node.

After waiting for quiesce topology, we could have a situation where we
query the number of tablets from the node which still hasn't processed
the last tablet migrations and updated system.tablets.

This patch adds a read barrier so that both nodes see the same tablets
metadata before we query the number of tablets.

Fixes: SCYLLADB-603

Closes scylladb/scylladb#28598

(cherry picked from commit 4ca40929ef)
2026-02-17 08:31:50 +01:00
Andrzej Jackowski
ad716f9341 test: explicitly set compression algorithm in test_autoretrain_dict
When `test_autoretrain_dict` was originally written, the default
`sstable_compression_user_table_options` was `LZ4Compressor`. The
test assumed (correctly) that initially the compression doesn't use
a trained dictionary, and later in the test scenario, it changed
the algorithm to one with a dictionary.

However, the default `sstable_compression_user_table_options` is now
`LZ4WithDictsCompressor`, so the old assumption is no longer correct.
As a result, the assertion that data is initially not compressed well
may or may not fail depending on dictionary training timing.

To fix this, this commit explicitly sets `ZstdCompressor`
as the initial `sstable_compression_user_table_options`, ensuring that
the assumption that initial compression is without a dictionary
is always met.

Note: `ZstdCompressor` differs from the former default `LZ4Compressor`.
However, it's a better choice — the test aims to show the benefit of
using a dictionary, not the benefit of Zstd over LZ4 (and the test uses
ZstdWithDictsCompressor as the algorithm with the dictionary).

Fixes: scylladb/scylladb#28204
(cherry picked from commit 9ffa62a986)
2026-02-16 16:23:36 +00:00
Andrzej Jackowski
2edd87f2e1 test: remove unneeded semicolons from python test
(cherry picked from commit e63cfc38b3)
2026-02-16 16:23:36 +00:00
Piotr Dulikowski
ba10e74523 Merge '[Backport 2026.1] test: cluster: Fix test_sync_point' from Scylladb[bot]
The test `test_sync_point` had a few shortcomings that made it flaky
or simply wrong:

1. We were verifying that hints were written by checking the size of
   in-flight hints. However, that could potentially lead to problems
   in rare situations.

   For instance, if all of the hints failed to be written to disk, the
   size of in-flight hints would drop to zero, but creating a sync point
   would correspond to the empty state.

   In such a situation, we should fail immediately and indicate what
   the cause was.

2. A sync point corresponds to the hints that have already been written
   to disk. The number of those is tracked by the metric `written`.
   It's a much more reliable way to make sure that hints have been
   written to the commitlog. That ensures that the sync point we'll
   create will really correspond to those hints.

3. The auxiliary function `wait_for` used in the test works like this:
   it executes the passed callback and looks at the result. If it's
   `None`, it retries it. Otherwise, the callback is deemed to have
   finished its execution and no further retries will be attempted.

   Before this commit, we simply returned a bool, and so the code was
   wrong. We improve it.

---

Note that this fixes scylladb/scylladb#28203, which was a manifestation
of scylladb/scylladb#25879. We created a sync point that corresponded
to the empty state, and so it immediately resolved, even when node 3
was still dead.

As a bonus, we rewrite the auxiliary code responsible for fetching
metrics and manipulating sync points. Now it's asynchronous and
uses the existing standard mechanisms available to developers.

Furthermore, we reduce the time needed for executing
`test_sync_point` by 27 seconds.

---

The total difference in time needed to execute the whole test file
(on my local machine, in dev mode):

Before:

    CPU utilization: 0.9%

    real    2m7.811s
    user    0m25.446s
    sys     0m16.733s

After:

    CPU utilization: 1.1%

    real    1m40.288s
    user    0m25.218s
    sys     0m16.566s

---

Refs scylladb/scylladb#25879
Fixes scylladb/scylladb#28203

Backport: This improves the stability of our CI, so let's
          backport it to all supported versions.

- (cherry picked from commit 628e74f157)

- (cherry picked from commit ac4af5f461)

- (cherry picked from commit c5239edf2a)

- (cherry picked from commit a256ba7de0)

- (cherry picked from commit f83f911bae)

Parent PR: #28602

Closes scylladb/scylladb#28623

* github.com:scylladb/scylladb:
  test: cluster: Reduce wait time in test_sync_point
  test: cluster: Fix test_sync_point
  test: cluster: Await sync points asynchronously
  test: cluster: Create sync points asynchronously
  test: cluster: Fetch hint metrics asynchronously
2026-02-16 10:32:03 +01:00
Jenkins Promoter
5abc2fea9f Update pgo profiles - aarch64 2026-02-15 05:01:51 +02:00
Karol Nowacki
2ab81f768b vector_search: Fix missing timeout on TLS handshake
Currently the TLS handshake in the vector search client does not have a timeout.
This is because tls::connect does not perform handshake itself; the handshake
is deferred until the first read/write operation is performed. This can lead to long
hangs on ANN requests.

This commit calls tls::check_session_is_resumed() after tls::connect
to force the handshake to happen immediately and to run under with_timeout.

(cherry picked from commit 079fe17e8b)
2026-02-13 21:24:45 +00:00
Karol Nowacki
cfebb52db0 vector_search: test: Fix flaky cert rewrite test
The test is flaky most likely because when TLS certificate rewrite
happens simultaneously with an ANN request, the handshake can hang for a
long time (~60s). This leads to a timeout in the test case.

This change introduces a checkpoint in the test so that it will
wait for the certificate rewrite to happen before sending an ANN request,
which should prevent the handshake from hanging and make the test more reliable.

Fixes: #28012
(cherry picked from commit aef5ff7491)
2026-02-13 21:24:45 +00:00
Dawid Mędrek
37ef37e8ab test: cluster: Reduce wait time in test_sync_point
If everything is OK, the sync point will not resolve with node 3 dead.
As a result, the waiting will use all of the time we allocate for it,
i.e. 30 seconds. That's a lot of time.

There's no easy way to verify that the sync point will NOT resolve, but
let's at least reduce the waiting to 3 seconds. If there's a bug, it
should be enough to trigger it at some point, while reducing the average
time needed for CI.

(cherry picked from commit f83f911bae)
2026-02-12 12:13:19 +00:00
Dawid Mędrek
fdad814aa3 test: cluster: Fix test_sync_point
The test had a few shortcomings that made it flaky or simply wrong:

1. We were verifying that hints were written by checking the size of
   in-flight hints. However, that could potentially lead to problems
   in rare situations.

   For instance, if all of the hints failed to be written to disk, the
   size of in-flight hints would drop to zero, but creating a sync point
   would correspond to the empty state.

   In such a situation, we should fail immediately and indicate what
   the cause was.

2. A sync point corresponds to the hints that have already been written
   to disk. The number of those is tracked by the metric `written`.
   It's a much more reliable way to make sure that hints have been
   written to the commitlog. That ensures that the sync point we'll
   create will really correspond to those hints.

3. The auxiliary function `wait_for` used in the test works like this:
   it executes the passed callback and looks at the result. If it's
   `None`, it retries it. Otherwise, the callback is deemed to have
   finished its execution and no further retries will be attempted.

   Before this commit, we simply returned a bool, and so the code was
   wrong. We improve it.

Note that this fixes scylladb/scylladb#28203, which was a manifestation
of scylladb/scylladb#25879. We created a sync point that corresponded
to the empty state, and so it immediately resolved, even when node 3
was still dead.

Refs scylladb/scylladb#25879
Fixes scylladb/scylladb#28203

(cherry picked from commit a256ba7de0)
2026-02-12 12:13:19 +00:00
Dawid Mędrek
0257f7cc89 test: cluster: Await sync points asynchronously
There's a dedicated HTTP API for communicating with the cluster, so
let's use it instead of yet another custom solution.

(cherry picked from commit c5239edf2a)
2026-02-12 12:13:19 +00:00
Dawid Mędrek
07bfd920e7 test: cluster: Create sync points asynchronously
There's a dedicated HTTP API for communicating with the nodes, so let's
use it instead of yet another custom solution.

(cherry picked from commit ac4af5f461)
2026-02-12 12:13:19 +00:00
Dawid Mędrek
698ba5bd0b test: cluster: Fetch hint metrics asynchronously
There's a dedicated API for fetching metrics now. Let's use it instead
of developing yet another solution that's also worse.

(cherry picked from commit 628e74f157)
2026-02-12 12:13:19 +00:00
Piotr Dulikowski
d8c7303d14 storage_service: fix indentation after previous patch
(cherry picked from commit 29da20744a)
2026-02-11 12:07:15 +00:00
Piotr Dulikowski
9365adb2fb raft topology: generate notification about released nodes only once
Hints destined for some other node can only be drained after the other
node is no longer a replica of any vnode or tablet. In case when tablets
are present, a node might still technically be a replica of some tablets
after it moved to left state. When it no longer is a replica of any
tablet, it becomes "released" and storage service generates a
notification about it. Hinted handoff listens to this notification and
kicks off draining hints after getting it.

The current implementation of the "released" notification would trigger
every time raft topology state is reloaded and a left node without any
tokens is present in the raft topology. Although draining hints is
idempotent, generating duplicate notifications is wasteful and recently
became very noisy after in 44de563 verbosity of the draining-related log
messages have been increased. The verbosity increase itself makes sense
as draining is supposed to be a rare operation, but the duplicate
notification bug now needs to be addressed.

Fix the duplicate notification problem by passing the list of previously
released nodes to the `storage_service::raft_topology_update_ip`
function and filtering based on it. If this function processes the
topology state for the first time, it will not produce any
notifications. This is fine as hinted handoff is prepared to detect
"released" nodes during the startup sequence in main.cc and start
draining the hints there, if needed.

Fixes: #28301
Refs: #25031
(cherry picked from commit d28c841fa9)
2026-02-11 12:07:14 +00:00
Piotr Dulikowski
6a55396e90 raft topology: extract "released" nodes calculation to external function
In the following commits we will need to compare the set of released
nodes before and after reload of raft topology state. Moving the logic
that calculates such a set to a separate function will make it easier to
do.

(cherry picked from commit 10e9672852)
2026-02-11 12:07:14 +00:00
Pawel Pery
f4b79c1b1d Revert "Merge 'vector_search: add validator tests' from Pawel Pery"
This reverts commit bcd1758911, reversing
changes made to b2c2a99741.

There is a design decision to not introduce additional test
orchestration tool for scylladb.git (see comments for #27499). One
commit has already been reverted in 55c7bc7. Last CI runs made validator
test flaky, so it is a time to remove all remaining validator tests.

It needs a backport to 2026.1 to remove remaining validator tests from there.

Fixes: VECTOR-497

Closes scylladb/scylladb#28568

(cherry picked from commit 81d11a23ce)

Closes scylladb/scylladb#28577
2026-02-09 15:16:40 +02:00
Michał Hudobski
f633f57163 auth: add CDC streams and timestamps to vector search permissions
It turns out that the cdc driver requires permissions to two additional system tables. This patch adds them to VECTOR_SEARCH_INDEXING and modifies the unit tests. The integration with vector store was tested manually, integration tests will be added in vector-store repository in a follow up PR.

Fixes: SCYLLADB-522

Closes scylladb/scylladb#28519

(cherry picked from commit 6b9fcc6ca3)

Closes scylladb/scylladb#28538
2026-02-05 10:31:39 +01:00
Jenkins Promoter
09ed4178a6 Update ScyllaDB version to: 2026.1.0-rc2 2026-02-03 18:15:05 +02:00
Patryk Jędrzejczak
2bf7a0f65e Merge '[Backport 2026.1] storage_service: set up topology properly in maintenance mode' from Scylladb[bot]
We currently make the local node the only token owner (that owns the
whole ring) in maintenance mode, but we don't update the topology properly.
The node is present in the topology, but in the `none` state. That's how
it's inserted by `tm.get_topology().set_host_id_cfg(host_id);` in
`scylla_main`. As a result, the node started in maintenance mode crashes
in the following way in the presence of a vnodes-based keyspace with the
NetworkTopologyStrategy:
```
scylla: locator/network_topology_strategy.cc:207:
    locator::natural_endpoints_tracker::natural_endpoints_tracker(
    const token_metadata &, const network_topology_strategy::dc_rep_factor_map &):
    Assertion `!_token_owners.empty() && !_racks.empty()' failed.
```
Both `_token_owners` and `_racks` are empty. The reason is that
`_tm.get_datacenter_token_owners()` and
`_tm.get_datacenter_racks_token_owners()` called above filter out nodes
in the `none` state.

This bug basically made maintenance mode unusable in customer clusters.

We fix it by changing the node state to `normal`.

We also extend `test_maintenance_mode` to provide a reproducer for

Fixes #27988

This PR must be backported to all branches, as maintenance mode is
currently unusable everywhere.

- (cherry picked from commit a08c53ae4b)

- (cherry picked from commit 9d4a5ade08)

- (cherry picked from commit c92962ca45)

- (cherry picked from commit 408c6ea3ee)

- (cherry picked from commit 53f58b85b7)

- (cherry picked from commit 867a1ca346)

- (cherry picked from commit 6c547e1692)

- (cherry picked from commit 7e7b9977c5)

Parent PR: #28322

Closes scylladb/scylladb#28499

* https://github.com/scylladb/scylladb:
  test: test_maintenance_mode: enable maintenance mode properly
  test: test_maintenance_mode: shutdown cluster connections
  test: test_maintenance_mode: run with different keyspace options
  test: test_maintenance_mode: check that group0 is disabled by creating a keyspace
  test: test_maintenance_mode: get rid of the conditional skip
  test: test_maintenance_mode: remove the redundant value from the query result
  storage_proxy: skip validate_read_replica in maintenance mode
  storage_service: set up topology properly in maintenance mode
2026-02-03 10:39:41 +01:00
Nadav Har'El
5b15c52f1e test/cqlpy: add reproducer for hidden Paxos table being shown by DESC
This patch adds a reproducer test showing issue #28183 - that when LWT
is used, hidden tables "...$paxos" are created but they are unexpectedly
shown by DESC TABLES, DESC SCHEMA and DESC KEYSPACE.

The new test was failing (in three places) on Scylla, as those internal
(and illegally-named) tables are listed, and passes on Cassandra
(which doesn't add hidden tables for LWT).

The commit also contains another test, which verifies if direct
description of paxos state table is wrapped in comment.

Refs #28183.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
(cherry picked from commit 9baaddb613)
2026-02-02 23:31:58 +00:00
Michał Jadwiszczak
26e17202f6 cql3/statements/describe_statement: hide paxos state tables
Paxos state tables are internal tables fully managed by Scylla
and they shouldn't be exposed to the user nor they shouldn't be backed up.

This commit hides those kind of tables from all listings and if such table
is directly described with `DESC ks."tbl$paxos"`, the description is generated
withing a comment and a note for the user is added.

Fixes scylladb/scylladb#28183

(cherry picked from commit f89a8c4ec4)
2026-02-02 23:31:58 +00:00
Patryk Jędrzejczak
b62e1b405b test: test_maintenance_mode: enable maintenance mode properly
The same issue as the one fixed in
394207fd69.
This one didn't cause real problems, but it's still cleaner to fix it.

(cherry picked from commit 7e7b9977c5)
2026-02-02 17:02:16 +00:00
Patryk Jędrzejczak
f3d2a16e66 test: test_maintenance_mode: shutdown cluster connections
Leaked connections are known to cause inter-test issues.

(cherry picked from commit 6c547e1692)
2026-02-02 17:02:16 +00:00
Patryk Jędrzejczak
eee99ebb3d test: test_maintenance_mode: run with different keyspace options
We extend the test to provide a reproducer for #27988 and to avoid
similar bugs in the future.

The test slows down from ~14s to ~19s on my local machine in dev
mode. It seems reasonable.

(cherry picked from commit 867a1ca346)
2026-02-02 17:02:16 +00:00
Patryk Jędrzejczak
c248744c5a test: test_maintenance_mode: check that group0 is disabled by creating a keyspace
In the following commit, we make the rest run with multiple keyspaces,
and the old check becomes inconvenient. We also move it below to the
part of the code that won't be executed for each keyspace.

Additionally, we check if the error message is as expected.

(cherry picked from commit 53f58b85b7)
2026-02-02 17:02:16 +00:00
Patryk Jędrzejczak
4ba3c08d45 test: test_maintenance_mode: get rid of the conditional skip
This skip has already caused trouble.
After 0668c642a2, the skip was always hit, and
the test was silently doing nothing. This made us miss #26816 for a long
time. The test was fixed in 222eab45f8, but we
should get rid of the skip anyway.

We increase the number of writes from 256 to 1000 to make the chance of not
finding the key on server A even lower. If that still happens, it must be
due to a bug, so we fail the test. We also make the test insert rows until
server A is a replica of one row. The expected number of inserted rows is
a small constant, so it should, in theory, make the test faster and cleaner
(we need one row on server A, so we insert exactly one such row).

It's possible to make the test fully deterministic, by e.g., hardcoding
the key and tokens of all nodes via `initial_token`, but I'm afraid it would
make the test "too deterministic" and could hide a bug.

(cherry picked from commit 408c6ea3ee)
2026-02-02 17:02:16 +00:00
Patryk Jędrzejczak
c8c21cc29c test: test_maintenance_mode: remove the redundant value from the query result
(cherry picked from commit c92962ca45)
2026-02-02 17:02:16 +00:00
Patryk Jędrzejczak
e95689c96b storage_proxy: skip validate_read_replica in maintenance mode
In maintenance mode, the local node adds only itself to the topology. However,
the effective replication map of a keyspace with tablets enabled contains all
tablet replicas. It gets them from the tablets map, not the topology. Hence,
`network_topology_strategy::sanity_check_read_replicas` hits
```
throw std::runtime_error(format("Requested location for node {} not in topology. backtrace {}", id, lazy_backtrace()));
```
for tablet replicas other than the local node.

As a result, all requests to a keyspace with tablets enabled and RF > 1 fail
in debug mode (`validate_read_replica` does nothing in other modes). We don't
want to skip maintenance mode tests in debug mode, so we skip the check in
maintenance mode.

We move the `is_debug_build()` check because:
- `validate_read_replicas` is a static function with no access to the config,
- we want the `!_db.local().get_config().maintenance_mode()` check to be
  dropped by the compiler in non-debug builds.

We also suppress `-Wunneeded-internal-declaration` with `[[maybe_unused]]`.

(cherry picked from commit 9d4a5ade08)
2026-02-02 17:02:15 +00:00
Patryk Jędrzejczak
6094f4b7b2 storage_service: set up topology properly in maintenance mode
We currently make the local node the only token owner (that owns the
whole ring) in maintenance mode, but we don't update the topology properly.
The node is present in the topology, but in the `none` state. That's how
it's inserted by `tm.get_topology().set_host_id_cfg(host_id);` in
`scylla_main`. As a result, the node started in maintenance mode crashes
in the following way in the presence of a vnodes-based keyspace with the
NetworkTopologyStrategy:
```
scylla: locator/network_topology_strategy.cc:207:
    locator::natural_endpoints_tracker::natural_endpoints_tracker(
    const token_metadata &, const network_topology_strategy::dc_rep_factor_map &):
    Assertion `!_token_owners.empty() && !_racks.empty()' failed.
```
Both `_token_owners` and `_racks` are empty. The reason is that
`_tm.get_datacenter_token_owners()` and
`_tm.get_datacenter_racks_token_owners()` called above filter out nodes
in the `none` state.

This bug basically made maintenance mode unusable in customer clusters.

We fix it by changing the node state to `normal`. We also update its
rack, datacenter, and shards count. Rack and datacenter are present in the
topology somehow, but there is nothing wrong with updating them again.
The shard count is also missing, so we better update it to avoid other
issues.

Fixes #27988

(cherry picked from commit a08c53ae4b)
2026-02-02 17:02:15 +00:00
Avi Kivity
ad64dc7c01 Merge '[Backport 2026.1] load_stats: fix problem with load_stats refresh throwing no_such_column_family' from Scylladb[bot]
When the topology coordinator refreshes load_stats, it caches load_stats for every node. In case the node becomes unresponsive, and fresh load_stats can not be read from the node, the cached version of load_stats will be used. This is to allow the load balancer to have at least some information about the table sizes and disk capacities of the host.

During load_stats refresh, we aggregate the table sizes from all the nodes. This procedure calls db.find_column_family() for each table_id found in load_stats. This function will throw if the table is not found. This will cause load_stats refresh to fail.

It is also possible for a table to have been dropped between the time load_stats has been prepared on the host, and the time it is processed on the topology coordinator. This would also cause an exception in the refresh procedure.

This fixes this problem by checking if the table still exists.

Fixes: #28359

- (cherry picked from commit 71be10b8d6)

- (cherry picked from commit 92dbde54a5)

Parent PR: #28440

Closes scylladb/scylladb#28471

* github.com:scylladb/scylladb:
  test: add test and reproducer for load_stats refresh exception
  load_stats: handle dropped tables when refreshing load_stats
2026-02-01 13:51:31 +02:00
Jenkins Promoter
bafd185087 Update pgo profiles - aarch64 2026-02-01 05:06:48 +02:00
Jenkins Promoter
07d1f8f48a Update pgo profiles - x86_64 2026-02-01 04:20:45 +02:00
Ferenc Szili
523d529d27 test: add test and reproducer for load_stats refresh exception
This patch adds a test and reproducer for the issue where the load_stats
refresh procedure throws exceptions if any of the tables have been
dropped since load_stats was produced.

(cherry picked from commit 92dbde54a5)
2026-02-01 00:34:26 +00:00
Ferenc Szili
c8dbd43ed5 load_stats: handle dropped tables when refreshing load_stats
When the topology coordinator refreshes load_stats, it caches load_stats
for every node. In case the node becomes unresponsive, and fresh
load_stats can not be read from the node, the cached version of
load_stats will be used. This is to allow the load balancer to
have at least some information about the table sizes and disk capacities
of the host.

During load_stats refresh, we aggregate the table sizes from all the
nodes. This procedure calls db.find_column_family() for each table_id
found in load_stats. This function will throw if the table is not found.
This will cause load_stats refresh to fail.

It is also possible for a table to have been dropped between the time
load_stats has been prepared on the host, and the time it is processed
on the topology coordinator. This would also cause an exception in the
refresh procedure.

This patch fixes this problem by checking if the table still exists.

(cherry picked from commit 71be10b8d6)
2026-02-01 00:34:26 +00:00
Botond Dénes
0cf9f41649 Merge '[Backport 2026.1] docs: add documentation for automatic repair' from Scylladb[bot]
Explain what automatic repair is and how to configure it. While at it, improve the existing repair documentation a bit.

Fixes: SCYLLADB-130

This PR missed the 2026.1 branch date, so it needs backport to 2026.1, where the auto repair feature debuts.

- (cherry picked from commit a84b1b8b78)

- (cherry picked from commit 57b2cd2c16)

- (cherry picked from commit 1713d75c0d)

Parent PR: #28199

Closes scylladb/scylladb#28424

* github.com:scylladb/scylladb:
  docs: add feature page for automatic repair
  docs: inter-link incremental-repair and repair documents
  docs: incremental-repair: fix curl example
2026-01-30 16:01:03 +02:00
Botond Dénes
dc89e2ea37 Merge '[Backport 2026.1] test: test_alternator_proxy_protocol: fix race between node startup and test start' from Scylladb[bot]
test_alternator_proxy_protocol starts a node and connects via the alternator ports.
Starting a node, by default, waits until the CQL ports are up. This does not guarantee
that the alternator ports are up (they will be up very soon after this), so there is a short
window where a connection to the alternator ports will fail.

Fix by adding a ServerUpState=SERVING mode, which waits for the node to report
to its supervisor (systemd, which we are pretending to be) that its ports are open.
The test is then adjusted to request this new ServerUpState.

Fixes #28210
Fixes #28211

Flaky tests are only in master and branch-2026.1, so backporting there.

- (cherry picked from commit ebac810c4e)

- (cherry picked from commit 59f2a3ce72)

Parent PR: #28291

Closes scylladb/scylladb#28443

* github.com:scylladb/scylladb:
  test: test_alternator_proxy_protocol: wait for the node to report itself as serving
  test: cluster_manager: add ability to wait for supervisor STATUS=serving
2026-01-30 15:59:09 +02:00
Tomasz Grabiec
797f56cb45 Merge '[Backport 2026.1] Improve load balancer logging and other minor cleanups' from Scylladb[bot]
Contains various improvements to tablet load balancer. Batched together to save on the bill for CI.

Most notably:
 - Make plan summary more concise, and print info only about present elements.
 - Print rack name in addition to DC name when making a per-rack plan
 - Print "Not possible to achieve balance" only when this is the final plan with no active migrations
 - Print per-node stats when "Not possible to achieve balance" is printed
 - amortize metrics lookup cost
 - avoid spamming logs with per-node "Node {} does not have complete tablet stats, ignoring"

Backport to 2026.1: since the changes enhance debuggability and are relatively low risk

Fixes #28423
Fixes #28422

- (cherry picked from commit 32b336e062)

- (cherry picked from commit df32318f66)

- (cherry picked from commit f2b0146f0f)

- (cherry picked from commit 0d090aa47b)

- (cherry picked from commit 12fdd205d6)

- (cherry picked from commit 615b86e88b)

- (cherry picked from commit 7228bd1502)

- (cherry picked from commit 4a161bff2d)

- (cherry picked from commit ef0e9ad34a)

- (cherry picked from commit 9715965d0c)

- (cherry picked from commit 8e831a7b6d)

Parent PR: #28337

Closes scylladb/scylladb#28428

* github.com:scylladb/scylladb:
  tablets: tablet_allocator.cc: Convert tabs to spaces
  tablets: load_balancer: Warn about incomplete stats once for all offending nodes
  tablets: load_balancer: Improve node stats printout
  tablets: load_balancer: Warn about imbalance only when there are no more active migrations
  tablets: load_balancer: Extract print_node_stats()
  tablet: load_balancer: Use empty() instead of size() where applicable
  tablets: Fix redundancy in migration_plan::empty()
  tablets: Cache pointer to stats during plan-making
  tablets: load_balancer: Print rack in addition to DC when giving context
  tablets: load_balancer: Make plan summary concise
  tablets: load_balancer: Move "tablet_migration_bypass" injection point to make_plan()
2026-01-30 14:08:34 +01:00
Pawel Pery
be1d418bc0 vector_search: allow full secondary indexes syntax while creating the vector index
Vector Search feature needs to support creating vector indexes with additional
filtering column. There will be two types of indexes: global which indexes
vectors per table, and local which indexes vectors per partition key. The new
syntaxes are based on ScyllaDB's Global Secondary Index and Local Secondary
Index. Vector indexes don't use secondary indexes functionalities in any way -
all indexing, filtering and processing data will be done on Vector Store side.

This patch allows creating vector indexes using this CQL syntax:

```
CREATE TABLE IF NOT EXISTS cycling.comments_vs (
  commenter text,
  comment text,
  comment_vector VECTOR <FLOAT, 5>,
  created_at timestamp,
  discussion_board_id int,
  country text,
  lang text,
  PRIMARY KEY ((commenter, discussion_board_id), created_at)
);

CREATE CUSTOM INDEX IF NOT EXISTS global_ann_index
  ON cycling.comments_vs(comment_vector, country, lang) USING 'vector_index'
  WITH OPTIONS = { 'similarity_function': 'DOT_PRODUCT' };

CREATE CUSTOM INDEX IF NOT EXISTS local_ann_index
  ON cycling.comments_vs((commenter, discussion_board_id), comment_vector, country, lang)
  USING 'vector_index'
  WITH OPTIONS = { 'similarity_function': 'DOT_PRODUCT' };
```

Currently, if we run these queries to create indexes we will receive such errors:

```
InvalidRequest: Error from server: code=2200 [Invalid query] message="Vector index can only be created on a single column"
InvalidRequest: Error from server: code=2200 [Invalid query] message="Local index definition must contain full partition key only. Redundant column: XYZ"
```

This commit refactors `vector_index::check_target` to correctly validate
columns building the index. Vector-store currently support filtering by native
types, so the type of columns is checked. The first column from the list must
be a vector (to build index based on these vectors), so it is also checked.

Allowed types for columns are native types without counter (it is not possible
to create a table with counter and vector) and without duration (it is not
possible to correctly compare durations, this type is even not allowed in
secondary indexes).

This commits adds cqlpy test to check errors while creating indexes.

Fixes: SCYLLADB-298

This needs to be backported to version 2026.1 as this is a fix for filtering support.

Closes scylladb/scylladb#28366

(cherry picked from commit f49c9e896a)

Closes scylladb/scylladb#28448
2026-01-30 11:25:01 +01:00
Patryk Jędrzejczak
46923f7358 Merge '[Backport 2026.1] Introduce TTL and retries to address resolution' from Scylladb[bot]
In production environments, we observed cases where the S3 client would repeatedly fail to connect due to DNS entries becoming stale. Because the existing logic only attempted the first resolved address and lacked a way to refresh DNS state, the client could get stuck in a failure loop.

Introduce RR TTL and connection failure retry to
- re-resolve the RR in a timely manner
- forcefully reset and re-resolve addresses
- add a special case when the TTL is 0 and the record must be resolved for every request

Fixes: CUSTOMER-96
Fixes: CUSTOMER-139

Should be backported to 2025.3/4 and 2026.1 since we already encountered it in the production clusters for 2025.3

- (cherry picked from commit bd9d5ad75b)

- (cherry picked from commit 359d0b7a3e)

- (cherry picked from commit ce0c7b5896)

- (cherry picked from commit 5b3e513cba)

- (cherry picked from commit 66a33619da)

- (cherry picked from commit 6eb7dba352)

- (cherry picked from commit a05a4593a6)

- (cherry picked from commit 3a31380b2c)

- (cherry picked from commit 912c48a806)

Parent PR: #27891

Closes scylladb/scylladb#28405

* https://github.com/scylladb/scylladb:
  connection_factory: includes cleanup
  dns_connection_factory: refine the move constructor
  connection_factory: retry on failure
  connection_factory: introduce TTL timer
  connection_factory: get rid of shared_future in dns_connection_factory
  connection_factory: extract connection logic into a member
  connection_factory: remove unnecessary `else`
  connection_factory: use all resolved DNS addresses
  s3_test: remove client double-close
2026-01-30 11:10:48 +01:00
Avi Kivity
4032e95715 test: test_alternator_proxy_protocol: wait for the node to report itself as serving
Use the new ServerUpState=SERVING mechanism to wait to the alternator
ports to be up, rather than relying on the default waiting for CQL,
which happens earlier and therefore opens a window where a connection to
the alternator ports will fail.

(cherry picked from commit 59f2a3ce72)
2026-01-29 22:46:11 +02:00
Avi Kivity
eab10c00b1 test: cluster_manager: add ability to wait for supervisor STATUS=serving
When running under systemd, ScyllaDB sends a STATUS=serving message
to systemd. Co-opt this mechanism by setting up NOTIFY_SOCKET, thus
making the cluster manager pretend it is systemd. Users of the cluster
manager can now wait for the node to report itself up, rather than
having to parse log files or retry connections.

(cherry picked from commit ebac810c4e)
2026-01-29 19:48:53 +00:00
Patryk Jędrzejczak
091c3b4e22 test: test_gossiper_orphan_remover: get host ID of the bootstrapping node before it crashes
The test is currently flaky. It tries to get the host ID of the bootstrapping
node via the REST API after the node crashes. This can obviously fail. The
test usually doesn't fail, though, as it relies on the host ID being saved
in `ScyllaServer._host_id` at this point by `ScyllaServer.try_get_host_id()`
repeatedly called in `ScyllaServer.start()`. However, with a very fast crash
and unlucky timings, no such call may succeed.

We deflake the test by getting the host ID before the crash. Note that at this
point, the bootstrapping node must be serving the REST API requests because
`await log.wait_for("finished do_send_ack2_msg")` above guarantees that the
node has started the gossip shadow round, which happens after starting the REST
API.

Fixes #28385

Closes scylladb/scylladb#28388

(cherry picked from commit a2c1569e04)

Closes scylladb/scylladb#28417
2026-01-29 11:25:10 +01:00
Tomasz Grabiec
19eadafdef tablets: tablet_allocator.cc: Convert tabs to spaces
(cherry picked from commit 8e831a7b6d)
2026-01-29 09:06:49 +00:00
Tomasz Grabiec
358fc15893 tablets: load_balancer: Warn about incomplete stats once for all offending nodes
To reduce log spamming when all nodes are missing stats.

(cherry picked from commit 9715965d0c)
2026-01-29 09:06:49 +00:00
Tomasz Grabiec
32124d209e tablets: load_balancer: Improve node stats printout
Make it more concise:
- reduce precision for load to 6 fractional digits
- reduce precision for tablets/shard to 3 fractional digits
- print "dc1/rack1" instead of "dc=dc1 rack=rack1", like in other places
- print "rd=0 wr=0" instead of "stream_read=0 stream_write=0"

Example:

 load_balancer - Node 477569c0-f937-11f0-ab6f-541ce4a00601: dc10/rack10c load=170.666667 tablets=1 shards=12 tablets/shard=0.083 state=normal cap=64424509440 stream: rd=0 wr=0
 load_balancer - Node 47678711-f937-11f0-ab6f-541ce4a00601: dc10/rack10c load=0.000000 tablets=0 shards=12 tablets/shard=0.000 state=normal cap=64424509440 stream: rd=0 wr=0
 load_balancer - Node 47832560-f937-11f0-ab6f-541ce4a00601: dc10/rack10c load=0.000000 tablets=0 shards=12 tablets/shard=0.000 state=normal cap=64424509440 stream: rd=0 wr=0

(cherry picked from commit ef0e9ad34a)
2026-01-29 09:06:49 +00:00
Tomasz Grabiec
c7f4bda459 tablets: load_balancer: Warn about imbalance only when there are no more active migrations
Otherwise, it may be only a temporary situation due to lack of
candidates, and may be unnecessarily alerting.

Also, print node stats to allow assessing how bad the situation is on
the spot. Those stats can hint to a cause of imbalance, if balancing
is per-DC and racks have different capacity.

(cherry picked from commit 4a161bff2d)
2026-01-29 09:06:49 +00:00
Tomasz Grabiec
568af3cd8d tablets: load_balancer: Extract print_node_stats()
(cherry picked from commit 7228bd1502)
2026-01-29 09:06:49 +00:00
Tomasz Grabiec
bd694dd1a1 tablet: load_balancer: Use empty() instead of size() where applicable
(cherry picked from commit 615b86e88b)
2026-01-29 09:06:49 +00:00
Tomasz Grabiec
9672e0171f tablets: Fix redundancy in migration_plan::empty()
(cherry picked from commit 12fdd205d6)
2026-01-29 09:06:49 +00:00
Tomasz Grabiec
8cec41acf2 tablets: Cache pointer to stats during plan-making
Saves on lookup cost, esp. for candidate evaluation. This showed up in
perf profile in the past.

Also, lays the ground for splitting stats per rack.

(cherry picked from commit 0d090aa47b)
2026-01-29 09:06:49 +00:00
Tomasz Grabiec
d207de0d76 tablets: load_balancer: Print rack in addition to DC when giving context
Load-balancing can be now per-rack instead of per-DC. So just printing
"in DC" is confusing. If we're balancing a rack, we should print which
rack is that.

(cherry picked from commit f2b0146f0f)
2026-01-29 09:06:49 +00:00
Tomasz Grabiec
edde4e878e tablets: load_balancer: Make plan summary concise
Before:

  load_balancer - Prepared 1 migration plans, out of which there were 1 tablet migration(s) and 0 resize decision(s) and 0 tablet repair(s) and 0 rack-list colocation(s)

After:

  load_balancer - Prepared plan: migrations: 1

We print only stats about elements which are present.

(cherry picked from commit df32318f66)
2026-01-29 09:06:49 +00:00
Tomasz Grabiec
be1c674f1a tablets: load_balancer: Move "tablet_migration_bypass" injection point to make_plan()
Just a cleanup. After this, we don't have a new scope in the outmost
make_plan() just for injection handling.

(cherry picked from commit 32b336e062)
2026-01-29 09:06:49 +00:00
Botond Dénes
a7cff37024 docs: add feature page for automatic repair
Explain what the feature is and how to confiture it.
Inter-link all the repair related pages, so one can discover all about
repair, regardless of which page they land on.

(cherry picked from commit 1713d75c0d)
2026-01-29 00:25:21 +00:00
Botond Dénes
9431bc5628 docs: inter-link incremental-repair and repair documents
The user can now discover the general explanatio of repair when reading
about incremental repair, useful if they don't know what repair is.
The user can now discover incremental repair while reading the generic
repair procedure document.

(cherry picked from commit 57b2cd2c16)
2026-01-29 00:25:21 +00:00
Botond Dénes
14db8375ac docs: incremental-repair: fix curl example
Currently it is regular text, make it a code block so it is easier to
read and copy+paste.

(cherry picked from commit a84b1b8b78)
2026-01-29 00:25:21 +00:00
Ernest Zaslavsky
614020b5d5 aws_error: handle all restartable nested exception types
Previously we only inspected std::system_error inside
std::nested_exception to support a specific TLS-related failure
mode. However, nested exceptions may contain any type, including
other restartable (retryable) errors. This change unwraps one
nested exception per iteration and re-applies all known handlers
until a match is found or the chain is exhausted.

Closes scylladb/scylladb#28240

(cherry picked from commit cb2aa85cf5)

Closes scylladb/scylladb#28345
2026-01-28 14:58:28 +02:00
Anna Stuchlik
e091afb400 doc: add the version name to the Install pages
This is a follow-up to https://github.com/scylladb/scylladb/pull/28022
It adds the version name to more install pages.

Closes scylladb/scylladb#28289

(cherry picked from commit c25b770342)

Closes scylladb/scylladb#28362
2026-01-28 12:52:23 +02:00
Ernest Zaslavsky
edc46fe6a1 connection_factory: includes cleanup
(cherry picked from commit 912c48a806)
2026-01-27 22:43:08 +00:00
Ernest Zaslavsky
f8b9b767c2 dns_connection_factory: refine the move constructor
Clean up the awkward move constructor that was declared in the header
but defaulted in a separate compilation unit, improving clarity and
consistency.

(cherry picked from commit 3a31380b2c)
2026-01-27 22:43:08 +00:00
Ernest Zaslavsky
23d038b385 connection_factory: retry on failure
If connecting to a provided address throws, renew the address list and
retry once (and only once) before giving up.

(cherry picked from commit a05a4593a6)
2026-01-27 22:43:08 +00:00
Ernest Zaslavsky
3e2d1384bf connection_factory: introduce TTL timer
Add a TTL-based timer to connection_factory to automatically refresh
resolved host name addresses when they expire.

(cherry picked from commit 6eb7dba352)
2026-01-27 22:43:08 +00:00
Ernest Zaslavsky
bd7481e30c connection_factory: get rid of shared_future in dns_connection_factory
Move state management from dns_connection_factory into state class
itself to encapsulate its internal state and stop managing it from the
`dns_connection_factory`

(cherry picked from commit 66a33619da)
2026-01-27 22:43:08 +00:00
Ernest Zaslavsky
16d7b65754 connection_factory: extract connection logic into a member
extract connection logic into a private member function to make it reusable

(cherry picked from commit 5b3e513cba)
2026-01-27 22:43:08 +00:00
Ernest Zaslavsky
e30c01eae6 connection_factory: remove unnecessary else
(cherry picked from commit ce0c7b5896)
2026-01-27 22:43:08 +00:00
Ernest Zaslavsky
d0f3725887 connection_factory: use all resolved DNS addresses
Improve dns_connection_factory to iterate over all resolved
addresses instead of using only the first one.

(cherry picked from commit 359d0b7a3e)
2026-01-27 22:43:07 +00:00
Ernest Zaslavsky
c12168b7ef s3_test: remove client double-close
`test_chunked_download_data_source_with_delays` was calling `close()` on a client twice, remove the unnecessary call

(cherry picked from commit bd9d5ad75b)
2026-01-27 22:43:07 +00:00
Yaron Kaikov
76c0162060 .github/workflows/backport-pr-fixes-validation: support Atlassian URL format in backport PR fixes validation
Add support for matching full Atlassian JIRA URLs in the format
https://scylladb.atlassian.net/browse/SCYLLADB-400 in addition to
the bare JIRA key format (SCYLLADB-400).

This makes the validation more flexible by accepting both formats
that developers commonly use when referencing JIRA issues.

Fixes: https://github.com/scylladb/scylladb/issues/28373

Closes scylladb/scylladb#28374

(cherry picked from commit 3f10f44232)

Closes scylladb/scylladb#28394
2026-01-27 16:05:06 +02:00
Avi Kivity
c9620d9573 test/cqlpy: restore LWT tests marked XFAIL for tablets
Commit 0156e97560 ("storage_proxy: cas: reject for
tablets-enabled tables") marked a bunch of LWT tests as
XFAIL with tablets enabled, pending resolution of #18066.
But since that event is now in the past, we undo the XFAIL
markings (or in some cases, use an any-keyspace fixture
instead of a vnodes-only fixture).

Ref #18066.

Closes scylladb/scylladb#28336

(cherry picked from commit ec70cea2a1)

Closes scylladb/scylladb#28365
2026-01-27 12:20:18 +02:00
Anna Stuchlik
91cf77d016 doc: update the GPG keys
Update the keys in the installation instructions (linux packages).

Fixes https://github.com/scylladb/scylladb/issues/28330

Closes scylladb/scylladb#28357

(cherry picked from commit edc291961b)

Closes scylladb/scylladb#28370
2026-01-27 11:20:11 +02:00
Anna Stuchlik
2c2f0693ab doc: remove the troubleshooting section on upgrades from OSS
This commit removes a document originally created to troubleshoot
upgrades from Open Source to Enterprise.

Since we no longer support Open Source, this document is now redundant.

Fixes https://github.com/scylladb/scylladb/issues/28246

Closes scylladb/scylladb#28248

(cherry picked from commit 84281f900f)

Closes scylladb/scylladb#28360
2026-01-26 19:17:28 +02:00
Jenkins Promoter
2c73d0e6b5 Update ScyllaDB version to: 2026.1.0-rc1 2026-01-26 11:06:45 +02:00
Yaron Kaikov
f94296e0ae Update ScyllaDB version to: 2026.1.0-rc0 2026-01-25 11:07:17 +02:00
Botond Dénes
edda66886e Merge 'Replace sstable_stream_source_impl's internal sink and source with seastar utilities' from Pavel Emelyanov
The class in question has internal implementations of in-memory data_sink_impl and data_source_impl. In seastar there's generic implementation of the same facilities. From the "code re-use" perspective it makes sense to use both. TODO-s in Scylla code supports that.

Using newer seastar facilities, not backporting.

Closes scylladb/scylladb#28321

* github.com:scylladb/scylladb:
  sstable: Replace buffer_data_sink_impl with seastar::util::basic_memory_data_sink
  sstables: Use seastar::util::as_input_stream() and remove buffer_data_source_impl
2026-01-23 16:55:18 +02:00
Pavel Emelyanov
3e09d3cc97 test: Keep test_gossiper_live_endpoints checks togethger
There are two checks for live endpoints performed in test_gossiper.py,
but one of those sits in test_gossiper_unreachable_endpoints somehow.
This patch moves live endpoints check into live endpoints test.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#28224
2026-01-23 16:53:48 +02:00
Piotr Dulikowski
3ec4f67407 Merge 'vector_index: Implement rescoring' from Szymon Malewski
This series implements rescoring algorithm.

Index options allowing to enable this functionality were introduced in earlier PR https://github.com/scylladb/scylladb/pull/28165.

When Vector Index has enabled quantization, Vector Store uses reduced vector representation to save memory, but it may degrade correctness of ANN queries. For quantized index we can enable rescoring algorithm, which recalculates similarity score from full vector representation stored in Scylla and reorder returned result set.
It works also with oversampling - we fetch more candidates from Vector Store, rescore them at Scylla and return only requested number of results.

Example:

Creating a Vector Index with Rescoring

```sql
-- Create a table with a vector column
CREATE TABLE ks.products (
    id int PRIMARY KEY,
    embedding vector<float, 128>
);

-- Create a vector index with rescoring enabled
CREATE INDEX products_embedding_idx ON ks.products (embedding)
    USING 'vector_index'
    WITH OPTIONS = {
        'similarity_function': 'cosine',
        'quantization': 'i8',
        'oversampling': '2.0',
        'rescoring': 'true'
    };
```

1. **Quantization** (`i8`) compresses vectors in the index, reducing memory usage but introducing precision loss in distance calculations
2. **Oversampling** (`2.0`) retrieves 2× more candidates than requested from the vector store (e.g., `LIMIT 10` fetches 20 candidates)
3. **Rescoring** (`true`) recalculates similarity scores using full-precision (`f32`) vectors from the base table and re-ranks results

Query example:

```sql
-- Find 10 most similar products
SELECT id, similarity_cosine(embedding, [0.1, 0.2, ...]) AS score
FROM ks.products
ORDER BY embedding ANN OF [0.1, 0.2, ...]
LIMIT 10;
```

With rescoring enabled, the query:
1. Fetches 20 candidates from the quantized index (due to oversampling=2.0)
2. Reads full-precision embeddings from the base table
3. Recalculates similarity scores with full precision
4. Re-ranks and returns the top 10 results

In this implementation we use CQL similarity function implementation to calculate new score values and use them in post query ordering. We add that column manually to selection, but it has to be removed from the final response.

Follow-up https://github.com/scylladb/scylladb/pull/28165
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-83

New feature - doesn't need backport.

Closes scylladb/scylladb#27769

* github.com:scylladb/scylladb:
  vector_index: rescoring: Fetch oversampled rows
  vector_index: rescoring: Sort by similarity column
  select_statement: Modify `needs_post_query_ordering` condition
  vector_index: rescoring: Add hidden similarity score column
  vector_index: Refactor extracting ANN query information
2026-01-23 15:20:10 +01:00
Patryk Jędrzejczak
a41d7a9240 Merge 'test_lwt_shutdown: fix flakiness by removing storage_proxy::stop injection' from Petr Gusev
The storage_proxy::stop() is not called by main (it is commented out due to #293), so the corresponding message injection is never hit. When the test releases paxos_state_learn_after_mutate, shutdown may already be in progress or even completed by the time we try to trigger the storage_proxy::stop injection, which makes the test flaky.

Fix this by completely removing the storage_proxy::stop injection. The injection is not required for test correctness. Shutdown must wait for the background LWT learn to finish, which is released via the paxos_state_learn_after_mutate injection. The shutdown process blocks on in-flight HTTP requests through seastar::httpd::http_server::stop and its _task_gate, so the HTTP request that releases paxos_state_learn_after_mutate is guaranteed to complete before the node is shut down.

Fixes scylladb/scylladb#28260

backport: 2025.4, the `test_lwt_shutdown` test was introduced in this version

Closes scylladb/scylladb#28315

* https://github.com/scylladb/scylladb:
  storage_proxy: drop stop() method
  test_lwt_shutdown: fix flakiness by removing storage_proxy::stop injection
2026-01-23 15:18:17 +01:00
Avi Kivity
30d6f3b8e0 test: test_proxy_protocol: bump timeout
It was observed twice that the test times out in debug mode.
Fix by increasing the timeout.

The test never expects a timeout, so increasing it won't increase
the test duration.

Fixes #28028

Closes scylladb/scylladb#28272
2026-01-23 15:37:00 +02:00
Łukasz Paszkowski
09fde82a33 test/scylla_gdb: fix coro_task request usage, rename duplicate test
- Pass pytest request fixture into coro_task (used for scylla_tmp_dir
  and core dump path)
- Rename duplicate `test_sstable_summary` that runs sstable-index-cache
  to `test_sstable_index_cache` so both tests are collected

Refs https://github.com/scylladb/scylladb/issues/22501

Closes scylladb/scylladb#28286
2026-01-23 15:25:58 +02:00
Pavel Emelyanov
4a307d931a sstable: Replace buffer_data_sink_impl with seastar::util::basic_memory_data_sink
The former accumulates sstable writer writes into a vector of temporary
buffers. In seastar there's a generic memory data sink that provides a
sink to accumulate stream of bytes into any container.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-01-23 14:22:22 +03:00
Pavel Emelyanov
97b1340a68 sstables: Use seastar::util::as_input_stream() and remove buffer_data_source_impl
The latter is used to wrap vector of buffers into an input_stream.
Seastar already provides the very same functionality with the
convenience as_input_stream() helper.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-01-23 14:21:14 +03:00
Piotr Dulikowski
fe9237fdc9 Merge 'alternator: don't require rf_rack flag for indexes, validate instead' from Michael Litvak
In 8df61f6d99 we changed the requirements for creating materialized
views and MV-based indexes - instead of requiring the
rf_rack_valid_keyspaces flag to be set, we now require the keyspace to
be RF-rack-valid at the time of creation, and it is enforced to remain
RF-rack-valid while the MV exists. This validation is done in the cql
create view/index statements.

The same should be done also for alternator - when creating a table with
GSI or LSI, or when adding a GSI to an existing table, previously we
required the flag rf_rack_valid_keyspaces to be set. Now we change it to
instead check if the keyspace is RF-rack-valid, and if not the operation
fails with an appropriate error.

Fixes https://github.com/scylladb/scylladb/issues/28214

backport to 2025.4 to add RF-rack-valid enforcements in alternator

Closes scylladb/scylladb#28154

* github.com:scylladb/scylladb:
  locator: document the exception type of assert_rf_rack_valid_keyspace
  alternator: don't require rf_rack flag for indexes, validate instead
2026-01-23 11:49:02 +01:00
Botond Dénes
c4c2f87be7 Merge 'db: fail reads and writes with local consistency level to a DC with RF=0' from null
When read or write operations are performed on a DC with RF=0 with LOCAL_QUORUM
or LOCAL_ONE consistency level, Cassandra throws `Unavailable` exception.
Scylla allowed such read operations and failed write operations with a cryptic:
"broken promise" error. This occured because the initial availability
check passed (quorum of 0 requires 0 replicas), but execution failed
later when no replicas existed to process the mutation.

This patch adds an explicit RF=0 validation for LOCAL_ONE and LOCAL_QUORUM that
throws before attempting operation execution.

The change also requires `test_query_dc_with_rf_0_does_not_crash_db` to be
upgraded. This testcase was asserting somewhat similar scenario, but wasn't
taking into account the whole matrix of combinations:
- scenarios: successful vs unsuccesful operation outcome
- local consistency levels: LOCAL_QUORUM & LOCAL_ONE
- operations: SELECT (read) & INSERT (write)

and so it's been extended to cover both the pre-existing and the current issues
and the whole matrix of combinations.

Fixes: scylladb/scylladb#27893

A minor change, no need to backport.

Closes scylladb/scylladb#27894

* github.com:scylladb/scylladb:
  db: fail reads and writes with local consistencty level to a DC with RF=0
  db: consistency_level: split `local_quorum_for()`
  db: consistency_level: fix nrs -> nts abbreviation
2026-01-23 12:36:20 +02:00
Petr Gusev
c45244b235 storage_proxy: drop stop() method
It's not called by main.cc and can be confusing.
2026-01-23 11:22:03 +01:00
Petr Gusev
f5ed3e9fea test_lwt_shutdown: fix flakiness by removing storage_proxy::stop injection
storage_proxy::stop() is not called by main (it is commented out due to #293),
so the corresponding message injection is never hit. When the test releases
paxos_state_learn_after_mutate, shutdown may already be in progress or even
completed by the time we try to trigger the storage_proxy::stop injection,
which makes the test flaky.

Fix this by completely removing the storage_proxy::stop injection.
The injection is not required for test correctness. Shutdown must wait for the
background LWT learn to finish, which is released via the
paxos_state_learn_after_mutate injection.

The shutdown process blocks on in-flight api HTTP requests through
seastar::httpd::http_server::stop and its _task_gate, so the
shutdown will not prevent the HTTP request that released the
paxos_state_learn_after_mutate from completing successfully.

Fixes scylladb/scylladb#28260
2026-01-23 11:20:36 +01:00
Botond Dénes
35c9a00275 Merge 'test.py: pass correctly extra cmd line arguments' from Andrei Chekun
During rewrite --extra-scylla-cmdline-options was missed and it was not passed to the tests that are using pytest. The issue that there were no possibility to pass these parameters via cmd to the Scylla, while tests were not affected because they were using the parameters from the yaml file.
This PR fixes this issue so it will be easier to modify the Scylla start parameters without modifying code.

No backport needed, only framework enhancement.

Closes scylladb/scylladb#28156

* github.com:scylladb/scylladb:
  test.py: do not crash when there is no boost log
  test.py: pass correctly extra cmd line arguments
2026-01-23 11:26:01 +02:00
Andrzej Jackowski
c493a66668 test: check cql_requests_count instead of tasks_processed in SL
Before this change, the test function `_verify_tasks_processed_metrics`
verified that after service level reconfiguration, a given number of
`scylla_scheduler_tasks_processed` were processed by a given scheduling
group. Moreover, the check verified that another scheduling group
didn't process a high number of requests. The second check was vulnerable
to flakiness, because sometimes additional load caused extensive work
in the second scheduling group (e.g. password hashing in `sl:driver`
due to new connections being created).

To avoid test failures, this commit changes which metric is verified:
instead of `scylla_scheduler_tasks_processed`, the metric
`scylla_transport_cql_requests_count` is checked. This prevents similar
problems, because there is no reason for a high number of
requests to be processed by the second scheduling group. Moreover,
it allows decreasing the number of requests that are sent for
verification, and thus speeds up the test.

Fixes: scylladb/scylladb#27715

Closes scylladb/scylladb#28318
2026-01-23 10:19:29 +01:00
Patryk Jędrzejczak
4e984139b2 Merge 'strongly consistent tables: basic implementation' from Petr Gusev
In this PR we add a basic implementation of the strongly-consistent tables:
* generate raft group id when a strongly-consistent table is created
* persist it into system.tables table
* start raft groups on replicas when a strongly-consistent tablet_map reaches them
* add strongly-consistent version of the storage_proxy, with the `query` and `mutate` methods
* the `mutate` method submits a command to the tablets raft group, the query method reads the data with `raft.read_barrier()`
* strongly-consistent versions of the `select_statement` and `modification_statement` are added
* a basic `test_strong_consistency.py/test_basic_write_read` is added which to check that we can write and read data in a strongly consistent fashion.

Limitations:
* for now the strongly consistent tables can have tablets only on shard zero. This is because we (ab/re) use the existing raft system tables which live only on shard0. In the next PRs we'll create separate tables for the new tablets raft groups.
* No Scylla-side proxying - the test has to figure out who is the leader and submit the command to the right node. This will be fixed separately.
* No tablet balancing -- migration/split/merges require separate complicated code.

The new behavior is hidden behind `STRONGLY_CONSISTENT_TABLES` feature, which is enabled when the `STRONGLY_CONSISTENT_TABLES` experimental feature flag is set.

Requirements, specs and general overview of the feature can be found [here](https://scylladb.atlassian.net/wiki/spaces/RND/pages/91422722/Strong+Consistency). Short term implementation plan is [here](https://docs.google.com/document/d/1afKeeHaCkKxER7IThHkaAQlh2JWpbqhFLIQ3CzmiXhI/edit?tab=t.0#heading=h.thkorgfek290)

One can check the strongly consistent writes and reads locally via cqlsh:
scylla.yaml:
```
experimental_features:
  - strongly-consistent-tables
```

cqlsh:
```
CREATE KEYSPACE IF NOT EXISTS my_ks WITH replication = {'class': 'NetworkTopologyStrategy', 'replication_factor': 1} AND tablets = {'initial': 1} AND consistency = 'local';
CREATE TABLE my_ks.test (pk int PRIMARY KEY, c int);
INSERT INTO my_ks.test (pk, c) VALUES (10, 20);
SELECT * FROM my_ks.test WHERE pk = 10;
```

Fixes SCYLLADB-34
Fixes SCYLLADB-32
Fixes SCYLLADB-31
Fixes SCYLLADB-33
Fixes SCYLLADB-56

backport: no need

Closes scylladb/scylladb#27614

* https://github.com/scylladb/scylladb:
  test_encryption: capture stderr
  test/cluster: add test_strong_consistency.py
  raft_group_registry: disable metrics for non-0 groups
  strong consistency: implement select_statement::do_execute()
  cql: add select_statement.cc
  strong consistency: implement coordinator::query()
  cql: add modification_statement
  cql: add statement_helpers
  strong consistency: implement coordinator::mutate()
  raft.hh: make server::wait_for_leader() public
  strong_consistency: add coordinator
  modification_statement: make get_timeout public
  strong_consistency: add groups_manager
  strong_consistency: add state_machine and raft_command
  table: add get_max_timestamp_for_tablet
  tablets: generate raft group_id-s for new table
  tablet_replication_strategy: add consistency field
  tablets: add raft_group_id
  modification_statement: remove virtual where it's not needed
  modification_statement: inline prepare_statement()
  system_keyspace: disable tablet_balancing for strongly_consistent_tables
  cql: rename strongly_consistent statements to broadcast statements
2026-01-23 09:52:33 +01:00
Anna Stuchlik
c681b3363d doc: remove an outdated KB
This page was created for very outdated versions of ScyllaDB (5.1 (ScyllaDB Open Source) and 2022.2 (ScyllaDB Enterprise) to ensure smooth upgrade to that versions.

We no longer need that document.

Fixes https://github.com/scylladb/scylladb/issues/28264

Closes scylladb/scylladb#28266
2026-01-22 18:27:14 +01:00
Szymon Wasik
927aebef37 Add vector search documentation links to CQL docs
This patch adds links to the Vector Search documentation that is hosted
together with Scylla Cloud docs to the CQL documentation.
It also make the note about supported capabilities consistent and
removes the experimental label as the feature is GAed.

Fixes: SCYLLADB-371

Closes scylladb/scylladb#28312
2026-01-22 16:46:38 +01:00
Michael Litvak
d5009882c6 locator: document the exception type of assert_rf_rack_valid_keyspace
The function assert_rf_rack_valid_keyspace uses the exception type
std::invalid_argument when the RF-rack validation fails. Document it and
change all callers to catch this specific exception type when checking
for RF-rack validation failures, so that other exception types can be
propagated properly.
2026-01-22 16:11:35 +01:00
Michael Litvak
1f7a65904e alternator: don't require rf_rack flag for indexes, validate instead
In 8df61f6d99 we changed the requirements for creating materialized
views and MV-based indexes - instead of requiring the
rf_rack_valid_keyspaces flag to be set, we now require the keyspace to
be RF-rack-valid at the time of creation, and it is enforced to remain
RF-rack-valid while the MV exists. This validation is done in the cql
create view/index statements.

The same should be done also for alternator - when creating a table with
GSI or LSI, or when adding a GSI to an existing table, previously we
required the flag rf_rack_valid_keyspaces to be set. Now we change it to
instead check if the keyspace is RF-rack-valid, and if not the operation
fails with an appropriate error.
2026-01-22 16:11:35 +01:00
Szymon Malewski
29d090845a vector_index: rescoring: Fetch oversampled rows
So far with oversampling the extended set of keys was returned from VS,
but query to the base table was still limited by the query `limit`.
Now for rescoring we want to fetch rows for all the keys returned from VS.
However later we need to restore the command limit, to trim result_set accordingly.
For non-rescoring scenarios we trim directly keys set returned from VS if it happens to exceed query limit.

With this change rescoring validation tests (except `no_nulls_in_rescored_results`) pass fully.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-83
2026-01-22 15:38:44 +01:00
Szymon Malewski
0bc95bcf87 vector_index: rescoring: Sort by similarity column
This patch implements second part of rescoring - ordering results by similarity column added in earlier patch.
For this purpose in this patch we define `_ordering_comparator`, which enables pre-existing post-query ordering functionality.

However, none additional test passes yet, as they include ovesampling, which will be the subject of following patches.
2026-01-22 15:38:44 +01:00
Szymon Malewski
57e7a4fa4f select_statement: Modify needs_post_query_ordering condition
Our plan for rescoring is to use the existing post-query ordering mechanism to sort (and trim) result_set by similarity column.
For general SELECT case this ordering is permitted only for queries with IN on the partition key and an ORDER BY, which is checked in `needs_post_query_ordering`.
Recently this check was overriden for ANN queries in https://github.com/scylladb/scylladb/pull/28109 to enable IN queries handled by VS without excessive post-processing.

In this patch we revert that change - ANN case will be handled by general check.
However we change the condition - we will enable post processing anytime `_ordering_comparator` is set.

In current implementation `_ordering_comparator` is created only in `select_statement::prepare` with `get_ordering_comparator`,
only for the same conditions as were checked in `needs_post_query_ordering`, so this change should be transparent for general SELECT.
For ANN query it is also not set (yet), so it will not influence ANN filtering, but we confirm that this functionality still works
by adding filtering test: `test/vector_search/filter_test.cc::vector_store_client_test_filtering_ann_cql`.

Rescoring ordering for ANN queries will be enabled when we add `_ordering_comparator` in following patch.
2026-01-22 15:38:44 +01:00
Szymon Malewski
c89957b725 vector_index: rescoring: Add hidden similarity score column
Rescoring consist of recalculating similarity score and reordering results based on it.
In this patch we add calculation of similarity score as a hidden (non-serialized) column and following patch will add reordering.
Normal ordering uses `add_column_for_post_processing`, however this works only for regular columns, not function.
So we create it together with user requested columns (this also forces the use of `selection_with_processing`) and hide the column later.
This also requires special handling for 'SELECT *' case - we need to manually add all columns before adding similarity column.

In case user already asks for similarity score in the SELECT clause, this value will be calculated twice - is should be optimized in future patches.
2026-01-22 15:38:40 +01:00
Anna Stuchlik
0aa881f190 doc: add the info about Alternator ports to the Admin Guide
Fixes https://github.com/scylladb/scylladb/issues/23706

Closes scylladb/scylladb#27724
2026-01-22 16:10:58 +03:00
Israel Fruchter
6b3ce5fdcc dist: scylla_coredump_setup: force unmount /var/lib/systemd/coredump before setup
When setting up coredump handling, if there are old mounts in a deleted state (e.g. from an older installation),
systemd might fail to activate the new `.mount` unit properly because it assumes the path is already mounted.

Explicitly unmount `/var/lib/systemd/coredump` before proceeding with the setup to ensure a clean state.

Fix: scylladb/scylla-enterprise#5692

Closes scylladb/scylladb#28300
2026-01-22 14:35:26 +02:00
Pavel Emelyanov
cb6ee05391 Merge 'Extend snapshot manifest.json with tablet-aware metadata' from Benny Halevy
This series extends the json manifest file we create when taking snapshots.
It adds the following metadata:
- manifesr version and scope
- snapshot name
- created_at and expires_at timestamps (#24061)
- node metadata (host_id, dc, rack)
- keyspace and table metadat
- tablet_count (#26352)
- per-sstable metadata (#26352)

Fixes [SCYLLADB-189](https://scylladb.atlassian.net/browse/SCYLLADB-189)
Fixes [SCYLLADB-195](https://scylladb.atlassian.net/browse/SCYLLADB-195)
Fixes [SCYLLADB-196](https://scylladb.atlassian.net/browse/SCYLLADB-196)

* Enhancement, no backport needed

[SCYLLADB-189]: https://scylladb.atlassian.net/browse/SCYLLADB-189?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ
[SCYLLADB-195]: https://scylladb.atlassian.net/browse/SCYLLADB-195?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ
[SCYLLADB-196]: https://scylladb.atlassian.net/browse/SCYLLADB-196?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ

Closes scylladb/scylladb#27945

* github.com:scylladb/scylladb:
  snapshot: keep per-sstable metadata in manifest.json
  snapshot: add table info and tablet_count to manifest.json
  snapshot: add basic support for snapshot ttl in manifest.json
  table: snapshot_on_all_shards: take snapshot_options
  db: snapshot_ctl: move skip_flush to struct snapshot_options
  snapshot: add snapshot name in manifest.json
  test: lib: cql_test_env: apply db::config::tablets_mode_for_new_keyspaces
  snapshot: add node info to manifest.json
  snapshot: add manifest info to manifest.json
  test: database_test: snapshot_works: add validate_manifest
2026-01-22 15:19:11 +03:00
Patryk Jędrzejczak
67045b5f17 Merge 'raft_topology, tablets: Drain tablets in parallel with other topology operations' from Tomasz Grabiec
Allows other topology operations to execute while tablets are being
drained on decommission. In particular, bootstrap on scale-out. This
is important for elasticity.

Allows multiple decommission/removenode to happen in parallel, which
is important for efficiency.

Flow of decommission/removenode request:
  1) pending and paused, has tablet replicas on target node.
     Tablet scheduler will start draining tablets.
  2) No tablets on target node, request is pending but not paused
  3) Request is scheduled, node is in transition
  4) Request is done

Nodes are considered draining as soon as there is a leave or remove
request on them. If there are tablet replicas present on the target
node, the request is in a paused state and will not be picked by
topology coordinator. The paused state is computed from topology state
automatically on reload.

When request is not paused, its execution starts in
write_both_read_old state. The old tablet_draining state is not
entered (it's deprecated now).

Tablet load balancing will yield the state machine as soon as some
request is no longer paused and ready to be scheduled, based on
standard preemption mechanics.

Fixes #21452

Closes scylladb/scylladb#24129

* https://github.com/scylladb/scylladb:
  docs: Document parallel decommission and removenode and relevant task API
  test: Add tests for parallel decommission/removenode
  test: util: Introduce ensure_group0_leader_on()
  test: tablets: Check that there are no migrations scheduled on draining nodes
  test: lib: topology_builder: Introduce add_draining_request()
  topology_coordinator, tablets: Fail draining operations when tablet migration fails due to critical disk utilization
  tablets: topology_coordinator: Refactor to propagate reason for migration rollback
  tablet_allocator: Skip co-location on draining nodes
  node_ops: task_manager_module: Populate entity field also for active requests
  tasks: node_ops: Put node id in the entity field
  tasks, node_ops: Unify setting of task_stats in get_status() and get_stats()
  topology: Protect against empty cancelation reason
  tasks, topology: Make pending node operations abortable
  doc: topology-over-raft.md: Fix diagram for replacing, tablet_draining is not engaged
  raft_topology, tablets: Drain tablets in parallel with other topology operations
  virtual_tables: Show draining and excluded fields in system.cluster_status and system.load_by_node
  locator: topology: Add "draining" flag to a node
  topology_coordinator: Extract generate_cancel_request_update()
  storage_service: Drop dependency in topology_state_machine.hh in the header
  locator: Extract common code in assert_rf_rack_valid_keyspace()
  topology_coordinator, storage_service: Validate node removal/decommission at request submission time
2026-01-22 13:06:53 +01:00
Patryk Jędrzejczak
e11450ccca test: test_raft_recovery_user_data: prevent repeated ALTER KEYSPACE request
The test is currently flaky. With `remove_dead_nodes_with == "remove"`,
it sends several ALTER KEYSPACE requests. The request performed just
after adding 3 new nodes can unexpectedly be sent twice to two
different nodes by the driver. The second receiver rejects the request
through the new guardrail added in 2e7ba1f8ce,
and the test fails.

This has been acknowledged as a bug in the Python driver. It shouldn't
retry non-idempotent requests with the default retry policy. There could
be one more bug in the driver, as it looks like the driver decides to
resend the request after it disconnects from the first receiver. The
first receiver has just bootstrapped, so the driver shouldn't disconnect.

We deflake the test by reconnecting the driver before performing the
problematic ALTER KEYSPACE request.

The change has been tested in byo, as the failure reproduces only in CI.
Without the change, the test fails once in ~250 runs in dev mode. With
the change, more than 1000 runs passed.

Fixes #27862

No backport needed as 2e7ba1f8ce is only
in master.

Closes scylladb/scylladb#28290
2026-01-22 14:13:42 +03:00
Szymon Malewski
e0cc6ca7e6 vector_index: Refactor extracting ANN query information
For the purpose of rescoring we will need information if the query is an ANN query
and the access to index option earlier in the `select_statement::prepare` than it happened before.
This patch refactors extracting this information to new helper structure `ann_ordering_info`
and is consistently using it.
2026-01-22 10:00:47 +01:00
Botond Dénes
7d2e6c0170 Merge 'config: add enforce_rack_list option' from Aleksandra Martyniuk
Add enforce_rack_list option. When the option is set to true,
all tablet keyspaces have rack list replication factor.

When the option is on:
- CREATE STATEMENT always auto-extends rf to rack lists;
- ALTER STATEMENT fails when there is numeric rf in any DC.

The flag is set to false by default and a node needs to be restarted
in order to change its value. Starting a node with enforce_rack_list
option will fail, if there are any tablet keyspaces with numeric rf
in any DC.

enforce_rack_list is a per-node option and a user needs to ensure
that no tablet keyspace is altered or created while nodes in
the cluster don't have the consistent value.

Mark rf_rack_valid_keyspaces as deprecated.

Fixes: https://github.com/scylladb/scylladb/issues/26399.

New feature; no backport needed

Closes scylladb/scylladb#28084

* github.com:scylladb/scylladb:
  test: add test for enforce_rack_list option
  db: mark rf_rack_valid_keyspaces as deprecated
  config: add enforce_rack_list option
  Revert "alternator: require rf_rack_valid_keyspaces when creating index"
2026-01-22 10:27:35 +02:00
Benny Halevy
d6557764b9 snapshot: keep per-sstable metadata in manifest.json
Adds a "sstables" array member to manifest.json.
For each sstables, keep the following metadata:
id - a uuid for the sstable (the sstable identifier
if the use-sstable-identifier option was used, otherwise
the sstable uuid generation)
toc_name - the name of the TOC.txt file
data_size and index_size - in bytes
first_token and last_token - of the sstable first and last keys.

Fixes: SCYLLADB-196

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-01-22 09:42:52 +02:00
Benny Halevy
dc9093303d snapshot: add table info and tablet_count to manifest.json
Add a table member to manifest.json with the keyspace_name,
table_name, table_id, tablets_type, and, for tablets-enabled tables, get
tablet_count on each shard and write the minimum to manifest.json.
For vnodes-based tables, tablet_count=0.

For now, `tablets_type` may be either `none` for vnodes tables, or
`powof2` for tablets tables.  In the future, when we support arbitrary
tablt boundaries, this will be reflected here, and it is likely we
would backup the whole tablets map sperately to get all tablet boundaries.

Fixes SCYLLADB-195

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-01-22 09:36:52 +02:00
Benny Halevy
91df129e21 snapshot: add basic support for snapshot ttl in manifest.json
Store the snapshot `created_at` time and an optional
`expires_at` time.

Fixes SCYLLADB-189

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-01-22 09:12:56 +02:00
Benny Halevy
5e90fbb9d2 table: snapshot_on_all_shards: take snapshot_options
And keep the options for now in the local_snapshot_writer.
The options will be used by following patches to pass
extra metadata like the snapshot creation time, expiration time, etc.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-01-22 09:12:56 +02:00
Benny Halevy
49a3e0914d db: snapshot_ctl: move skip_flush to struct snapshot_options
So we can easily extend it and add more options.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-01-22 09:12:56 +02:00
Benny Halevy
d9fc3b1c11 snapshot: add snapshot name in manifest.json
Store the snapshot tag in the manifest file.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-01-22 09:12:56 +02:00
Benny Halevy
0f014e2916 test: lib: cql_test_env: apply db::config::tablets_mode_for_new_keyspaces
If tablets are enabled via db::config add the `tablet = {'enabled': true}'
option when creating a keyspace, even if `cql_test_config.initial_tablets`
is disengaged.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-01-22 09:12:56 +02:00
Benny Halevy
0d82e56078 snapshot: add node info to manifest.json
Add metadata about the node: host_id, datacenter, and rack.
This enables dc- or rack- aware restore.
Today this information is "encoded" into the snapshot hierarchy
prefixes, but if all manifest files would be stored in a flat
directory, we'd need to encode that metadata in the object name,
but it'd be better for the manifest contents to be self descriptive.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-01-22 09:12:56 +02:00
Benny Halevy
24040efc54 snapshot: add manifest info to manifest.json
Add metadata about the manifest itself:
A version and the manifest scope (currently "node",
but in the future, may also be "shard", or "tablet")

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-01-22 09:12:56 +02:00
Benny Halevy
9e0f5410ae test: database_test: snapshot_works: add validate_manifest
Validate the manifest.json format by loading it using rjson::parse
and then validate its contents to ensure it lists exactly the
SSTables present in the snapshot directory.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-01-22 09:12:56 +02:00
Patryk Jędrzejczak
e503340efc test: test_raft_recovery_during_join: get host ID of the bootstrapping node before it crashes
The test is currently flaky. It tries to get the host ID of the bootstrapping
node via the REST API after the node crashes. This can obviously fail. The
test usually doesn't fail, though, as it relies on the host ID being saved
in `ScyllaServer._host_id` at this point by `ScyllaServer.try_get_host_id()`
repeatedly called in `ScyllaServer.start()`. However, with a very fast crash
and unlucky timings, no such call may succeed.

We deflake the test by getting the host ID before the crash. Note that at this
point, the bootstrapping node must be serving the REST API requests because
`await coordinator_log.wait_for("delay_node_bootstrap: waiting for message")`
above guarantees that the node has submitted the join topology request, which
happens after starting the REST API.

Fixes #28227

Closes scylladb/scylladb#28233
2026-01-22 06:55:16 +02:00
Botond Dénes
4281d18c2e Merge 'schema: Apply sstable_compression_user_table_options to CQL aux and Alternator tables' from Nikos Dragazis
In PR 5b6570be52 we introduced the config option `sstable_compression_user_table_options` to allow adjusting the default compression settings for user tables. However, the new option was hooked into the CQL layer and applied only to CQL base tables, not to the whole spectrum of user tables: CQL auxiliary tables (materialized views, secondary indexes, CDC log tables), Alternator base tables, Alternator auxiliary tables (GSIs, LSIs, Streams).

This gap also led to inconsistent default compression algorithms after we changed the option’s default algorithm from LZ4 to LZ4WithDicts (adf9c426c2).

This series introduces a general “schema initializer” mechanism in `schema_builder` and uses it to apply the default compression settings uniformly across all user tables. This ensures that all base and aux tables take their default compression settings from config.

Fixes #26914.

Backport justification: LZ4WithDicts is the new default since 2025.4, but the config option exists since 2025.2. Based on severity, I suggest we backport only to 2025.4 to maintain consistency of the defaults.

Closes scylladb/scylladb#27204

* github.com:scylladb/scylladb:
  db/config: Update sstable_compression_user_table_options description
  schema: Add initializer for compression defaults
  schema: Generalize static configurators into schema initializers
  schema: Initialize static properties eagerly
  db: config: Add accessor for sstable_compression_user_table_options
  test: Check that CQL and Alternator tables respect compression config
2026-01-22 06:50:48 +02:00
Nadav Har'El
5c1e525618 Merge 'vector_search: cache restrictions JSON at prepare time ' from Dawid Pawlik
Add `prepared_filter` class which handles the preparation, construction, and caching of Vector Search filtering compatible JSON object.

If no bind markers found in primary key restrictions, the JSON object will be built once at prepare time and cached for use during execution calls.

Additionally, this patch moves the filter functions from `cql3::restrictions` to `vector_search` namespace and does some renaming to make the purpose of those functions clear.

Follow-up: https://github.com/scylladb/scylladb/pull/28109
Fixes: [SCYLLADB-299](https://scylladb.atlassian.net/browse/SCYLLADB-299)

[SCYLLADB-299]: https://scylladb.atlassian.net/browse/SCYLLADB-299?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ

Closes scylladb/scylladb#28276

* github.com:scylladb/scylladb:
  test/vector_search: add filter tests with bind variables
  vector_search: cache restrictions JSON at prepare time
  refactor: vector_search: move filter logic to vector_search namespace
2026-01-21 19:03:35 +02:00
Patryk Jędrzejczak
1f0f694c9e test: test_zero_token_nodes_multidc: properly handle reads with CL=LOCAL_ONE
The test is currently flaky. It incorrectly assumes that a read with
CL=LOCAL_ONE will see the data inserted by a preceding write with
CL=LOCAL_ONE in the same datacenter with RF=2.

The same issue has already been fixed for CL=ONE in
21edec1ace. The difference is that
for CL=LOCAL_ONE, only dc1 is problematic, as dc2 has RF=1.

We fix the issue for CL=LOCAL_ONE by skipping the check for dc1.

Fixes #28253

The fix addresses CI flakiness and only changes the test, so it
should be backported.

Closes scylladb/scylladb#28274
2026-01-21 15:17:42 +01:00
Petr Gusev
843da2bbb8 test_encryption: capture stderr
The check_output function doesn't capture stderr by default, making
it impossible to debug if something goes wrong.
2026-01-21 14:56:01 +01:00
Petr Gusev
925d86fefc test/cluster: add test_strong_consistency.py
Add a basic test that creates a strongly consistent keyspace and table,
writes some data, and verifies that the same data can be read back.

Since Scylla-side request proxying is not yet implemented, writes are
handled only on the leader node. The test uses the existing
`/raft/leader_host` REST endpoint to determine the leader of the tablets
Raft group.
2026-01-21 14:56:01 +01:00
Petr Gusev
59a876cebb raft_group_registry: disable metrics for non-0 groups
The `raft::server` registers metrics using the `server_id` label. When
both a group0 Raft server and the tablets Raft server are created on
the same node/shard, duplicate metrics cause conflicts.

This commit temporarily disables metrics for non-0 groups. A proper fix
will likely require adding a `group_id` label in the future.
2026-01-21 14:56:01 +01:00
Petr Gusev
1f170d2566 strong consistency: implement select_statement::do_execute() 2026-01-21 14:56:01 +01:00
Petr Gusev
a5d611866e cql: add select_statement.cc 2026-01-21 14:56:01 +01:00
Petr Gusev
493ebe2add strong consistency: implement coordinator::query() 2026-01-21 14:56:01 +01:00
Petr Gusev
ccf90cfde8 cql: add modification_statement
We use decoration instead of inheritance, since inheritance already
serves to differentiate statement types (modification_statement has
update_statement and delete_statement as descendants). A better
solution would likely involve refactoring modification_statement and
extracting the mutation-generation logic into a reusable component
shared by both eventual and strongly consistent statements.
2026-01-21 14:56:01 +01:00
Petr Gusev
989566e8a3 cql: add statement_helpers
Introduce two helper methods that will be used for strongly consistent
select_statement and modification_statement.

redirect_statement() forwards the request to another shard or node.
Currently, only shard forwarding is implemented; node-level proxying
will be added in follow-up PRs.

is_strongly_consistent() will be used in the prepare() method of raw
statements to determine whether a strongly consistent statement should
be created for the given CQL statement.
2026-01-21 14:56:01 +01:00
Petr Gusev
b94fd11939 strong consistency: implement coordinator::mutate()
To guarantee monotonic mutation timestamps, we compute the maximum
timestamp used so far for the current tablet. This is done by calling
read_barrier() on the tablet’s Raft group server and extracting the
maximum timestamp from the local database via
table::get_max_timestamp_for_tablet().

Because read_barrier() may take a while, we perform it proactively in a
dedicated fiber, leader_info_updater, rather than during the mutation
request. This fiber is started when the Raft group server starts for a
tablet. It reacts to wait_for_state_change(), computes the maximum
timestamp, and stores it per term.

The new groups_manager::begin_mutate() function checks whether the
maximum timestamp has already been computed for the current term. If
not, it asks the client to wait. This two-step interface (synchronous
begin_mutate() + asynchronous wait on the need_wait_for_leader future)
is needed because the term can change at any asynchronous point.
If begin_mutate() were asynchronous, the client would need to recheck
the term after `co_await begin_mutate()`.

We currently do not handle raft::commit_status_unknown. We rethrow it to
the CQL client, which must check whether the command was applied and
retry if necessary. Handling this inside Scylla would require persisting
a deduplication key after applying the mutation, which introduces write
amplification. Additionally, connection breaks between Scylla and the
driver can always occur, so the client must be prepared to verify the
command status regardless.
2026-01-21 14:56:01 +01:00
Petr Gusev
801bf82a34 raft.hh: make server::wait_for_leader() public
When a strongly consistent request arrives at a node, we
need to know which replica is the leader, since such requests
are generally executed only on the leader. If a leader has
not yet been elected, we must wait. This commit exposes
wait_for_leader() so it can be used for that purpose.

We cannot rely solely on wait_for_state_change(), because it does not
trigger when some other node becomes a leader.
2026-01-21 14:56:01 +01:00
Petr Gusev
7d111f2396 strong_consistency: add coordinator
Add the `coordinator` class, which will be responsible for coordinating
reads and writes to strongly consistent tables. This commit includes
only the boilerplate; the methods will be implemented in separate
commits.
2026-01-21 14:56:01 +01:00
Petr Gusev
4413142f25 modification_statement: make get_timeout public
We'll need to access this method in a new
strong_consistency/modification_statement class.
2026-01-21 14:56:00 +01:00
Petr Gusev
4902186ede strong_consistency: add groups_manager
This class is reponsible for managing raft groups for
strongly-consistent tablets.
2026-01-21 14:56:00 +01:00
Petr Gusev
4eee5bc273 strong_consistency: add state_machine and raft_command
These commands will be used by strongly consistent tablets to submit
mutations to Raft. A simple state_machine implementation is introduced
to apply these commands.

We apply commands in batches to reduce commitlog I/O overhead. The
batched variant of database::apply has known atomicity issues. For
example, it does not guarantee atomicity under memory pressure: some
mutations may be published to the memtable while others are blocked in
run_when_memory_available. We will address these issues later.
2026-01-21 14:56:00 +01:00
Petr Gusev
a8350b274e table: add get_max_timestamp_for_tablet
Strongly consistent writes require knowing the maximum timestamp of
locally applied mutations to guarantee monotonically increasing
timestamps for subsequent writes.

This commit adds a function that returns the maximum timestamp for a
given tablet.

Why it is safe to use this function with deleted cells:
* Tombstones are included in memtable.get_max_timestamp() calculations.
* The maximum timestamp of a memtable is used to initialize the maximum
  timestamp of the resulting sstable.
* During compaction, a new sstable’s maximum timestamp is initialized as
  the maximum of the contributing sstables.
2026-01-21 14:56:00 +01:00
Petr Gusev
dc4896b69b tablets: generate raft group_id-s for new table
We assign shard to zero because raft is currently only supported on the
zero shard.
2026-01-21 14:56:00 +01:00
Petr Gusev
6dd74be684 tablet_replication_strategy: add consistency field
This commit adds a `consistency` field to `tablet_replication_strategy`.
In upcoming commits we'll use this field to determine if a
`raft_group_id` should be generated for a new table.
2026-01-21 14:56:00 +01:00
Petr Gusev
53f93eb830 tablets: add raft_group_id
Add a `raft_group_id` column to `system.tablets` and to the `tablet_map`
class. The column is populated only when the
`strongly_consistent_tables` feature is enabled.

This feature is currently disabled by default and is enabled only when
the user sets the `STRONGLY_CONSISTENT_TABLES` experimental flag.

The `raft_group_id` column is added to `system.tablets` only when this
flag is set. This allows the schema to evolve freely while the feature
is experimental, without requiring complex migrations.
2026-01-21 14:56:00 +01:00
Petr Gusev
cab3e1eea5 modification_statement: remove virtual where it's not needed
This is a refactoring/simplification commit.
2026-01-21 14:56:00 +01:00
Petr Gusev
9015bed794 modification_statement: inline prepare_statement()
This is a refactoring/simplification commit.

There are many 'prepare' functions in this class that don't
meaningfully differ from each other. The prepare_statement() adds
accidental complexity by adding a level of indirection -- the reader
has to jump between the call site and the function body to reconstruct
the full picture.
2026-01-21 14:56:00 +01:00
Petr Gusev
998ee5b7fb system_keyspace: disable tablet_balancing for strongly_consistent_tables
We don't yet support tablet balancing for strongly consistent tables.
2026-01-21 14:56:00 +01:00
Petr Gusev
6b0d757f28 cql: rename strongly_consistent statements to broadcast statements
In preparation for upcoming work on strongly consistent queries in
Scylla, this commit renames the existing `strongly_consistent`
statements to `broadcast_statements` to avoid confusion.

The old code paths are kept temporarily, as they may be useful for
reference or for copying parts during the implementation of the new
strongly consistent statements.
2026-01-21 14:56:00 +01:00
Pavel Emelyanov
18b5a49b0c Populate all sl:* groups into dedicated top-level supergroup
This patch changes the layout of user-facing scheduling groups from

/
`- statement
`- sl:default
`- sl:*
`- other groups (compaction, streaming, etc.)

into

/
`- user (supergroup)
   `- statement
   `- sl:default
   `- sl:*
`- other groups (compaction, streaming, etc.)

The new supergroup has 1000 static shares and is name-less, in a sense
that it only have a variable in the code to refer to and is not exported
via metrics (should be fixed in seastar if we want to).

The moved groups don't change their names or shares, only move inside
the scheduling hierarchy.

The goal of the change is to improve resource consumption of sl:*
groups. Right now activities in low-shares service levels are scheduled
on-par with e.g. streaming activity, which is considered to be low-prio
one. By moving all sl:* groups into their own supergroup with 1000
shares changes the meaning of sl:* shares. From now on these shares
values describe preirities of service level between each-other, and the
user activities compete with the rest of the system with 1000 shares,
regardless of how many service levels are there.

Unit tests keep their user groups under root supergroup (for simplicity)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#28235
2026-01-21 14:14:48 +02:00
Botond Dénes
cf4133c62d Merge 'service: pass topology guard to RBNO' from Aleksandra Martyniuk
Currently, raft-based node operations with streaming use topology guards, but repair-based don't.

Topology guards ensure that if a respective session is closed (the operation has finished), each leftover operation being a part of this session fails. Thanks to that we won't incorrectly assume that e.g. the old rpc received late belongs to the newly started operation. This is especially important if the operation involves writes.

Pass a topology_guard down from raft_topology_cmd_handler to repair tasks. Repair tasks already support topology guards.

Fixes: https://github.com/scylladb/scylladb/issues/27759

No topology_guard in any version; needs backport to all versions

Closes scylladb/scylladb#27839

* github.com:scylladb/scylladb:
  service: use session variable for streaming
  service: pass topology guard to RBNO
2026-01-21 12:47:28 +02:00
Robert Bindar
ea8a661119 reduce test_backup.py and test_refresh.py datasets
backup and restore tests. This made the testing times explode
with both cluster/object_store/test_backup.py and
cluster/test_refresh.py taking more than an hour each to complete
under test.py and around 14min under pytest directly.
This was painful especially in CI because it runs tests under test.py which
suffers from the issue of not being able to run test cases from within
the same file in parallel (a fix is attempted in 27618).

This patch reduces the dataset of these tests to the minimum and
gets rid of one of the tested topology as it was redundant.
The test times are reduced to 2min under pytest and 14 mins under
test.py.

Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>

Closes scylladb/scylladb#28280
2026-01-21 10:47:36 +02:00
Nadav Har'El
8962093d90 Merge 'vector_index: Introduce rescoring index option' from Szymon Malewski
This series introduces `rescoring` index option.
There is no rescoring algorithm implementation yet.
This series prepares it by:
- adding new index option
- adding documentation
- adding tests for option handling
- adding tests for rescoring implementation - at this point they report errors and are marked that this is expected, because rescoring is not implemented.

Follow-up https://github.com/scylladb/scylladb/pull/27677
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-293
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-294

No backporting - it is a new feature.

Closes scylladb/scylladb#28165

* github.com:scylladb/scylladb:
  vector_search: Add more rescoring validation tests
  vector_search: Add rescoring validation test
  vector_search: doc: Document new index option
  vector_search: test: Add `rescoring` index option test
  vector_index: introduce rescoring option
  vector_index: improve options validation
2026-01-21 10:46:22 +02:00
Avi Kivity
ecb6fb00f0 streamed_mutation_freezer: use chunked_vector instead of std::deque for clustering rows
The streamed_mutation_freezer class uses a deque to avoid large
allocations, but fails as seen in the referenced issue when the
vector backing the deque grows too large. This may be a problem
in itself, but the issue doesn't provide enough information to tell.

Fix the immediate problem by switching to chunked_vector, which
is better in avoiding large allocations. We do lose some early-free
in serialize_mutation_fragments(), but since most of the memory should
be in the clustering row itself, not in the deque/chunked_vector holding
it, it should not be a problem.

Fixes #28275

Closes scylladb/scylladb#28281
2026-01-21 10:13:44 +02:00
Botond Dénes
09d3b6c98b Merge 'auth: use paged internal queries during migration' from Piotr Smaron
Auth v2 migration uses non-paged queries via `execute_internal` API.
This commit changes it to use `query_internal` instead, which uses
paging under the hood.

Fixes: https://github.com/scylladb/scylladb/issues/27577

A minor enhancement, no need to backport.

Closes scylladb/scylladb#25395

* github.com:scylladb/scylladb:
  auth: use paged internal queries during migration
  auth: move some code in migrate_to_auth_v2 up
  auth: re-align pieces of migrate_to_auth_v2
  cql: extend `query_internal` with `query_state` param
2026-01-21 09:58:06 +02:00
Raphael S. Carvalho
d16f9c821d Revert "api: storage_service/tablets/repair: disable incremental repair by default"
This reverts commit c8cff94a5a.

Re-enabling incremental repair on master with "Aborting on shard 0 during
scaleout + repair #26041" and "Failure to attach sstables in streaming consumer
leaves sealed sstables on disk #27414" fixed.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes scylladb/scylladb#28120
2026-01-21 08:50:13 +02:00
Dawid Pawlik
4a7e20953a test/cqlpy: remove the xfail mark from already passing tests
Since #28109 was merged, those tests started to pass as we allow
the filtering on primary key columns within ANN vector queries.

Closes scylladb/scylladb#28231
2026-01-21 08:47:20 +02:00
Botond Dénes
7a3f51b304 Update seastar submodule
* seastar e00f1513...f55dc7eb (8):
  > build: detect io_uring correctly
  > ioinfo: report actual I/O latency goal instead of configured value
  > core/loop: repeater,repeat_until_value: don't rethrow exceptions
  > seastar::test: Add exception interception to extract nested info
  > bool_class: add operator<=>
  > Merge 'Perftune.py: generic virtual device with lower-level slave network devices support' from Vladislav Zolotarov
    scripts/perftune.py: NetPerfTuner: rename "hardware interface" methods to "tunable interface" methods
    scripts/perftune.py: NetPerfTuner: refactor NIC handling to support any virtual interface with slaves
    scripts/perftune.py: NetPerfTuner: rename methods to private for better encapsulation
  > scripts/perftune.py: add fast path queues IRQ filtering and indexing for Microsoft Azure Network Adapter (MANA) NICs
  > file: Finally override dma_read_bulk() in posix_file_impl

Closes scylladb/scylladb#28262
2026-01-21 08:44:20 +02:00
Szymon Malewski
d6226500f6 vector_search: Add more rescoring validation tests
Adding tests for specific cases of rescoring processing:
- wildcard selection - "SELECT * ..." is a case with slightly different path of rescoring processing. We want to confirm that it is handled correctly.
- calculating similarity with other vectors in SELECT clause should not influence ANN ordering.
- NULL handling - results that for any reason have NULL in a score should be filtered out.

As rescoring is not implemented yet, the tests use boost::unit_test::expected_failures
to indicate that the test reports errors.
2026-01-20 21:01:45 +01:00
Karol Nowacki
376c70be75 vector_search: Add rescoring validation test
Verify that vector store results will be correctly rescored and reordered
according to the rescoring algorithm.
As rescoring is not implemented yet, the tests use `boost::unit_test::expected_failures`
to indicate that they report errors.

First test checks rescoring with a simple selection list.
Second makes sure that rescoring is not triggered for quantization=f32 - full representation of vectors.
Third repeats the first one, but adds to it returning of similarity score value.
2026-01-20 21:01:45 +01:00
Karol Nowacki
41dcf12463 vector_search: doc: Document new index option
Adds documentation for the rescoring option for vector search indexes.
2026-01-20 21:01:45 +01:00
Karol Nowacki
b268eda67e vector_search: test: Add rescoring index option test
Add tests to validate `rescoring` index options.
It also improves tests for related `oversampling` option validation.
2026-01-20 21:01:45 +01:00
Szymon Malewski
c5945b1ef4 vector_index: introduce rescoring option
This patch adds vector index option allowing to enable rescoring - recalculation of similarity metric and re-ranking of quantized VS candidates.
Quantization is a necessary condition to run rescoring - checked in convenience function `is_rescoring_enabled`.
Rescoring itself is not implemented - it will come in following patches.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-294
2026-01-20 21:01:45 +01:00
Szymon Malewski
262a8cef0b vector_index: improve options validation
In this patch we enhance validation of option by:
- giving context (option name) in error messages
- listing supported values in error messages of enumerated options
- avoiding using templates

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-293
Follow-up: https://github.com/scylladb/scylladb/pull/27677
2026-01-20 21:01:41 +01:00
Dawid Pawlik
f27ef79d0d test/vector_search: add filter tests with bind variables
Add tests that check if preparation of the filter  does work
and not produce cache when the restrictions consist of bind variables.
2026-01-20 18:17:46 +01:00
Dawid Pawlik
e62cb29b7d vector_search: cache restrictions JSON at prepare time
Add `prepared_filter` class which handles the preparation, construction
and caching of Vector Search filtering compatible JSON object.

If no bind markers found in SELECT statement, the JSON object will be built
once at prepare time and cached for use during execution calls.

Adjust tests accordingly to use prepared filters.

Follow-up: #28109
Fixes: SCYLLADB-299
2026-01-20 17:15:52 +01:00
Andrei Chekun
91e2b027ce test.py: do not crash when there is no boost log
Small fix to not crash the whole process when boost tests failed to
start and do not produced the log file that can be parsed.
2026-01-20 15:52:40 +01:00
Andrei Chekun
58d3052ad4 test.py: pass correctly extra cmd line arguments
During rewrite --extra-scylla-cmdline-options was missed and it was not
passed to the tests that are using pytest. The issue that there were no
possibility to pass these parameters via cmd to the Scylla, while tests
were not affected because they were using the parameters from the yaml
file. This PR fixes this issue so it will be easier to modify the Scylla
start parameters without modifying code.
2026-01-20 15:52:40 +01:00
Anna Stuchlik
84e9b94503 doc: fix the default compaction strategy for Materialized Views
Fixes https://github.com/scylladb/scylladb/issues/24483

Closes scylladb/scylladb#27725
2026-01-20 15:41:03 +01:00
Dawid Pawlik
f54a4010c0 refactor: vector_search: move filter logic to vector_search namespace
Move Vector Search filter functions from `cql3::restrictions` to
`vector_search` namespace as it's a better place according to
it's purpose.
The effective name has now changed from `cql3::restrictions::to_json`
to `vector_search::to_json` which clearly mentions that the JSON
object will be used for Vector Search.

Rename the auxilary functions to use `to_json` suffix instead of
variety of verbs as those functions logic focuses on building JSON
object from different structures. The late naming emphasized too
much distinction between those functions, while they do pretty much
the same thing.

Follow-up: #28109
2026-01-20 13:13:43 +01:00
Botond Dénes
a53f989d2f db/row_cache: make_nonpopulating_reader(): pass cache tracker to snapshot
The API contract in partition_version.hh states that when dealing with
evictable entries, a real cache tracker pointer has to be passed to all
methods that ask for it. The nonpopulating reader violates this, passing
a nullptr to the snapshot. This was observed to cause a crash when a
concurrent cache read accessed the snapshot with the null tracker.

A reproducer is included which fails before and passes after the fix.

Fixes: #26847

Closes scylladb/scylladb#28163
2026-01-20 12:34:37 +01:00
Avi Kivity
c7dda5500c database: simplify apply_counter_update exception handling
Use coroutine::try_future to exit the coroutine immediately on
error instead of explict checks.

Closes scylladb/scylladb#28257
2026-01-20 11:13:49 +02:00
Wojciech Mitros
fc2aecea69 idl: don't redefine bound_weight and partition_region in paging_state.idl.hh
Bound_weight and partition_region are defined in both paging_state.idl.hh and
position_in_partition.idl.hh. This isn't currently causing any issues, but if
a future RPC uses both the paging_state and position_in_partition, after
including both files we'll get a duplicate error.
In this patch we prevent this by removing the definitions from paging_state.idl.hh
and including position_in_partition.idl.hh in their place.

Closes scylladb/scylladb#28228
2026-01-20 11:12:47 +02:00
Aleksandra Martyniuk
2be5ee9f9d service: use session variable for streaming
Use session that was retrieved at the beginning of the handler for
node operations with streaming to ensure that the session id won't
change in between.
2026-01-20 10:06:34 +01:00
Aleksandra Martyniuk
3fe596d556 service: pass topology guard to RBNO
Currently, raft-based node operations with streaming use topology
guards, but repair-based don't.

Topology guards ensure that if a respective session is closed
(the operation has finished), each leftover operation being a part
of this session fails. Thanks to that we won't incorrectly assume
that e.g. the old rpc received late belongs to the newly started
operation. This is especially important if the operation involves
writes.

Pass a topology_guard down from raft_topology_cmd_handler to repair
tasks. Repair tasks already support topology guards.

Fixes: https://github.com/scylladb/scylladb/issues/27759
2026-01-20 10:06:34 +01:00
Aleksandra Martyniuk
f0dbf6135d test: add test for enforce_rack_list option 2026-01-20 10:01:15 +01:00
Aleksandra Martyniuk
7dc371f312 db: mark rf_rack_valid_keyspaces as deprecated
Mark rf_rack_valid_keyspaces option as deprecated. User should
use enforce_rack_list option instead.

The option can still be used and it does not change it's behavior.

Docs is updated accordingly.
2026-01-20 09:58:57 +01:00
Aleksandra Martyniuk
761ace4f05 config: add enforce_rack_list option
Add enforce_rack_list option. When the option is set to true,
all tablet keyspaces have rack list replication factor.

When the option is on:
- CREATE STATEMENT always auto-extends rf to rack lists;
- ALTER STATEMENT fails when there is numeric rf in any DC.

The flag is set to false by default and a node needs to be restarted
in order to change its value. Starting a node with enforce_rack_list
option will fail, if there are any tablet keyspaces with numeric rf
in any DC.

enforce_rack_list is a per-node option and a user needs to ensure
that no tablet keyspace is altered or created while nodes in
the cluster don't have the consistent value.
2026-01-20 09:58:51 +01:00
Michael Litvak
e7ec87382e Revert "alternator: require rf_rack_valid_keyspaces when creating index"
This reverts commit 4b26a86cb0.

The rf_rack_valid_keyspaces option is now not required for creating MVs.
2026-01-20 09:56:48 +01:00
Pavel Emelyanov
8ecd4d73ac test: Update cluster/object_store/ tests to use new S3 config format
Currently the suite generates config in old format, and only a single
test validates that using new format "works".

This change updates the suite (mainly the MinioServer::create_conf()
method) to generate endpoint confit in new format.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#28113
2026-01-20 10:53:34 +02:00
Pavel Emelyanov
5e369c0439 audit: Stop using deprecated seastar UDP sending API
The datagram_channel::send() method that sends net::packet-s is
deprecated in favor of using span<temporary_buffer> one. Auditing code
still uses the former one -- it constructs a packet by using formatted
string by copying the string into the packet's fragment, then sends it.
This patch releases string into temporary_buffer and then passes
one-element span to send().

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#28198
2026-01-20 10:51:23 +02:00
Avi Kivity
874322f95e multishard_query: simplify do_query() coroutine/continuation complexity
do_query() is a coroutine but uses some continuations to take
advantage of exceptions being propagated via future::then() without
being thrown. We can accomplish the same thing with a nested coroutine
and coroutine::try_future(), simplifying the code.

While this area isn't performance intensive, we're not adding allocations.
The coroutine frame may add an allocation, but since read_page()
certainly does not return immediately, the following then() will allocate
as well. Since we eliminated that then(), the change is at least neutral
allocation-wise.

Closes scylladb/scylladb#28258
2026-01-20 10:45:10 +02:00
Łukasz Paszkowski
e07fe2536e test/pylib/util.py: Add retries and additional logging to start_writes()
Consider the following scenario:
1. Let nodes A,B,C form a cluster with RF=3
2. Write query with CL=QUORUM is submitted and is acknowledged by
   nodes B,C
3. Follow-up read query with CL=QUORUM is sent to verify the write
   from the previous step
4. Coordinator sends data/digest requests to the nodes A,B. Since the
   node A is missing data, digest mismatches and data reconciliation
   is triggered
5. The node A or B fails, becomes unavailable, etc
6. During reconciliation, data requests are sent to node A,B and fail
   failing the entire read query

When the above scenario happens, the tests using `start_writes()` fail
with the following stacktrace:
```
...

>           await finish_writes()

test/cluster/test_tablets_migration.py:259:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
test/pylib/util.py:241: in finish
    await asyncio.gather(*tasks)
test/pylib/util.py:227: in do_writes
    raise e
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

worker_id = 1

...

>                   rows = await cql.run_async(rd_stmt, [pk])
E                   cassandra.ReadFailure: Error from server: code=1300 [Replica(s) failed to execute read] message="Operation failed for test_1767777001181_bmsvk.test - received 1 responses and 1 failures from 2 CL=QUORUM." info={'consistency': 'QUORUM', 'required_responses': 2, 'received_responses': 1, 'failures': 1}
```

Note that when a node failure happens before/during a read query,
there is no test failure as the speculative retries are enabled
by default. Hence an additional data/digest read is sent to the third
remaining node.

However, the same speculative read is cancelled the moment, the read
query reaches CL which may trigger a read-repair.

This change:
- Retries the verification read in start_writes() on failure to mitigate
  races between reads and node failures
- Adds additional logging to correlate Python exceptions with Scylla logs

Fixes https://github.com/scylladb/scylladb/issues/27478
Fixes https://github.com/scylladb/scylladb/issues/27974
Fixes https://github.com/scylladb/scylladb/issues/27494
Fixes https://github.com/scylladb/scylladb/issues/23529

Note that this change test flakiness observed during tablet transitions.
However, it serves as a workaround for a higher-level issue
https://github.com/scylladb/scylladb/issues/28125

Closes scylladb/scylladb#28140
2026-01-20 10:38:20 +02:00
Piotr Smaron
6084f250ae auth: use paged internal queries during migration
Auth v2 migration uses non-paged queries via `execute_internal` API.
This commit changes it to use `query_internal` instead, which uses
paging under the hood.

Fixes: scylladb/scylladb#27577
2026-01-20 09:32:21 +01:00
Piotr Smaron
345458c2d8 auth: move some code in migrate_to_auth_v2 up
Just move the touched code above so the next commit is more readable.
But this has a drawback: previously, if the returned rows were empty,
this code was not executed, but now is independently of the query
results. This shouldn't be a big deal, though, as auth shouldn't be
empty.
2026-01-20 09:15:53 +01:00
Piotr Smaron
97fe3f2a2c auth: re-align pieces of migrate_to_auth_v2
This is needed to make next commits more readable.
Just moved the touched code 4 spaces to the right.
2026-01-20 09:15:53 +01:00
Piotr Smaron
3ca6b59f80 cql: extend query_internal with query_state param
This later is going to be used to pass a query timeout via `qs` to
`query_internal`.
2026-01-20 09:15:48 +01:00
Nadav Har'El
70b3cd0540 Merge 'vector_index: introduce quantization and oversampling options' from Szymon Malewski
This patch adds vector index options allowing to enable quantization and oversampling.
Specific quantization value will be used internally by vector store.

In the current implementation, get_oversampling allows us to decide how many times more candidates
to retrieve from vector store - final response is still trimmed to the given limit.
It is a first step to allow rescoring - recalculation of similarity metric and re-ranking.
Without rescoring oversampling will be also further optimized to happen internally in vector store.

`test/vector_search/rescoring_test.cc` implements basic tests of added functionality.
New options are documented in `docs/cql/secondary-indexes.rst`.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-82
Ref https://scylladb.atlassian.net/browse/SCYLLADB-83

New feature - no backporting

Closes scylladb/scylladb#27677

* github.com:scylladb/scylladb:
  vector_search: doc: Document new index options
  vector_search: test: Test oversampling
  vector_search: test: Add rescoring index options test
  vector_search: test: Extract Configure utility to shared header
  vector_index: introduce `quantization` and `oversampling` options
2026-01-20 08:50:46 +02:00
Avi Kivity
36347c3ce9 Merge 'db/system_keyspace: remove namespace v3' from Botond Dénes
Cassandra changed their system tables in 3.0. We migrated to the new system table layout in 2017, in ScyllaDB 2.0.
System tables introduced in Cassandra 3.0, as well as the 3.0 variant of pre-existing system tables were added to the db::system_table::v3 namespace.
We ended up adding some new ScyllaDB-only system tables to this namespace as well.

As the dust settled, most of the v3 system tables ended up being either simple aliases to non-v3 tables, or new tables.
Either way, the codebase uses just one variant of each table for a long time now the v3:: distinction is pointless.

Remove the v3 namespace and unify the table listing under the top-level db::system_keyspace scope.

Code cleanup, no backport

Closes scylladb/scylladb#28146

* github.com:scylladb/scylladb:
  db/system_keyspace: move remining tables out of v3 keyspace
  db/system_keyspace: relocate truncated() and commitlog_cleanups()
  db/system_keyspace: drop v3::local()
  db/system_keyspace: remove duplicate table names from v3
2026-01-19 20:54:38 +02:00
Nikos Dragazis
4cde34f6f2 storage_service: Remove redundant yields
The loops in `ongoing_rf_change()` perform explicit yields, but they
also perform coroutine operations which can yield implicitly. The
explicit yields are redundant.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2026-01-19 16:18:49 +02:00
Tomasz Grabiec
7977c97694 Merge 'db: add effective_capacity to load_per_node virtual table' from Ferenc Szili
`effective_capacity` is a value used in size based load balancing. It contains the sum of available disk space of a node and all the tablet sizes.
This change adds this value to the virtual table `system.load_per_node`. This can be useful for debugging size based load balancing.

Size based load balancing is currently only on master, so no backport is needed.

Closes scylladb/scylladb#28220

* github.com:scylladb/scylladb:
  docs: add effective_capacity to system keyspace docs
  virtual_table: add effective_capacity to load_per_node
2026-01-19 13:17:28 +01:00
Marcin Maliszkiewicz
be8a30230b Merge 'test/cluster/dtest: import scrub_test.py' from Botond Dénes
This test has to be adjusted in lock-step with scylladb.git, due to changes in https://github.com/scylladb/scylladb/pull/27836. It is simpler to just take the time and import it, so https://github.com/scylladb/scylladb/pull/27836 can patch all the affected tests, including this one.
All code is imported verbatim, then patched later, such that the series remains bisectable.

dtest import, no backport needed

Closes scylladb/scylladb#28085

* github.com:scylladb/scylladb:
  test/cluster/dtest: remove is_win() and users
  test/cluster/dtest/scrub_test.py: add license blurb
  test/cluster/dtest: import scrub_test.py
  test/cluster/dtest/ccmlib: scylla_node.py: adapt run_scylla_sstable() at al
  test/cluster/dtest/ccmlib: scylla_node.py: import run_scylla_sstable()
2026-01-19 12:14:08 +01:00
Botond Dénes
2e4d0e42f0 test/cluster/dtest: remove is_win() and users
ScyllaDB and its tests never run on windows, this function is not
needed, patch it out.
2026-01-19 12:56:57 +02:00
Botond Dénes
8953a143e5 test/cluster/dtest/scrub_test.py: add license blurb
The original scrub test was done by the Cassandra project, hence there
is two Licenses notices: one for the original work by Cassandra
(2015) and one for our modifications on top (2021).
2026-01-19 12:55:59 +02:00
Botond Dénes
d2c266eb47 test/cluster/dtest: import scrub_test.py
Import the test verbatim. Requires adding is_win() to ccmlib/common.py,
with a dummy implementation.
2026-01-19 12:52:44 +02:00
Botond Dénes
99e8a92aef test/cluster/dtest/ccmlib: scylla_node.py: adapt run_scylla_sstable() at al
To work in the local test.py context.
2026-01-19 12:52:44 +02:00
Botond Dénes
807da53583 test/cluster/dtest/ccmlib: scylla_node.py: import run_scylla_sstable()
And dependencies: get_sstables() and __gather_sstables().
Code is importend verbatim, but doesn't work yet (no users yet either).
Will be patched to work in the next commit.
2026-01-19 12:52:44 +02:00
Botond Dénes
e01041d3ee db/system_keyspace: move remining tables out of v3 keyspace
The last remining tables in the v3 keyspace are those that are genuinely
distinct -- added by Cassandra 3.0 or >= ScyllaDB 2.0.
Move these out of the v3 keyspace too, with this the v3 keyspace is
defunct and removed.
2026-01-19 12:32:21 +02:00
Botond Dénes
ce57ef94bd db/system_keyspace: relocate truncated() and commitlog_cleanups()
The name variables of these tables is outside the v3 namespace but the
method defining their schema is in the v3 namespace. Relocate the
methods out from the v3 namespace, to the scope where the name variables
live.
The methods are moved to the private: part of system_keyspace, as they
don't have external users currently.
2026-01-19 12:32:21 +02:00
Botond Dénes
2ccb8ff666 db/system_keyspace: drop v3::local()
It is unused, the non-v3 variant is used instead.
2026-01-19 12:32:21 +02:00
Botond Dénes
b52a3f3a43 db/system_keyspace: remove duplicate table names from v3
Those table names that are effectively just an alias of the their
counterpart outside of the v3 namespace (struct).

scylla_local() is made public. Currently it is private, but it has
external users, working around the private designation by using the
public v3::scylla_local() alias. This change just makes the existing
status clear.
2026-01-19 12:32:21 +02:00
Karol Nowacki
324b829263 vector_search: doc: Document new index options
Adds documentation for the `quantization` and `oversampling`
options for vector search indexes.
2026-01-19 10:28:46 +01:00
Karol Nowacki
bca17290f4 vector_search: test: Test oversampling
Add test to verify that Scylla correctly oversamples the limit
according to the oversampling option.
2026-01-19 10:28:46 +01:00
Karol Nowacki
e347f6d0d4 vector_search: test: Add rescoring index options test
Add tests to validate quantization and oversampling index options.
2026-01-19 10:28:44 +01:00
Karol Nowacki
24b037e8e3 vector_search: test: Extract Configure utility to shared header
Move Configure test utility to dedicated file for reuse across test suites.
2026-01-19 10:21:44 +01:00
Szymon Malewski
b8e91ee6ae vector_index: introduce quantization and oversampling options
This patch adds vector index options allowing to enable quantization and oversampling.
Specific quantization value will be used internally by vector store.

In the current implementation, `get_oversampling` allows us to decide how many times more candidates
to retrieve from vector store - final response is still trimmed to the given limit.
It is a first step to allow rescoring - recalculation of similarity metric and re-ranking.
Without rescoring oversampling will be also further optimized to happen internally in vector store.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-82
Ref https://scylladb.atlassian.net/browse/SCYLLADB-83
2026-01-19 10:21:43 +01:00
Tomasz Grabiec
dd0fc35c63 lsa: Export metrics for reclaim/evict/compact time
Currently, we only know about long reclaims from lsa-timing stall
reports. Shorter reclaims can go under the radar.

Those metrics will help to asses increase in LSA activity, which
translates to higher CPU cost of a workload.

reclaim tracks memory which goes to the standard allocator, e.g. when
entering and allocating_section or in the background reclaimer.

evict/compact count activity towrads building LSA reserve, in
allocating_section entry, or naked LSA allocation.

Closes scylladb/scylladb#27774
2026-01-19 12:08:16 +03:00
Nadav Har'El
3e270a49f7 test/cqlpy: remove test_describe.py from cluster reuse blacklist
The way that test.py runs test/cqlpy tests requires that tests end their
session with all keyspaces deleted. If we forget to delete a keyspace,
test.py suspects some test fails and reports a failure. As reported in
issue #26291, the test file test/cqlpy/test_describe.py caused this check
to trigger, so this file was added to the blacklist "dirties_cluster"
in suite.yaml to force test.py to ignore this problem.

I believe the cause of the problem was as follows: test_describe.py
didn't really leave any undeleted keyspace. Rather, test_describe.py had
one test which used "USE" and this broke DESC KEYSPACES (Refs #26334) -
which test.py used to see which keyspaces remained.

We solved this problem not just once, but twice:
1. In pull request #26345, I fixed the test not to use "USE" on the main
   CQL session.
2. In pull request #27971, I fixed DESC KEYSPACES implementation so even
   if "USE" was in effect, it will return the correct results.

I checked manually, and after removing test_describe.py from the
dirties_cluster blacklist, all cqlpy tests now pass, without
spurious failures in the test following test_describe.py. So it's time
to remove it from the blacklist.

Fixes #26291

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#27973
2026-01-19 12:02:00 +03:00
Dario Mirovic
823d1b9c03 audit: fix start_audit init sequence placement
Commit d54d409 (audit: write out to both table and syslog) unified
create_audit and start_audit, which moved the audit service creation later
in the startup sequence. This broke startup when audit is enabled because
view_builder prepares CQL queries before start_audit runs, and
query preparation calls audit_instance().local_is_initialized()
which crashes on the non-existent sharded service.

Move start_audit to run before view_builder::start() and other components
that may prepare CQL queries during their initialization.

Fixes SCYLLADB-252

Closes scylladb/scylladb#28139
2026-01-19 11:57:39 +03:00
Botond Dénes
6f5f42305a docs: make the glossary more tablet inclusive
Our glossary is stuck in the past, still discussing token ownership in
terms of vnodes and cluster synchronization in terms of gossip.
This patch tries to improve this a bit, although much more work needs to
be done.
The term `Tablet` is added and the definition of `Token` and `Token
Range` is rephrased to be tablet inclusive.
The term `Cluster` is changed to mention raft as the synchronization
mechanism instead of gossip.

One oustanding problem is that our general architecture page describing
the ring acrhitecture is still Vnode only. We have a seprate Tablets
page, but the two don't link to each other and most documentation refer
only to the former. A casual reader might be able to spend a a lot of
time on our documentation page, without even seeing the word: tablets.

Closes scylladb/scylladb#28170
2026-01-19 11:50:13 +03:00
Ernest Zaslavsky
829bd9b598 aws_error: fix nested exception handling
The loop that unwraps nested exception, rethrows nested exception and saves pointer to the temporary std::exception& inner on stack, then continues. This pointer is, thus, pointing to a released temporary

Closes scylladb/scylladb#28143
2026-01-19 11:41:47 +03:00
Botond Dénes
b7bc48e7b7 reader_concurrency_semaphore: improve handling of base resources
reader_permit::release_base_resources() is a soft evict for the permit:
it releases the resources aquired during admission. This is used in
cases where a single process owns multiple permits, creating a risk for
deadlock, like it is the case for repair. In this case,
release_base_resources() acts as a manual eviction mechanism to prevent
permits blockings each other from admission.

Recently we found a bad interaction between release_base_resources() and
permit eviction. Repair uses both mechanism: it marks its permits as
inactive and later it also uses release_base_resources(). This partice
might be worth reconsidering, but the fact remains that there is a bug
in the reader permit which causes the base resources to be released
twice when release_base_resources() is called on an already evicted
permit. This is incorrect and is fixed in this patch.

Improve release_base_resources():
* make _base_resources const
* move signal call into the if (_base_resources_consumed()) { }
* use reader_permit::impl::signal() instead of
  reader_concurrency_semaphore::signal()
* all places where base resources are released now call
  release_base_resources()

A reproducer unit test is added, which fails before and passes after the
fix.

Fixes: #28083

Closes scylladb/scylladb#28155
2026-01-19 11:37:51 +03:00
Nadav Har'El
d86d5b33aa test/cqlpy: translate Cassandra's unit tests for LWT
This is a translation of Cassandra's CQL unit test source file
validation/operations/InsertUpdateIfConditionTest.java into our cqlpy
framework.

This test file checks various LWT conditional updates. After that
file became too big, the Cassandra developers split parts from it -
moving tests for LWT with collections, UDTs, and static columns to
separate test files - which I already translated (pull request #13663).
This patch translates the remaining, main, LWT tests.

Strangely, this test file also has, in the middle of the file, several
tests for conditional schema changes, like CREATE KEYSPACE IF NOT EXISTS,
a feature which has *nothing* to do with LWT so really didn't belong in
this file. But I translated those as well.

These new tests all pass on both ScyllaDB and Cassandra, and have not
uncovered any new bug.

However these tests do demonstrate yet again something that users and
developers of ScyllaDB's LWT must be aware of: Whereas usually
ScyllaDB's goal has been compatiblity with Cassandra's CQL, in LWT
this has *not* been the case: ScyllaDB deviated from Cassandra's
behavior in its LWT implementation in several places. These intentional
deviations were documented in docs/kb/lwt-differences.rst.

Accordingly, the tests here include almost a hundred (!) modificatons
(search for "if is_scylla") to allow the same test to pass on both
ScyllaDB and Cassandra, as well as many comments explaining the types
of differences we're seeing.

Although these deviations from Cassandra compatibility are known and
intentional, it's worth listing here the ones re-discovered by these
new tests:

1. On a successful conditional write, Cassandra returns just true, Scylla
   also returns the old contents of the row.

2. Similarly, in an IF EXISTS write that failed (the row did not exist),
   Cassandra returns just false, Scylla also returns extra null values for
   each and every column of the row.

3. Cassandra allows in "IF v IN (?, ?)" to bind individual values to
   UNSET_VALUE and skips them, Scylla treats this as an error. Refs #13659.

4. When there are static columns, Scylla's LWT response returns the static
   column first, Cassandra returns the modified column first. Since both
   also say which columns they return, neither is more correct than the other,
   a normally users will address specific columns by name, not by position.

5. docs/kb/lwt-differences.rst explains that "the returned result set
   contains an old row for every conditional statement in the batch".
   Beyond this different, actually non-conditional updates in the batch will
   also get a row in Scylla's result. Refs #27955.

6. For batch statement, ScyllaDB allows mixing `IF EXISTS`, `IF NOT EXISTS`,
   and other conditions for the same row. Cassandra doesn't, so checks that
   these combinations are not allowed were commented out.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#27961
2026-01-19 09:46:04 +02:00
Botond Dénes
19efd7f6f9 Merge 'The system_replicated_keys should be mark as a system keyspace' from Amnon Heiman
This PR marks system_replicated_keys as a system keyspace.
It was missing when the keyspace was added.

A side effect of that is that metrics that are not supposed to be reported are.
Fixes #27903

Closes scylladb/scylladb#27954

* github.com:scylladb/scylladb:
  distributed_loader: system_replicated_keys as system keyspace
  replicated_key_provider: make KSNAME public
2026-01-19 09:37:41 +02:00
Aleksandra Martyniuk
65cba0c3e7 service: node_ops: remove coroutine::lambda wrappers
In storage_service::raft_topology_cmd_handler we pass a lambda
wrapped in coroutine::lambda to a function that creates streaming_task_impl.
The lambda is kept in streaming_task_impl that invokes it in its run
method.

The lambda captures may be destroyed before the lambda is called, leading
to use after free.

Do not wrap a lambda passed to streaming_task_impl into coroutine::lambda.
Use this auto dissociate the lambda lifetime from the calling statement.

Fixes: https://github.com/scylladb/scylladb/issues/28200.

Closes scylladb/scylladb#28201
2026-01-19 09:19:53 +02:00
Botond Dénes
c8811387e1 Merge 'service: do not change the schema while pausing the rf change ' from Aleksandra Martyniuk
Currently, if a rf change request is paused, it immediately changes
the system_schema.keyspaces to use rack list for this keyspace.
If the request is aborted, the co-location might not be finished.
Hence, we can end up with inconsistent schema and tablet replica state.

Update the system_schema.keyspaces only after the co-location is done (and
not when it's started).

Fixes: https://github.com/scylladb/scylladb/issues/28167

No backport needed; changes that introduced a bug are only on master

Closes scylladb/scylladb#28168

* github.com:scylladb/scylladb:
  service: fin indentation
  test: add test_numeric_rf_to_rack_list_conversion_abort
  service: tasks: fix type of global_topology_request_virtual_task
  service: do not change the schema while pausing the rf change
2026-01-19 09:15:20 +02:00
Botond Dénes
7d637b14e8 erge 'test/cluster/test_internode_compression: Transpose test from dtest' from Calle Wilund
Refs #27429

Re-implement the dtest with same name as a scylla pytest, using a python level network proxy instead of tcpdump etc.
Both to avoid sudo and also to ensure we don't race.

Juggles different listen_address and broadcast_address values to insert a proxy measuring RPC traffic.

Note: the measuring relies on python network IO not splitting data chunks, since we don't really have packet-level view of the connections.

Note that a scylla change is required to make the ip address magic work, otherwise topology mechanism gets
confused. This should maybe at some point be looked into more, since we should be more resilient against various services in scylla binding to different addresses.

When this test is merged, we can drop the flaky test from dtest. And hope no new flakiness comes from this one...

Closes scylladb/scylladb#28133

* github.com:scylladb/scylladb:
  test/cluster/test_internode_compression: Transpose test from dtest
  gossiper/main: Extend special treatment of node ID resolve for rpc_address
2026-01-19 08:34:31 +02:00
Ferenc Szili
1136a3f398 docs: add effective_capacity to system keyspace docs
This adds the description of effective_capacity to the documentation
of the system keyspace.
2026-01-18 16:57:08 +01:00
Ferenc Szili
3e0362ec67 virtual_table: add effective_capacity to load_per_node
This change adds effective_capacity to the virtual table load_per_node.
This value can be useful for debugging size based load balancing.
2026-01-18 16:52:13 +01:00
Nadav Har'El
3e138a2685 test/cqlpy: Add our copyright/license to translated Cassandra tests
All the tests under test/cqlpy/cassandra_tests/ were translated from
Cassandra's unit tests originally written in Java into our own test
framework, and accordinly carry a clear mention of their origin and
original license.

However, we did modify these original tests - even if the modification
was slight and mostly straightforward. Therefore I was asked to also
mention our own copyright (and license) for these modifications.

So this patch adds to every file in test/cqlpy/cassandra_tests/ text like:

   # Modifications: Copyright 2026-present ScyllaDB
   # SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0

with the appropriate year instead of 2026.

Fixes #28215

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#28216
2026-01-18 16:25:28 +01:00
Tomasz Grabiec
3b0df29ceb docs: Document parallel decommission and removenode and relevant task API 2026-01-18 15:36:08 +01:00
Tomasz Grabiec
85140cdf7e test: Add tests for parallel decommission/removenode 2026-01-18 15:36:08 +01:00
Tomasz Grabiec
5c93e12373 test: util: Introduce ensure_group0_leader_on()
Many tests want to assume that group0 leader runs on a particualr
server, typically the first server in the list.

And they cannot be easily made to work with arbitrary leader, becuase
they setup a particular topology and then stop particular nodes, and
want to assume the leader is stable. They open leader's log and
expect things to appear in that log.

It's much easier to ensure the leader, than to prepare tests to
handle failovers.
2026-01-18 15:36:07 +01:00
Tomasz Grabiec
478b8f09df test: tablets: Check that there are no migrations scheduled on draining nodes
In case of decommission, it's not desirable because it's less urgent.

In case of removenode, it leads to failure of removenode operation
because scheduled co-locating migration will fail if the destination
is on the excluded node, and this failure will be interpreted as drain
failure and coordinator will cancel the request.

Not a problem before "parallel decommission" because this failure is
only a streaming failure, not a barrier failure, so exception doesn't
escape into the catch clause in transition stage handler, and the
migration is simply rolled back. Once draining happens in the tablet
migration track, streaming failure will be interpreted as drain
failure and cancel the request.
2026-01-18 15:36:07 +01:00
Tomasz Grabiec
e082e32cc7 test: lib: topology_builder: Introduce add_draining_request() 2026-01-18 15:36:07 +01:00
Tomasz Grabiec
baea12c9cb topology_coordinator, tablets: Fail draining operations when tablet migration fails due to critical disk utilization
Reaching critical disk utilization on destination means the draining
either caused it, or at least works against reliveing it. So it's
better to cancel those requests. In case of decommission, if critical
disk utilization was caused by it due to not enough capacity, aborting
decomission will bring capacity back to the system and rebalancing
will relieve critical disk utlization.
2026-01-18 15:36:07 +01:00
Tomasz Grabiec
1b784e98f3 tablets: topology_coordinator: Refactor to propagate reason for migration rollback
Will be easier to implement later change to cancel topology request,
where we need to give a reason for doing so.
2026-01-18 15:36:07 +01:00
Tomasz Grabiec
2d954f4b19 tablet_allocator: Skip co-location on draining nodes
In case of decommission, it's not desirable because it's less
urgent.

In case of removenode, it leads to failure of removenode operation
because scheduled co-locating migration will fail if the destination
is on the excluded node, and this failure will be interpreted as drain
failure and coordinator will cancel the request.

Not a problem before "parallel decommission" because this failure is
only a streaming failure, not a barrier failure, so exception doesn't
escape into the catch clause in transition stage handler, and the
migration is simply rolled back. Once draining happens in the tablet
migration track, streaming failure will be interpreted as drain
failure and cancel the request.
2026-01-18 15:36:06 +01:00
Tomasz Grabiec
d9e1a6006f node_ops: task_manager_module: Populate entity field also for active requests 2026-01-18 15:36:06 +01:00
Tomasz Grabiec
bbd293d440 tasks: node_ops: Put node id in the entity field
If we have many node requests active at a time, it's useful to know
which requets works on which node.

Fixes #27208
2026-01-18 15:36:06 +01:00
Tomasz Grabiec
576ebcdd30 tasks, node_ops: Unify setting of task_stats in get_status() and get_stats()
They should return the same, so extract the common logic.
2026-01-18 15:36:05 +01:00
Tomasz Grabiec
629d6d98fa topology: Protect against empty cancelation reason
Request would be deemed successful, which is counter to the intention
of cancelation and effect on the system.
2026-01-18 15:36:05 +01:00
Tomasz Grabiec
7446eb7e8d tasks, topology: Make pending node operations abortable
We want to be able to cancel decommission when it's still in the
tablet draining phase. Such a request is in a pending and paused
state, and can be safely canceled. We set the node's "draining" flag
back to false.
2026-01-18 15:36:05 +01:00
Tomasz Grabiec
091ed4d54b doc: topology-over-raft.md: Fix diagram for replacing, tablet_draining is not engaged
Since 288e75fe22
2026-01-18 15:36:05 +01:00
Tomasz Grabiec
a009644c7d raft_topology, tablets: Drain tablets in parallel with other topology operations
Allows other topology operations to execute while tablets are being
drained on decommission. In particular, bootstrap on scale-out. This
is important for elasticity.

Allows multiple decommission/removenode to happen in parallel, which
is important for efficiency.

Flow of decommission/removenode request:
  1) pending and paused, has tablet replicas on target node.
     Tablet scheduler will start draining tablets.
  2) No tablets on target node, request is pending but not paused
  3) Request is scheduled, node is in transition
  4) Request is done

Nodes are considered draining as soon as there is a leave or remove
request on them. If there are tablet replicas present on the target
node, the request is in a paused state and will not be picked by
topology coordinator. The paused state is computed from topology state
automatically on reload.

When request is not paused, its execution starts in
write_both_read_old state. The old tablet_draining state is not
entered (it's deprecated now).

Tablet load balancing will yield the state machine as soon as some
request is no longer paused and ready to be scheduled, based on
standard preemption mechanics.

The test case test_explicit_tablet_movement_during_decommission is
removed. It verifies that tablet move API works during tablet draining
transition. After this PR, we no longer enter this transition, so the
test doesn't work. It loses its purpose, because movement during
normal tablet balancing is not special and tested elsewhere.
2026-01-18 15:36:05 +01:00
Tomasz Grabiec
e38ee160fc virtual_tables: Show draining and excluded fields in system.cluster_status and system.load_by_node
It gives a more accurate picture of what happens in the cluster.
2026-01-18 15:36:04 +01:00
Tomasz Grabiec
1c2e47e059 locator: topology: Add "draining" flag to a node
They are being drained of tablet replicas, tablet scheduler works to
move replicas away from such nodes. This state is set at the
beginning of decommission and removenode operations.
2026-01-18 15:36:04 +01:00
Tomasz Grabiec
a37b1ce832 topology_coordinator: Extract generate_cancel_request_update() 2026-01-18 15:36:04 +01:00
Tomasz Grabiec
77bd00bf9f storage_service: Drop dependency in topology_state_machine.hh in the header
To reduce compilation time.
2026-01-18 15:36:04 +01:00
Tomasz Grabiec
a24c3fc229 locator: Extract common code in assert_rf_rack_valid_keyspace() 2026-01-18 15:36:04 +01:00
Tomasz Grabiec
d3ee82ea51 topology_coordinator, storage_service: Validate node removal/decommission at request submission time
After parallel tablet draining, the validation at the time request
starts executing is too late, tablets will be already drained.

This trips tests which expect validation failure, but get tablet
draining failure instead.

Also, in case of decommission, it's a waste to go through draining
only to discover that the operation has to be rolled back due to
validation.

So avoid submitting a request altogether if it's invalid.
The validation at request execution start remains, for extra sefety.

validate_removing_node() was extracted out of topology_coordinator,
so that it can be called by storage_service on non-coordinator.

Some tests need adjusting for the fact that after failed removenode
the node may still not be marked as excluded, so we need to explicitly
exclude it or add to the list of ignored nodes in the next removenode
operation.
2026-01-18 15:36:04 +01:00
Nadav Har'El
34d28475d9 Merge 'Implement Vector Search filtering API' from Dawid Pawlik
Since Vector Store service filtering API has been implemented (scylladb/vector-store#334), there is a need for the implementation of Scylla side part.
This patch should implement a `statement_restrictions` parsing into Vector Store filtering API compatible JSON objects.
Those objects should be added to ANN query vector POST requests as `filter` object.

After this patch, the subset of all operations ([Vector Search Filtering Milestone 1](https://scylladb.atlassian.net/wiki/spaces/RND/pages/156729450/Vector+Search+Filtering+Design+Document#Milestone-1)) happy path should be completed, allowing users to filter on primary key columns with single column `=` and `IN` or multiple column `()=()` and `() IN ()`.
The restrictions for other operations should be implemented in a PR on Vector Store service side.

---

This PR implements parsing the `statement_restrictions` into Vector Store filtering API compatible JSON objects.
The JSON objects are created and used in ANN vector queries with filtering.
It closes the Scylla side implementation of Vector Search filtering milestone 1.

Unit tests for `statement_restrictions` parsing are added. Integration tests will be added on Vector Store service side PR.

---

Fixes: SCYLLADB-249

New feature, should land into 2026.1

Closes scylladb/scylladb#28109

* github.com:scylladb/scylladb:
  docs: update documentation on filtering with vector queries
  test/vector_search: add test for filtered ANN with VS mock
  test/vector_search: add restriction to JSON conversion unit tests
  vector_search: cql: construct and use filter in ANN vector queries
  select_statement: do not require post query ordering for vector queries
  vector_search: add `statement_restrictions` to JSON parsing
2026-01-18 16:11:29 +02:00
Ernest Zaslavsky
eb76858369 Update seastar submodule
seastar dd46b6f..e00f1513
```
e00f1513 Merge 'net: Add DNS TTL to the net::hostent' from Ernest Zaslavsky
8a69e1f4 net: extract common implementation of inet_address::find_all
cb469fd1 net: deprecate the addr_list in hostent
1d59c0ca net: expose DNS TTL via net::hostent
3c6d919f http: add virtual close() to connection_factory
bbd0001a Revert "net: expose DNS TTL via net::hostent"
```

Closes scylladb/scylladb#28147
2026-01-18 15:00:48 +02:00
Andrzej Jackowski
6eca7e4ff6 transport: unify lambda capture lifetime for control connections
Workload prioritization was added in scylladb/scylladb#22031.
The functionality of updating service levels was implemented as
a lambda coroutine, leaving room for the lambda coroutine fiasco.

The problem was noticed and addressed in scylladb/scylladb#26404.
There are currently three functions that call switch_tenant:
 - update_user_scheduling_group_v1 and update_user_scheduling_group_v2
   use the deducing this (this auto self) to ensure the proper
   lifecycle of the lambda capture.
 - update_control_connection_scheduling_group doesn’t use the deducing
   this, but the lambda captures only `this`, which is used before
   the first possible coroutine preemption. Therefore, it doesn’t seem
   that any memory corruption or undefined behavior is possible here.

Nevertheless, it seems better to start using the deducing this in
update_control_connection_scheduling_group as well, to avoid problems
in the future if someone modifies the code and forgets to add it.

Fixes: SCYLLADB-284

Closes scylladb/scylladb#28158
2026-01-17 20:36:31 +02:00
Nikos Dragazis
8aca7b0eb9 test: database_test: Fix serialization of partition key
The `make_key` lambda erroneously allocates a fixed 8-byte buffer
(`sizeof(s.size())`) for variable-length strings, potentially causing
uninitialized bytes to be included. If such bytes exist and they are
not valid UTF-8 characters, deserialization fails:

```
ERROR 2026-01-16 08:18:26,062 [shard 0:main] testlog - snapshot_list_contains_dropped_tables: cql env callback failed, error: exceptions::invalid_request_exception (Exception while binding column p1: marshaling error: Validation failed - non-UTF8 character in a UTF8 string, at byte offset 7)
```

Fixes #28195.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>

Closes scylladb/scylladb#28197
2026-01-17 20:32:06 +02:00
Botond Dénes
1e09a34686 replica: add abort polling to memtable and cache readers
Continuing the read once it is aborted (e.g. due to timeout) is a waste
of resources, as the produced results will be discarded.
Poll the permit's abort exception in the memtable and cache reader's
fill_buffer(). This results in one poll per buffer filled (8KB of data).

We already have similar poll for sstable readers, as disk reads are
usually much heavier and therefore it is more important to stop them
ASAP after abort. Cache and memtable reads are usually quick but not
always, hence it is important to also have polling in the cache and
memtable readers.

Refs: #11469
Fixes: #28148

Closes scylladb/scylladb#28149
2026-01-16 18:03:04 +01:00
Ferenc Szili
0aebc17c4c docs: correct spelling errors in size based balancing docs
0ede8d154b introduced the dev doc for size
based load balancing, but also added spelling errors.
This PR fixed these errors.

Closes scylladb/scylladb#28196
2026-01-16 17:41:57 +02:00
Aleksandra Martyniuk
ad2381923f service: fin indentation 2026-01-16 11:38:10 +01:00
Aleksandra Martyniuk
504290902c test: add test_numeric_rf_to_rack_list_conversion_abort
Add regression test that checks whether aborted rf change leaves
the system_schema.keyspaces unchanged.
2026-01-16 11:36:21 +01:00
Aleksandra Martyniuk
3ed8701301 service: tasks: fix type of global_topology_request_virtual_task
Currently, the type of global_topology_request_virtual_task isn't
taken out of std::variant before printing, which results with
a task of type variant(actual_type).

Retrieve the type from the variant before passing it to task type.
2026-01-16 11:36:21 +01:00
Aleksandra Martyniuk
580dfd63e5 service: do not change the schema while pausing the rf change
Currently, if a rf change request is paused, it immediately changes
the system_schema.keyspaces to use rack list for this keyspace.
If the request is aborted, the co-location might not be finished.
Hence, we can end up with inconsistent schema and tablet replica state.

Update the system_schema.keyspaces only after the co-location is done (and
not when it's started).
2026-01-16 11:36:15 +01:00
Patryk Jędrzejczak
eb7be9010d Merge 'topology_coordinator: Refresh load stats after table is created or altered' from Tomasz Grabiec
We switched to the size-based load balancing, which now has more
strict requirements for load stats. We no longer need only per-node
stats, but also per-tablet stats.

Bootstrapping a node triggers stats refresh, but allocating tablets on
table creation didn't. So after creating a table, load balancer
couldn't make progress for up to 60s (stats refresh period).

This makes tests take longer, and can even cause failures if tests are
using a low-enough timeout.

Fixes https://github.com/scylladb/scylladb/issues/27921

No backport becuse only master is vulnerable (size-based load balancing).

Closes scylladb/scylladb#27926

* https://github.com/scylladb/scylladb:
  test: cluster: Add reproducer for missed notification in topology coordinator
  topology_coordinator: Wake up the state machine after stats refresh
  topology_coordinator: Move tablet_load_stats_refresh_before_rebalancing injection earlier
  topology_coordinator: Fix potential missed notification
  topology_coordinator: Refresh load stats after table is created or altered
  tablets: Do a group0 read barrier on tablet load stats refresh
  topology_coordinator: Ensure stats are refreshed in the gossip scheduling group
  test: Use ManagerClient.{disable,enable}_tablet_balancing()
  test: Add missing calls to disable_tablet_balancing() in tests which use move_tablet() API
  test: pylib: Introduce ManagerClient.{disable,enable}_tablet_balancing()
2026-01-16 11:34:57 +01:00
Dawid Pawlik
383f9e6e56 docs: update documentation on filtering with vector queries
Add a description of available filtering options with ANN vector queries.
Provide an example of such query and a reference to `WHERE` clause restrictions.
2026-01-16 11:18:23 +01:00
Dawid Pawlik
67d3454d2b test/vector_search: add test for filtered ANN with VS mock
Implement a test using Vector Store mock to check if end-to-end
integration works with filtered ANN query.
2026-01-16 11:18:23 +01:00
Dawid Pawlik
a54be82536 test/vector_search: add restriction to JSON conversion unit tests
Add unit tests for conversion of CQL restrictions to Vector Store filtering API
compatible JSON objects. The tests include:
- empty restriction
- `ALLOW FILTERING` in restriction
- single column restrictions
    - `=`, `<`, `>`, `<=`, `>=`, `IN`
- multiple column restrictions
    - `()=()`, `()<()`, `()>()`, `()<=()`, `()>=()`, `() IN ()`
- multiple restrictions conjunction
- `TEXT` and `BOOLEAN` column restrictions
2026-01-16 11:18:23 +01:00
Dawid Pawlik
2a38794b8e vector_search: cql: construct and use filter in ANN vector queries
Add `filter` option in `ann()` function to write the filter JSON
object as the POST request in ANN vector queries.

Adjust existing `vector_store_client_test` tests accordingly.
2026-01-16 11:18:23 +01:00
Dawid Pawlik
304e908e3b select_statement: do not require post query ordering for vector queries
As there is only one `ORDER BY` clause with `ANN OF` ordering supported
in ANN vector queries, there is no need to require post query ordering
for the ANN vector queries. The standard ordering is not allowed here.
In fact the ordering is done on the Vector Store service side within
the ANN search, so that the returned primary keys are already sorted
accordingly.

If left unchanged, the filtering with `IN` clauses would cause
a `bad_function_call` server error as the filtering with `IN` clauses
require the post query ordering in a standard case.
2026-01-16 11:18:23 +01:00
Dawid Pawlik
a84d1361db vector_search: add statement_restrictions to JSON parsing
Add a module parsing the statement restrictions into Vector Store
filtering API compatible JSON objects.

The API was defined in: scylladb/vector-store#334

Examplary JSON object compatible with the API:
```
{
 "restrictions": [
     { "type": "==", "lhs": "pk", "rhs": 1 },
     { "type": "IN", "lhs": "pk", "rhs": [2, 3] },
     { "type": "<", "lhs": "ck", "rhs": 4 },
     { "type": "<=", "lhs": "ck", "rhs": 5 },
     { "type": ">", "lhs": "pk", "rhs": 6 },
     { "type": ">=", "lhs": "pk", "rhs": 7 },
     { "type": "()==()", "lhs": ["pk", "ck"], "rhs": [10, 20] },
     { "type": "()IN()", "lhs": ["pk", "ck"], "rhs": [[100, 200], [300, 400]] },
     { "type": "()<()", "lhs": ["pk", "ck"], "rhs": [30, 40] },
     { "type": "()<=()", "lhs": ["pk", "ck"], "rhs": [50, 60] },
     { "type": "()>()", "lhs": ["pk", "ck"], "rhs": [70, 80] },
     { "type": "()>=()", "lhs": ["pk", "ck"], "rhs": [90, 0] }
 ],
 "allow_filtering": true
}
```
2026-01-16 11:18:23 +01:00
Tomasz Grabiec
3fb7719277 topology_coordinator: Update load stats in case rebuilding with no live replica
Such rebuild has no read_from replica, but we know the tablet size will be 0.
If we don't, stats will be incomplete until the next refresh.

This is important for test cases which do removenode or replace while
all replicas are down. So for example test_replace from
test_tablets_removenode.py, which uses RF=1 and replaces a node.

Without this, the test waits for 60s needlessly after the first round
of rebuilding migrations before scheduling more migrations. This can
cause the test to time out.

Fixes #28115

Closes scylladb/scylladb#28121
2026-01-16 11:19:01 +02:00
Calle Wilund
2fd6ca4c46 test/cluster/test_internode_compression: Transpose test from dtest
Refs #27429

re-implement the dtest with same name as a scylla pytest, using
a python level network proxy instead of tcpdump etc. Both to avoid
sudo and also to ensure we don't race.

v2:
* Included positive test (mode=all)
2026-01-14 10:53:34 +01:00
Nikos Dragazis
7fa1f87355 db/config: Update sstable_compression_user_table_options description
Clarify what "user table" means.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2026-01-13 20:45:59 +02:00
Nikos Dragazis
1e37781d86 schema: Add initializer for compression defaults
In PR 5b6570be52 we introduced the config option
`sstable_compression_user_table_options` to allow adjusting the default
compression settings for user tables. However, the new option was hooked
into the CQL layer and applied only to CQL base tables, not to the whole
spectrum of user tables: CQL auxiliary tables (materialized views,
secondary indexes, CDC log tables), Alternator base tables, Alternator
auxiliary tables (GSIs, LSIs, Streams).

Fix this by moving the logic into the `schema_builder` via a schema
initializer. This ensures that the default compression settings are
applied uniformly regardless of how the table is created, while also
keeping the logic in a central place.

Register the initializer at startup in all executables where schemas are
being used (`scylla_main()`, `scylla_sstable_main()`, `cql_test_env`).

Finally, remove the ad-hoc logic from `create_table_statement`
(redundant as of this patch), remove the xfail markers from the relevant
tests and adjust `test_describe_cdc_log_table_create_statement` to
expect LZ4WithDicts as the default compressor.

Fixes #26914.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2026-01-13 20:45:59 +02:00
Nikos Dragazis
d5ec66bc0c schema: Generalize static configurators into schema initializers
Extend the `static_configurator` mechanism to support initialization of
arbitrary schema properties, not only static ones, by passing a
`schema_builder` reference to the configurator interface.

As part of this change, rename `static_configurator` to
`schema_initializer` to better reflect its broader responsibility.

Add a checkpoint/restore mechanism to allow de-registering an
initializer (useful for testing; will be used in the next patch).

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2026-01-13 20:45:59 +02:00
Nikos Dragazis
5b4aa4b6a6 schema: Initialize static properties eagerly
Schemas maintain a set of so-called "static properties". These are not
user-visible schema properties; they are internal values carried by
in-memory `schema` objects for convenience (349bc1a9b6,
https://github.com/scylladb/scylladb/pull/13170#issuecomment-1469848086).

Currently, the initialization of these properties happens when a
`schema_builder` builds a schema (`schema_builder::build()`), by
invoking all registered "static configurators".

This patch moves the initialization of static properties into the
`schema_builder` constructor. With this change, the builder initializes
the properties once, stores them in a data member, and reuses them for
all schema objects that it builds. This doesn't affect correctness as
the values produced by static configurators are "static" by
nature; they do not depend on runtime state.

In the next patch, we will replace the "static configurator" pattern
with a more general pattern that also supports initialization of regular
schema properties, not just static ones. Regular properties cannot be
initialized in `build()` because users may have already explicitly set
values via setters, and there is no way to distinguish between default
values and explicitly assigned ones.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2026-01-13 20:45:55 +02:00
Nikos Dragazis
76b2d0f961 db: config: Add accessor for sstable_compression_user_table_options
The `sstable_compression_user_table_options` config option determines
the default compression settings for user tables.

In patch 2fc812a1b9, the default value of this option was changed from
LZ4 to LZ4WithDicts and a fallback logic was introduced during startup
to temporarily revert the option to LZ4 until the dictionary compression
feature is enabled.

Replace this fallback logic with an accessor that returns the correct
settings depending on the feature flag. This is cleaner and more
consistent with the way we handle the `sstable_format` option, where the
same problem appears (see `get_preferred_sstable_version()`).

As a consequence, the configuration option must always be accessed
through this accessor. Add a comment to point this out.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2026-01-13 18:30:38 +02:00
Nikos Dragazis
4ec7a064a9 test: Check that CQL and Alternator tables respect compression config
In patches 11f6a25d44 and 7b9428d8d7 we added tests to verify that
auxiliary tables for both CQL and Alternator have the same default
compression settings as their base tables. These tests do not check
where these defaults originate from; they just verify that they are
consistent.

Add some more tests to verify the actual source of the defaults, which
is expected to be the `sstable_compression_user_table_options`
from the configuration. Unlike the previous tests, these tests require
dedicated Scylla instances with custom configuration, so they must be
placed under `test/cluster/`.

Mark them as xfail-ing. The marker will be removed later in this series.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2026-01-13 18:16:08 +02:00
Calle Wilund
da17e8b18b gossiper/main: Extend special treatment of node ID resolve for rpc_address
Refs #27429

If running with broadcast_address != listen/cql/rpc address, topology
gets confused about the varying addresses. Need to special case
resolve both addresses as "self". I.e. extend broadcast_address
treatment to cql_address as well.

Added export of this via gossiper for symmetry.
2026-01-13 14:12:19 +01:00
Tomasz Grabiec
8dc20e6aaf test: cluster: Add reproducer for missed notification in topology coordinator 2026-01-13 00:40:23 +01:00
Tomasz Grabiec
7a04dd2d22 topology_coordinator: Wake up the state machine after stats refresh
Otherwise, coordinator may not react to changing stats after explicit
calls to trigger_load_stats_refresh() done on node replace or table
creation, if stats take longer to refresh than it takes the
coordinator to go idle.

The periodic refresh does wake up the topology coordinator, so the
issue is not dramatic in production, but it's annoying in tests, which
take longer because of that.

Fixes #25163
2026-01-13 00:40:23 +01:00
Tomasz Grabiec
d910e6ea63 topology_coordinator: Move tablet_load_stats_refresh_before_rebalancing injection earlier
Refreshing stats will signal _topo_sm.event, so do it before waiting
for the event, to avoiding busy looping in the coordinator.

This will produce lots of logs in test cases which enable debug-level
logging in the raft logger.

Refs #28086
2026-01-13 00:40:16 +01:00
Tomasz Grabiec
e5dee2aab8 topology_coordinator: Fix potential missed notification
Checking for work is not atomic, so there is room for missed
notification. Especially that notifications are not always triggered
from fibers which take the group0 guard.

Fix by subscribing for the event before checking for work.

Fixes #27958
2026-01-13 00:39:01 +01:00
Tomasz Grabiec
2b7aa3211d topology_coordinator: Refresh load stats after table is created or altered
We switched to the size-based load balancing, which now has more
strict requirements for load stats. We no longer need only per-node
stats (capacity), but also per-tablet stats.

Bootstrapping a node triggers stats refresh, but allocating tablets on
table creation didn't. So after creating a table, load balancer
couldn't make progress for up to 60s (stats refresh period).

This makes tests take longer, and can even cause failures if tests are
using a low-enough timeout.

Fixes #27921
2026-01-13 00:38:59 +01:00
Tomasz Grabiec
663831ebd7 tablets: Do a group0 read barrier on tablet load stats refresh
Stats refresh will be triggered on topology coordinator by events like
allocating new tablets on table creation. For refresh to be effective,
all replicas must see the new tablets, otherwise stats will be
incomplete.
2026-01-13 00:38:00 +01:00
Tomasz Grabiec
c4c5ed5aba topology_coordinator: Ensure stats are refreshed in the gossip scheduling group
Refresh can be triggered from different places, but it should run in
the gossip scheduling group, like group0 operations.
2026-01-13 00:38:00 +01:00
Tomasz Grabiec
5e6935f276 test: Use ManagerClient.{disable,enable}_tablet_balancing() 2026-01-13 00:38:00 +01:00
Tomasz Grabiec
6936704677 test: Add missing calls to disable_tablet_balancing() in tests which use move_tablet() API
If a test tries to move a tablet, it assumes the tablets are stable.
This fixes flakiness exposed by size-based load-balancing and a later
change to refresh stats sooner.
2026-01-13 00:38:00 +01:00
Tomasz Grabiec
c8098e07c9 test: pylib: Introduce ManagerClient.{disable,enable}_tablet_balancing()
It's a global operation, so we can use any server.

It's not only convenient. The call via api.disable_tablet_balancing()
confuse people to think that it's a per-server operation. This leads
to proliferation of code which does it needlessly on all servers.
2026-01-13 00:38:00 +01:00
Amnon Heiman
c6d1c63ddb distributed_loader: system_replicated_keys as system keyspace
This patch adds system_replicated_keys to the list of known system
keyspaces.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2026-01-02 16:41:47 +02:00
Amnon Heiman
83c1103917 replicated_key_provider: make KSNAME public
Move KSNAME constant from internal static to public member of
replicated_key_provider_factory class.

It will be used to identify it as a system keyspace.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2026-01-02 16:39:51 +02:00
429 changed files with 14048 additions and 4212 deletions

View File

@@ -18,7 +18,7 @@ jobs:
// Regular expression pattern to check for "Fixes" prefix
// Adjusted to dynamically insert the repository full name
const pattern = `Fixes:? ((?:#|${repo.replace('/', '\\/')}#|https://github\\.com/${repo.replace('/', '\\/')}/issues/)(\\d+)|([A-Z]+-\\d+))`;
const pattern = `Fixes:? ((?:#|${repo.replace('/', '\\/')}#|https://github\\.com/${repo.replace('/', '\\/')}/issues/)(\\d+)|(?:https://scylladb\\.atlassian\\.net/browse/)?([A-Z]+-\\d+))`;
const regex = new RegExp(pattern);
if (!regex.test(body)) {

View File

@@ -0,0 +1,53 @@
name: Backport with Jira Integration
on:
push:
branches:
- master
- next-*.*
- branch-*.*
pull_request_target:
types: [labeled, closed]
branches:
- master
- next
- next-*.*
- branch-*.*
jobs:
backport-on-push:
if: github.event_name == 'push'
uses: scylladb/github-automation/.github/workflows/backport-with-jira.yaml@main
with:
event_type: 'push'
base_branch: ${{ github.ref }}
commits: ${{ github.event.before }}..${{ github.sha }}
secrets:
gh_token: ${{ secrets.AUTO_BACKPORT_TOKEN }}
jira_auth: ${{ secrets.USER_AND_KEY_FOR_JIRA_AUTOMATION }}
backport-on-label:
if: github.event_name == 'pull_request_target' && github.event.action == 'labeled'
uses: scylladb/github-automation/.github/workflows/backport-with-jira.yaml@main
with:
event_type: 'labeled'
base_branch: refs/heads/${{ github.event.pull_request.base.ref }}
pull_request_number: ${{ github.event.pull_request.number }}
head_commit: ${{ github.event.pull_request.base.sha }}
label_name: ${{ github.event.label.name }}
pr_state: ${{ github.event.pull_request.state }}
secrets:
gh_token: ${{ secrets.AUTO_BACKPORT_TOKEN }}
jira_auth: ${{ secrets.USER_AND_KEY_FOR_JIRA_AUTOMATION }}
backport-chain:
if: github.event_name == 'pull_request_target' && github.event.action == 'closed' && github.event.pull_request.merged == true
uses: scylladb/github-automation/.github/workflows/backport-with-jira.yaml@main
with:
event_type: 'chain'
base_branch: refs/heads/${{ github.event.pull_request.base.ref }}
pull_request_number: ${{ github.event.pull_request.number }}
pr_body: ${{ github.event.pull_request.body }}
secrets:
gh_token: ${{ secrets.AUTO_BACKPORT_TOKEN }}
jira_auth: ${{ secrets.USER_AND_KEY_FOR_JIRA_AUTOMATION }}

View File

@@ -9,16 +9,57 @@ on:
jobs:
trigger-jenkins:
if: (github.event.comment.user.login != 'scylladbbot' && contains(github.event.comment.body, '@scylladbbot') && contains(github.event.comment.body, 'trigger-ci')) || github.event.label.name == 'conflicts'
if: (github.event_name == 'issue_comment' && github.event.comment.user.login != 'scylladbbot') || github.event.label.name == 'conflicts'
runs-on: ubuntu-latest
steps:
- name: Verify Org Membership
id: verify_author
env:
EVENT_NAME: ${{ github.event_name }}
PR_AUTHOR: ${{ github.event.pull_request.user.login }}
PR_ASSOCIATION: ${{ github.event.pull_request.author_association }}
COMMENT_AUTHOR: ${{ github.event.comment.user.login }}
COMMENT_ASSOCIATION: ${{ github.event.comment.author_association }}
shell: bash
run: |
if [[ "$EVENT_NAME" == "pull_request_target" ]]; then
AUTHOR="$PR_AUTHOR"
ASSOCIATION="$PR_ASSOCIATION"
else
AUTHOR="$COMMENT_AUTHOR"
ASSOCIATION="$COMMENT_ASSOCIATION"
fi
ORG="scylladb"
if gh api "/orgs/${ORG}/members/${AUTHOR}" --silent 2>/dev/null; then
echo "member=true" >> $GITHUB_OUTPUT
else
echo "::warning::${AUTHOR} is not a member of ${ORG}; skipping CI trigger."
echo "member=false" >> $GITHUB_OUTPUT
fi
- name: Validate Comment Trigger
if: github.event_name == 'issue_comment'
id: verify_comment
env:
COMMENT_BODY: ${{ github.event.comment.body }}
shell: bash
run: |
CLEAN_BODY=$(echo "$COMMENT_BODY" | grep -v '^[[:space:]]*>')
if echo "$CLEAN_BODY" | grep -qi '@scylladbbot' && echo "$CLEAN_BODY" | grep -qi 'trigger-ci'; then
echo "trigger=true" >> $GITHUB_OUTPUT
else
echo "trigger=false" >> $GITHUB_OUTPUT
fi
- name: Trigger Scylla-CI-Route Jenkins Job
if: steps.verify_author.outputs.member == 'true' && (github.event_name == 'pull_request_target' || steps.verify_comment.outputs.trigger == 'true')
env:
JENKINS_USER: ${{ secrets.JENKINS_USERNAME }}
JENKINS_API_TOKEN: ${{ secrets.JENKINS_TOKEN }}
JENKINS_URL: "https://jenkins.scylladb.com"
PR_NUMBER: "${{ github.event.issue.number || github.event.pull_request.number }}"
PR_REPO_NAME: "${{ github.event.repository.full_name }}"
run: |
PR_NUMBER=${{ github.event.issue.number }}
PR_REPO_NAME=${{ github.event.repository.full_name }}
curl -X POST "$JENKINS_URL/job/releng/job/Scylla-CI-Route/buildWithParameters?PR_NUMBER=$PR_NUMBER&PR_REPO_NAME=$PR_REPO_NAME" \
--user "$JENKINS_USER:$JENKINS_API_TOKEN" --fail -i -v
--user "$JENKINS_USER:$JENKINS_API_TOKEN" --fail

View File

@@ -78,7 +78,7 @@ fi
# Default scylla product/version tags
PRODUCT=scylla
VERSION=2026.1.0-dev
VERSION=2026.1.1
if test -f version
then

View File

@@ -17,6 +17,7 @@
#include "auth/service.hh"
#include "db/config.hh"
#include "db/view/view_build_status.hh"
#include "locator/tablets.hh"
#include "mutation/tombstone.hh"
#include "locator/abstract_replication_strategy.hh"
#include "utils/log.hh"
@@ -1875,23 +1876,34 @@ future<executor::request_return_type> executor::create_table_on_shard0(service::
auto ts = group0_guard.write_timestamp();
utils::chunked_vector<mutation> schema_mutations;
auto ksm = create_keyspace_metadata(keyspace_name, _proxy, _gossiper, ts, tags_map, _proxy.features(), tablets_mode);
locator::replication_strategy_params params(ksm->strategy_options(), ksm->initial_tablets(), ksm->consistency_option());
const auto& topo = _proxy.local_db().get_token_metadata().get_topology();
auto rs = locator::abstract_replication_strategy::create_replication_strategy(ksm->strategy_name(), params, topo);
// Alternator Streams doesn't yet work when the table uses tablets (#23838)
if (stream_specification && stream_specification->IsObject()) {
auto stream_enabled = rjson::find(*stream_specification, "StreamEnabled");
if (stream_enabled && stream_enabled->IsBool() && stream_enabled->GetBool()) {
locator::replication_strategy_params params(ksm->strategy_options(), ksm->initial_tablets(), ksm->consistency_option());
const auto& topo = _proxy.local_db().get_token_metadata().get_topology();
auto rs = locator::abstract_replication_strategy::create_replication_strategy(ksm->strategy_name(), params, topo);
if (rs->uses_tablets()) {
co_return api_error::validation("Streams not yet supported on a table using tablets (issue #23838). "
"If you want to use streams, create a table with vnodes by setting the tag 'system:initial_tablets' set to 'none'.");
}
}
}
// Creating an index in tablets mode requires the rf_rack_valid_keyspaces option to be enabled.
// GSI and LSI indexes are based on materialized views which require this option to avoid consistency issues.
if (!view_builders.empty() && ksm->uses_tablets() && !_proxy.data_dictionary().get_config().rf_rack_valid_keyspaces()) {
co_return api_error::validation("GlobalSecondaryIndexes and LocalSecondaryIndexes with tablets require the rf_rack_valid_keyspaces option to be enabled.");
// Creating an index in tablets mode requires the keyspace to be RF-rack-valid.
// GSI and LSI indexes are based on materialized views which require RF-rack-validity to avoid consistency issues.
if (!view_builders.empty() || _proxy.data_dictionary().get_config().rf_rack_valid_keyspaces()) {
try {
locator::assert_rf_rack_valid_keyspace(keyspace_name, _proxy.local_db().get_token_metadata_ptr(), *rs);
} catch (const std::invalid_argument& ex) {
if (!view_builders.empty()) {
co_return api_error::validation(fmt::format("GlobalSecondaryIndexes and LocalSecondaryIndexes on a table "
"using tablets require the number of racks in the cluster to be either 1 or 3"));
} else {
co_return api_error::validation(fmt::format("Cannot create table '{}' with tablets: the configuration "
"option 'rf_rack_valid_keyspaces' is enabled, which enforces that tables using tablets can only be created in clusters "
"that have either 1 or 3 racks", table_name));
}
}
}
try {
schema_mutations = service::prepare_new_keyspace_announcement(_proxy.local_db(), ksm, ts);
@@ -2114,9 +2126,12 @@ future<executor::request_return_type> executor::update_table(client_state& clien
co_return api_error::validation(fmt::format(
"LSI {} already exists in table {}, can't use same name for GSI", index_name, table_name));
}
if (p.local().local_db().find_keyspace(keyspace_name).get_replication_strategy().uses_tablets() &&
!p.local().data_dictionary().get_config().rf_rack_valid_keyspaces()) {
co_return api_error::validation("GlobalSecondaryIndexes with tablets require the rf_rack_valid_keyspaces option to be enabled.");
try {
locator::assert_rf_rack_valid_keyspace(keyspace_name, p.local().local_db().get_token_metadata_ptr(),
p.local().local_db().find_keyspace(keyspace_name).get_replication_strategy());
} catch (const std::invalid_argument& ex) {
co_return api_error::validation(fmt::format("GlobalSecondaryIndexes on a table "
"using tablets require the number of racks in the cluster to be either 1 or 3"));
}
elogger.trace("Adding GSI {}", index_name);

View File

@@ -767,7 +767,7 @@ static future<bool> scan_table(
// by tasking another node to take over scanning of the dead node's primary
// ranges. What we do here is that this node will also check expiration
// on its *secondary* ranges - but only those whose primary owner is down.
auto tablet_secondary_replica = tablet_map.get_secondary_replica(*tablet); // throws if no secondary replica
auto tablet_secondary_replica = tablet_map.get_secondary_replica(*tablet, erm->get_topology()); // throws if no secondary replica
if (tablet_secondary_replica.host == my_host_id && tablet_secondary_replica.shard == this_shard_id()) {
if (!gossiper.is_alive(tablet_primary_replica.host)) {
co_await scan_tablet(*tablet, proxy, abort_source, page_sem, expiration_stats, scan_ctx, tablet_map);

View File

@@ -3051,7 +3051,7 @@
},
{
"name":"incremental_mode",
"description":"Set the incremental repair mode. Can be 'disabled', 'incremental', or 'full'. 'incremental': The incremental repair logic is enabled. Unrepaired sstables will be included for repair. Repaired sstables will be skipped. The incremental repair states will be updated after repair. 'full': The incremental repair logic is enabled. Both repaired and unrepaired sstables will be included for repair. The incremental repair states will be updated after repair. 'disabled': The incremental repair logic is disabled completely. The incremental repair states, e.g., repaired_at in sstables and sstables_repaired_at in the system.tablets table, will not be updated after repair. When the option is not provided, it defaults to 'disabled' mode.",
"description":"Set the incremental repair mode. Can be 'disabled', 'incremental', or 'full'. 'incremental': The incremental repair logic is enabled. Unrepaired sstables will be included for repair. Repaired sstables will be skipped. The incremental repair states will be updated after repair. 'full': The incremental repair logic is enabled. Both repaired and unrepaired sstables will be included for repair. The incremental repair states will be updated after repair. 'disabled': The incremental repair logic is disabled completely. The incremental repair states, e.g., repaired_at in sstables and sstables_repaired_at in the system.tablets table, will not be updated after repair. When the option is not provided, it defaults to incremental mode.",
"required":false,
"allowMultiple":false,
"type":"string",

View File

@@ -515,6 +515,15 @@ void set_sstables_loader(http_context& ctx, routes& r, sharded<sstables_loader>&
auto sstables = parsed.GetArray() |
std::views::transform([] (const auto& s) { return sstring(rjson::to_string_view(s)); }) |
std::ranges::to<std::vector>();
apilog.info("Restore invoked with following parameters: keyspace={}, table={}, endpoint={}, bucket={}, prefix={}, sstables_count={}, scope={}, primary_replica_only={}",
keyspace,
table,
endpoint,
bucket,
prefix,
sstables.size(),
scope,
primary_replica_only);
auto task_id = co_await sst_loader.local().download_new_sstables(keyspace, table, prefix, std::move(sstables), endpoint, bucket, scope, primary_replica_only);
co_return json::json_return_type(fmt::to_string(task_id));
});
@@ -2016,12 +2025,14 @@ void set_snapshot(http_context& ctx, routes& r, sharded<db::snapshot_ctl>& snap_
auto tag = req->get_query_param("tag");
auto column_families = split(req->get_query_param("cf"), ",");
auto sfopt = req->get_query_param("sf");
auto sf = db::snapshot_ctl::skip_flush(strcasecmp(sfopt.c_str(), "true") == 0);
db::snapshot_options opts = {
.skip_flush = strcasecmp(sfopt.c_str(), "true") == 0,
};
std::vector<sstring> keynames = split(req->get_query_param("kn"), ",");
try {
if (column_families.empty()) {
co_await snap_ctl.local().take_snapshot(tag, keynames, sf);
co_await snap_ctl.local().take_snapshot(tag, keynames, opts);
} else {
if (keynames.empty()) {
throw httpd::bad_param_exception("The keyspace of column families must be specified");
@@ -2029,7 +2040,7 @@ void set_snapshot(http_context& ctx, routes& r, sharded<db::snapshot_ctl>& snap_
if (keynames.size() > 1) {
throw httpd::bad_param_exception("Only one keyspace allowed when specifying a column family");
}
co_await snap_ctl.local().take_column_family_snapshot(keynames[0], column_families, tag, sf);
co_await snap_ctl.local().take_column_family_snapshot(keynames[0], column_families, tag, opts);
}
co_return json_void();
} catch (...) {
@@ -2064,7 +2075,8 @@ void set_snapshot(http_context& ctx, routes& r, sharded<db::snapshot_ctl>& snap_
auto info = parse_scrub_options(ctx, std::move(req));
if (!info.snapshot_tag.empty()) {
co_await snap_ctl.local().take_column_family_snapshot(info.keyspace, info.column_families, info.snapshot_tag, db::snapshot_ctl::skip_flush::no);
db::snapshot_options opts = {.skip_flush = false};
co_await snap_ctl.local().take_column_family_snapshot(info.keyspace, info.column_families, info.snapshot_tag, opts);
}
compaction::compaction_stats stats;

View File

@@ -146,7 +146,8 @@ void set_tasks_compaction_module(http_context& ctx, routes& r, sharded<service::
auto info = parse_scrub_options(ctx, std::move(req));
if (!info.snapshot_tag.empty()) {
co_await snap_ctl.local().take_column_family_snapshot(info.keyspace, info.column_families, info.snapshot_tag, db::snapshot_ctl::skip_flush::no);
db::snapshot_options opts = {.skip_flush = false};
co_await snap_ctl.local().take_column_family_snapshot(info.keyspace, info.column_families, info.snapshot_tag, opts);
}
auto& compaction_module = db.local().get_compaction_manager().get_task_manager_module();

View File

@@ -53,10 +53,10 @@ static std::string json_escape(std::string_view str) {
}
future<> audit_syslog_storage_helper::syslog_send_helper(const sstring& msg) {
future<> audit_syslog_storage_helper::syslog_send_helper(temporary_buffer<char> msg) {
try {
auto lock = co_await get_units(_semaphore, 1, std::chrono::hours(1));
co_await _sender.send(_syslog_address, net::packet{msg.data(), msg.size()});
co_await _sender.send(_syslog_address, std::span(&msg, 1));
}
catch (const std::exception& e) {
auto error_msg = seastar::format(
@@ -90,7 +90,7 @@ future<> audit_syslog_storage_helper::start(const db::config& cfg) {
co_return;
}
co_await syslog_send_helper("Initializing syslog audit backend.");
co_await syslog_send_helper(temporary_buffer<char>::copy_of("Initializing syslog audit backend."));
}
future<> audit_syslog_storage_helper::stop() {
@@ -120,7 +120,7 @@ future<> audit_syslog_storage_helper::write(const audit_info* audit_info,
audit_info->table(),
username);
co_await syslog_send_helper(msg);
co_await syslog_send_helper(std::move(msg).release());
}
future<> audit_syslog_storage_helper::write_login(const sstring& username,
@@ -139,7 +139,7 @@ future<> audit_syslog_storage_helper::write_login(const sstring& username,
client_ip,
username);
co_await syslog_send_helper(msg.c_str());
co_await syslog_send_helper(std::move(msg).release());
}
}

View File

@@ -26,7 +26,7 @@ class audit_syslog_storage_helper : public storage_helper {
net::datagram_channel _sender;
seastar::semaphore _semaphore;
future<> syslog_send_helper(const sstring& msg);
future<> syslog_send_helper(seastar::temporary_buffer<char> msg);
public:
explicit audit_syslog_storage_helper(cql3::query_processor&, service::migration_manager&);
virtual ~audit_syslog_storage_helper();

View File

@@ -876,22 +876,6 @@ future<> migrate_to_auth_v2(db::system_keyspace& sys_ks, ::service::raft_group0_
continue; // some tables might not have been created if they were not used
}
// use longer than usual timeout as we scan the whole table
// but not infinite or very long as we want to fail reasonably fast
const auto t = 5min;
const timeout_config tc{t, t, t, t, t, t, t};
::service::client_state cs(::service::client_state::internal_tag{}, tc);
::service::query_state qs(cs, empty_service_permit());
auto rows = co_await qp.execute_internal(
seastar::format("SELECT * FROM {}.{}", meta::legacy::AUTH_KS, cf_name),
db::consistency_level::ALL,
qs,
{},
cql3::query_processor::cache_internal::no);
if (rows->empty()) {
continue;
}
std::vector<sstring> col_names;
for (const auto& col : schema->all_columns()) {
col_names.push_back(col.name_as_cql_string());
@@ -900,30 +884,51 @@ future<> migrate_to_auth_v2(db::system_keyspace& sys_ks, ::service::raft_group0_
for (size_t i = 1; i < col_names.size(); ++i) {
val_binders_str += ", ?";
}
for (const auto& row : *rows) {
std::vector<data_value_or_unset> values;
for (const auto& col : schema->all_columns()) {
if (row.has(col.name_as_text())) {
values.push_back(
col.type->deserialize(row.get_blob_unfragmented(col.name_as_text())));
} else {
values.push_back(unset_value{});
std::vector<mutation> collected;
// use longer than usual timeout as we scan the whole table
// but not infinite or very long as we want to fail reasonably fast
const auto t = 5min;
const timeout_config tc{t, t, t, t, t, t, t};
::service::client_state cs(::service::client_state::internal_tag{}, tc);
::service::query_state qs(cs, empty_service_permit());
co_await qp.query_internal(
seastar::format("SELECT * FROM {}.{}", meta::legacy::AUTH_KS, cf_name),
db::consistency_level::ALL,
{},
1000,
[&qp, &cf_name, &col_names, &val_binders_str, &schema, ts, &collected] (const cql3::untyped_result_set::row& row) -> future<stop_iteration> {
std::vector<data_value_or_unset> values;
for (const auto& col : schema->all_columns()) {
if (row.has(col.name_as_text())) {
values.push_back(
col.type->deserialize(row.get_blob_unfragmented(col.name_as_text())));
} else {
values.push_back(unset_value{});
}
}
}
auto muts = co_await qp.get_mutations_internal(
seastar::format("INSERT INTO {}.{} ({}) VALUES ({})",
db::system_keyspace::NAME,
cf_name,
fmt::join(col_names, ", "),
val_binders_str),
internal_distributed_query_state(),
ts,
std::move(values));
if (muts.size() != 1) {
on_internal_error(log,
format("expecting single insert mutation, got {}", muts.size()));
}
co_yield std::move(muts[0]);
auto muts = co_await qp.get_mutations_internal(
seastar::format("INSERT INTO {}.{} ({}) VALUES ({})",
db::system_keyspace::NAME,
cf_name,
fmt::join(col_names, ", "),
val_binders_str),
internal_distributed_query_state(),
ts,
std::move(values));
if (muts.size() != 1) {
on_internal_error(log,
format("expecting single insert mutation, got {}", muts.size()));
}
collected.push_back(std::move(muts[0]));
co_return stop_iteration::no;
},
std::move(qs));
for (auto& m : collected) {
co_yield std::move(m);
}
}
co_yield co_await sys_ks.make_auth_version_mutation(ts,

View File

@@ -48,6 +48,7 @@
#include "mutation/mutation_fragment_stream_validator.hh"
#include "utils/assert.hh"
#include "utils/error_injection.hh"
#include "utils/chunked_vector.hh"
#include "utils/pretty_printers.hh"
#include "readers/multi_range.hh"
#include "readers/compacting.hh"
@@ -611,23 +612,23 @@ private:
}
// Called in a seastar thread
dht::partition_range_vector
utils::chunked_vector<dht::partition_range>
get_ranges_for_invalidation(const std::vector<sstables::shared_sstable>& sstables) {
// If owned ranges is disengaged, it means no cleanup work was done and
// so nothing needs to be invalidated.
if (!_owned_ranges) {
return dht::partition_range_vector{};
return {};
}
auto owned_ranges = dht::to_partition_ranges(*_owned_ranges, utils::can_yield::yes);
auto owned_ranges = dht::to_partition_ranges_chunked(*_owned_ranges).get();
auto non_owned_ranges = sstables
| std::views::transform([] (const sstables::shared_sstable& sst) {
seastar::thread::maybe_yield();
return dht::partition_range::make({sst->get_first_decorated_key(), true},
{sst->get_last_decorated_key(), true});
}) | std::ranges::to<dht::partition_range_vector>();
}) | std::ranges::to<utils::chunked_vector<dht::partition_range>>();
return dht::subtract_ranges(*_schema, non_owned_ranges, std::move(owned_ranges)).get();
return dht::subtract_ranges(*_schema, std::move(non_owned_ranges), std::move(owned_ranges)).get();
}
protected:
compaction(compaction_group_view& table_s, compaction_descriptor descriptor, compaction_data& cdata, compaction_progress_monitor& progress_monitor, use_backlog_tracker use_backlog_tracker)
@@ -718,8 +719,8 @@ protected:
compaction_completion_desc
get_compaction_completion_desc(std::vector<sstables::shared_sstable> input_sstables, std::vector<sstables::shared_sstable> output_sstables) {
auto ranges_for_for_invalidation = get_ranges_for_invalidation(input_sstables);
return compaction_completion_desc{std::move(input_sstables), std::move(output_sstables), std::move(ranges_for_for_invalidation)};
auto ranges = get_ranges_for_invalidation(input_sstables);
return compaction_completion_desc{std::move(input_sstables), std::move(output_sstables), std::move(ranges)};
}
// Tombstone expiration is enabled based on the presence of sstable set.

View File

@@ -16,6 +16,7 @@
#include "sstables/sstable_set.hh"
#include "compaction_fwd.hh"
#include "mutation_writer/token_group_based_splitting_writer.hh"
#include "utils/chunked_vector.hh"
namespace compaction {
@@ -38,7 +39,7 @@ struct compaction_completion_desc {
// New, fresh SSTables that should be added to SSTable set, replacing the old ones.
std::vector<sstables::shared_sstable> new_sstables;
// Set of compacted partition ranges that should be invalidated in the cache.
dht::partition_range_vector ranges_for_cache_invalidation;
utils::chunked_vector<dht::partition_range> ranges_for_cache_invalidation;
};
// creates a new SSTable for a given shard

View File

@@ -778,6 +778,7 @@ compaction_manager::get_incremental_repair_read_lock(compaction::compaction_grou
cmlog.debug("Get get_incremental_repair_read_lock for {} started", reason);
}
compaction::compaction_state& cs = get_compaction_state(&t);
auto gh = cs.gate.hold();
auto ret = co_await cs.incremental_repair_lock.hold_read_lock();
if (!reason.empty()) {
cmlog.debug("Get get_incremental_repair_read_lock for {} done", reason);
@@ -791,6 +792,7 @@ compaction_manager::get_incremental_repair_write_lock(compaction::compaction_gro
cmlog.debug("Get get_incremental_repair_write_lock for {} started", reason);
}
compaction::compaction_state& cs = get_compaction_state(&t);
auto gh = cs.gate.hold();
auto ret = co_await cs.incremental_repair_lock.hold_write_lock();
if (!reason.empty()) {
cmlog.debug("Get get_incremental_repair_write_lock for {} done", reason);
@@ -1519,7 +1521,9 @@ future<> compaction_manager::maybe_wait_for_sstable_count_reduction(compaction_g
| std::views::transform(std::mem_fn(&sstables::sstable::run_identifier))
| std::ranges::to<std::unordered_set>());
};
const auto threshold = size_t(std::max(schema->max_compaction_threshold(), 32));
const auto threshold = utils::get_local_injector().inject_parameter<size_t>("set_sstable_count_reduction_threshold")
.value_or(size_t(std::max(schema->max_compaction_threshold(), 32)));
auto count = co_await num_runs_for_compaction();
if (count <= threshold) {
cmlog.trace("No need to wait for sstable count reduction in {}: {} <= {}",
@@ -1534,9 +1538,7 @@ future<> compaction_manager::maybe_wait_for_sstable_count_reduction(compaction_g
auto& cstate = get_compaction_state(&t);
try {
while (can_perform_regular_compaction(t) && co_await num_runs_for_compaction() > threshold) {
co_await cstate.compaction_done.wait([this, &t] {
return !can_perform_regular_compaction(t);
});
co_await cstate.compaction_done.wait();
}
} catch (const broken_condition_variable&) {
co_return;
@@ -2387,6 +2389,8 @@ future<> compaction_manager::remove(compaction_group_view& t, sstring reason) no
if (!c_state.gate.is_closed()) {
auto close_gate = c_state.gate.close();
co_await stop_ongoing_compactions(reason, &t);
// Wait for users of incremental repair lock (can be either repair itself or maintenance compactions).
co_await c_state.incremental_repair_lock.write_lock();
co_await std::move(close_gate);
}

View File

@@ -725,29 +725,9 @@ raft_tests = set([
vector_search_tests = set([
'test/vector_search/vector_store_client_test',
'test/vector_search/load_balancer_test',
'test/vector_search/client_test'
])
vector_search_validator_bin = 'vector-search-validator/bin/vector-search-validator'
vector_search_validator_deps = set([
'test/vector_search_validator/build-validator',
'test/vector_search_validator/Cargo.toml',
'test/vector_search_validator/crates/validator/Cargo.toml',
'test/vector_search_validator/crates/validator/src/main.rs',
'test/vector_search_validator/crates/validator-scylla/Cargo.toml',
'test/vector_search_validator/crates/validator-scylla/src/lib.rs',
'test/vector_search_validator/crates/validator-scylla/src/cql.rs',
])
vector_store_bin = 'vector-search-validator/bin/vector-store'
vector_store_deps = set([
'test/vector_search_validator/build-env',
'test/vector_search_validator/build-vector-store',
])
vector_search_validator_bins = set([
vector_search_validator_bin,
vector_store_bin,
'test/vector_search/client_test',
'test/vector_search/filter_test',
'test/vector_search/rescoring_test'
])
wasms = set([
@@ -783,7 +763,7 @@ other = set([
'iotune',
])
all_artifacts = apps | cpp_apps | tests | other | wasms | vector_search_validator_bins
all_artifacts = apps | cpp_apps | tests | other | wasms
arg_parser = argparse.ArgumentParser('Configure scylla', add_help=False, formatter_class=argparse.ArgumentDefaultsHelpFormatter)
arg_parser.add_argument('--out', dest='buildfile', action='store', default='build.ninja',
@@ -1034,6 +1014,9 @@ scylla_core = (['message/messaging_service.cc',
'cql3/functions/aggregate_fcts.cc',
'cql3/functions/castas_fcts.cc',
'cql3/functions/error_injection_fcts.cc',
'cql3/statements/strong_consistency/modification_statement.cc',
'cql3/statements/strong_consistency/select_statement.cc',
'cql3/statements/strong_consistency/statement_helpers.cc',
'cql3/functions/vector_similarity_fcts.cc',
'cql3/statements/cf_prop_defs.cc',
'cql3/statements/cf_statement.cc',
@@ -1059,8 +1042,8 @@ scylla_core = (['message/messaging_service.cc',
'cql3/statements/raw/parsed_statement.cc',
'cql3/statements/property_definitions.cc',
'cql3/statements/update_statement.cc',
'cql3/statements/strongly_consistent_modification_statement.cc',
'cql3/statements/strongly_consistent_select_statement.cc',
'cql3/statements/broadcast_modification_statement.cc',
'cql3/statements/broadcast_select_statement.cc',
'cql3/statements/delete_statement.cc',
'cql3/statements/prune_materialized_view_statement.cc',
'cql3/statements/batch_statement.cc',
@@ -1351,6 +1334,9 @@ scylla_core = (['message/messaging_service.cc',
'lang/wasm.cc',
'lang/wasm_alien_thread_runner.cc',
'lang/wasm_instance_cache.cc',
'service/strong_consistency/groups_manager.cc',
'service/strong_consistency/coordinator.cc',
'service/strong_consistency/state_machine.cc',
'service/raft/group0_state_id_handler.cc',
'service/raft/group0_state_machine.cc',
'service/raft/group0_state_machine_merger.cc',
@@ -1380,6 +1366,7 @@ scylla_core = (['message/messaging_service.cc',
'vector_search/dns.cc',
'vector_search/client.cc',
'vector_search/clients.cc',
'vector_search/filter.cc',
'vector_search/truststore.cc'
] + [Antlr3Grammar('cql3/Cql.g')] \
+ scylla_raft_core
@@ -1489,6 +1476,7 @@ idls = ['idl/gossip_digest.idl.hh',
'idl/hinted_handoff.idl.hh',
'idl/storage_proxy.idl.hh',
'idl/sstables.idl.hh',
'idl/strong_consistency/state_machine.idl.hh',
'idl/group0_state_machine.idl.hh',
'idl/mapreduce_request.idl.hh',
'idl/replica_exception.idl.hh',
@@ -1784,6 +1772,8 @@ deps['test/raft/discovery_test'] = ['test/raft/discovery_test.cc',
deps['test/vector_search/vector_store_client_test'] = ['test/vector_search/vector_store_client_test.cc'] + scylla_tests_dependencies
deps['test/vector_search/load_balancer_test'] = ['test/vector_search/load_balancer_test.cc'] + scylla_tests_dependencies
deps['test/vector_search/client_test'] = ['test/vector_search/client_test.cc'] + scylla_tests_dependencies
deps['test/vector_search/filter_test'] = ['test/vector_search/filter_test.cc'] + scylla_tests_dependencies
deps['test/vector_search/rescoring_test'] = ['test/vector_search/rescoring_test.cc'] + scylla_tests_dependencies
boost_tests_prefixes = ["test/boost/", "test/vector_search/", "test/raft/", "test/manual/", "test/ldap/"]
@@ -2570,11 +2560,10 @@ def write_build_file(f,
description = RUST_LIB $out
''').format(mode=mode, antlr3_exec=args.antlr3_exec, fmt_lib=fmt_lib, test_repeat=args.test_repeat, test_timeout=args.test_timeout, rustc_wrapper=rustc_wrapper, **modeval))
f.write(
'build {mode}-build: phony {artifacts} {wasms} {vector_search_validator_bins}\n'.format(
'build {mode}-build: phony {artifacts} {wasms}\n'.format(
mode=mode,
artifacts=str.join(' ', ['$builddir/' + mode + '/' + x for x in sorted(build_artifacts - wasms - vector_search_validator_bins)]),
artifacts=str.join(' ', ['$builddir/' + mode + '/' + x for x in sorted(build_artifacts - wasms)]),
wasms = str.join(' ', ['$builddir/' + x for x in sorted(build_artifacts & wasms)]),
vector_search_validator_bins=str.join(' ', ['$builddir/' + x for x in sorted(build_artifacts & vector_search_validator_bins)]),
)
)
if profile_recipe := modes[mode].get('profile_recipe'):
@@ -2604,7 +2593,7 @@ def write_build_file(f,
continue
profile_dep = modes[mode].get('profile_target', "")
if binary in other or binary in wasms or binary in vector_search_validator_bins:
if binary in other or binary in wasms:
continue
srcs = deps[binary]
# 'scylla'
@@ -2715,11 +2704,10 @@ def write_build_file(f,
)
f.write(
'build {mode}-test: test.{mode} {test_executables} $builddir/{mode}/scylla {wasms} {vector_search_validator_bins} \n'.format(
'build {mode}-test: test.{mode} {test_executables} $builddir/{mode}/scylla {wasms}\n'.format(
mode=mode,
test_executables=' '.join(['$builddir/{}/{}'.format(mode, binary) for binary in sorted(tests)]),
wasms=' '.join([f'$builddir/{binary}' for binary in sorted(wasms)]),
vector_search_validator_bins=' '.join([f'$builddir/{binary}' for binary in sorted(vector_search_validator_bins)]),
)
)
f.write(
@@ -2887,19 +2875,6 @@ def write_build_file(f,
'build compiler-training: phony {}\n'.format(' '.join(['{mode}-compiler-training'.format(mode=mode) for mode in default_modes]))
)
f.write(textwrap.dedent(f'''\
rule build-vector-search-validator
command = test/vector_search_validator/build-validator $builddir
rule build-vector-store
command = test/vector_search_validator/build-vector-store $builddir
'''))
f.write(
'build $builddir/{vector_search_validator_bin}: build-vector-search-validator {}\n'.format(' '.join([dep for dep in sorted(vector_search_validator_deps)]), vector_search_validator_bin=vector_search_validator_bin)
)
f.write(
'build $builddir/{vector_store_bin}: build-vector-store {}\n'.format(' '.join([dep for dep in sorted(vector_store_deps)]), vector_store_bin=vector_store_bin)
)
f.write(textwrap.dedent(f'''\
build dist-unified-tar: phony {' '.join([f'$builddir/{mode}/dist/tar/{scylla_product}-unified-{scylla_version}-{scylla_release}.{arch}.tar.gz' for mode in default_modes])}
build dist-unified: phony dist-unified-tar

View File

@@ -47,6 +47,9 @@ target_sources(cql3
functions/aggregate_fcts.cc
functions/castas_fcts.cc
functions/error_injection_fcts.cc
statements/strong_consistency/select_statement.cc
statements/strong_consistency/modification_statement.cc
statements/strong_consistency/statement_helpers.cc
functions/vector_similarity_fcts.cc
statements/cf_prop_defs.cc
statements/cf_statement.cc
@@ -72,8 +75,8 @@ target_sources(cql3
statements/raw/parsed_statement.cc
statements/property_definitions.cc
statements/update_statement.cc
statements/strongly_consistent_modification_statement.cc
statements/strongly_consistent_select_statement.cc
statements/broadcast_modification_statement.cc
statements/broadcast_select_statement.cc
statements/delete_statement.cc
statements/prune_materialized_view_statement.cc
statements/batch_statement.cc

View File

@@ -10,9 +10,41 @@
#include "types/types.hh"
#include "types/vector.hh"
#include "exceptions/exceptions.hh"
#include <span>
#include <bit>
namespace cql3 {
namespace functions {
namespace detail {
std::vector<float> extract_float_vector(const bytes_opt& param, size_t dimension) {
if (!param) {
throw exceptions::invalid_request_exception("Cannot extract float vector from null parameter");
}
const size_t expected_size = dimension * sizeof(float);
if (param->size() != expected_size) {
throw exceptions::invalid_request_exception(
fmt::format("Invalid vector size: expected {} bytes for {} floats, got {} bytes",
expected_size, dimension, param->size()));
}
std::vector<float> result;
result.reserve(dimension);
bytes_view view(*param);
for (size_t i = 0; i < dimension; ++i) {
// read_simple handles network byte order (big-endian) conversion
uint32_t raw = read_simple<uint32_t>(view);
result.push_back(std::bit_cast<float>(raw));
}
return result;
}
} // namespace detail
namespace {
// The computations of similarity scores match the exact formulas of Cassandra's (jVector's) implementation to ensure compatibility.
@@ -22,14 +54,14 @@ namespace {
// You should only use this function if you need to preserve the original vectors and cannot normalize
// them in advance.
float compute_cosine_similarity(const std::vector<data_value>& v1, const std::vector<data_value>& v2) {
float compute_cosine_similarity(std::span<const float> v1, std::span<const float> v2) {
double dot_product = 0.0;
double squared_norm_a = 0.0;
double squared_norm_b = 0.0;
for (size_t i = 0; i < v1.size(); ++i) {
double a = value_cast<float>(v1[i]);
double b = value_cast<float>(v2[i]);
double a = v1[i];
double b = v2[i];
dot_product += a * b;
squared_norm_a += a * a;
@@ -37,7 +69,7 @@ float compute_cosine_similarity(const std::vector<data_value>& v1, const std::ve
}
if (squared_norm_a == 0 || squared_norm_b == 0) {
throw exceptions::invalid_request_exception("Function system.similarity_cosine doesn't support all-zero vectors");
return std::numeric_limits<float>::quiet_NaN();
}
// The cosine similarity is in the range [-1, 1].
@@ -46,12 +78,12 @@ float compute_cosine_similarity(const std::vector<data_value>& v1, const std::ve
return (1 + (dot_product / (std::sqrt(squared_norm_a * squared_norm_b)))) / 2;
}
float compute_euclidean_similarity(const std::vector<data_value>& v1, const std::vector<data_value>& v2) {
float compute_euclidean_similarity(std::span<const float> v1, std::span<const float> v2) {
double sum = 0.0;
for (size_t i = 0; i < v1.size(); ++i) {
double a = value_cast<float>(v1[i]);
double b = value_cast<float>(v2[i]);
double a = v1[i];
double b = v2[i];
double diff = a - b;
sum += diff * diff;
@@ -65,12 +97,12 @@ float compute_euclidean_similarity(const std::vector<data_value>& v1, const std:
// Assumes that both vectors are L2-normalized.
// This similarity is intended as an optimized way to perform cosine similarity calculation.
float compute_dot_product_similarity(const std::vector<data_value>& v1, const std::vector<data_value>& v2) {
float compute_dot_product_similarity(std::span<const float> v1, std::span<const float> v2) {
double dot_product = 0.0;
for (size_t i = 0; i < v1.size(); ++i) {
double a = value_cast<float>(v1[i]);
double b = value_cast<float>(v2[i]);
double a = v1[i];
double b = v2[i];
dot_product += a * b;
}
@@ -136,13 +168,15 @@ bytes_opt vector_similarity_fct::execute(std::span<const bytes_opt> parameters)
return std::nullopt;
}
const auto& type = arg_types()[0];
data_value v1 = type->deserialize(*parameters[0]);
data_value v2 = type->deserialize(*parameters[1]);
const auto& v1_elements = value_cast<std::vector<data_value>>(v1);
const auto& v2_elements = value_cast<std::vector<data_value>>(v2);
// Extract dimension from the vector type
const auto& type = static_cast<const vector_type_impl&>(*arg_types()[0]);
size_t dimension = type.get_dimension();
float result = SIMILARITY_FUNCTIONS.at(_name)(v1_elements, v2_elements);
// Optimized path: extract floats directly from bytes, bypassing data_value overhead
std::vector<float> v1 = detail::extract_float_vector(parameters[0], dimension);
std::vector<float> v2 = detail::extract_float_vector(parameters[1], dimension);
float result = SIMILARITY_FUNCTIONS.at(_name)(v1, v2);
return float_type->decompose(result);
}

View File

@@ -11,6 +11,7 @@
#include "native_scalar_function.hh"
#include "cql3/assignment_testable.hh"
#include "cql3/functions/function_name.hh"
#include <span>
namespace cql3 {
namespace functions {
@@ -19,7 +20,7 @@ static const function_name SIMILARITY_COSINE_FUNCTION_NAME = function_name::nati
static const function_name SIMILARITY_EUCLIDEAN_FUNCTION_NAME = function_name::native_function("similarity_euclidean");
static const function_name SIMILARITY_DOT_PRODUCT_FUNCTION_NAME = function_name::native_function("similarity_dot_product");
using similarity_function_t = float (*)(const std::vector<data_value>&, const std::vector<data_value>&);
using similarity_function_t = float (*)(std::span<const float>, std::span<const float>);
extern thread_local const std::unordered_map<function_name, similarity_function_t> SIMILARITY_FUNCTIONS;
std::vector<data_type> retrieve_vector_arg_types(const function_name& name, const std::vector<shared_ptr<assignment_testable>>& provided_args);
@@ -33,5 +34,14 @@ public:
virtual bytes_opt execute(std::span<const bytes_opt> parameters) override;
};
namespace detail {
// Extract float vector directly from serialized bytes, bypassing data_value overhead.
// This is an internal API exposed for testing purposes.
// Vector<float, N> wire format: N floats as big-endian uint32_t values, 4 bytes each.
std::vector<float> extract_float_vector(const bytes_opt& param, size_t dimension);
} // namespace detail
} // namespace functions
} // namespace cql3

View File

@@ -105,6 +105,7 @@ public:
static const std::chrono::minutes entry_expiry;
using key_type = prepared_cache_key_type;
using pinned_value_type = cache_value_ptr;
using value_type = checked_weak_ptr;
using statement_is_too_big = typename cache_type::entry_is_too_big;
@@ -116,9 +117,14 @@ public:
: _cache(size, entry_expiry, logger)
{}
template <typename LoadFunc>
future<pinned_value_type> get_pinned(const key_type& key, LoadFunc&& load) {
return _cache.get_ptr(key.key(), [load = std::forward<LoadFunc>(load)] (const cache_key_type&) { return load(); });
}
template <typename LoadFunc>
future<value_type> get(const key_type& key, LoadFunc&& load) {
return _cache.get_ptr(key.key(), [load = std::forward<LoadFunc>(load)] (const cache_key_type&) { return load(); }).then([] (cache_value_ptr v_ptr) {
return get_pinned(key, std::forward<LoadFunc>(load)).then([] (cache_value_ptr v_ptr) {
return make_ready_future<value_type>((*v_ptr)->checked_weak_from_this());
});
}

View File

@@ -48,8 +48,10 @@ const std::chrono::minutes prepared_statements_cache::entry_expiry = std::chrono
struct query_processor::remote {
remote(service::migration_manager& mm, service::mapreduce_service& fwd,
service::storage_service& ss, service::raft_group0_client& group0_client)
service::storage_service& ss, service::raft_group0_client& group0_client,
service::strong_consistency::coordinator& _sc_coordinator)
: mm(mm), mapreducer(fwd), ss(ss), group0_client(group0_client)
, sc_coordinator(_sc_coordinator)
, gate("query_processor::remote")
{}
@@ -57,6 +59,7 @@ struct query_processor::remote {
service::mapreduce_service& mapreducer;
service::storage_service& ss;
service::raft_group0_client& group0_client;
service::strong_consistency::coordinator& sc_coordinator;
seastar::named_gate gate;
};
@@ -514,9 +517,16 @@ query_processor::~query_processor() {
}
}
std::pair<std::reference_wrapper<service::strong_consistency::coordinator>, gate::holder>
query_processor::acquire_strongly_consistent_coordinator() {
auto [remote_, holder] = remote();
return {remote_.get().sc_coordinator, std::move(holder)};
}
void query_processor::start_remote(service::migration_manager& mm, service::mapreduce_service& mapreducer,
service::storage_service& ss, service::raft_group0_client& group0_client) {
_remote = std::make_unique<struct remote>(mm, mapreducer, ss, group0_client);
service::storage_service& ss, service::raft_group0_client& group0_client,
service::strong_consistency::coordinator& sc_coordinator) {
_remote = std::make_unique<struct remote>(mm, mapreducer, ss, group0_client, sc_coordinator);
}
future<> query_processor::stop_remote() {
@@ -687,7 +697,7 @@ future<::shared_ptr<cql_transport::messages::result_message::prepared>>
query_processor::prepare(sstring query_string, const service::client_state& client_state, cql3::dialect d) {
try {
auto key = compute_id(query_string, client_state.get_raw_keyspace(), d);
auto prep_ptr = co_await _prepared_cache.get(key, [this, &query_string, &client_state, d] {
auto prep_entry = co_await _prepared_cache.get_pinned(key, [this, &query_string, &client_state, d] {
auto prepared = get_statement(query_string, client_state, d);
prepared->calculate_metadata_id();
auto bound_terms = prepared->statement->get_bound_terms();
@@ -701,13 +711,13 @@ query_processor::prepare(sstring query_string, const service::client_state& clie
return make_ready_future<std::unique_ptr<statements::prepared_statement>>(std::move(prepared));
});
const auto& warnings = prep_ptr->warnings;
const auto msg = ::make_shared<result_message::prepared::cql>(prepared_cache_key_type::cql_id(key), std::move(prep_ptr),
co_await utils::get_local_injector().inject(
"query_processor_prepare_wait_after_cache_get",
utils::wait_for_message(std::chrono::seconds(60)));
auto msg = ::make_shared<result_message::prepared::cql>(prepared_cache_key_type::cql_id(key), std::move(prep_entry),
client_state.is_protocol_extension_set(cql_transport::cql_protocol_extension::LWT_ADD_METADATA_MARK));
for (const auto& w : warnings) {
msg->add_warning(w);
}
co_return ::shared_ptr<cql_transport::messages::result_message::prepared>(std::move(msg));
co_return std::move(msg);
} catch(typename prepared_statements_cache::statement_is_too_big&) {
throw prepared_statement_is_too_big(query_string);
}
@@ -860,6 +870,7 @@ struct internal_query_state {
sstring query_string;
std::unique_ptr<query_options> opts;
statements::prepared_statement::checked_weak_ptr p;
std::optional<service::query_state> qs;
bool more_results = true;
};
@@ -867,10 +878,14 @@ internal_query_state query_processor::create_paged_state(
const sstring& query_string,
db::consistency_level cl,
const data_value_list& values,
int32_t page_size) {
int32_t page_size,
std::optional<service::query_state> qs) {
auto p = prepare_internal(query_string);
auto opts = make_internal_options(p, values, cl, page_size);
return internal_query_state{query_string, std::make_unique<cql3::query_options>(std::move(opts)), std::move(p), true};
if (!qs) {
qs.emplace(query_state_for_internal_call());
}
return internal_query_state{query_string, std::make_unique<cql3::query_options>(std::move(opts)), std::move(p), std::move(qs), true};
}
bool query_processor::has_more_results(cql3::internal_query_state& state) const {
@@ -893,9 +908,8 @@ future<> query_processor::for_each_cql_result(
future<::shared_ptr<untyped_result_set>>
query_processor::execute_paged_internal(internal_query_state& state) {
state.p->statement->validate(*this, service::client_state::for_internal_calls());
auto qs = query_state_for_internal_call();
::shared_ptr<cql_transport::messages::result_message> msg =
co_await state.p->statement->execute(*this, qs, *state.opts, std::nullopt);
co_await state.p->statement->execute(*this, *state.qs, *state.opts, std::nullopt);
class visitor : public result_message::visitor_base {
internal_query_state& _state;
@@ -1202,8 +1216,9 @@ future<> query_processor::query_internal(
db::consistency_level cl,
const data_value_list& values,
int32_t page_size,
noncopyable_function<future<stop_iteration>(const cql3::untyped_result_set_row&)> f) {
auto query_state = create_paged_state(query_string, cl, values, page_size);
noncopyable_function<future<stop_iteration>(const cql3::untyped_result_set_row&)> f,
std::optional<service::query_state> qs) {
auto query_state = create_paged_state(query_string, cl, values, page_size, std::move(qs));
co_return co_await for_each_cql_result(query_state, std::move(f));
}

View File

@@ -44,6 +44,10 @@ class query_state;
class mapreduce_service;
class raft_group0_client;
namespace strong_consistency {
class coordinator;
}
namespace broadcast_tables {
struct query;
}
@@ -155,7 +159,8 @@ public:
~query_processor();
void start_remote(service::migration_manager&, service::mapreduce_service&,
service::storage_service& ss, service::raft_group0_client&);
service::storage_service& ss, service::raft_group0_client&,
service::strong_consistency::coordinator&);
future<> stop_remote();
data_dictionary::database db() {
@@ -174,6 +179,9 @@ public:
return _proxy;
}
std::pair<std::reference_wrapper<service::strong_consistency::coordinator>, gate::holder>
acquire_strongly_consistent_coordinator();
cql_stats& get_cql_stats() {
return _cql_stats;
}
@@ -322,6 +330,7 @@ public:
* page_size - maximum page size
* f - a function to be run on each row of the query result,
* if the function returns stop_iteration::yes the iteration will stop
* qs - optional query state (default: std::nullopt)
*
* \note This function is optimized for convenience, not performance.
*/
@@ -330,7 +339,8 @@ public:
db::consistency_level cl,
const data_value_list& values,
int32_t page_size,
noncopyable_function<future<stop_iteration>(const cql3::untyped_result_set_row&)> f);
noncopyable_function<future<stop_iteration>(const cql3::untyped_result_set_row&)> f,
std::optional<service::query_state> qs = std::nullopt);
/*
* \brief iterate over all cql results using paging
@@ -499,7 +509,8 @@ private:
const sstring& query_string,
db::consistency_level,
const data_value_list& values,
int32_t page_size);
int32_t page_size,
std::optional<service::query_state> qs = std::nullopt);
/*!
* \brief run a query using paging

View File

@@ -0,0 +1,20 @@
/*
* Copyright 2025-present ScyllaDB
*/
/*
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
*/
#pragma once
#include <ostream>
namespace cql3 {
class result;
void print_query_results_text(std::ostream& os, const result& result);
void print_query_results_json(std::ostream& os, const result& result);
} // namespace cql3

View File

@@ -9,8 +9,10 @@
*/
#include <cstdint>
#include "types/json_utils.hh"
#include "utils/assert.hh"
#include "utils/hashers.hh"
#include "utils/rjson.hh"
#include "cql3/result_set.hh"
namespace cql3 {
@@ -46,6 +48,13 @@ void metadata::add_non_serialized_column(lw_shared_ptr<column_specification> nam
_column_info->_names.emplace_back(std::move(name));
}
void metadata::hide_last_column() {
if (_column_info->_column_count == 0) {
utils::on_internal_error("Trying to hide a column when there are no columns visible.");
}
_column_info->_column_count--;
}
void metadata::set_paging_state(lw_shared_ptr<const service::pager::paging_state> paging_state) {
_flags.set<flag::HAS_MORE_PAGES>();
_paging_state = std::move(paging_state);
@@ -188,4 +197,85 @@ make_empty_metadata() {
return empty_metadata_cache;
}
void print_query_results_text(std::ostream& os, const cql3::result& result) {
const auto& metadata = result.get_metadata();
const auto& column_metadata = metadata.get_names();
struct column_values {
size_t max_size{0};
sstring header_format;
sstring row_format;
std::vector<sstring> values;
void add(sstring value) {
max_size = std::max(max_size, value.size());
values.push_back(std::move(value));
}
};
std::vector<column_values> columns;
columns.resize(column_metadata.size());
for (size_t i = 0; i < column_metadata.size(); ++i) {
columns[i].add(column_metadata[i]->name->text());
}
for (const auto& row : result.result_set().rows()) {
for (size_t i = 0; i < row.size(); ++i) {
if (row[i]) {
columns[i].add(column_metadata[i]->type->to_string(linearized(managed_bytes_view(*row[i]))));
} else {
columns[i].add("");
}
}
}
std::vector<sstring> separators(columns.size(), sstring());
for (size_t i = 0; i < columns.size(); ++i) {
auto& col_values = columns[i];
col_values.header_format = seastar::format(" {{:<{}}} ", col_values.max_size);
col_values.row_format = seastar::format(" {{:>{}}} ", col_values.max_size);
for (size_t c = 0; c < col_values.max_size; ++c) {
separators[i] += "-";
}
}
for (size_t r = 0; r < result.result_set().rows().size() + 1; ++r) {
std::vector<sstring> row;
row.reserve(columns.size());
for (size_t i = 0; i < columns.size(); ++i) {
const auto& format = r == 0 ? columns[i].header_format : columns[i].row_format;
row.push_back(fmt::format(fmt::runtime(std::string_view(format)), columns[i].values[r]));
}
fmt::print(os, "{}\n", fmt::join(row, "|"));
if (!r) {
fmt::print(os, "-{}-\n", fmt::join(separators, "-+-"));
}
}
}
void print_query_results_json(std::ostream& os, const cql3::result& result) {
const auto& metadata = result.get_metadata();
const auto& column_metadata = metadata.get_names();
rjson::streaming_writer writer(os);
writer.StartArray();
for (const auto& row : result.result_set().rows()) {
writer.StartObject();
for (size_t i = 0; i < row.size(); ++i) {
writer.Key(column_metadata[i]->name->text());
if (!row[i] || row[i]->empty()) {
writer.Null();
continue;
}
const auto value = to_json_string(*column_metadata[i]->type, *row[i]);
const auto type = to_json_type(*column_metadata[i]->type, *row[i]);
writer.RawValue(value, type);
}
writer.EndObject();
}
writer.EndArray();
}
}

View File

@@ -73,6 +73,7 @@ public:
uint32_t value_count() const;
void add_non_serialized_column(lw_shared_ptr<column_specification> name);
void hide_last_column();
public:
void set_paging_state(lw_shared_ptr<const service::pager::paging_state> paging_state);

View File

@@ -225,10 +225,9 @@ cql3::statements::alter_keyspace_statement::prepare_schema_mutations(query_proce
// The second hyphen is not really true because currently topological changes can
// disturb it (see scylladb/scylladb#23345), but we ignore that.
locator::assert_rf_rack_valid_keyspace(_name, tmptr, *rs);
} catch (const std::exception& e) {
} catch (const std::invalid_argument& e) {
if (replica::database::enforce_rf_rack_validity_for_keyspace(qp.db().get_config(), *ks_md)) {
// There's no guarantee what the type of the exception will be, so we need to
// wrap it manually here in a type that can be passed to the user.
// wrap the exception manually here in a type that can be passed to the user.
throw exceptions::invalid_request_exception(e.what());
} else {
// Even when RF-rack-validity is not enforced for the keyspace, we'd

View File

@@ -9,7 +9,7 @@
*/
#include "cql3/statements/strongly_consistent_modification_statement.hh"
#include "cql3/statements/broadcast_modification_statement.hh"
#include <optional>
@@ -28,11 +28,11 @@
namespace cql3 {
static logging::logger logger("strongly_consistent_modification_statement");
static logging::logger logger("broadcast_modification_statement");
namespace statements {
strongly_consistent_modification_statement::strongly_consistent_modification_statement(
broadcast_modification_statement::broadcast_modification_statement(
uint32_t bound_terms,
schema_ptr schema,
broadcast_tables::prepared_update query)
@@ -43,7 +43,7 @@ strongly_consistent_modification_statement::strongly_consistent_modification_sta
{ }
future<::shared_ptr<cql_transport::messages::result_message>>
strongly_consistent_modification_statement::execute(query_processor& qp, service::query_state& qs, const query_options& options, std::optional<service::group0_guard> guard) const {
broadcast_modification_statement::execute(query_processor& qp, service::query_state& qs, const query_options& options, std::optional<service::group0_guard> guard) const {
return execute_without_checking_exception_message(qp, qs, options, std::move(guard))
.then(cql_transport::messages::propagate_exception_as_future<shared_ptr<cql_transport::messages::result_message>>);
}
@@ -63,7 +63,7 @@ evaluate_prepared(
}
future<::shared_ptr<cql_transport::messages::result_message>>
strongly_consistent_modification_statement::execute_without_checking_exception_message(query_processor& qp, service::query_state& qs, const query_options& options, std::optional<service::group0_guard> guard) const {
broadcast_modification_statement::execute_without_checking_exception_message(query_processor& qp, service::query_state& qs, const query_options& options, std::optional<service::group0_guard> guard) const {
if (this_shard_id() != 0) {
co_return ::make_shared<cql_transport::messages::result_message::bounce_to_shard>(0, cql3::computed_function_values{});
}
@@ -103,11 +103,11 @@ strongly_consistent_modification_statement::execute_without_checking_exception_m
), result);
}
uint32_t strongly_consistent_modification_statement::get_bound_terms() const {
uint32_t broadcast_modification_statement::get_bound_terms() const {
return _bound_terms;
}
future<> strongly_consistent_modification_statement::check_access(query_processor& qp, const service::client_state& state) const {
future<> broadcast_modification_statement::check_access(query_processor& qp, const service::client_state& state) const {
auto f = state.has_column_family_access(_schema->ks_name(), _schema->cf_name(), auth::permission::MODIFY);
if (_query.value_condition.has_value()) {
f = f.then([this, &state] {
@@ -117,7 +117,7 @@ future<> strongly_consistent_modification_statement::check_access(query_processo
return f;
}
bool strongly_consistent_modification_statement::depends_on(std::string_view ks_name, std::optional<std::string_view> cf_name) const {
bool broadcast_modification_statement::depends_on(std::string_view ks_name, std::optional<std::string_view> cf_name) const {
return _schema->ks_name() == ks_name && (!cf_name || _schema->cf_name() == *cf_name);
}

View File

@@ -27,13 +27,13 @@ struct prepared_update {
}
class strongly_consistent_modification_statement : public cql_statement_opt_metadata {
class broadcast_modification_statement : public cql_statement_opt_metadata {
const uint32_t _bound_terms;
const schema_ptr _schema;
const broadcast_tables::prepared_update _query;
public:
strongly_consistent_modification_statement(uint32_t bound_terms, schema_ptr schema, broadcast_tables::prepared_update query);
broadcast_modification_statement(uint32_t bound_terms, schema_ptr schema, broadcast_tables::prepared_update query);
virtual future<::shared_ptr<cql_transport::messages::result_message>>
execute(query_processor& qp, service::query_state& qs, const query_options& options, std::optional<service::group0_guard> guard) const override;

View File

@@ -9,7 +9,7 @@
*/
#include "cql3/statements/strongly_consistent_select_statement.hh"
#include "cql3/statements/broadcast_select_statement.hh"
#include <seastar/core/future.hh>
#include <seastar/core/on_internal_error.hh>
@@ -24,7 +24,7 @@ namespace cql3 {
namespace statements {
static logging::logger logger("strongly_consistent_select_statement");
static logging::logger logger("broadcast_select_statement");
static
expr::expression get_key(const cql3::expr::expression& partition_key_restrictions) {
@@ -58,7 +58,7 @@ bool is_selecting_only_value(const cql3::selection::selection& selection) {
selection.get_columns()[0]->name() == "value";
}
strongly_consistent_select_statement::strongly_consistent_select_statement(schema_ptr schema, uint32_t bound_terms,
broadcast_select_statement::broadcast_select_statement(schema_ptr schema, uint32_t bound_terms,
lw_shared_ptr<const parameters> parameters,
::shared_ptr<selection::selection> selection,
::shared_ptr<const restrictions::statement_restrictions> restrictions,
@@ -73,7 +73,7 @@ strongly_consistent_select_statement::strongly_consistent_select_statement(schem
_query{prepare_query()}
{ }
broadcast_tables::prepared_select strongly_consistent_select_statement::prepare_query() const {
broadcast_tables::prepared_select broadcast_select_statement::prepare_query() const {
if (!is_selecting_only_value(*_selection)) {
throw service::broadcast_tables::unsupported_operation_error("only 'value' selector is allowed");
}
@@ -94,7 +94,7 @@ evaluate_prepared(
}
future<::shared_ptr<cql_transport::messages::result_message>>
strongly_consistent_select_statement::execute_without_checking_exception_message(query_processor& qp, service::query_state& qs, const query_options& options, std::optional<service::group0_guard> guard) const {
broadcast_select_statement::execute_without_checking_exception_message(query_processor& qp, service::query_state& qs, const query_options& options, std::optional<service::group0_guard> guard) const {
if (this_shard_id() != 0) {
co_return ::make_shared<cql_transport::messages::result_message::bounce_to_shard>(0, cql3::computed_function_values{});
}

View File

@@ -25,12 +25,12 @@ struct prepared_select {
}
class strongly_consistent_select_statement : public select_statement {
class broadcast_select_statement : public select_statement {
const broadcast_tables::prepared_select _query;
broadcast_tables::prepared_select prepare_query() const;
public:
strongly_consistent_select_statement(schema_ptr schema,
broadcast_select_statement(schema_ptr schema,
uint32_t bound_terms,
lw_shared_ptr<const parameters> parameters,
::shared_ptr<selection::selection> selection,

View File

@@ -123,10 +123,9 @@ future<std::tuple<::shared_ptr<cql_transport::event::schema_change>, utils::chun
// We hold a group0_guard, so it's correct to check this here.
// The topology or schema cannot change while we're performing this query.
locator::assert_rf_rack_valid_keyspace(_name, tmptr, *rs);
} catch (const std::exception& e) {
} catch (const std::invalid_argument& e) {
if (replica::database::enforce_rf_rack_validity_for_keyspace(cfg, *ksm)) {
// There's no guarantee what the type of the exception will be, so we need to
// wrap it manually here in a type that can be passed to the user.
// wrap the exception in a type that can be passed to the user.
throw exceptions::invalid_request_exception(e.what());
} else {
// Even when RF-rack-validity is not enforced for the keyspace, we'd

View File

@@ -31,8 +31,6 @@
#include "db/config.hh"
#include "compaction/time_window_compaction_strategy.hh"
bool is_internal_keyspace(std::string_view name);
namespace cql3 {
namespace statements {
@@ -124,10 +122,6 @@ void create_table_statement::apply_properties_to(schema_builder& builder, const
addColumnMetadataFromAliases(cfmd, Collections.singletonList(valueAlias), defaultValidator, ColumnDefinition.Kind.COMPACT_VALUE);
#endif
if (!_properties->get_compression_options() && !is_internal_keyspace(keyspace())) {
builder.set_compressor_params(db.get_config().sstable_compression_user_table_options());
}
_properties->apply_to_builder(builder, _properties->make_schema_extensions(db.extensions()), db, keyspace(), true);
}

View File

@@ -23,6 +23,7 @@
#include "index/vector_index.hh"
#include "schema/schema.hh"
#include "service/client_state.hh"
#include "service/paxos/paxos_state.hh"
#include "types/types.hh"
#include "cql3/query_processor.hh"
#include "cql3/cql_statement.hh"
@@ -329,6 +330,19 @@ future<std::vector<description>> table(const data_dictionary::database& db, cons
"*/",
*table_desc.create_statement);
table_desc.create_statement = std::move(os).to_managed_string();
} else if (service::paxos::paxos_store::try_get_base_table(name)) {
// Paxos state table is internally managed by Scylla and it shouldn't be exposed to the user.
// The table is allowed to be described as a comment to ease administrative work but it's hidden from all listings.
fragmented_ostringstream os{};
fmt::format_to(os.to_iter(),
"/* Do NOT execute this statement! It's only for informational purposes.\n"
" A paxos state table is created automatically when enabling LWT on a base table.\n"
"\n{}\n"
"*/",
*table_desc.create_statement);
table_desc.create_statement = std::move(os).to_managed_string();
}
result.push_back(std::move(table_desc));
@@ -364,7 +378,7 @@ future<std::vector<description>> table(const data_dictionary::database& db, cons
future<std::vector<description>> tables(const data_dictionary::database& db, const lw_shared_ptr<keyspace_metadata>& ks, std::optional<bool> with_internals = std::nullopt) {
auto& replica_db = db.real_database();
auto tables = ks->tables() | std::views::filter([&replica_db] (const schema_ptr& s) {
return !cdc::is_log_for_some_table(replica_db, s->ks_name(), s->cf_name());
return !cdc::is_log_for_some_table(replica_db, s->ks_name(), s->cf_name()) && !service::paxos::paxos_store::try_get_base_table(s->cf_name());
}) | std::ranges::to<std::vector<schema_ptr>>();
std::ranges::sort(tables, std::ranges::less(), std::mem_fn(&schema::cf_name));

View File

@@ -98,6 +98,7 @@ static locator::replication_strategy_config_options prepare_options(
const sstring& strategy_class,
const locator::token_metadata& tm,
bool rf_rack_valid_keyspaces,
bool enforce_rack_list,
locator::replication_strategy_config_options options,
const locator::replication_strategy_config_options& old_options,
bool rack_list_enabled,
@@ -107,7 +108,7 @@ static locator::replication_strategy_config_options prepare_options(
auto is_nts = locator::abstract_replication_strategy::to_qualified_class_name(strategy_class) == "org.apache.cassandra.locator.NetworkTopologyStrategy";
auto is_alter = !old_options.empty();
const auto& all_dcs = tm.get_datacenter_racks_token_owners();
auto auto_expand_racks = uses_tablets && rf_rack_valid_keyspaces && rack_list_enabled;
auto auto_expand_racks = uses_tablets && rack_list_enabled && (rf_rack_valid_keyspaces || enforce_rack_list);
logger.debug("prepare_options: {}: is_nts={} auto_expand_racks={} rack_list_enabled={} old_options={} new_options={} all_dcs={}",
strategy_class, is_nts, auto_expand_racks, rack_list_enabled, old_options, options, all_dcs);
@@ -417,7 +418,7 @@ lw_shared_ptr<data_dictionary::keyspace_metadata> ks_prop_defs::as_ks_metadata(s
auto initial_tablets = get_initial_tablets(default_initial_tablets, cfg.enforce_tablets());
bool uses_tablets = initial_tablets.has_value();
bool rack_list_enabled = utils::get_local_injector().enter("create_with_numeric") ? false : feat.rack_list_rf;
auto options = prepare_options(sc, tm, cfg.rf_rack_valid_keyspaces(), get_replication_options(), {}, rack_list_enabled, uses_tablets);
auto options = prepare_options(sc, tm, cfg.rf_rack_valid_keyspaces(), cfg.enforce_rack_list(), get_replication_options(), {}, rack_list_enabled, uses_tablets);
return data_dictionary::keyspace_metadata::new_keyspace(ks_name, sc,
std::move(options), initial_tablets, get_consistency_option(), get_boolean(KW_DURABLE_WRITES, true), get_storage_options());
}
@@ -434,7 +435,7 @@ lw_shared_ptr<data_dictionary::keyspace_metadata> ks_prop_defs::as_ks_metadata_u
auto sc = get_replication_strategy_class();
bool rack_list_enabled = utils::get_local_injector().enter("create_with_numeric") ? false : feat.rack_list_rf;
if (sc) {
options = prepare_options(*sc, tm, cfg.rf_rack_valid_keyspaces(), get_replication_options(), old_options, rack_list_enabled, uses_tablets);
options = prepare_options(*sc, tm, cfg.rf_rack_valid_keyspaces(), cfg.enforce_rack_list(), get_replication_options(), old_options, rack_list_enabled, uses_tablets);
} else {
sc = old->strategy_name();
options = old_options;

View File

@@ -11,7 +11,7 @@
#include "utils/assert.hh"
#include "cql3/cql_statement.hh"
#include "cql3/statements/modification_statement.hh"
#include "cql3/statements/strongly_consistent_modification_statement.hh"
#include "cql3/statements/broadcast_modification_statement.hh"
#include "cql3/statements/raw/modification_statement.hh"
#include "cql3/statements/prepared_statement.hh"
#include "cql3/expr/expr-utils.hh"
@@ -29,6 +29,8 @@
#include "cql3/query_processor.hh"
#include "service/storage_proxy.hh"
#include "service/broadcast_tables/experimental/lang.hh"
#include "cql3/statements/strong_consistency/modification_statement.hh"
#include "cql3/statements/strong_consistency/statement_helpers.hh"
#include <boost/lexical_cast.hpp>
@@ -546,7 +548,7 @@ modification_statement::process_where_clause(data_dictionary::database db, expr:
}
}
::shared_ptr<strongly_consistent_modification_statement>
::shared_ptr<broadcast_modification_statement>
modification_statement::prepare_for_broadcast_tables() const {
// FIXME: implement for every type of `modification_statement`.
throw service::broadcast_tables::unsupported_operation_error{};
@@ -554,24 +556,27 @@ modification_statement::prepare_for_broadcast_tables() const {
namespace raw {
::shared_ptr<cql_statement_opt_metadata>
modification_statement::prepare_statement(data_dictionary::database db, prepare_context& ctx, cql_stats& stats) {
::shared_ptr<cql3::statements::modification_statement> statement = prepare(db, ctx, stats);
if (service::broadcast_tables::is_broadcast_table_statement(keyspace(), column_family())) {
return statement->prepare_for_broadcast_tables();
} else {
return statement;
}
}
std::unique_ptr<prepared_statement>
modification_statement::prepare(data_dictionary::database db, cql_stats& stats) {
schema_ptr schema = validation::validate_column_family(db, keyspace(), column_family());
auto meta = get_prepare_context();
auto statement = prepare_statement(db, meta, stats);
auto statement = std::invoke([&] -> shared_ptr<cql_statement> {
auto result = prepare(db, meta, stats);
if (strong_consistency::is_strongly_consistent(db, schema->ks_name())) {
return ::make_shared<strong_consistency::modification_statement>(std::move(result));
}
if (service::broadcast_tables::is_broadcast_table_statement(keyspace(), column_family())) {
return result->prepare_for_broadcast_tables();
}
return result;
});
auto partition_key_bind_indices = meta.get_partition_key_bind_indexes(*schema);
return std::make_unique<prepared_statement>(audit_info(), std::move(statement), meta, std::move(partition_key_bind_indices));
return std::make_unique<prepared_statement>(audit_info(), std::move(statement), meta,
std::move(partition_key_bind_indices));
}
::shared_ptr<cql3::statements::modification_statement>

View File

@@ -30,7 +30,7 @@ class operation;
namespace statements {
class strongly_consistent_modification_statement;
class broadcast_modification_statement;
namespace raw { class modification_statement; }
@@ -113,15 +113,15 @@ public:
virtual void add_update_for_key(mutation& m, const query::clustering_range& range, const update_parameters& params, const json_cache_opt& json_cache) const = 0;
virtual uint32_t get_bound_terms() const override;
uint32_t get_bound_terms() const override;
virtual const sstring& keyspace() const;
const sstring& keyspace() const;
virtual const sstring& column_family() const;
const sstring& column_family() const;
virtual bool is_counter() const;
bool is_counter() const;
virtual bool is_view() const;
bool is_view() const;
int64_t get_timestamp(int64_t now, const query_options& options) const;
@@ -129,12 +129,12 @@ public:
std::optional<gc_clock::duration> get_time_to_live(const query_options& options) const;
virtual future<> check_access(query_processor& qp, const service::client_state& state) const override;
future<> check_access(query_processor& qp, const service::client_state& state) const override;
// Validate before execute, using client state and current schema
void validate(query_processor&, const service::client_state& state) const override;
virtual bool depends_on(std::string_view ks_name, std::optional<std::string_view> cf_name) const override;
bool depends_on(std::string_view ks_name, std::optional<std::string_view> cf_name) const override;
void add_operation(::shared_ptr<operation> op);
@@ -256,7 +256,9 @@ public:
virtual json_cache_opt maybe_prepare_json_cache(const query_options& options) const;
virtual ::shared_ptr<strongly_consistent_modification_statement> prepare_for_broadcast_tables() const;
virtual ::shared_ptr<broadcast_modification_statement> prepare_for_broadcast_tables() const;
db::timeout_clock::duration get_timeout(const service::client_state& state, const query_options& options) const;
protected:
/**
@@ -264,9 +266,7 @@ protected:
* processed to check that they are compatible.
* @throws InvalidRequestException
*/
virtual void validate_where_clause_for_conditions() const;
db::timeout_clock::duration get_timeout(const service::client_state& state, const query_options& options) const;
void validate_where_clause_for_conditions() const;
friend class raw::modification_statement;
};

View File

@@ -40,7 +40,6 @@ protected:
public:
virtual std::unique_ptr<prepared_statement> prepare(data_dictionary::database db, cql_stats& stats) override;
::shared_ptr<cql_statement_opt_metadata> prepare_statement(data_dictionary::database db, prepare_context& ctx, cql_stats& stats);
::shared_ptr<cql3::statements::modification_statement> prepare(data_dictionary::database db, prepare_context& ctx, cql_stats& stats) const;
void add_raw(sstring&& raw) { _raw_cql = std::move(raw); }
const sstring& get_raw_cql() const { return _raw_cql; }

View File

@@ -131,8 +131,6 @@ private:
void verify_ordering_is_valid(const prepared_orderings_type&, const schema&, const restrictions::statement_restrictions& restrictions) const;
prepared_ann_ordering_type prepare_ann_ordering(const schema& schema, prepare_context& ctx, data_dictionary::database db) const;
// Checks whether this ordering reverses all results.
// We only allow leaving select results unchanged or reversing them.
bool is_ordering_reversed(const prepared_orderings_type&) const;

View File

@@ -8,6 +8,8 @@
* SPDX-License-Identifier: (LicenseRef-ScyllaDB-Source-Available-1.0 and Apache-2.0)
*/
#include "cql3/statements/strong_consistency/select_statement.hh"
#include "cql3/statements/strong_consistency/statement_helpers.hh"
#include "cql3/statements/select_statement.hh"
#include "cql3/expr/expression.hh"
#include "cql3/expr/evaluate.hh"
@@ -16,7 +18,7 @@
#include "cql3/statements/raw/select_statement.hh"
#include "cql3/query_processor.hh"
#include "cql3/statements/prune_materialized_view_statement.hh"
#include "cql3/statements/strongly_consistent_select_statement.hh"
#include "cql3/statements/broadcast_select_statement.hh"
#include "exceptions/exceptions.hh"
#include <seastar/core/future.hh>
@@ -25,12 +27,14 @@
#include "service/broadcast_tables/experimental/lang.hh"
#include "service/qos/qos_common.hh"
#include "transport/messages/result_message.hh"
#include "cql3/functions/functions.hh"
#include "cql3/functions/as_json_function.hh"
#include "cql3/selection/selection.hh"
#include "cql3/util.hh"
#include "cql3/restrictions/statement_restrictions.hh"
#include "index/secondary_index.hh"
#include "types/vector.hh"
#include "vector_search/filter.hh"
#include "validation.hh"
#include "exceptions/unrecognized_entity_exception.hh"
#include <optional>
@@ -368,8 +372,9 @@ uint64_t select_statement::get_inner_loop_limit(uint64_t limit, bool is_aggregat
}
bool select_statement::needs_post_query_ordering() const {
// We need post-query ordering only for queries with IN on the partition key and an ORDER BY.
return _restrictions->key_is_in_relation() && !_parameters->orderings().empty();
// We need post-query ordering for queries with IN on the partition key and an ORDER BY
// and ANN index queries with rescoring.
return static_cast<bool>(_ordering_comparator);
}
struct select_statement_executor {
@@ -1958,14 +1963,46 @@ mutation_fragments_select_statement::do_execute(query_processor& qp, service::qu
}));
}
::shared_ptr<cql3::statements::select_statement> vector_indexed_table_select_statement::prepare(data_dictionary::database db, schema_ptr schema,
uint32_t bound_terms, lw_shared_ptr<const parameters> parameters, ::shared_ptr<selection::selection> selection,
::shared_ptr<restrictions::statement_restrictions> restrictions, ::shared_ptr<std::vector<size_t>> group_by_cell_indices, bool is_reversed,
ordering_comparator_type ordering_comparator, prepared_ann_ordering_type prepared_ann_ordering, std::optional<expr::expression> limit,
std::optional<expr::expression> per_partition_limit, cql_stats& stats, std::unique_ptr<attributes> attrs) {
struct ann_ordering_info {
secondary_index::index _index;
raw::select_statement::prepared_ann_ordering_type _prepared_ann_ordering;
bool is_rescoring_enabled;
};
static std::optional<ann_ordering_info> get_ann_ordering_info(
data_dictionary::database db,
schema_ptr schema,
lw_shared_ptr<const raw::select_statement::parameters> parameters,
prepare_context& ctx) {
if (parameters->orderings().empty()) {
return std::nullopt;
}
auto [column_id, ordering] = parameters->orderings().front();
const auto& ann_vector = std::get_if<raw::select_statement::ann_vector>(&ordering);
if (!ann_vector) {
return std::nullopt;
}
::shared_ptr<column_identifier> column = column_id->prepare_column_identifier(*schema);
const column_definition* def = schema->get_column_definition(column->name());
if (!def) {
throw exceptions::invalid_request_exception(
fmt::format("Undefined column name {}", column->text()));
}
if (!def->type->is_vector() || static_cast<const vector_type_impl*>(def->type.get())->get_elements_type()->get_kind() != abstract_type::kind::float_kind) {
throw exceptions::invalid_request_exception("ANN ordering is only supported on float vector indexes");
}
auto e = expr::prepare_expression(*ann_vector, db, schema->ks_name(), nullptr, def->column_specification);
expr::fill_prepare_context(e, ctx);
raw::select_statement::prepared_ann_ordering_type prepared_ann_ordering = std::make_pair(std::move(def), std::move(e));
auto cf = db.find_column_family(schema);
auto& sim = cf.get_index_manager();
auto [index_opt, _] = restrictions->find_idx(sim);
auto indexes = sim.list_indexes();
auto it = std::find_if(indexes.begin(), indexes.end(), [&prepared_ann_ordering](const auto& ind) {
@@ -1977,27 +2014,90 @@ mutation_fragments_select_statement::do_execute(query_processor& qp, service::qu
if (it == indexes.end()) {
throw exceptions::invalid_request_exception("ANN ordering by vector requires the column to be indexed using 'vector_index'");
}
index_opt = *it;
if (!index_opt) {
throw std::runtime_error("No index found.");
return ann_ordering_info{
*it,
std::move(prepared_ann_ordering),
secondary_index::vector_index::is_rescoring_enabled(it->metadata().options())
};
}
static uint32_t add_similarity_function_to_selectors(
std::vector<selection::prepared_selector>& prepared_selectors,
const ann_ordering_info& ann_ordering_info,
data_dictionary::database db,
schema_ptr schema) {
auto similarity_function_name = secondary_index::vector_index::get_cql_similarity_function_name(ann_ordering_info._index.metadata().options());
// Create the function name
auto func_name = functions::function_name::native_function(sstring(similarity_function_name));
// Create the function arguments
std::vector<expr::expression> args;
args.push_back(expr::column_value(ann_ordering_info._prepared_ann_ordering.first));
args.push_back(ann_ordering_info._prepared_ann_ordering.second);
// Get the function object
std::vector<shared_ptr<assignment_testable>> provided_args;
provided_args.push_back(expr::as_assignment_testable(args[0], expr::type_of(args[0])));
provided_args.push_back(expr::as_assignment_testable(args[1], expr::type_of(args[1])));
auto func = cql3::functions::instance().get(db, schema->ks_name(), func_name, provided_args, schema->ks_name(), schema->cf_name(), nullptr);
// Create the function call expression
expr::function_call similarity_func_call{
.func = func,
.args = std::move(args),
};
// Add the similarity function as a prepared selector (last)
prepared_selectors.push_back(selection::prepared_selector{
.expr = std::move(similarity_func_call),
.alias = nullptr,
});
return prepared_selectors.size() - 1;
}
static select_statement::ordering_comparator_type get_similarity_ordering_comparator(std::vector<selection::prepared_selector>& prepared_selectors, uint32_t similarity_column_index) {
auto type = expr::type_of(prepared_selectors[similarity_column_index].expr);
if (type->get_kind() != abstract_type::kind::float_kind) {
seastar::on_internal_error(logger, "Similarity function must return float type.");
}
return [similarity_column_index, type] (const raw::select_statement::result_row_type& r1, const raw::select_statement::result_row_type& r2) {
auto& c1 = r1[similarity_column_index];
auto& c2 = r2[similarity_column_index];
auto f1 = c1 ? value_cast<float>(type->deserialize(*c1)) : std::numeric_limits<float>::quiet_NaN();
auto f2 = c2 ? value_cast<float>(type->deserialize(*c2)) : std::numeric_limits<float>::quiet_NaN();
if (std::isfinite(f1) && std::isfinite(f2)) {
return f1 > f2;
}
return std::isfinite(f1);
};
}
::shared_ptr<cql3::statements::select_statement> vector_indexed_table_select_statement::prepare(data_dictionary::database db, schema_ptr schema,
uint32_t bound_terms, lw_shared_ptr<const parameters> parameters, ::shared_ptr<selection::selection> selection,
::shared_ptr<restrictions::statement_restrictions> restrictions, ::shared_ptr<std::vector<size_t>> group_by_cell_indices, bool is_reversed,
ordering_comparator_type ordering_comparator, prepared_ann_ordering_type prepared_ann_ordering, std::optional<expr::expression> limit,
std::optional<expr::expression> per_partition_limit, cql_stats& stats, const secondary_index::index& index, std::unique_ptr<attributes> attrs) {
auto prepared_filter = vector_search::prepare_filter(*restrictions, parameters->allow_filtering());
return ::make_shared<cql3::statements::vector_indexed_table_select_statement>(schema, bound_terms, parameters, std::move(selection), std::move(restrictions),
std::move(group_by_cell_indices), is_reversed, std::move(ordering_comparator), std::move(prepared_ann_ordering), std::move(limit),
std::move(per_partition_limit), stats, *index_opt, std::move(attrs));
std::move(per_partition_limit), stats, index, std::move(prepared_filter), std::move(attrs));
}
vector_indexed_table_select_statement::vector_indexed_table_select_statement(schema_ptr schema, uint32_t bound_terms, lw_shared_ptr<const parameters> parameters,
::shared_ptr<selection::selection> selection, ::shared_ptr<const restrictions::statement_restrictions> restrictions,
::shared_ptr<std::vector<size_t>> group_by_cell_indices, bool is_reversed, ordering_comparator_type ordering_comparator,
prepared_ann_ordering_type prepared_ann_ordering, std::optional<expr::expression> limit,
std::optional<expr::expression> per_partition_limit, cql_stats& stats, const secondary_index::index& index, std::unique_ptr<attributes> attrs)
std::optional<expr::expression> per_partition_limit, cql_stats& stats, const secondary_index::index& index,
vector_search::prepared_filter prepared_filter, std::unique_ptr<attributes> attrs)
: select_statement{schema, bound_terms, parameters, selection, restrictions, group_by_cell_indices, is_reversed, ordering_comparator, limit,
per_partition_limit, stats, std::move(attrs)}
, _index{index}
, _prepared_ann_ordering(std::move(prepared_ann_ordering)) {
, _prepared_ann_ordering(std::move(prepared_ann_ordering))
, _prepared_filter(std::move(prepared_filter)) {
if (!limit.has_value()) {
throw exceptions::invalid_request_exception("Vector ANN queries must have a limit specified");
@@ -2032,13 +2132,19 @@ future<shared_ptr<cql_transport::messages::result_message>> vector_indexed_table
auto timeout = db::timeout_clock::now() + get_timeout(state.get_client_state(), options);
auto aoe = abort_on_expiry(timeout);
auto filter_json = _prepared_filter.to_json(options);
uint64_t fetch = static_cast<uint64_t>(std::ceil(limit * secondary_index::vector_index::get_oversampling(_index.metadata().options())));
auto pkeys = co_await qp.vector_store_client().ann(
_schema->ks_name(), _index.metadata().name(), _schema, get_ann_ordering_vector(options), limit, aoe.abort_source());
_schema->ks_name(), _index.metadata().name(), _schema, get_ann_ordering_vector(options), fetch, filter_json, aoe.abort_source());
if (!pkeys.has_value()) {
co_await coroutine::return_exception(
exceptions::invalid_request_exception(std::visit(vector_search::vector_store_client::ann_error_visitor{}, pkeys.error())));
}
if (pkeys->size() > limit && !secondary_index::vector_index::is_rescoring_enabled(_index.metadata().options())) {
pkeys->erase(pkeys->begin() + limit, pkeys->end());
}
co_return co_await query_base_table(qp, state, options, pkeys.value(), timeout);
});
@@ -2055,11 +2161,11 @@ void vector_indexed_table_select_statement::update_stats() const {
}
lw_shared_ptr<query::read_command> vector_indexed_table_select_statement::prepare_command_for_base_query(
query_processor& qp, service::query_state& state, const query_options& options) const {
query_processor& qp, service::query_state& state, const query_options& options, uint64_t fetch_limit) const {
auto slice = make_partition_slice(options);
return ::make_lw_shared<query::read_command>(_schema->id(), _schema->version(), std::move(slice), qp.proxy().get_max_result_size(slice),
query::tombstone_limit(qp.proxy().get_tombstone_limit()),
query::row_limit(get_inner_loop_limit(get_limit(options, _limit), _selection->is_aggregate())), query::partition_limit(query::max_partitions),
query::row_limit(get_inner_loop_limit(fetch_limit, _selection->is_aggregate())), query::partition_limit(query::max_partitions),
_query_start_time_point, tracing::make_trace_info(state.get_trace_state()), query_id::create_null_id(), query::is_first_page::no,
options.get_timestamp(state));
}
@@ -2077,7 +2183,7 @@ std::vector<float> vector_indexed_table_select_statement::get_ann_ordering_vecto
future<::shared_ptr<cql_transport::messages::result_message>> vector_indexed_table_select_statement::query_base_table(query_processor& qp,
service::query_state& state, const query_options& options, const std::vector<vector_search::primary_key>& pkeys,
lowres_clock::time_point timeout) const {
auto command = prepare_command_for_base_query(qp, state, options);
auto command = prepare_command_for_base_query(qp, state, options, pkeys.size());
// For tables without clustering columns, we can optimize by querying
// partition ranges instead of individual primary keys, since the
@@ -2116,6 +2222,7 @@ future<::shared_ptr<cql_transport::messages::result_message>> vector_indexed_tab
query::result_merger{command->get_row_limit(), query::max_partitions});
co_return co_await wrap_result_to_error_message([this, &command, &options](auto result) {
command->set_row_limit(get_limit(options, _limit));
return process_results(std::move(result), command, options, _query_start_time_point);
})(std::move(result));
}
@@ -2129,6 +2236,7 @@ future<::shared_ptr<cql_transport::messages::result_message>> vector_indexed_tab
{timeout, state.get_permit(), state.get_client_state(), state.get_trace_state(), {}, {}, options.get_specific_options().node_local_only},
std::nullopt)
.then(wrap_result_to_error_message([this, &options, command](service::storage_proxy::coordinator_query_result qr) {
command->set_row_limit(get_limit(options, _limit));
return this->process_results(std::move(qr.query_result), command, options, _query_start_time_point);
}));
}
@@ -2223,32 +2331,41 @@ std::unique_ptr<prepared_statement> select_statement::prepare(data_dictionary::d
prepared_selectors = maybe_jsonize_select_clause(std::move(prepared_selectors), db, schema);
auto aggregation_depth = 0u;
std::optional<ann_ordering_info> ann_ordering_info_opt = get_ann_ordering_info(db, schema, _parameters, ctx);
bool is_ann_query = ann_ordering_info_opt.has_value();
// Force aggregation if GROUP BY is used. This will wrap every column x as first(x).
if (!_group_by_columns.empty()) {
aggregation_depth = std::max(aggregation_depth, 1u);
if (prepared_selectors.empty()) {
// We have a "SELECT * GROUP BY". If we leave prepared_selectors
// empty, below we choose selection::wildcard() for SELECT *, and
// forget to do the "levellize" trick needed for the GROUP BY.
// So we need to set prepared_selectors. See #16531.
auto all_columns = selection::selection::wildcard_columns(schema);
std::vector<::shared_ptr<selection::raw_selector>> select_all;
select_all.reserve(all_columns.size());
for (const column_definition *cdef : all_columns) {
auto name = ::make_shared<cql3::column_identifier::raw>(cdef->name_as_text(), true);
select_all.push_back(::make_shared<selection::raw_selector>(
expr::unresolved_identifier(std::move(name)), nullptr));
}
prepared_selectors = selection::raw_selector::to_prepared_selectors(select_all, *schema, db, keyspace());
if (prepared_selectors.empty() && (!_group_by_columns.empty() || (is_ann_query && ann_ordering_info_opt->is_rescoring_enabled))) {
// We have a "SELECT * GROUP BY" or "SELECT * ORDER BY ANN" with rescoring enabled. If we leave prepared_selectors
// empty, below we choose selection::wildcard() for SELECT *, and either:
// - forget to do the "levellize" trick needed for the GROUP BY. See #16531.
// - forget to add the similarity function needed for ORDER BY ANN with rescoring. See below.
// So we need to set prepared_selectors.
auto all_columns = selection::selection::wildcard_columns(schema);
std::vector<::shared_ptr<selection::raw_selector>> select_all;
select_all.reserve(all_columns.size());
for (const column_definition *cdef : all_columns) {
auto name = ::make_shared<cql3::column_identifier::raw>(cdef->name_as_text(), true);
select_all.push_back(::make_shared<selection::raw_selector>(
expr::unresolved_identifier(std::move(name)), nullptr));
}
prepared_selectors = selection::raw_selector::to_prepared_selectors(select_all, *schema, db, keyspace());
}
for (auto& ps : prepared_selectors) {
expr::fill_prepare_context(ps.expr, ctx);
}
// Force aggregation if GROUP BY is used. This will wrap every column x as first(x).
auto aggregation_depth = _group_by_columns.empty() ? 0u : 1u;
select_statement::ordering_comparator_type ordering_comparator;
bool hide_last_column = false;
if (is_ann_query && ann_ordering_info_opt->is_rescoring_enabled) {
uint32_t similarity_column_index = add_similarity_function_to_selectors(prepared_selectors, *ann_ordering_info_opt, db, schema);
hide_last_column = true;
ordering_comparator = get_similarity_ordering_comparator(prepared_selectors, similarity_column_index);
}
for (auto& ps : prepared_selectors) {
aggregation_depth = std::max(aggregation_depth, expr::aggregation_depth(ps.expr));
}
@@ -2266,6 +2383,11 @@ std::unique_ptr<prepared_statement> select_statement::prepare(data_dictionary::d
? selection::selection::wildcard(schema)
: selection::selection::from_selectors(db, schema, keyspace(), levellized_prepared_selectors);
if (is_ann_query && hide_last_column) {
// Hide the similarity selector from the client by reducing column_count
selection->get_result_metadata()->hide_last_column();
}
// Cassandra 5.0.2 disallows PER PARTITION LIMIT with aggregate queries
// but only if GROUP BY is not used.
// See #9879 for more details.
@@ -2273,8 +2395,6 @@ std::unique_ptr<prepared_statement> select_statement::prepare(data_dictionary::d
throw exceptions::invalid_request_exception("PER PARTITION LIMIT is not allowed with aggregate queries.");
}
bool is_ann_query = !_parameters->orderings().empty() && std::holds_alternative<select_statement::ann_vector>(_parameters->orderings().front().second);
auto restrictions = prepare_restrictions(db, schema, ctx, selection, for_view, _parameters->allow_filtering() || is_ann_query,
restrictions::check_indexes(!_parameters->is_mutation_fragments()));
@@ -2282,19 +2402,14 @@ std::unique_ptr<prepared_statement> select_statement::prepare(data_dictionary::d
validate_distinct_selection(*schema, *selection, *restrictions);
}
select_statement::ordering_comparator_type ordering_comparator;
bool is_reversed_ = false;
std::optional<prepared_ann_ordering_type> prepared_ann_ordering;
auto orderings = _parameters->orderings();
if (!orderings.empty()) {
if (!orderings.empty() && !is_ann_query) {
std::visit([&](auto&& ordering) {
using T = std::decay_t<decltype(ordering)>;
if constexpr (std::is_same_v<T, select_statement::ann_vector>) {
prepared_ann_ordering = prepare_ann_ordering(*schema, ctx, db);
} else {
if constexpr (!std::is_same_v<T, select_statement::ann_vector>) {
SCYLLA_ASSERT(!for_view);
verify_ordering_is_allowed(*_parameters, *restrictions);
prepared_orderings_type prepared_orderings = prepare_orderings(*schema);
@@ -2307,7 +2422,7 @@ std::unique_ptr<prepared_statement> select_statement::prepare(data_dictionary::d
}
std::vector<sstring> warnings;
if (!prepared_ann_ordering.has_value()) {
if (!is_ann_query) {
check_needs_filtering(*restrictions, db.get_config().strict_allow_filtering(), warnings);
ensure_filtering_columns_retrieval(db, *selection, *restrictions);
}
@@ -2361,7 +2476,21 @@ std::unique_ptr<prepared_statement> select_statement::prepare(data_dictionary::d
&& restrictions->partition_key_restrictions_size() == schema->partition_key_size());
};
if (_parameters->is_prune_materialized_view()) {
if (strong_consistency::is_strongly_consistent(db, schema->ks_name())) {
stmt = ::make_shared<strong_consistency::select_statement>(
schema,
ctx.bound_variables_size(),
_parameters,
std::move(selection),
std::move(restrictions),
std::move(group_by_cell_indices),
is_reversed_,
std::move(ordering_comparator),
prepare_limit(db, ctx, _limit),
prepare_limit(db, ctx, _per_partition_limit),
stats,
std::move(prepared_attrs));
} else if (_parameters->is_prune_materialized_view()) {
stmt = ::make_shared<cql3::statements::prune_materialized_view_statement>(
schema,
ctx.bound_variables_size(),
@@ -2390,10 +2519,10 @@ std::unique_ptr<prepared_statement> select_statement::prepare(data_dictionary::d
prepare_limit(db, ctx, _per_partition_limit),
stats,
std::move(prepared_attrs));
} else if (prepared_ann_ordering) {
} else if (is_ann_query) {
stmt = vector_indexed_table_select_statement::prepare(db, schema, ctx.bound_variables_size(), _parameters, std::move(selection), std::move(restrictions),
std::move(group_by_cell_indices), is_reversed_, std::move(ordering_comparator), std::move(*prepared_ann_ordering),
prepare_limit(db, ctx, _limit), prepare_limit(db, ctx, _per_partition_limit), stats, std::move(prepared_attrs));
std::move(group_by_cell_indices), is_reversed_, std::move(ordering_comparator), std::move(ann_ordering_info_opt->_prepared_ann_ordering),
prepare_limit(db, ctx, _limit), prepare_limit(db, ctx, _per_partition_limit), stats, ann_ordering_info_opt->_index, std::move(prepared_attrs));
} else if (restrictions->uses_secondary_indexing()) {
stmt = view_indexed_table_select_statement::prepare(
db,
@@ -2425,7 +2554,7 @@ std::unique_ptr<prepared_statement> select_statement::prepare(data_dictionary::d
std::move(prepared_attrs)
);
} else if (service::broadcast_tables::is_broadcast_table_statement(keyspace(), column_family())) {
stmt = ::make_shared<cql3::statements::strongly_consistent_select_statement>(
stmt = ::make_shared<cql3::statements::broadcast_select_statement>(
schema,
ctx.bound_variables_size(),
_parameters,
@@ -2615,28 +2744,6 @@ void select_statement::verify_ordering_is_valid(const prepared_orderings_type& o
}
}
select_statement::prepared_ann_ordering_type select_statement::prepare_ann_ordering(const schema& schema, prepare_context& ctx, data_dictionary::database db) const {
auto [column_id, ordering] = _parameters->orderings().front();
const auto& ann_vector = std::get_if<select_statement::ann_vector>(&ordering);
SCYLLA_ASSERT(ann_vector);
::shared_ptr<column_identifier> column = column_id->prepare_column_identifier(schema);
const column_definition* def = schema.get_column_definition(column->name());
if (!def) {
throw exceptions::invalid_request_exception(
fmt::format("Undefined column name {}", column->text()));
}
if (!def->type->is_vector() || static_cast<const vector_type_impl*>(def->type.get())->get_elements_type()->get_kind() != abstract_type::kind::float_kind) {
throw exceptions::invalid_request_exception("ANN ordering is only supported on float vector indexes");
}
auto e = expr::prepare_expression(*ann_vector, db, keyspace(), nullptr, def->column_specification);
expr::fill_prepare_context(e, ctx);
return std::make_pair(std::move(def), std::move(e));
}
select_statement::ordering_comparator_type select_statement::get_ordering_comparator(const prepared_orderings_type& orderings,
selection::selection& selection,
const restrictions::statement_restrictions& restrictions) {

View File

@@ -22,6 +22,7 @@
#include "locator/host_id.hh"
#include "service/cas_shard.hh"
#include "vector_search/vector_store_client.hh"
#include "vector_search/filter.hh"
namespace service {
class client_state;
@@ -362,6 +363,7 @@ private:
class vector_indexed_table_select_statement : public select_statement {
secondary_index::index _index;
prepared_ann_ordering_type _prepared_ann_ordering;
vector_search::prepared_filter _prepared_filter;
mutable gc_clock::time_point _query_start_time_point;
public:
@@ -371,13 +373,13 @@ public:
lw_shared_ptr<const parameters> parameters, ::shared_ptr<selection::selection> selection,
::shared_ptr<restrictions::statement_restrictions> restrictions, ::shared_ptr<std::vector<size_t>> group_by_cell_indices, bool is_reversed,
ordering_comparator_type ordering_comparator, prepared_ann_ordering_type prepared_ann_ordering, std::optional<expr::expression> limit,
std::optional<expr::expression> per_partition_limit, cql_stats& stats, std::unique_ptr<cql3::attributes> attrs);
std::optional<expr::expression> per_partition_limit, cql_stats& stats, const secondary_index::index& index, std::unique_ptr<cql3::attributes> attrs);
vector_indexed_table_select_statement(schema_ptr schema, uint32_t bound_terms, lw_shared_ptr<const parameters> parameters,
::shared_ptr<selection::selection> selection, ::shared_ptr<const restrictions::statement_restrictions> restrictions,
::shared_ptr<std::vector<size_t>> group_by_cell_indices, bool is_reversed, ordering_comparator_type ordering_comparator,
prepared_ann_ordering_type prepared_ann_ordering, std::optional<expr::expression> limit, std::optional<expr::expression> per_partition_limit,
cql_stats& stats, const secondary_index::index& index, std::unique_ptr<cql3::attributes> attrs);
cql_stats& stats, const secondary_index::index& index, vector_search::prepared_filter prepared_filter, std::unique_ptr<cql3::attributes> attrs);
private:
future<::shared_ptr<cql_transport::messages::result_message>> do_execute(
@@ -385,7 +387,7 @@ private:
void update_stats() const;
lw_shared_ptr<query::read_command> prepare_command_for_base_query(query_processor& qp, service::query_state& state, const query_options& options) const;
lw_shared_ptr<query::read_command> prepare_command_for_base_query(query_processor& qp, service::query_state& state, const query_options& options, uint64_t fetch_limit) const;
std::vector<float> get_ann_ordering_vector(const query_options& options) const;

View File

@@ -0,0 +1,82 @@
/*
* Copyright (C) 2025-present ScyllaDB
*/
/*
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
*/
#include "modification_statement.hh"
#include "transport/messages/result_message.hh"
#include "cql3/query_processor.hh"
#include "service/strong_consistency/coordinator.hh"
#include "cql3/statements/strong_consistency/statement_helpers.hh"
namespace cql3::statements::strong_consistency {
static logging::logger logger("sc_modification_statement");
modification_statement::modification_statement(shared_ptr<base_statement> statement)
: cql_statement_opt_metadata(&timeout_config::write_timeout)
, _statement(std::move(statement))
{
}
using result_message = cql_transport::messages::result_message;
future<shared_ptr<result_message>> modification_statement::execute(query_processor& qp, service::query_state& qs,
const query_options& options, std::optional<service::group0_guard> guard) const
{
return execute_without_checking_exception_message(qp, qs, options, std::move(guard))
.then(cql_transport::messages::propagate_exception_as_future<shared_ptr<result_message>>);
}
future<shared_ptr<result_message>> modification_statement::execute_without_checking_exception_message(
query_processor& qp, service::query_state& qs, const query_options& options,
std::optional<service::group0_guard> guard) const
{
auto json_cache = base_statement::json_cache_opt{};
const auto keys = _statement->build_partition_keys(options, json_cache);
if (keys.size() != 1 || !query::is_single_partition(keys[0])) {
throw exceptions::invalid_request_exception("Strongly consistent queries can only target a single partition");
}
if (_statement->requires_read()) {
throw exceptions::invalid_request_exception("Strongly consistent updates don't support data prefetch");
}
auto [coordinator, holder] = qp.acquire_strongly_consistent_coordinator();
const auto mutate_result = co_await coordinator.get().mutate(_statement->s,
keys[0].start()->value().token(),
[&](api::timestamp_type ts) {
const auto prefetch_data = update_parameters::prefetch_data(_statement->s);
const auto ttl = _statement->get_time_to_live(options);
const auto params = update_parameters(_statement->s, options, ts, ttl, prefetch_data);
const auto ranges = _statement->create_clustering_ranges(options, json_cache);
auto muts = _statement->apply_updates(keys, ranges, params, json_cache);
if (muts.size() != 1) {
on_internal_error(logger, ::format("statement '{}' has unexpected number of mutations {}",
raw_cql_statement, muts.size()));
}
return std::move(*muts.begin());
});
using namespace service::strong_consistency;
if (const auto* redirect = get_if<need_redirect>(&mutate_result)) {
co_return co_await redirect_statement(qp, options, redirect->target);
}
co_return seastar::make_shared<result_message::void_message>();
}
future<> modification_statement::check_access(query_processor& qp, const service::client_state& state) const {
return _statement->check_access(qp, state);
}
uint32_t modification_statement::get_bound_terms() const {
return _statement->get_bound_terms();
}
bool modification_statement::depends_on(std::string_view ks_name, std::optional<std::string_view> cf_name) const {
return _statement->depends_on(ks_name, cf_name);
}
}

View File

@@ -0,0 +1,39 @@
/*
* Copyright (C) 2025-present ScyllaDB
*/
/*
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
*/
#pragma once
#include "cql3/cql_statement.hh"
#include "cql3/expr/expression.hh"
#include "cql3/statements/modification_statement.hh"
namespace cql3::statements::strong_consistency {
class modification_statement : public cql_statement_opt_metadata {
using result_message = cql_transport::messages::result_message;
using base_statement = cql3::statements::modification_statement;
shared_ptr<base_statement> _statement;
public:
modification_statement(shared_ptr<base_statement> statement);
future<shared_ptr<result_message>> execute(query_processor& qp, service::query_state& state,
const query_options& options, std::optional<service::group0_guard> guard) const override;
future<shared_ptr<result_message>> execute_without_checking_exception_message(query_processor& qp,
service::query_state& qs, const query_options& options,
std::optional<service::group0_guard> guard) const override;
future<> check_access(query_processor& qp, const service::client_state& state) const override;
uint32_t get_bound_terms() const override;
bool depends_on(std::string_view ks_name, std::optional<std::string_view> cf_name) const override;
};
}

View File

@@ -0,0 +1,56 @@
/*
* Copyright (C) 2025-present ScyllaDB
*/
/*
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
*/
#include "select_statement.hh"
#include "query/query-request.hh"
#include "cql3/query_processor.hh"
#include "service/strong_consistency/coordinator.hh"
#include "cql3/statements/strong_consistency/statement_helpers.hh"
namespace cql3::statements::strong_consistency {
using result_message = cql_transport::messages::result_message;
future<::shared_ptr<result_message>> select_statement::do_execute(query_processor& qp,
service::query_state& state,
const query_options& options) const
{
const auto key_ranges = _restrictions->get_partition_key_ranges(options);
if (key_ranges.size() != 1 || !query::is_single_partition(key_ranges[0])) {
throw exceptions::invalid_request_exception("Strongly consistent queries can only target a single partition");
}
const auto now = gc_clock::now();
auto read_command = make_lw_shared<query::read_command>(
_query_schema->id(),
_query_schema->version(),
make_partition_slice(options),
query::max_result_size(query::result_memory_limiter::maximum_result_size),
query::tombstone_limit(query::tombstone_limit::max),
query::row_limit(get_inner_loop_limit(get_limit(options, _limit), _selection->is_aggregate())),
query::partition_limit(query::max_partitions),
now,
tracing::make_trace_info(state.get_trace_state()),
query_id::create_null_id(),
query::is_first_page::no,
options.get_timestamp(state));
const auto timeout = db::timeout_clock::now() + get_timeout(state.get_client_state(), options);
auto [coordinator, holder] = qp.acquire_strongly_consistent_coordinator();
auto query_result = co_await coordinator.get().query(_query_schema, *read_command,
key_ranges, state.get_trace_state(), timeout);
using namespace service::strong_consistency;
if (const auto* redirect = get_if<need_redirect>(&query_result)) {
co_return co_await redirect_statement(qp, options, redirect->target);
}
co_return co_await process_results(get<lw_shared_ptr<query::result>>(std::move(query_result)),
read_command, options, now);
}
}

View File

@@ -0,0 +1,26 @@
/*
* Copyright (C) 2025-present ScyllaDB
*/
/*
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
*/
#pragma once
#include "cql3/cql_statement.hh"
#include "cql3/statements/select_statement.hh"
namespace cql3::statements::strong_consistency {
class select_statement : public cql3::statements::select_statement {
using result_message = cql_transport::messages::result_message;
public:
using cql3::statements::select_statement::select_statement;
future<::shared_ptr<cql_transport::messages::result_message>> do_execute(query_processor& qp,
service::query_state& state, const query_options& options) const override;
};
}

View File

@@ -0,0 +1,37 @@
/*
* Copyright (C) 2025-present ScyllaDB
*/
/*
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
*/
#include "statement_helpers.hh"
#include "transport/messages/result_message_base.hh"
#include "cql3/query_processor.hh"
#include "replica/database.hh"
#include "locator/tablet_replication_strategy.hh"
namespace cql3::statements::strong_consistency {
future<::shared_ptr<cql_transport::messages::result_message>> redirect_statement(query_processor& qp,
const query_options& options,
const locator::tablet_replica& target)
{
const auto my_host_id = qp.db().real_database().get_token_metadata().get_topology().my_host_id();
if (target.host != my_host_id) {
throw exceptions::invalid_request_exception(format(
"Strongly consistent writes can be executed only on the leader node, "
"leader id {}, current host id {}",
target.host, my_host_id));
}
auto&& func_values_cache = const_cast<cql3::query_options&>(options).take_cached_pk_function_calls();
co_return qp.bounce_to_shard(target.shard, std::move(func_values_cache));
}
bool is_strongly_consistent(data_dictionary::database db, std::string_view ks_name) {
const auto* tablet_aware_rs = db.find_keyspace(ks_name).get_replication_strategy().maybe_as_tablet_aware();
return tablet_aware_rs && tablet_aware_rs->get_consistency() != data_dictionary::consistency_config_option::eventual;
}
}

View File

@@ -0,0 +1,23 @@
/*
* Copyright (C) 2025-present ScyllaDB
*/
/*
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
*/
#pragma once
#include "cql3/cql_statement.hh"
#include "locator/tablets.hh"
namespace cql3::statements::strong_consistency {
future<::shared_ptr<cql_transport::messages::result_message>> redirect_statement(
query_processor& qp,
const query_options& options,
const locator::tablet_replica& target);
bool is_strongly_consistent(data_dictionary::database db, std::string_view ks_name);
}

View File

@@ -13,7 +13,7 @@
#include "cql3/expr/expression.hh"
#include "cql3/expr/evaluate.hh"
#include "cql3/expr/expr-utils.hh"
#include "cql3/statements/strongly_consistent_modification_statement.hh"
#include "cql3/statements/broadcast_modification_statement.hh"
#include "service/broadcast_tables/experimental/lang.hh"
#include "raw/update_statement.hh"
@@ -333,7 +333,7 @@ std::optional<expr::expression> get_value_condition(const expr::expression& the_
return binop->rhs;
}
::shared_ptr<strongly_consistent_modification_statement>
::shared_ptr<broadcast_modification_statement>
update_statement::prepare_for_broadcast_tables() const {
if (attrs) {
if (attrs->is_time_to_live_set()) {
@@ -359,7 +359,7 @@ update_statement::prepare_for_broadcast_tables() const {
.value_condition = get_value_condition(_condition),
};
return ::make_shared<strongly_consistent_modification_statement>(
return ::make_shared<broadcast_modification_statement>(
get_bound_terms(),
s,
query

View File

@@ -45,7 +45,7 @@ private:
virtual void execute_operations_for_key(mutation& m, const clustering_key_prefix& prefix, const update_parameters& params, const json_cache_opt& json_cache) const;
public:
virtual ::shared_ptr<strongly_consistent_modification_statement> prepare_for_broadcast_tables() const override;
virtual ::shared_ptr<broadcast_modification_statement> prepare_for_broadcast_tables() const override;
};
/*

View File

@@ -55,8 +55,21 @@ int32_t batchlog_shard_of(db_clock::time_point written_at) {
return hash & ((1ULL << batchlog_shard_bits) - 1);
}
bool is_batchlog_v1(const schema& schema) {
return schema.cf_name() == system_keyspace::BATCHLOG;
}
std::pair<partition_key, clustering_key>
get_batchlog_key(const schema& schema, int32_t version, db::batchlog_stage stage, int32_t batchlog_shard, db_clock::time_point written_at, std::optional<utils::UUID> id) {
if (is_batchlog_v1(schema)) {
if (!id) {
on_internal_error(blogger, "get_batchlog_key(): key for batchlog v1 requires batchlog id");
}
auto pkey = partition_key::from_single_value(schema, {serialized(*id)});
auto ckey = clustering_key::make_empty();
return std::pair(std::move(pkey), std::move(ckey));
}
auto pkey = partition_key::from_exploded(schema, {serialized(version), serialized(int8_t(stage)), serialized(batchlog_shard)});
std::vector<bytes> ckey_components;
@@ -85,6 +98,14 @@ mutation get_batchlog_mutation_for(schema_ptr schema, managed_bytes data, int32_
auto cdef_data = schema->get_column_definition(to_bytes("data"));
m.set_cell(ckey, *cdef_data, atomic_cell::make_live(*cdef_data->type, timestamp, std::move(data)));
if (is_batchlog_v1(*schema)) {
auto cdef_version = schema->get_column_definition(to_bytes("version"));
m.set_cell(ckey, *cdef_version, atomic_cell::make_live(*cdef_version->type, timestamp, serialized(version)));
auto cdef_written_at = schema->get_column_definition(to_bytes("written_at"));
m.set_cell(ckey, *cdef_written_at, atomic_cell::make_live(*cdef_written_at->type, timestamp, serialized(now)));
}
return m;
}
@@ -122,9 +143,10 @@ mutation get_batchlog_delete_mutation(schema_ptr schema, int32_t version, db_clo
const std::chrono::seconds db::batchlog_manager::replay_interval;
const uint32_t db::batchlog_manager::page_size;
db::batchlog_manager::batchlog_manager(cql3::query_processor& qp, db::system_keyspace& sys_ks, batchlog_manager_config config)
db::batchlog_manager::batchlog_manager(cql3::query_processor& qp, db::system_keyspace& sys_ks, gms::feature_service& fs, batchlog_manager_config config)
: _qp(qp)
, _sys_ks(sys_ks)
, _fs(fs)
, _replay_timeout(config.replay_timeout)
, _replay_rate(config.replay_rate)
, _delay(config.delay)
@@ -300,23 +322,156 @@ future<> db::batchlog_manager::maybe_migrate_v1_to_v2() {
});
}
future<db::all_batches_replayed> db::batchlog_manager::replay_all_failed_batches(post_replay_cleanup cleanup) {
co_await maybe_migrate_v1_to_v2();
namespace {
typedef db_clock::rep clock_type;
using clock_type = db_clock::rep;
struct replay_stats {
std::optional<db_clock::time_point> min_too_fresh;
bool need_cleanup = false;
};
} // anonymous namespace
static future<db::all_batches_replayed> process_batch(
cql3::query_processor& qp,
db::batchlog_manager::stats& stats,
db::batchlog_manager::post_replay_cleanup cleanup,
utils::rate_limiter& limiter,
schema_ptr schema,
std::unordered_map<int32_t, replay_stats>& replay_stats_per_shard,
const db_clock::time_point now,
db_clock::duration replay_timeout,
std::chrono::seconds write_timeout,
const cql3::untyped_result_set::row& row) {
const bool is_v1 = db::is_batchlog_v1(*schema);
const auto stage = is_v1 ? db::batchlog_stage::initial : static_cast<db::batchlog_stage>(row.get_as<int8_t>("stage"));
const auto batch_shard = is_v1 ? 0 : row.get_as<int32_t>("shard");
auto written_at = row.get_as<db_clock::time_point>("written_at");
auto id = row.get_as<utils::UUID>("id");
// enough time for the actual write + batchlog entry mutation delivery (two separate requests).
auto timeout = replay_timeout;
if (utils::get_local_injector().is_enabled("skip_batch_replay")) {
blogger.debug("Skipping batch replay due to skip_batch_replay injection");
co_return db::all_batches_replayed::no;
}
auto data = row.get_blob_unfragmented("data");
blogger.debug("Replaying batch {} from stage {} and batch shard {}", id, int32_t(stage), batch_shard);
utils::chunked_vector<mutation> mutations;
bool send_failed = false;
auto& shard_written_at = replay_stats_per_shard.try_emplace(batch_shard, replay_stats{}).first->second;
try {
utils::chunked_vector<std::pair<canonical_mutation, schema_ptr>> fms;
auto in = ser::as_input_stream(data);
while (in.size()) {
auto fm = ser::deserialize(in, std::type_identity<canonical_mutation>());
const auto tbl = qp.db().try_find_table(fm.column_family_id());
if (!tbl) {
continue;
}
if (written_at <= tbl->get_truncation_time()) {
continue;
}
schema_ptr s = tbl->schema();
if (s->tombstone_gc_options().mode() == tombstone_gc_mode::repair) {
timeout = std::min(timeout, std::chrono::duration_cast<db_clock::duration>(s->tombstone_gc_options().propagation_delay_in_seconds()));
}
fms.emplace_back(std::move(fm), std::move(s));
}
if (now < written_at + timeout) {
blogger.debug("Skipping replay of {}, too fresh", id);
shard_written_at.min_too_fresh = std::min(shard_written_at.min_too_fresh.value_or(written_at), written_at);
co_return db::all_batches_replayed::no;
}
auto size = data.size();
for (const auto& [fm, s] : fms) {
mutations.emplace_back(fm.to_mutation(s));
co_await coroutine::maybe_yield();
}
if (!mutations.empty()) {
const auto ttl = [written_at]() -> clock_type {
/*
* Calculate ttl for the mutations' hints (and reduce ttl by the time the mutations spent in the batchlog).
* This ensures that deletes aren't "undone" by an old batch replay.
*/
auto unadjusted_ttl = std::numeric_limits<gc_clock::rep>::max();
warn(unimplemented::cause::HINT);
#if 0
for (auto& m : *mutations) {
unadjustedTTL = Math.min(unadjustedTTL, HintedHandOffManager.calculateHintTTL(mutation));
}
#endif
return unadjusted_ttl - std::chrono::duration_cast<gc_clock::duration>(db_clock::now() - written_at).count();
}();
if (ttl > 0) {
// Origin does the send manually, however I can't see a super great reason to do so.
// Our normal write path does not add much redundancy to the dispatch, and rate is handled after send
// in both cases.
// FIXME: verify that the above is reasonably true.
co_await limiter.reserve(size);
stats.write_attempts += mutations.size();
auto timeout = db::timeout_clock::now() + write_timeout;
if (cleanup) {
co_await qp.proxy().send_batchlog_replay_to_all_replicas(mutations, timeout);
} else {
co_await qp.proxy().send_batchlog_replay_to_all_replicas(std::move(mutations), timeout);
}
}
}
} catch (data_dictionary::no_such_keyspace& ex) {
// should probably ignore and drop the batch
} catch (const data_dictionary::no_such_column_family&) {
// As above -- we should drop the batch if the table doesn't exist anymore.
} catch (...) {
blogger.warn("Replay failed (will retry): {}", std::current_exception());
// timeout, overload etc.
// Do _not_ remove the batch, assuning we got a node write error.
// Since we don't have hints (which origin is satisfied with),
// we have to resort to keeping this batch to next lap.
if (is_v1 || !cleanup || stage == db::batchlog_stage::failed_replay) {
co_return db::all_batches_replayed::no;
}
send_failed = true;
}
auto& sp = qp.proxy();
if (send_failed) {
blogger.debug("Moving batch {} to stage failed_replay", id);
auto m = get_batchlog_mutation_for(schema, mutations, netw::messaging_service::current_version, db::batchlog_stage::failed_replay, written_at, id);
co_await sp.mutate_locally(m, tracing::trace_state_ptr(), db::commitlog::force_sync::no);
}
// delete batch
auto m = get_batchlog_delete_mutation(schema, netw::messaging_service::current_version, stage, written_at, id);
co_await qp.proxy().mutate_locally(m, tracing::trace_state_ptr(), db::commitlog::force_sync::no);
shard_written_at.need_cleanup = true;
co_return db::all_batches_replayed(!send_failed);
}
future<db::all_batches_replayed> db::batchlog_manager::replay_all_failed_batches_v1(post_replay_cleanup) {
db::all_batches_replayed all_replayed = all_batches_replayed::yes;
// rate limit is in bytes per second. Uses Double.MAX_VALUE if disabled (set to 0 in cassandra.yaml).
// max rate is scaled by the number of nodes in the cluster (same as for HHOM - see CASSANDRA-5272).
auto throttle = _replay_rate / _qp.proxy().get_token_metadata_ptr()->count_normal_token_owners();
auto limiter = make_lw_shared<utils::rate_limiter>(throttle);
utils::rate_limiter limiter(throttle);
auto schema = _qp.db().find_schema(system_keyspace::NAME, system_keyspace::BATCHLOG_V2);
struct replay_stats {
std::optional<db_clock::time_point> min_too_fresh;
bool need_cleanup = false;
};
auto schema = _qp.db().find_schema(system_keyspace::NAME, system_keyspace::BATCHLOG);
std::unordered_map<int32_t, replay_stats> replay_stats_per_shard;
@@ -324,125 +479,49 @@ future<db::all_batches_replayed> db::batchlog_manager::replay_all_failed_batches
// same across a while prefix of written_at (across all ids).
const auto now = db_clock::now();
auto batch = [this, cleanup, limiter, schema, &all_replayed, &replay_stats_per_shard, now] (const cql3::untyped_result_set::row& row) -> future<stop_iteration> {
const auto stage = static_cast<batchlog_stage>(row.get_as<int8_t>("stage"));
const auto batch_shard = row.get_as<int32_t>("shard");
auto written_at = row.get_as<db_clock::time_point>("written_at");
auto id = row.get_as<utils::UUID>("id");
// enough time for the actual write + batchlog entry mutation delivery (two separate requests).
auto timeout = _replay_timeout;
auto batch = [this, &limiter, schema, &all_replayed, &replay_stats_per_shard, now] (const cql3::untyped_result_set::row& row) mutable -> future<stop_iteration> {
all_replayed = all_replayed && co_await process_batch(_qp, _stats, post_replay_cleanup::no, limiter, schema, replay_stats_per_shard, now, _replay_timeout, write_timeout, row);
co_return stop_iteration::no;
};
if (utils::get_local_injector().is_enabled("skip_batch_replay")) {
blogger.debug("Skipping batch replay due to skip_batch_replay injection");
all_replayed = all_batches_replayed::no;
co_return stop_iteration::no;
}
co_await with_gate(_gate, [this, &all_replayed, batch = std::move(batch)] () mutable -> future<> {
blogger.debug("Started replayAllFailedBatches");
co_await utils::get_local_injector().inject("add_delay_to_batch_replay", std::chrono::milliseconds(1000));
auto data = row.get_blob_unfragmented("data");
auto schema = _qp.db().find_schema(system_keyspace::NAME, system_keyspace::BATCHLOG);
blogger.debug("Replaying batch {} from stage {} and batch shard {}", id, int32_t(stage), batch_shard);
co_await _qp.query_internal(
format("SELECT * FROM {}.{} BYPASS CACHE", system_keyspace::NAME, system_keyspace::BATCHLOG),
db::consistency_level::ONE,
{},
page_size,
batch);
utils::chunked_vector<mutation> mutations;
bool send_failed = false;
blogger.debug("Finished replayAllFailedBatches with all_replayed: {}", all_replayed);
});
auto& shard_written_at = replay_stats_per_shard.try_emplace(batch_shard, replay_stats{}).first->second;
co_return all_replayed;
}
try {
utils::chunked_vector<std::pair<canonical_mutation, schema_ptr>> fms;
auto in = ser::as_input_stream(data);
while (in.size()) {
auto fm = ser::deserialize(in, std::type_identity<canonical_mutation>());
const auto tbl = _qp.db().try_find_table(fm.column_family_id());
if (!tbl) {
continue;
}
if (written_at <= tbl->get_truncation_time()) {
continue;
}
schema_ptr s = tbl->schema();
if (s->tombstone_gc_options().mode() == tombstone_gc_mode::repair) {
timeout = std::min(timeout, std::chrono::duration_cast<db_clock::duration>(s->tombstone_gc_options().propagation_delay_in_seconds()));
}
fms.emplace_back(std::move(fm), std::move(s));
}
future<db::all_batches_replayed> db::batchlog_manager::replay_all_failed_batches_v2(post_replay_cleanup cleanup) {
co_await maybe_migrate_v1_to_v2();
if (now < written_at + timeout) {
blogger.debug("Skipping replay of {}, too fresh", id);
db::all_batches_replayed all_replayed = all_batches_replayed::yes;
// rate limit is in bytes per second. Uses Double.MAX_VALUE if disabled (set to 0 in cassandra.yaml).
// max rate is scaled by the number of nodes in the cluster (same as for HHOM - see CASSANDRA-5272).
auto throttle = _replay_rate / _qp.proxy().get_token_metadata_ptr()->count_normal_token_owners();
utils::rate_limiter limiter(throttle);
shard_written_at.min_too_fresh = std::min(shard_written_at.min_too_fresh.value_or(written_at), written_at);
auto schema = _qp.db().find_schema(system_keyspace::NAME, system_keyspace::BATCHLOG_V2);
co_return stop_iteration::no;
}
std::unordered_map<int32_t, replay_stats> replay_stats_per_shard;
auto size = data.size();
for (const auto& [fm, s] : fms) {
mutations.emplace_back(fm.to_mutation(s));
co_await coroutine::maybe_yield();
}
if (!mutations.empty()) {
const auto ttl = [written_at]() -> clock_type {
/*
* Calculate ttl for the mutations' hints (and reduce ttl by the time the mutations spent in the batchlog).
* This ensures that deletes aren't "undone" by an old batch replay.
*/
auto unadjusted_ttl = std::numeric_limits<gc_clock::rep>::max();
warn(unimplemented::cause::HINT);
#if 0
for (auto& m : *mutations) {
unadjustedTTL = Math.min(unadjustedTTL, HintedHandOffManager.calculateHintTTL(mutation));
}
#endif
return unadjusted_ttl - std::chrono::duration_cast<gc_clock::duration>(db_clock::now() - written_at).count();
}();
if (ttl > 0) {
// Origin does the send manually, however I can't see a super great reason to do so.
// Our normal write path does not add much redundancy to the dispatch, and rate is handled after send
// in both cases.
// FIXME: verify that the above is reasonably true.
co_await limiter->reserve(size);
_stats.write_attempts += mutations.size();
auto timeout = db::timeout_clock::now() + write_timeout;
if (cleanup) {
co_await _qp.proxy().send_batchlog_replay_to_all_replicas(mutations, timeout);
} else {
co_await _qp.proxy().send_batchlog_replay_to_all_replicas(std::move(mutations), timeout);
}
}
}
} catch (data_dictionary::no_such_keyspace& ex) {
// should probably ignore and drop the batch
} catch (const data_dictionary::no_such_column_family&) {
// As above -- we should drop the batch if the table doesn't exist anymore.
} catch (...) {
blogger.warn("Replay failed (will retry): {}", std::current_exception());
all_replayed = all_batches_replayed::no;
// timeout, overload etc.
// Do _not_ remove the batch, assuning we got a node write error.
// Since we don't have hints (which origin is satisfied with),
// we have to resort to keeping this batch to next lap.
if (!cleanup || stage == batchlog_stage::failed_replay) {
co_return stop_iteration::no;
}
send_failed = true;
}
auto& sp = _qp.proxy();
if (send_failed) {
blogger.debug("Moving batch {} to stage failed_replay", id);
auto m = get_batchlog_mutation_for(schema, mutations, netw::messaging_service::current_version, batchlog_stage::failed_replay, written_at, id);
co_await sp.mutate_locally(m, tracing::trace_state_ptr(), db::commitlog::force_sync::no);
}
// delete batch
auto m = get_batchlog_delete_mutation(schema, netw::messaging_service::current_version, stage, written_at, id);
co_await _qp.proxy().mutate_locally(m, tracing::trace_state_ptr(), db::commitlog::force_sync::no);
shard_written_at.need_cleanup = true;
// Use a stable `now` across all batches, so skip/replay decisions are the
// same across a while prefix of written_at (across all ids).
const auto now = db_clock::now();
auto batch = [this, cleanup, &limiter, schema, &all_replayed, &replay_stats_per_shard, now] (const cql3::untyped_result_set::row& row) mutable -> future<stop_iteration> {
all_replayed = all_replayed && co_await process_batch(_qp, _stats, cleanup, limiter, schema, replay_stats_per_shard, now, _replay_timeout, write_timeout, row);
co_return stop_iteration::no;
};
@@ -501,3 +580,10 @@ future<db::all_batches_replayed> db::batchlog_manager::replay_all_failed_batches
co_return all_replayed;
}
future<db::all_batches_replayed> db::batchlog_manager::replay_all_failed_batches(post_replay_cleanup cleanup) {
if (_fs.batchlog_v2) {
return replay_all_failed_batches_v2(cleanup);
}
return replay_all_failed_batches_v1(cleanup);
}

View File

@@ -27,6 +27,12 @@ class query_processor;
} // namespace cql3
namespace gms {
class feature_service;
} // namespace gms
namespace db {
class system_keyspace;
@@ -49,6 +55,11 @@ class batchlog_manager : public peering_sharded_service<batchlog_manager> {
public:
using post_replay_cleanup = bool_class<class post_replay_cleanup_tag>;
struct stats {
uint64_t write_attempts = 0;
};
private:
static constexpr std::chrono::seconds replay_interval = std::chrono::seconds(60);
static constexpr uint32_t page_size = 128; // same as HHOM, for now, w/out using any heuristics. TODO: set based on avg batch size.
@@ -56,14 +67,13 @@ private:
using clock_type = lowres_clock;
struct stats {
uint64_t write_attempts = 0;
} _stats;
stats _stats;
seastar::metrics::metric_groups _metrics;
cql3::query_processor& _qp;
db::system_keyspace& _sys_ks;
gms::feature_service& _fs;
db_clock::duration _replay_timeout;
uint64_t _replay_rate;
std::chrono::milliseconds _delay;
@@ -84,12 +94,14 @@ private:
future<> maybe_migrate_v1_to_v2();
future<all_batches_replayed> replay_all_failed_batches_v1(post_replay_cleanup cleanup);
future<all_batches_replayed> replay_all_failed_batches_v2(post_replay_cleanup cleanup);
future<all_batches_replayed> replay_all_failed_batches(post_replay_cleanup cleanup);
public:
// Takes a QP, not a distributes. Because this object is supposed
// to be per shard and does no dispatching beyond delegating the the
// shard qp (which is what you feed here).
batchlog_manager(cql3::query_processor&, db::system_keyspace& sys_ks, batchlog_manager_config config);
batchlog_manager(cql3::query_processor&, db::system_keyspace& sys_ks, gms::feature_service& fs, batchlog_manager_config config);
// abort the replay loop and return its future.
future<> drain();
@@ -102,7 +114,7 @@ public:
return _last_replay;
}
const stats& stats() const {
const stats& get_stats() const {
return _stats;
}
private:

View File

@@ -323,6 +323,9 @@ void cache_mutation_reader::touch_partition() {
inline
future<> cache_mutation_reader::fill_buffer() {
if (const auto& ex = get_abort_exception(); ex) {
return make_exception_future<>(ex);
}
if (_state == state::before_static_row) {
touch_partition();
auto after_static_row = [this] {

View File

@@ -1986,13 +1986,13 @@ future<> db::commitlog::segment_manager::replenish_reserve() {
}
continue;
} catch (shutdown_marker&) {
_reserve_segments.abort(std::current_exception());
break;
} catch (...) {
clogger.warn("Exception in segment reservation: {}", std::current_exception());
}
co_await sleep(100ms);
}
_reserve_segments.abort(std::make_exception_ptr(shutdown_marker()));
}
future<std::vector<db::commitlog::descriptor>>

View File

@@ -1291,7 +1291,7 @@ db::config::config(std::shared_ptr<db::extensions> exts)
, ignore_dead_nodes_for_replace(this, "ignore_dead_nodes_for_replace", value_status::Used, "", "List dead nodes to ignore for replace operation using a comma-separated list of host IDs. E.g., scylla --ignore-dead-nodes-for-replace 8d5ed9f4-7764-4dbd-bad8-43fddce94b7c,125ed9f4-7777-1dbn-mac8-43fddce9123e")
, override_decommission(this, "override_decommission", value_status::Deprecated, false, "Set true to force a decommissioned node to join the cluster (cannot be set if consistent-cluster-management is enabled).")
, enable_repair_based_node_ops(this, "enable_repair_based_node_ops", liveness::LiveUpdate, value_status::Used, true, "Set true to use enable repair based node operations instead of streaming based.")
, allowed_repair_based_node_ops(this, "allowed_repair_based_node_ops", liveness::LiveUpdate, value_status::Used, "replace,removenode,rebuild,bootstrap,decommission", "A comma separated list of node operations which are allowed to enable repair based node operations. The operations can be bootstrap, replace, removenode, decommission and rebuild.")
, allowed_repair_based_node_ops(this, "allowed_repair_based_node_ops", liveness::LiveUpdate, value_status::Used, "replace,removenode,rebuild", "A comma separated list of node operations which are allowed to enable repair based node operations. The operations can be bootstrap, replace, removenode, decommission and rebuild.")
, enable_compacting_data_for_streaming_and_repair(this, "enable_compacting_data_for_streaming_and_repair", liveness::LiveUpdate, value_status::Used, true, "Enable the compacting reader, which compacts the data for streaming and repair (load'n'stream included) before sending it to, or synchronizing it with peers. Can reduce the amount of data to be processed by removing dead data, but adds CPU overhead.")
, enable_tombstone_gc_for_streaming_and_repair(this, "enable_tombstone_gc_for_streaming_and_repair", liveness::LiveUpdate, value_status::Used, false,
"If the compacting reader is enabled for streaming and repair (see enable_compacting_data_for_streaming_and_repair), allow it to garbage-collect tombstones."
@@ -1341,7 +1341,7 @@ db::config::config(std::shared_ptr<db::extensions> exts)
, sstable_compression_user_table_options(this, "sstable_compression_user_table_options", value_status::Used, compression_parameters{compression_parameters::algorithm::lz4_with_dicts},
"Server-global user table compression options. If enabled, all user tables"
"will be compressed using the provided options, unless overridden"
"by compression options in the table schema. The available options are:\n"
"by compression options in the table schema. User tables are all tables in non-system keyspaces. The available options are:\n"
"* sstable_compression: The compression algorithm to use. Supported values: LZ4Compressor, LZ4WithDictsCompressor (default), SnappyCompressor, DeflateCompressor, ZstdCompressor, ZstdWithDictsCompressor, '' (empty string; disables compression).\n"
"* chunk_length_in_kb: (Default: 4) The size of chunks to compress in kilobytes. Allowed values are powers of two between 1 and 128.\n"
"* crc_check_chance: (Default: 1.0) Not implemented (option value is ignored).\n"
@@ -1584,7 +1584,14 @@ db::config::config(std::shared_ptr<db::extensions> exts)
, enable_create_table_with_compact_storage(this, "enable_create_table_with_compact_storage", liveness::LiveUpdate, value_status::Used, false, "Enable the deprecated feature of CREATE TABLE WITH COMPACT STORAGE. This feature will eventually be removed in a future version.")
, rf_rack_valid_keyspaces(this, "rf_rack_valid_keyspaces", liveness::MustRestart, value_status::Used, false,
"Enforce RF-rack-valid keyspaces. Additionally, if there are existing RF-rack-invalid "
"keyspaces, attempting to start a node with this option ON will fail.")
"keyspaces, attempting to start a node with this option ON will fail. "
"DEPRECATED. Use enforce_rack_list instead.")
, enforce_rack_list(this, "enforce_rack_list", liveness::MustRestart, value_status::Used, false,
"Enforce rack list for tablet keyspaces. "
"When the option is on, CREATE STATEMENT expands numeric rfs to rack lists "
"and ALTER STATEMENT is allowed only when rack lists are used in all DCs."
"Additionally, if there are existing tablet keyspaces with numeric rf in any DC "
"attempting to start a node with this option ON will fail.")
// FIXME: make frequency per table in order to reduce work in each iteration.
// Bigger tables will take longer to be resized. similar-sized tables can be batched into same iteration.
, tablet_load_stats_refresh_interval_in_seconds(this, "tablet_load_stats_refresh_interval_in_seconds", liveness::LiveUpdate, value_status::Used, 60,
@@ -1785,6 +1792,21 @@ const db::extensions& db::config::extensions() const {
return *_extensions;
}
compression_parameters db::config::get_sstable_compression_user_table_options(bool dicts_feature_enabled) const {
if (sstable_compression_user_table_options.is_set()
|| dicts_feature_enabled
|| !sstable_compression_user_table_options().uses_dictionary_compressor()) {
return sstable_compression_user_table_options();
} else {
// Fall back to non-dict if dictionary compression is not enabled cluster-wide.
auto options = sstable_compression_user_table_options();
auto params = options.get_options();
auto algo = compression_parameters::non_dict_equivalent(options.get_algorithm());
params[compression_parameters::SSTABLE_COMPRESSION] = sstring(compression_parameters::algorithm_to_name(algo));
return compression_parameters{params};
}
}
std::map<sstring, db::experimental_features_t::feature> db::experimental_features_t::map() {
// We decided against using the construct-on-first-use idiom here:
// https://github.com/scylladb/scylla/pull/5369#discussion_r353614807

View File

@@ -419,7 +419,13 @@ public:
named_value<bool> enable_sstables_mc_format;
named_value<bool> enable_sstables_md_format;
named_value<sstring> sstable_format;
// NOTE: Do not use this option directly.
// Use get_sstable_compression_user_table_options() instead.
named_value<compression_parameters> sstable_compression_user_table_options;
compression_parameters get_sstable_compression_user_table_options(bool dicts_feature_enabled) const;
named_value<bool> sstable_compression_dictionaries_allow_in_ddl;
named_value<bool> sstable_compression_dictionaries_enable_writing;
named_value<float> sstable_compression_dictionaries_memory_budget_fraction;
@@ -599,6 +605,7 @@ public:
named_value<bool> enable_create_table_with_compact_storage;
named_value<bool> rf_rack_valid_keyspaces;
named_value<bool> enforce_rack_list;
named_value<uint32_t> tablet_load_stats_refresh_interval_in_seconds;
named_value<bool> force_capacity_based_balancing;

View File

@@ -29,6 +29,7 @@
#include "utils/assert.hh"
#include "utils/updateable_value.hh"
#include "utils/labels.hh"
#include "utils/chunked_vector.hh"
namespace cache {
@@ -850,7 +851,7 @@ mutation_reader row_cache::make_nonpopulating_reader(schema_ptr schema, reader_p
std::move(permit),
e.key(),
query::clustering_key_filter_ranges(slice.row_ranges(*schema, e.key().key())),
e.partition().read(_tracker.region(), _tracker.memtable_cleaner(), nullptr, phase_of(pos)),
e.partition().read(_tracker.region(), _tracker.memtable_cleaner(), &_tracker, phase_of(pos)),
false,
_tracker.region(),
_read_section,
@@ -1215,10 +1216,10 @@ future<> row_cache::invalidate(external_updater eu, const dht::decorated_key& dk
}
future<> row_cache::invalidate(external_updater eu, const dht::partition_range& range, cache_invalidation_filter filter) {
return invalidate(std::move(eu), dht::partition_range_vector({range}), std::move(filter));
return invalidate(std::move(eu), utils::chunked_vector<dht::partition_range>({range}), std::move(filter));
}
future<> row_cache::invalidate(external_updater eu, dht::partition_range_vector&& ranges, cache_invalidation_filter filter) {
future<> row_cache::invalidate(external_updater eu, utils::chunked_vector<dht::partition_range>&& ranges, cache_invalidation_filter filter) {
return do_update(std::move(eu), [this, ranges = std::move(ranges), filter = std::move(filter)] mutable {
return seastar::async([this, ranges = std::move(ranges), filter = std::move(filter)] {
auto on_failure = defer([this] () noexcept {

View File

@@ -17,6 +17,7 @@
#include "utils/histogram.hh"
#include "mutation/partition_version.hh"
#include "utils/double-decker.hh"
#include "utils/chunked_vector.hh"
#include "db/cache_tracker.hh"
#include "readers/empty.hh"
#include "readers/mutation_source.hh"
@@ -457,7 +458,7 @@ public:
// mutation source made prior to the call to invalidate().
future<> invalidate(external_updater, const dht::decorated_key&);
future<> invalidate(external_updater, const dht::partition_range& = query::full_partition_range, cache_invalidation_filter filter = [] (const auto&) { return true; });
future<> invalidate(external_updater, dht::partition_range_vector&&, cache_invalidation_filter filter = [] (const auto&) { return true; });
future<> invalidate(external_updater, utils::chunked_vector<dht::partition_range>&&, cache_invalidation_filter filter = [] (const auto&) { return true; });
// Evicts entries from cache.
//

View File

@@ -1139,14 +1139,17 @@ future<> schema_applier::finalize_tables_and_views() {
// was already dropped (see https://github.com/scylladb/scylla/issues/5614)
for (auto& dropped_view : diff.tables_and_views.local().views.dropped) {
auto s = dropped_view.get();
co_await _ss.local().on_cleanup_for_drop_table(s->id());
co_await replica::database::cleanup_drop_table_on_all_shards(sharded_db, _sys_ks, true, diff.table_shards[s->id()]);
}
for (auto& dropped_table : diff.tables_and_views.local().tables.dropped) {
auto s = dropped_table.get();
co_await _ss.local().on_cleanup_for_drop_table(s->id());
co_await replica::database::cleanup_drop_table_on_all_shards(sharded_db, _sys_ks, true, diff.table_shards[s->id()]);
}
for (auto& dropped_cdc : diff.tables_and_views.local().cdc.dropped) {
auto s = dropped_cdc.get();
co_await _ss.local().on_cleanup_for_drop_table(s->id());
co_await replica::database::cleanup_drop_table_on_all_shards(sharded_db, _sys_ks, true, diff.table_shards[s->id()]);
}

View File

@@ -96,16 +96,16 @@ static logging::logger diff_logger("schema_diff");
/** system.schema_* tables used to store keyspace/table/type attributes prior to C* 3.0 */
namespace db {
namespace {
const auto set_use_schema_commitlog = schema_builder::register_static_configurator([](const sstring& ks_name, const sstring& cf_name, schema_static_props& props) {
if (ks_name == schema_tables::NAME) {
props.enable_schema_commitlog();
const auto set_use_schema_commitlog = schema_builder::register_schema_initializer([](schema_builder& builder) {
if (builder.ks_name() == schema_tables::NAME) {
builder.enable_schema_commitlog();
}
});
const auto set_group0_table_options =
schema_builder::register_static_configurator([](const sstring& ks_name, const sstring& cf_name, schema_static_props& props) {
if (ks_name == schema_tables::NAME) {
schema_builder::register_schema_initializer([](schema_builder& builder) {
if (builder.ks_name() == schema_tables::NAME) {
// all schema tables are group0 tables
props.is_group0_table = true;
builder.set_is_group0_table();
}
});
}

View File

@@ -65,7 +65,7 @@ future<> snapshot_ctl::run_snapshot_modify_operation(noncopyable_function<future
});
}
future<> snapshot_ctl::take_snapshot(sstring tag, std::vector<sstring> keyspace_names, skip_flush sf) {
future<> snapshot_ctl::take_snapshot(sstring tag, std::vector<sstring> keyspace_names, snapshot_options opts) {
if (tag.empty()) {
throw std::runtime_error("You must supply a snapshot name.");
}
@@ -74,21 +74,21 @@ future<> snapshot_ctl::take_snapshot(sstring tag, std::vector<sstring> keyspace_
std::ranges::copy(_db.local().get_keyspaces() | std::views::keys, std::back_inserter(keyspace_names));
};
return run_snapshot_modify_operation([tag = std::move(tag), keyspace_names = std::move(keyspace_names), sf, this] () mutable {
return do_take_snapshot(std::move(tag), std::move(keyspace_names), sf);
return run_snapshot_modify_operation([tag = std::move(tag), keyspace_names = std::move(keyspace_names), opts, this] () mutable {
return do_take_snapshot(std::move(tag), std::move(keyspace_names), opts);
});
}
future<> snapshot_ctl::do_take_snapshot(sstring tag, std::vector<sstring> keyspace_names, skip_flush sf) {
future<> snapshot_ctl::do_take_snapshot(sstring tag, std::vector<sstring> keyspace_names, snapshot_options opts) {
co_await coroutine::parallel_for_each(keyspace_names, [tag, this] (const auto& ks_name) {
return check_snapshot_not_exist(ks_name, tag);
});
co_await coroutine::parallel_for_each(keyspace_names, [this, tag = std::move(tag), sf] (const auto& ks_name) {
return replica::database::snapshot_keyspace_on_all_shards(_db, ks_name, tag, bool(sf));
co_await coroutine::parallel_for_each(keyspace_names, [this, tag = std::move(tag), opts] (const auto& ks_name) {
return replica::database::snapshot_keyspace_on_all_shards(_db, ks_name, tag, opts);
});
}
future<> snapshot_ctl::take_column_family_snapshot(sstring ks_name, std::vector<sstring> tables, sstring tag, skip_flush sf) {
future<> snapshot_ctl::take_column_family_snapshot(sstring ks_name, std::vector<sstring> tables, sstring tag, snapshot_options opts) {
if (ks_name.empty()) {
throw std::runtime_error("You must supply a keyspace name");
}
@@ -99,14 +99,14 @@ future<> snapshot_ctl::take_column_family_snapshot(sstring ks_name, std::vector<
throw std::runtime_error("You must supply a snapshot name.");
}
return run_snapshot_modify_operation([this, ks_name = std::move(ks_name), tables = std::move(tables), tag = std::move(tag), sf] () mutable {
return do_take_column_family_snapshot(std::move(ks_name), std::move(tables), std::move(tag), sf);
return run_snapshot_modify_operation([this, ks_name = std::move(ks_name), tables = std::move(tables), tag = std::move(tag), opts] () mutable {
return do_take_column_family_snapshot(std::move(ks_name), std::move(tables), std::move(tag), opts);
});
}
future<> snapshot_ctl::do_take_column_family_snapshot(sstring ks_name, std::vector<sstring> tables, sstring tag, skip_flush sf) {
future<> snapshot_ctl::do_take_column_family_snapshot(sstring ks_name, std::vector<sstring> tables, sstring tag, snapshot_options opts) {
co_await check_snapshot_not_exist(ks_name, tag, tables);
co_await replica::database::snapshot_tables_on_all_shards(_db, ks_name, std::move(tables), std::move(tag), bool(sf));
co_await replica::database::snapshot_tables_on_all_shards(_db, ks_name, std::move(tables), std::move(tag), opts);
}
future<> snapshot_ctl::clear_snapshot(sstring tag, std::vector<sstring> keyspace_names, sstring cf_name) {

View File

@@ -38,10 +38,14 @@ class backup_task_impl;
} // snapshot namespace
struct snapshot_options {
bool skip_flush = false;
gc_clock::time_point created_at = gc_clock::now();
std::optional<gc_clock::time_point> expires_at;
};
class snapshot_ctl : public peering_sharded_service<snapshot_ctl> {
public:
using skip_flush = bool_class<class skip_flush_tag>;
struct table_snapshot_details {
int64_t total;
int64_t live;
@@ -70,8 +74,8 @@ public:
*
* @param tag the tag given to the snapshot; may not be null or empty
*/
future<> take_snapshot(sstring tag, skip_flush sf = skip_flush::no) {
return take_snapshot(tag, {}, sf);
future<> take_snapshot(sstring tag, snapshot_options opts = {}) {
return take_snapshot(tag, {}, opts);
}
/**
@@ -80,7 +84,7 @@ public:
* @param tag the tag given to the snapshot; may not be null or empty
* @param keyspace_names the names of the keyspaces to snapshot; empty means "all"
*/
future<> take_snapshot(sstring tag, std::vector<sstring> keyspace_names, skip_flush sf = skip_flush::no);
future<> take_snapshot(sstring tag, std::vector<sstring> keyspace_names, snapshot_options opts = {});
/**
* Takes the snapshot of multiple tables. A snapshot name must be specified.
@@ -89,7 +93,7 @@ public:
* @param tables a vector of tables names to snapshot
* @param tag the tag given to the snapshot; may not be null or empty
*/
future<> take_column_family_snapshot(sstring ks_name, std::vector<sstring> tables, sstring tag, skip_flush sf = skip_flush::no);
future<> take_column_family_snapshot(sstring ks_name, std::vector<sstring> tables, sstring tag, snapshot_options opts = {});
/**
* Remove the snapshot with the given name from the given keyspaces.
@@ -127,8 +131,8 @@ private:
friend class snapshot::backup_task_impl;
future<> do_take_snapshot(sstring tag, std::vector<sstring> keyspace_names, skip_flush sf = skip_flush::no);
future<> do_take_column_family_snapshot(sstring ks_name, std::vector<sstring> tables, sstring tag, skip_flush sf = skip_flush::no);
future<> do_take_snapshot(sstring tag, std::vector<sstring> keyspace_names, snapshot_options opts = {} );
future<> do_take_column_family_snapshot(sstring ks_name, std::vector<sstring> tables, sstring tag, snapshot_options opts = {});
};
}

View File

@@ -42,11 +42,11 @@ extern logging::logger cdc_log;
namespace db {
namespace {
const auto set_wait_for_sync_to_commitlog = schema_builder::register_static_configurator([](const sstring& ks_name, const sstring& cf_name, schema_static_props& props) {
if ((ks_name == system_distributed_keyspace::NAME_EVERYWHERE && cf_name == system_distributed_keyspace::CDC_GENERATIONS_V2) ||
(ks_name == system_distributed_keyspace::NAME && cf_name == system_distributed_keyspace::CDC_TOPOLOGY_DESCRIPTION))
const auto set_wait_for_sync_to_commitlog = schema_builder::register_schema_initializer([](schema_builder& builder) {
if ((builder.ks_name() == system_distributed_keyspace::NAME_EVERYWHERE && builder.cf_name() == system_distributed_keyspace::CDC_GENERATIONS_V2) ||
(builder.ks_name() == system_distributed_keyspace::NAME && builder.cf_name() == system_distributed_keyspace::CDC_TOPOLOGY_DESCRIPTION))
{
props.wait_for_sync_to_commitlog = true;
builder.set_wait_for_sync_to_commitlog(true);
}
});
}

View File

@@ -66,60 +66,44 @@ static thread_local auto sstableinfo_type = user_type_impl::get_instance(
namespace db {
namespace {
const auto set_null_sharder = schema_builder::register_static_configurator([](const sstring& ks_name, const sstring& cf_name, schema_static_props& props) {
const auto set_null_sharder = schema_builder::register_schema_initializer([](schema_builder& builder) {
// tables in the "system" keyspace which need to use null sharder
static const std::unordered_set<sstring> tables = {
// empty
};
if (ks_name == system_keyspace::NAME && tables.contains(cf_name)) {
props.use_null_sharder = true;
if (builder.ks_name() == system_keyspace::NAME && tables.contains(builder.cf_name())) {
builder.set_use_null_sharder(true);
}
});
const auto set_wait_for_sync_to_commitlog = schema_builder::register_static_configurator([](const sstring& ks_name, const sstring& cf_name, schema_static_props& props) {
const auto set_wait_for_sync_to_commitlog = schema_builder::register_schema_initializer([](schema_builder& builder) {
static const std::unordered_set<sstring> tables = {
system_keyspace::PAXOS,
};
if (ks_name == system_keyspace::NAME && tables.contains(cf_name)) {
props.wait_for_sync_to_commitlog = true;
if (builder.ks_name() == system_keyspace::NAME && tables.contains(builder.cf_name())) {
builder.set_wait_for_sync_to_commitlog(true);
}
});
const auto set_use_schema_commitlog = schema_builder::register_static_configurator([](const sstring& ks_name, const sstring& cf_name, schema_static_props& props) {
const auto set_use_schema_commitlog = schema_builder::register_schema_initializer([](schema_builder& builder) {
static const std::unordered_set<sstring> tables = {
schema_tables::SCYLLA_TABLE_SCHEMA_HISTORY,
system_keyspace::BROADCAST_KV_STORE,
system_keyspace::CDC_GENERATIONS_V3,
system_keyspace::RAFT,
system_keyspace::RAFT_SNAPSHOTS,
system_keyspace::RAFT_SNAPSHOT_CONFIG,
system_keyspace::GROUP0_HISTORY,
system_keyspace::DISCOVERY,
system_keyspace::TABLETS,
system_keyspace::TOPOLOGY,
system_keyspace::TOPOLOGY_REQUESTS,
system_keyspace::LOCAL,
system_keyspace::PEERS,
system_keyspace::SCYLLA_LOCAL,
system_keyspace::COMMITLOG_CLEANUPS,
system_keyspace::SERVICE_LEVELS_V2,
system_keyspace::VIEW_BUILD_STATUS_V2,
system_keyspace::CDC_STREAMS_STATE,
system_keyspace::CDC_STREAMS_HISTORY,
system_keyspace::ROLES,
system_keyspace::ROLE_MEMBERS,
system_keyspace::ROLE_ATTRIBUTES,
system_keyspace::ROLE_PERMISSIONS,
system_keyspace::v3::CDC_LOCAL,
system_keyspace::DICTS,
system_keyspace::VIEW_BUILDING_TASKS,
system_keyspace::CLIENT_ROUTES,
system_keyspace::CDC_LOCAL,
};
if (ks_name == system_keyspace::NAME && tables.contains(cf_name)) {
props.enable_schema_commitlog();
if (builder.ks_name() == system_keyspace::NAME && tables.contains(builder.cf_name())) {
builder.enable_schema_commitlog();
}
});
const auto set_group0_table_options =
schema_builder::register_static_configurator([](const sstring& ks_name, const sstring& cf_name, schema_static_props& props) {
schema_builder::register_schema_initializer([](schema_builder& builder) {
static const std::unordered_set<sstring> tables = {
// scylla_local may store a replicated tombstone related to schema
// (see `make_group0_schema_version_mutation`), so we include it in the group0 tables list.
@@ -142,8 +126,8 @@ namespace {
system_keyspace::CLIENT_ROUTES,
system_keyspace::REPAIR_TASKS,
};
if (ks_name == system_keyspace::NAME && tables.contains(cf_name)) {
props.is_group0_table = true;
if (builder.ks_name() == system_keyspace::NAME && tables.contains(builder.cf_name())) {
builder.set_is_group0_table();
}
});
}
@@ -918,7 +902,7 @@ schema_ptr system_keyspace::corrupt_data() {
return scylla_local;
}
schema_ptr system_keyspace::v3::batches() {
schema_ptr system_keyspace::batches() {
static thread_local auto schema = [] {
schema_builder builder(generate_legacy_id(NAME, BATCHES), NAME, BATCHES,
// partition key
@@ -946,53 +930,7 @@ schema_ptr system_keyspace::v3::batches() {
return schema;
}
schema_ptr system_keyspace::v3::built_indexes() {
// identical to ours, but ours otoh is a mix-in of the 3.x series cassandra one
return db::system_keyspace::built_indexes();
}
schema_ptr system_keyspace::v3::local() {
static thread_local auto schema = [] {
schema_builder builder(generate_legacy_id(NAME, LOCAL), NAME, LOCAL,
// partition key
{{"key", utf8_type}},
// clustering key
{},
// regular columns
{
{"bootstrapped", utf8_type},
{"broadcast_address", inet_addr_type},
{"cluster_name", utf8_type},
{"cql_version", utf8_type},
{"data_center", utf8_type},
{"gossip_generation", int32_type},
{"host_id", uuid_type},
{"listen_address", inet_addr_type},
{"native_protocol_version", utf8_type},
{"partitioner", utf8_type},
{"rack", utf8_type},
{"release_version", utf8_type},
{"rpc_address", inet_addr_type},
{"schema_version", uuid_type},
{"thrift_version", utf8_type},
{"tokens", set_type_impl::get_instance(utf8_type, true)},
{"truncated_at", map_type_impl::get_instance(uuid_type, bytes_type, true)},
},
// static columns
{},
// regular column name type
utf8_type,
// comment
"information about the local node"
);
builder.set_gc_grace_seconds(0);
builder.with_hash_version();
return builder.build(schema_builder::compact_storage::no);
}();
return schema;
}
schema_ptr system_keyspace::v3::truncated() {
schema_ptr system_keyspace::truncated() {
static thread_local auto local = [] {
schema_builder builder(generate_legacy_id(NAME, TRUNCATED), NAME, TRUNCATED,
// partition key
@@ -1022,7 +960,7 @@ schema_ptr system_keyspace::v3::truncated() {
thread_local data_type replay_position_type = tuple_type_impl::get_instance({long_type, int32_type});
schema_ptr system_keyspace::v3::commitlog_cleanups() {
schema_ptr system_keyspace::commitlog_cleanups() {
static thread_local auto local = [] {
schema_builder builder(generate_legacy_id(NAME, COMMITLOG_CLEANUPS), NAME, COMMITLOG_CLEANUPS,
// partition key
@@ -1049,47 +987,7 @@ schema_ptr system_keyspace::v3::commitlog_cleanups() {
return local;
}
schema_ptr system_keyspace::v3::peers() {
// identical
return db::system_keyspace::peers();
}
schema_ptr system_keyspace::v3::peer_events() {
// identical
return db::system_keyspace::peer_events();
}
schema_ptr system_keyspace::v3::range_xfers() {
// identical
return db::system_keyspace::range_xfers();
}
schema_ptr system_keyspace::v3::compaction_history() {
// identical
return db::system_keyspace::compaction_history();
}
schema_ptr system_keyspace::v3::sstable_activity() {
// identical
return db::system_keyspace::sstable_activity();
}
schema_ptr system_keyspace::v3::size_estimates() {
// identical
return db::system_keyspace::size_estimates();
}
schema_ptr system_keyspace::v3::large_partitions() {
// identical
return db::system_keyspace::large_partitions();
}
schema_ptr system_keyspace::v3::scylla_local() {
// identical
return db::system_keyspace::scylla_local();
}
schema_ptr system_keyspace::v3::available_ranges() {
schema_ptr system_keyspace::available_ranges() {
static thread_local auto schema = [] {
schema_builder builder(generate_legacy_id(NAME, AVAILABLE_RANGES), NAME, AVAILABLE_RANGES,
// partition key
@@ -1112,7 +1010,7 @@ schema_ptr system_keyspace::v3::available_ranges() {
return schema;
}
schema_ptr system_keyspace::v3::views_builds_in_progress() {
schema_ptr system_keyspace::views_builds_in_progress() {
static thread_local auto schema = [] {
schema_builder builder(generate_legacy_id(NAME, VIEWS_BUILDS_IN_PROGRESS), NAME, VIEWS_BUILDS_IN_PROGRESS,
// partition key
@@ -1135,7 +1033,7 @@ schema_ptr system_keyspace::v3::views_builds_in_progress() {
return schema;
}
schema_ptr system_keyspace::v3::built_views() {
schema_ptr system_keyspace::built_views() {
static thread_local auto schema = [] {
schema_builder builder(generate_legacy_id(NAME, BUILT_VIEWS), NAME, BUILT_VIEWS,
// partition key
@@ -1158,7 +1056,7 @@ schema_ptr system_keyspace::v3::built_views() {
return schema;
}
schema_ptr system_keyspace::v3::scylla_views_builds_in_progress() {
schema_ptr system_keyspace::scylla_views_builds_in_progress() {
static thread_local auto schema = [] {
auto id = generate_legacy_id(NAME, SCYLLA_VIEWS_BUILDS_IN_PROGRESS);
return schema_builder(NAME, SCYLLA_VIEWS_BUILDS_IN_PROGRESS, std::make_optional(id))
@@ -1174,7 +1072,7 @@ schema_ptr system_keyspace::v3::scylla_views_builds_in_progress() {
return schema;
}
/*static*/ schema_ptr system_keyspace::v3::cdc_local() {
/*static*/ schema_ptr system_keyspace::cdc_local() {
static thread_local auto cdc_local = [] {
schema_builder builder(generate_legacy_id(NAME, CDC_LOCAL), NAME, CDC_LOCAL,
// partition key
@@ -1800,7 +1698,9 @@ std::unordered_set<dht::token> decode_tokens(const set_type_impl::native_type& t
std::unordered_set<dht::token> tset;
for (auto& t: tokens) {
auto str = value_cast<sstring>(t);
SCYLLA_ASSERT(str == dht::token::from_sstring(str).to_sstring());
if (str != dht::token::from_sstring(str).to_sstring()) {
on_internal_error(slogger, format("decode_tokens: invalid token string '{}'", str));
}
tset.insert(dht::token::from_sstring(str));
}
return tset;
@@ -2180,21 +2080,21 @@ future<> system_keyspace::update_cdc_generation_id(cdc::generation_id gen_id) {
co_await std::visit(make_visitor(
[this] (cdc::generation_id_v1 id) -> future<> {
co_await execute_cql(
format("INSERT INTO system.{} (key, streams_timestamp) VALUES (?, ?)", v3::CDC_LOCAL),
sstring(v3::CDC_LOCAL), id.ts);
format("INSERT INTO system.{} (key, streams_timestamp) VALUES (?, ?)", CDC_LOCAL),
sstring(CDC_LOCAL), id.ts);
},
[this] (cdc::generation_id_v2 id) -> future<> {
co_await execute_cql(
format("INSERT INTO system.{} (key, streams_timestamp, uuid) VALUES (?, ?, ?)", v3::CDC_LOCAL),
sstring(v3::CDC_LOCAL), id.ts, id.id);
format("INSERT INTO system.{} (key, streams_timestamp, uuid) VALUES (?, ?, ?)", CDC_LOCAL),
sstring(CDC_LOCAL), id.ts, id.id);
}
), gen_id);
}
future<std::optional<cdc::generation_id>> system_keyspace::get_cdc_generation_id() {
auto msg = co_await execute_cql(
format("SELECT streams_timestamp, uuid FROM system.{} WHERE key = ?", v3::CDC_LOCAL),
sstring(v3::CDC_LOCAL));
format("SELECT streams_timestamp, uuid FROM system.{} WHERE key = ?", CDC_LOCAL),
sstring(CDC_LOCAL));
if (msg->empty()) {
co_return std::nullopt;
@@ -2220,19 +2120,19 @@ static const sstring CDC_REWRITTEN_KEY = "rewritten";
future<> system_keyspace::cdc_set_rewritten(std::optional<cdc::generation_id_v1> gen_id) {
if (gen_id) {
return execute_cql(
format("INSERT INTO system.{} (key, streams_timestamp) VALUES (?, ?)", v3::CDC_LOCAL),
format("INSERT INTO system.{} (key, streams_timestamp) VALUES (?, ?)", CDC_LOCAL),
CDC_REWRITTEN_KEY, gen_id->ts).discard_result();
} else {
// Insert just the row marker.
return execute_cql(
format("INSERT INTO system.{} (key) VALUES (?)", v3::CDC_LOCAL),
format("INSERT INTO system.{} (key) VALUES (?)", CDC_LOCAL),
CDC_REWRITTEN_KEY).discard_result();
}
}
future<bool> system_keyspace::cdc_is_rewritten() {
// We don't care about the actual timestamp; it's additional information for debugging purposes.
return execute_cql(format("SELECT key FROM system.{} WHERE key = ?", v3::CDC_LOCAL), CDC_REWRITTEN_KEY)
return execute_cql(format("SELECT key FROM system.{} WHERE key = ?", CDC_LOCAL), CDC_REWRITTEN_KEY)
.then([] (::shared_ptr<cql3::untyped_result_set> msg) {
return !msg->empty();
});
@@ -2376,11 +2276,11 @@ std::vector<schema_ptr> system_keyspace::all_tables(const db::config& cfg) {
scylla_local(), db::schema_tables::scylla_table_schema_history(),
repair_history(),
repair_tasks(),
v3::views_builds_in_progress(), v3::built_views(),
v3::scylla_views_builds_in_progress(),
v3::truncated(),
v3::commitlog_cleanups(),
v3::cdc_local(),
views_builds_in_progress(), built_views(),
scylla_views_builds_in_progress(),
truncated(),
commitlog_cleanups(),
cdc_local(),
raft(), raft_snapshots(), raft_snapshot_config(), group0_history(), discovery(),
topology(), cdc_generations_v3(), topology_requests(), service_levels_v2(), view_build_status_v2(),
dicts(), view_building_tasks(), client_routes(), cdc_streams_state(), cdc_streams_history()
@@ -2403,7 +2303,7 @@ static bool maybe_write_in_user_memory(schema_ptr s) {
return (s.get() == system_keyspace::batchlog().get())
|| (s.get() == system_keyspace::batchlog_v2().get())
|| (s.get() == system_keyspace::paxos().get())
|| s == system_keyspace::v3::scylla_views_builds_in_progress();
|| s == system_keyspace::scylla_views_builds_in_progress();
}
future<> system_keyspace::make(
@@ -2689,7 +2589,7 @@ mutation system_keyspace::make_size_estimates_mutation(const sstring& ks, std::v
future<> system_keyspace::register_view_for_building(sstring ks_name, sstring view_name, const dht::token& token) {
sstring req = format("INSERT INTO system.{} (keyspace_name, view_name, generation_number, cpu_id, first_token) VALUES (?, ?, ?, ?, ?)",
v3::SCYLLA_VIEWS_BUILDS_IN_PROGRESS);
SCYLLA_VIEWS_BUILDS_IN_PROGRESS);
return execute_cql(
std::move(req),
std::move(ks_name),
@@ -2705,7 +2605,7 @@ future<> system_keyspace::register_view_for_building_for_all_shards(sstring ks_n
// before all shards are registered.
// if another shard has already registered, this won't overwrite its status. if it hasn't registered, we insert
// a status with first_token=null and next_token=null, indicating it hasn't made progress.
auto&& schema = db::system_keyspace::v3::scylla_views_builds_in_progress();
auto&& schema = db::system_keyspace::scylla_views_builds_in_progress();
auto timestamp = api::new_timestamp();
mutation m{schema, partition_key::from_single_value(*schema, utf8_type->decompose(ks_name))};
@@ -2723,7 +2623,7 @@ future<> system_keyspace::register_view_for_building_for_all_shards(sstring ks_n
future<> system_keyspace::update_view_build_progress(sstring ks_name, sstring view_name, const dht::token& token) {
sstring req = format("INSERT INTO system.{} (keyspace_name, view_name, next_token, cpu_id) VALUES (?, ?, ?, ?)",
v3::SCYLLA_VIEWS_BUILDS_IN_PROGRESS);
SCYLLA_VIEWS_BUILDS_IN_PROGRESS);
return execute_cql(
std::move(req),
std::move(ks_name),
@@ -2734,14 +2634,14 @@ future<> system_keyspace::update_view_build_progress(sstring ks_name, sstring vi
future<> system_keyspace::remove_view_build_progress_across_all_shards(sstring ks_name, sstring view_name) {
return execute_cql(
format("DELETE FROM system.{} WHERE keyspace_name = ? AND view_name = ?", v3::SCYLLA_VIEWS_BUILDS_IN_PROGRESS),
format("DELETE FROM system.{} WHERE keyspace_name = ? AND view_name = ?", SCYLLA_VIEWS_BUILDS_IN_PROGRESS),
std::move(ks_name),
std::move(view_name)).discard_result();
}
future<> system_keyspace::remove_view_build_progress(sstring ks_name, sstring view_name) {
return execute_cql(
format("DELETE FROM system.{} WHERE keyspace_name = ? AND view_name = ? AND cpu_id = ?", v3::SCYLLA_VIEWS_BUILDS_IN_PROGRESS),
format("DELETE FROM system.{} WHERE keyspace_name = ? AND view_name = ? AND cpu_id = ?", SCYLLA_VIEWS_BUILDS_IN_PROGRESS),
std::move(ks_name),
std::move(view_name),
int32_t(this_shard_id())).discard_result();
@@ -2749,20 +2649,20 @@ future<> system_keyspace::remove_view_build_progress(sstring ks_name, sstring vi
future<> system_keyspace::mark_view_as_built(sstring ks_name, sstring view_name) {
return execute_cql(
format("INSERT INTO system.{} (keyspace_name, view_name) VALUES (?, ?)", v3::BUILT_VIEWS),
format("INSERT INTO system.{} (keyspace_name, view_name) VALUES (?, ?)", BUILT_VIEWS),
std::move(ks_name),
std::move(view_name)).discard_result();
}
future<> system_keyspace::remove_built_view(sstring ks_name, sstring view_name) {
return execute_cql(
format("DELETE FROM system.{} WHERE keyspace_name = ? AND view_name = ?", v3::BUILT_VIEWS),
format("DELETE FROM system.{} WHERE keyspace_name = ? AND view_name = ?", BUILT_VIEWS),
std::move(ks_name),
std::move(view_name)).discard_result();
}
future<std::vector<system_keyspace::view_name>> system_keyspace::load_built_views() {
return execute_cql(format("SELECT * FROM system.{}", v3::BUILT_VIEWS)).then([] (::shared_ptr<cql3::untyped_result_set> cql_result) {
return execute_cql(format("SELECT * FROM system.{}", BUILT_VIEWS)).then([] (::shared_ptr<cql3::untyped_result_set> cql_result) {
return *cql_result
| std::views::transform([] (const cql3::untyped_result_set::row& row) {
auto ks_name = row.get_as<sstring>("keyspace_name");
@@ -2774,7 +2674,7 @@ future<std::vector<system_keyspace::view_name>> system_keyspace::load_built_view
future<std::vector<system_keyspace::view_build_progress>> system_keyspace::load_view_build_progress() {
return execute_cql(format("SELECT keyspace_name, view_name, first_token, next_token, cpu_id FROM system.{}",
v3::SCYLLA_VIEWS_BUILDS_IN_PROGRESS)).then([] (::shared_ptr<cql3::untyped_result_set> cql_result) {
SCYLLA_VIEWS_BUILDS_IN_PROGRESS)).then([] (::shared_ptr<cql3::untyped_result_set> cql_result) {
std::vector<view_build_progress> progress;
for (auto& row : *cql_result) {
auto ks_name = row.get_as<sstring>("keyspace_name");
@@ -3227,6 +3127,8 @@ future<service::topology> system_keyspace::load_topology_state(const std::unorde
co_return ret;
}
const bool strongly_consistent_tables = _db.features().strongly_consistent_tables;
for (auto& row : *rs) {
if (!row.has("host_id")) {
// There are no clustering rows, only the static row.
@@ -3275,7 +3177,7 @@ future<service::topology> system_keyspace::load_topology_state(const std::unorde
};
}
} else if (must_have_tokens(nstate)) {
on_fatal_internal_error(slogger, format(
on_internal_error(slogger, format(
"load_topology_state: node {} in {} state but missing ring slice", host_id, nstate));
}
}
@@ -3357,7 +3259,7 @@ future<service::topology> system_keyspace::load_topology_state(const std::unorde
// Currently, at most one node at a time can be in transitioning state.
if (!map->empty()) {
const auto& [other_id, other_rs] = *map->begin();
on_fatal_internal_error(slogger, format(
on_internal_error(slogger, format(
"load_topology_state: found two nodes in transitioning state: {} in {} state and {} in {} state",
other_id, other_rs.state, host_id, nstate));
}
@@ -3415,8 +3317,7 @@ future<service::topology> system_keyspace::load_topology_state(const std::unorde
format("SELECT count(range_end) as cnt FROM {}.{} WHERE key = '{}' AND id = ?",
NAME, CDC_GENERATIONS_V3, cdc::CDC_GENERATIONS_V3_KEY),
gen_id.id);
SCYLLA_ASSERT(gen_rows);
if (gen_rows->empty()) {
if (!gen_rows || gen_rows->empty()) {
on_internal_error(slogger, format(
"load_topology_state: last committed CDC generation time UUID ({}) present, but data missing", gen_id.id));
}
@@ -3463,7 +3364,9 @@ future<service::topology> system_keyspace::load_topology_state(const std::unorde
ret.session = service::session_id(some_row.get_as<utils::UUID>("session"));
}
if (some_row.has("tablet_balancing_enabled")) {
if (strongly_consistent_tables) {
ret.tablet_balancing_enabled = false;
} else if (some_row.has("tablet_balancing_enabled")) {
ret.tablet_balancing_enabled = some_row.get_as<bool>("tablet_balancing_enabled");
} else {
ret.tablet_balancing_enabled = true;

View File

@@ -127,6 +127,8 @@ class system_keyspace : public seastar::peering_sharded_service<system_keyspace>
static schema_ptr raft_snapshot_config();
static schema_ptr local();
static schema_ptr truncated();
static schema_ptr commitlog_cleanups();
static schema_ptr peers();
static schema_ptr peer_events();
static schema_ptr range_xfers();
@@ -137,7 +139,10 @@ class system_keyspace : public seastar::peering_sharded_service<system_keyspace>
static schema_ptr large_rows();
static schema_ptr large_cells();
static schema_ptr corrupt_data();
static schema_ptr scylla_local();
static schema_ptr batches();
static schema_ptr available_ranges();
static schema_ptr built_views();
static schema_ptr cdc_local();
future<> force_blocking_flush(sstring cfname);
// This function is called when the system.peers table is read,
// and it fixes some types of inconsistencies that can occur
@@ -204,6 +209,14 @@ public:
static constexpr auto VIEW_BUILDING_TASKS = "view_building_tasks";
static constexpr auto CLIENT_ROUTES = "client_routes";
static constexpr auto VERSIONS = "versions";
static constexpr auto BATCHES = "batches";
static constexpr auto AVAILABLE_RANGES = "available_ranges";
static constexpr auto VIEWS_BUILDS_IN_PROGRESS = "views_builds_in_progress";
static constexpr auto BUILT_VIEWS = "built_views";
static constexpr auto SCYLLA_VIEWS_BUILDS_IN_PROGRESS = "scylla_views_builds_in_progress";
static constexpr auto CDC_LOCAL = "cdc_local";
static constexpr auto CDC_TIMESTAMPS = "cdc_timestamps";
static constexpr auto CDC_STREAMS = "cdc_streams";
// auth
static constexpr auto ROLES = "roles";
@@ -211,42 +224,6 @@ public:
static constexpr auto ROLE_ATTRIBUTES = "role_attributes";
static constexpr auto ROLE_PERMISSIONS = "role_permissions";
struct v3 {
static constexpr auto BATCHES = "batches";
static constexpr auto PAXOS = "paxos";
static constexpr auto BUILT_INDEXES = "IndexInfo";
static constexpr auto LOCAL = "local";
static constexpr auto PEERS = "peers";
static constexpr auto PEER_EVENTS = "peer_events";
static constexpr auto RANGE_XFERS = "range_xfers";
static constexpr auto COMPACTION_HISTORY = "compaction_history";
static constexpr auto SSTABLE_ACTIVITY = "sstable_activity";
static constexpr auto SIZE_ESTIMATES = "size_estimates";
static constexpr auto AVAILABLE_RANGES = "available_ranges";
static constexpr auto VIEWS_BUILDS_IN_PROGRESS = "views_builds_in_progress";
static constexpr auto BUILT_VIEWS = "built_views";
static constexpr auto SCYLLA_VIEWS_BUILDS_IN_PROGRESS = "scylla_views_builds_in_progress";
static constexpr auto CDC_LOCAL = "cdc_local";
static schema_ptr batches();
static schema_ptr built_indexes();
static schema_ptr local();
static schema_ptr truncated();
static schema_ptr commitlog_cleanups();
static schema_ptr peers();
static schema_ptr peer_events();
static schema_ptr range_xfers();
static schema_ptr compaction_history();
static schema_ptr sstable_activity();
static schema_ptr size_estimates();
static schema_ptr large_partitions();
static schema_ptr scylla_local();
static schema_ptr available_ranges();
static schema_ptr views_builds_in_progress();
static schema_ptr built_views();
static schema_ptr scylla_views_builds_in_progress();
static schema_ptr cdc_local();
};
// Partition estimates for a given range of tokens.
struct range_estimates {
schema_ptr schema;
@@ -264,6 +241,7 @@ public:
static schema_ptr batchlog_v2();
static schema_ptr paxos();
static schema_ptr built_indexes(); // TODO (from Cassandra): make private
static schema_ptr scylla_local();
static schema_ptr raft();
static schema_ptr raft_snapshots();
static schema_ptr repair_history();
@@ -283,6 +261,8 @@ public:
static schema_ptr dicts();
static schema_ptr view_building_tasks();
static schema_ptr client_routes();
static schema_ptr views_builds_in_progress();
static schema_ptr scylla_views_builds_in_progress();
// auth
static schema_ptr roles();

View File

@@ -195,7 +195,7 @@ public:
return mutation_reader(std::make_unique<build_progress_reader>(
s,
std::move(permit),
_db.find_column_family(s->ks_name(), system_keyspace::v3::SCYLLA_VIEWS_BUILDS_IN_PROGRESS),
_db.find_column_family(s->ks_name(), system_keyspace::SCYLLA_VIEWS_BUILDS_IN_PROGRESS),
range,
slice,
std::move(trace_state),

View File

@@ -930,8 +930,7 @@ bool view_updates::can_skip_view_updates(const clustering_or_static_row& update,
const row& existing_row = existing.cells();
const row& updated_row = update.cells();
const bool base_has_nonexpiring_marker = update.marker().is_live() && !update.marker().is_expiring();
return std::ranges::all_of(_base->regular_columns(), [this, &updated_row, &existing_row, base_has_nonexpiring_marker] (const column_definition& cdef) {
return std::ranges::all_of(_base->regular_columns(), [this, &updated_row, &existing_row] (const column_definition& cdef) {
const auto view_it = _view->columns_by_name().find(cdef.name());
const bool column_is_selected = view_it != _view->columns_by_name().end();
@@ -939,49 +938,29 @@ bool view_updates::can_skip_view_updates(const clustering_or_static_row& update,
// as part of its PK, there are NO virtual columns corresponding to the unselected columns in the view.
// Because of that, we don't generate view updates when the value in an unselected column is created
// or changes.
if (!column_is_selected && _base_info.has_base_non_pk_columns_in_view_pk) {
if (!column_is_selected) {
return true;
}
//TODO(sarna): Optimize collections case - currently they do not go under optimization
if (!cdef.is_atomic()) {
return false;
}
// We cannot skip if the value was created or deleted, unless we have a non-expiring marker
// We cannot skip if the value was created or deleted
const auto* existing_cell = existing_row.find_cell(cdef.id);
const auto* updated_cell = updated_row.find_cell(cdef.id);
if (existing_cell == nullptr || updated_cell == nullptr) {
return existing_cell == updated_cell || (!column_is_selected && base_has_nonexpiring_marker);
return existing_cell == updated_cell;
}
if (!cdef.is_atomic()) {
return existing_cell->as_collection_mutation().data == updated_cell->as_collection_mutation().data;
}
atomic_cell_view existing_cell_view = existing_cell->as_atomic_cell(cdef);
atomic_cell_view updated_cell_view = updated_cell->as_atomic_cell(cdef);
// We cannot skip when a selected column is changed
if (column_is_selected) {
if (view_it->second->is_view_virtual()) {
return atomic_cells_liveness_equal(existing_cell_view, updated_cell_view);
}
return compare_atomic_cell_for_merge(existing_cell_view, updated_cell_view) == 0;
if (view_it->second->is_view_virtual()) {
return atomic_cells_liveness_equal(existing_cell_view, updated_cell_view);
}
// With non-expiring row marker, liveness checks below are not relevant
if (base_has_nonexpiring_marker) {
return true;
}
if (existing_cell_view.is_live() != updated_cell_view.is_live()) {
return false;
}
// We cannot skip if the change updates TTL
const bool existing_has_ttl = existing_cell_view.is_live_and_has_ttl();
const bool updated_has_ttl = updated_cell_view.is_live_and_has_ttl();
if (existing_has_ttl || updated_has_ttl) {
return existing_has_ttl == updated_has_ttl && existing_cell_view.expiry() == updated_cell_view.expiry();
}
return true;
return compare_atomic_cell_for_merge(existing_cell_view, updated_cell_view) == 0;
});
}
@@ -1749,7 +1728,7 @@ static endpoints_to_update get_view_natural_endpoint_vnodes(
std::vector<std::reference_wrapper<const locator::node>> base_nodes,
std::vector<std::reference_wrapper<const locator::node>> view_nodes,
locator::endpoint_dc_rack my_location,
const locator::network_topology_strategy* network_topology,
const bool network_topology,
replica::cf_stats& cf_stats) {
using node_vector = std::vector<std::reference_wrapper<const locator::node>>;
node_vector base_endpoints, view_endpoints;
@@ -1902,7 +1881,7 @@ endpoints_to_update get_view_natural_endpoint(
locator::host_id me,
const locator::effective_replication_map_ptr& base_erm,
const locator::effective_replication_map_ptr& view_erm,
const locator::abstract_replication_strategy& replication_strategy,
const bool network_topology,
const dht::token& base_token,
const dht::token& view_token,
bool use_tablets,
@@ -1910,7 +1889,6 @@ endpoints_to_update get_view_natural_endpoint(
auto& topology = base_erm->get_token_metadata_ptr()->get_topology();
auto& view_topology = view_erm->get_token_metadata_ptr()->get_topology();
auto& my_location = topology.get_location(me);
auto* network_topology = dynamic_cast<const locator::network_topology_strategy*>(&replication_strategy);
auto resolve = [&] (const locator::topology& topology, const locator::host_id& ep, bool is_view) -> const locator::node& {
if (auto* np = topology.find_node(ep)) {
@@ -1944,7 +1922,7 @@ endpoints_to_update get_view_natural_endpoint(
// view pairing as the leaving base replica.
// note that the recursive call will not recurse again because leaving_base is in base_nodes.
auto leaving_base = it->get().host_id();
return get_view_natural_endpoint(leaving_base, base_erm, view_erm, replication_strategy, base_token,
return get_view_natural_endpoint(leaving_base, base_erm, view_erm, network_topology, base_token,
view_token, use_tablets, cf_stats);
}
}
@@ -2040,7 +2018,9 @@ future<> view_update_generator::mutate_MV(
wait_for_all_updates wait_for_all)
{
auto& ks = _db.find_keyspace(base->ks_name());
auto& replication = ks.get_replication_strategy();
const bool uses_tablets = ks.uses_tablets();
const bool uses_nts = dynamic_cast<const locator::network_topology_strategy*>(&ks.get_replication_strategy()) != nullptr;
// The object pointed by `ks` may disappear after preeemption. It should not be touched again after this comment.
std::unordered_map<table_id, locator::effective_replication_map_ptr> erms;
auto get_erm = [&] (table_id id) {
auto it = erms.find(id);
@@ -2059,8 +2039,8 @@ future<> view_update_generator::mutate_MV(
co_await max_concurrent_for_each(view_updates, max_concurrent_updates, [&] (frozen_mutation_and_schema mut) mutable -> future<> {
auto view_token = dht::get_token(*mut.s, mut.fm.key());
auto view_ermp = erms.at(mut.s->id());
auto [target_endpoint, no_pairing_endpoint] = get_view_natural_endpoint(me, base_ermp, view_ermp, replication, base_token, view_token,
ks.uses_tablets(), cf_stats);
auto [target_endpoint, no_pairing_endpoint] = get_view_natural_endpoint(me, base_ermp, view_ermp, uses_nts, base_token, view_token,
uses_tablets, cf_stats);
auto remote_endpoints = view_ermp->get_pending_replicas(view_token);
auto memory_units = seastar::make_lw_shared<db::timeout_semaphore_units>(pending_view_update_memory_units.split(memory_usage_of(mut)));
if (no_pairing_endpoint) {
@@ -3707,7 +3687,7 @@ void validate_view_keyspace(const data_dictionary::database& db, std::string_vie
try {
locator::assert_rf_rack_valid_keyspace(keyspace_name, tmptr, rs);
} catch (const std::exception& e) {
} catch (const std::invalid_argument& e) {
throw std::logic_error(fmt::format(
"Materialized views and secondary indexes are not supported on the keyspace '{}': {}",
keyspace_name, e.what()));

View File

@@ -303,7 +303,7 @@ endpoints_to_update get_view_natural_endpoint(
locator::host_id node,
const locator::effective_replication_map_ptr& base_erm,
const locator::effective_replication_map_ptr& view_erm,
const locator::abstract_replication_strategy& replication_strategy,
const bool network_topology,
const dht::token& base_token,
const dht::token& view_token,
bool use_tablets,

View File

@@ -68,6 +68,8 @@ public:
.with_column("peer", inet_addr_type, column_kind::partition_key)
.with_column("dc", utf8_type)
.with_column("up", boolean_type)
.with_column("draining", boolean_type)
.with_column("excluded", boolean_type)
.with_column("status", utf8_type)
.with_column("load", utf8_type)
.with_column("tokens", int32_type)
@@ -107,8 +109,11 @@ public:
if (tm.get_topology().has_node(hostid)) {
// Not all entries in gossiper are present in the topology
sstring dc = tm.get_topology().get_location(hostid).dc;
auto& node = tm.get_topology().get_node(hostid);
sstring dc = node.dc_rack().dc;
set_cell(cr, "dc", dc);
set_cell(cr, "draining", node.is_draining());
set_cell(cr, "excluded", node.is_excluded());
}
if (ownership.contains(eps.get_ip())) {
@@ -1134,6 +1139,8 @@ public:
set_cell(r.cells(), "dc", node.dc());
set_cell(r.cells(), "rack", node.rack());
set_cell(r.cells(), "up", _gossiper.local().is_alive(host));
set_cell(r.cells(), "draining", node.is_draining());
set_cell(r.cells(), "excluded", node.is_excluded());
if (auto ip = _gossiper.local().get_address_map().find(host)) {
set_cell(r.cells(), "ip", data_value(inet_address(*ip)));
}
@@ -1144,6 +1151,9 @@ public:
if (stats && stats->capacity.contains(host)) {
auto capacity = stats->capacity.at(host);
set_cell(r.cells(), "storage_capacity", data_value(int64_t(capacity)));
if (auto ts_iter = stats->tablet_stats.find(host); ts_iter != stats->tablet_stats.end()) {
set_cell(r.cells(), "effective_capacity", data_value(int64_t(ts_iter->second.effective_capacity)));
}
if (auto utilization = load.get_allocated_utilization(host)) {
set_cell(r.cells(), "storage_allocated_utilization", data_value(double(*utilization)));
@@ -1168,9 +1178,12 @@ private:
.with_column("rack", utf8_type)
.with_column("ip", inet_addr_type)
.with_column("up", boolean_type)
.with_column("draining", boolean_type)
.with_column("excluded", boolean_type)
.with_column("tablets_allocated", long_type)
.with_column("tablets_allocated_per_shard", double_type)
.with_column("storage_capacity", long_type)
.with_column("effective_capacity", long_type)
.with_column("storage_allocated_load", long_type)
.with_column("storage_allocated_utilization", double_type)
.with_column("storage_load", long_type)
@@ -1332,8 +1345,8 @@ public:
private:
static schema_ptr build_schema() {
auto id = generate_legacy_id(system_keyspace::NAME, "cdc_timestamps");
return schema_builder(system_keyspace::NAME, "cdc_timestamps", std::make_optional(id))
auto id = generate_legacy_id(system_keyspace::NAME, system_keyspace::CDC_TIMESTAMPS);
return schema_builder(system_keyspace::NAME, system_keyspace::CDC_TIMESTAMPS, std::make_optional(id))
.with_column("keyspace_name", utf8_type, column_kind::partition_key)
.with_column("table_name", utf8_type, column_kind::partition_key)
.with_column("timestamp", reversed_type_impl::get_instance(timestamp_type), column_kind::clustering_key)
@@ -1415,8 +1428,8 @@ public:
}
private:
static schema_ptr build_schema() {
auto id = generate_legacy_id(system_keyspace::NAME, "cdc_streams");
return schema_builder(system_keyspace::NAME, "cdc_streams", std::make_optional(id))
auto id = generate_legacy_id(system_keyspace::NAME, system_keyspace::CDC_STREAMS);
return schema_builder(system_keyspace::NAME, system_keyspace::CDC_STREAMS, std::make_optional(id))
.with_column("keyspace_name", utf8_type, column_kind::partition_key)
.with_column("table_name", utf8_type, column_kind::partition_key)
.with_column("timestamp", timestamp_type, column_kind::clustering_key)
@@ -1484,7 +1497,7 @@ future<> initialize_virtual_tables(
co_await add_table(std::make_unique<cdc_streams_table>(db, ss));
db.find_column_family(system_keyspace::size_estimates()).set_virtual_reader(mutation_source(db::size_estimates::virtual_reader(db, sys_ks.local())));
db.find_column_family(system_keyspace::v3::views_builds_in_progress()).set_virtual_reader(mutation_source(db::view::build_progress_virtual_reader(db)));
db.find_column_family(system_keyspace::views_builds_in_progress()).set_virtual_reader(mutation_source(db::view::build_progress_virtual_reader(db)));
db.find_column_family(system_keyspace::built_indexes()).set_virtual_reader(mutation_source(db::index::built_indexes_virtual_reader(db)));
}

View File

@@ -352,6 +352,16 @@ dht::partition_range_vector to_partition_ranges(const dht::token_range_vector& r
return prs;
}
future<utils::chunked_vector<dht::partition_range>> to_partition_ranges_chunked(const dht::token_range_vector& ranges) {
utils::chunked_vector<dht::partition_range> prs;
prs.reserve(ranges.size());
for (auto& range : ranges) {
prs.push_back(dht::to_partition_range(range));
co_await coroutine::maybe_yield();
}
co_return prs;
}
std::map<unsigned, dht::partition_range_vector>
split_range_to_shards(dht::partition_range pr, const schema& s, const sharder& raw_sharder) {
std::map<unsigned, dht::partition_range_vector> ret;
@@ -364,11 +374,11 @@ split_range_to_shards(dht::partition_range pr, const schema& s, const sharder& r
return ret;
}
future<dht::partition_range_vector> subtract_ranges(const schema& schema, const dht::partition_range_vector& source_ranges, dht::partition_range_vector ranges_to_subtract) {
future<utils::chunked_vector<dht::partition_range>> subtract_ranges(const schema& schema, utils::chunked_vector<dht::partition_range> source_ranges, utils::chunked_vector<dht::partition_range> ranges_to_subtract) {
auto cmp = dht::ring_position_comparator(schema);
// optimize set of potentially overlapping ranges by deoverlapping them.
auto ranges = dht::partition_range::deoverlap(source_ranges, cmp);
dht::partition_range_vector res;
auto ranges = dht::partition_range::deoverlap(std::move(source_ranges), cmp);
utils::chunked_vector<dht::partition_range> res;
res.reserve(ranges.size() * 2);
auto range = ranges.begin();

View File

@@ -91,6 +91,7 @@ inline token get_token(const schema& s, partition_key_view key) {
dht::partition_range to_partition_range(dht::token_range);
dht::partition_range_vector to_partition_ranges(const dht::token_range_vector& ranges, utils::can_yield can_yield = utils::can_yield::no);
future<utils::chunked_vector<dht::partition_range>> to_partition_ranges_chunked(const dht::token_range_vector& ranges);
// Each shard gets a sorted, disjoint vector of ranges
std::map<unsigned, dht::partition_range_vector>
@@ -105,7 +106,7 @@ std::unique_ptr<dht::i_partitioner> make_partitioner(sstring name);
// Returns a sorted and deoverlapped list of ranges that are
// the result of subtracting all ranges from ranges_to_subtract.
// ranges_to_subtract must be sorted and deoverlapped.
future<dht::partition_range_vector> subtract_ranges(const schema& schema, const dht::partition_range_vector& ranges, dht::partition_range_vector ranges_to_subtract);
future<utils::chunked_vector<dht::partition_range>> subtract_ranges(const schema& schema, utils::chunked_vector<dht::partition_range> ranges, utils::chunked_vector<dht::partition_range> ranges_to_subtract);
// Returns a token_range vector split based on the given number of most-significant bits
dht::token_range_vector split_token_range_msb(unsigned most_significant_bits);

View File

@@ -30,6 +30,31 @@ enum class token_kind {
after_all_keys,
};
// Represents a token for partition keys.
// Has a disengaged state, which sorts before all engaged states.
struct raw_token {
int64_t value;
/// Constructs a disengaged token.
raw_token() : value(std::numeric_limits<int64_t>::min()) {}
/// Constructs an engaged token.
/// The token must be of token_kind::key kind.
explicit raw_token(const token&);
explicit raw_token(int64_t v) : value(v) {};
std::strong_ordering operator<=>(const raw_token& o) const noexcept = default;
std::strong_ordering operator<=>(const token& o) const noexcept;
/// Returns true iff engaged.
explicit operator bool() const noexcept {
return value != std::numeric_limits<int64_t>::min();
}
};
using raw_token_opt = seastar::optimized_optional<raw_token>;
class token {
// INT64_MIN is not a legal token, but a special value used to represent
// infinity in token intervals.
@@ -52,6 +77,10 @@ public:
constexpr explicit token(int64_t d) noexcept : token(kind::key, normalize(d)) {}
token(raw_token raw) noexcept
: token(raw ? kind::key : kind::before_all_keys, raw.value)
{ }
// This constructor seems redundant with the bytes_view constructor, but
// it's necessary for IDL, which passes a deserialized_bytes_proxy here.
// (deserialized_bytes_proxy is convertible to bytes&&, but not bytes_view.)
@@ -223,6 +252,29 @@ public:
}
};
inline
raw_token::raw_token(const token& t)
: value(t.raw())
{
#ifdef DEBUG
assert(t._kind == token::kind::key);
#endif
}
inline
std::strong_ordering raw_token::operator<=>(const token& o) const noexcept {
switch (o._kind) {
case token::kind::after_all_keys:
return std::strong_ordering::less;
case token::kind::before_all_keys:
// before_all_keys has a raw value set to the same raw value as a disengaged raw_token, and sorts before all keys.
// So we can order them by just comparing raw values.
[[fallthrough]];
case token::kind::key:
return value <=> o._data;
}
}
inline constexpr std::strong_ordering tri_compare_raw(const int64_t l1, const int64_t l2) noexcept {
if (l1 == l2) {
return std::strong_ordering::equal;
@@ -329,6 +381,17 @@ struct fmt::formatter<dht::token> : fmt::formatter<string_view> {
}
};
template <>
struct fmt::formatter<dht::raw_token> : fmt::formatter<string_view> {
template <typename FormatContext>
auto format(const dht::raw_token& t, FormatContext& ctx) const {
if (!t) {
return fmt::format_to(ctx.out(), "null");
}
return fmt::format_to(ctx.out(), "{}", t.value);
}
};
namespace std {
template<>

View File

@@ -114,6 +114,10 @@ WantedBy=local-fs.target scylla-server.service
'''[1:-1]
with open('/etc/systemd/system/var-lib-systemd-coredump.mount', 'w') as f:
f.write(dot_mount)
# in case we have old mounts in deleted state hanging around from older installation
# systemd doesn't seem to be able to deal with those properly, and assume they are still active
# and doesn't do anything about them
run('umount /var/lib/systemd/coredump', shell=True, check=False)
os.makedirs('/var/lib/scylla/coredump', exist_ok=True)
systemd_unit.reload()
systemd_unit('var-lib-systemd-coredump.mount').enable()

View File

@@ -97,7 +97,9 @@ bcp LICENSE-ScyllaDB-Source-Available.md /licenses/
run microdnf clean all
run microdnf --setopt=tsflags=nodocs -y update
run microdnf --setopt=tsflags=nodocs -y install hostname kmod procps-ng python3 python3-pip
run microdnf --setopt=tsflags=nodocs -y install hostname kmod procps-ng python3 python3-pip cpio
# Extract only systemctl binary from systemd package to avoid installing the whole systemd in the container.
run bash -rc "microdnf download systemd && rpm2cpio systemd-*.rpm | cpio -idmv ./usr/bin/systemctl && rm -rf systemd-*.rpm"
run curl -L --output /etc/yum.repos.d/scylla.repo ${repo_file_url}
run pip3 install --no-cache-dir --prefix /usr supervisor
run bash -ec "echo LANG=C.UTF-8 > /etc/locale.conf"
@@ -106,6 +108,8 @@ run bash -ec "cat /scylla_bashrc >> /etc/bash.bashrc"
run mkdir -p /var/log/scylla
run chown -R scylla:scylla /var/lib/scylla
run sed -i -e 's/^SCYLLA_ARGS=".*"$/SCYLLA_ARGS="--log-to-syslog 0 --log-to-stdout 1 --network-stack posix"/' /etc/sysconfig/scylla-server
# Cleanup packages not needed in the final image and clean package manager cache to reduce image size.
run bash -rc "microdnf remove -y cpio && microdnf clean all"
run mkdir -p /opt/scylladb/supervisor
run touch /opt/scylladb/SCYLLA-CONTAINER-FILE

View File

@@ -1,6 +1,14 @@
### a dictionary of redirections
#old path: new path
# Remove an outdated KB
/stable/kb/perftune-modes-sync.html: /stable/kb/index.html
# Remove the troubleshooting page relevant for Open Source only
/stable/troubleshooting/missing-dotmount-files.html: /troubleshooting/index.html
# Move the diver information to another project
/stable/using-scylla/drivers/index.html: https://docs.scylladb.com/stable/drivers/index.html

View File

@@ -142,10 +142,6 @@ want modify a non-top-level attribute directly (e.g., a.b[3].c) need RMW:
Alternator implements such requests by reading the entire top-level
attribute a, modifying only a.b[3].c, and then writing back a.
Currently, Alternator doesn't use Tablets. That's because Alternator relies
on LWT (lightweight transactions), and LWT is not supported in keyspaces
with Tablets enabled.
```{eval-rst}
.. toctree::
:maxdepth: 2

View File

@@ -187,6 +187,23 @@ You can create a keyspace with tablets enabled with the ``tablets = {'enabled':
the keyspace schema with ``tablets = { 'enabled': false }`` or
``tablets = { 'enabled': true }``.
.. _keyspace-rf-rack-valid-to-enforce-rack-list:
Enforcing Rack-List Replication for Tablet Keyspaces
------------------------------------------------------------------
The ``rf_rack_valid_keyspaces`` is a legacy option that ensures that all keyspaces with tablets enabled are
:term:`RF-rack-valid <RF-rack-valid keyspace>`.
Requiring every tablet keyspace to use the rack list replication factor exclusively is enough to guarantee the keyspace is
:term:`RF-rack-valid <RF-rack-valid keyspace>`. It reduces restrictions and provides stronger guarantees compared
to ``rf_rack_valid_keyspaces`` option.
To enforce rack list in tablet keyspaces, use ``enforce_rack_list`` option. It can be set only if all tablet keyspaces use
rack list. To ensure that, follow a procedure of :ref:`conversion to rack list replication factor <conversion-to-rack-list-rf>`.
After that restart all nodes in the cluster, with ``enforce_rack_list`` enabled and ``rf_rack_valid_keyspaces`` disabled. Make
sure to avoid setting or updating replication factor (with CREATE KEYSPACE or ALTER KEYSPACE) while nodes are being restarted.
.. _tablets-limitations:
Limitations and Unsupported Features

View File

@@ -190,7 +190,7 @@ then every rack in every datacenter receives a replica, except for racks compris
of only :doc:`zero-token nodes </architecture/zero-token-nodes>`. Racks added after
the keyspace creation do not receive replicas.
When ``rf_rack_valid_keyspaces``` is enabled in the config and the keyspace is tablet-based,
When ``enforce_rack_list`` (or (deprecated) ``rf_rack_valid_keyspaces``) is enabled in the config and the keyspace is tablet-based,
the numeric replication factor is automatically expanded into a rack list when the statement is
executed, which can be observed in the DESCRIBE output afterwards. If the numeric RF is smaller than
the number of racks in a DC, a subset of racks is chosen arbitrarily.
@@ -200,8 +200,6 @@ for two cases. One is setting replication factor to 0, in which case the number
The other is when the numeric replication factor is equal to the current number of replicas
for a given datacanter, in which case the current rack list is preserved.
Altering from a numeric replication factor to a rack list is not supported yet.
Note that when ``ALTER`` ing keyspaces and supplying ``replication_factor``,
auto-expansion will only *add* new datacenters for safety, it will not alter
existing datacenters or remove any even if they are no longer in the cluster.
@@ -424,6 +422,21 @@ Altering from a rack list to a numeric replication factor is not supported.
Keyspaces which use rack lists are :term:`RF-rack-valid <RF-rack-valid keyspace>` if each rack in the rack list contains at least one node (excluding :doc:`zero-token nodes </architecture/zero-token-nodes>`).
.. _conversion-to-rack-list-rf:
Conversion to rack-list replication factor
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
To migrate a keyspace from a numeric replication factor to a rack-list replication factor, provide the rack-list replication factor explicitly in ALTER KEYSPACE statement. The number of racks in the list must be equal to the numeric replication factor. The replication factor can be converted in any number of DCs at once. In a statement that converts replication factor, no replication factor updates (increase or decrease) are allowed in any DC.
.. code-block:: cql
CREATE KEYSPACE Excelsior
WITH replication = { 'class' : 'NetworkTopologyStrategy', 'dc1' : 3, 'dc2' : 1} AND tablets = { 'enabled': true };
ALTER KEYSPACE Excelsior
WITH replication = { 'class' : 'NetworkTopologyStrategy', 'dc1' : ['RAC1', 'RAC2', 'RAC3'], 'dc2' : ['RAC4']} AND tablets = { 'enabled': true };
.. _drop-keyspace-statement:
DROP KEYSPACE

View File

@@ -241,8 +241,8 @@ Currently, the possible orderings are limited by the :ref:`clustering order <clu
.. _vector-queries:
Vector queries
~~~~~~~~~~~~~~
Vector queries :label-note:`ScyllaDB Cloud`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The ``ORDER BY`` clause can also be used with vector columns to perform the approximate nearest neighbor (ANN) search.
When using vector columns, the syntax is as follows:
@@ -280,11 +280,22 @@ For example::
FROM ImageEmbeddings
ORDER BY embedding ANN OF [0.1, 0.2, 0.3, 0.4] LIMIT 5;
.. warning::
Currently, vector queries do not support filtering with ``WHERE`` clause,
grouping with ``GROUP BY`` and paging. This will be added in the future releases.
Vector queries also support filtering with ``WHERE`` clauses on columns that are part of the primary key.
See :ref:`WHERE <where-clause>`.
For example::
SELECT image_id FROM ImageEmbeddings
WHERE user_id = 'user123'
ORDER BY embedding ANN OF [0.1, 0.2, 0.3, 0.4] LIMIT 5;
.. note::
Vector indexes are supported in ScyllaDB Cloud only in clusters that have the Vector Search feature enabled.
Vector indexes do not support all ScyllaDB features (e.g., tracing, TTL, paging, and grouping). More information
about Vector Search is available in the
`ScyllaDB Cloud documentation <https://cloud.docs.scylladb.com/stable/vector-search/>`_.
.. _limit-clause:

View File

@@ -129,30 +129,94 @@ More on :doc:`Local Secondary Indexes </features/local-secondary-indexes>`
.. _create-vector-index-statement:
Vector Index :label-caution:`Experimental` :label-note:`ScyllaDB Cloud`
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Vector Index :label-note:`ScyllaDB Cloud`
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. note::
Vector indexes are supported in ScyllaDB Cloud only in the clusters that have the vector search feature enabled.
Moreover, vector indexes are an experimental feature that:
* is not suitable for production use,
* does not guarantee backward compatibility between ScyllaDB versions,
* does not support all the features of ScyllaDB (e.g., tracing, filtering, TTL).
Vector indexes are supported in ScyllaDB Cloud only in clusters that have the Vector Search feature enabled.
Vector indexes do not support all ScyllaDB features (e.g., tracing, TTL, paging, and grouping). More information
about Vector Search is available in the
`ScyllaDB Cloud documentation <https://cloud.docs.scylladb.com/stable/vector-search/>`_.
ScyllaDB supports creating vector indexes on tables, allowing queries on the table to use those indexes for efficient
similarity search on vector data.
similarity search on vector data. Vector indexes can be a global index for indexing vectors per table or a local
index for indexing vectors per partition.
The vector index is the only custom type index supported in ScyllaDB. It is created using
the ``CUSTOM`` keyword and specifying the index type as ``vector_index``. Example:
the ``CUSTOM`` keyword and specifying the index type as ``vector_index``. It is also possible to
add additional columns to the index for filtering the search results. The partition column
specified in the global vector index definition must be the vector column, and any subsequent
columns are treated as filtering columns. The local vector index requires that the partition key
of the base table is also the partition key of the index and the vector column is the first one
from the following columns.
Example of a simple index:
.. code-block:: cql
CREATE CUSTOM INDEX vectorIndex ON ImageEmbeddings (embedding)
CREATE CUSTOM INDEX vectorIndex ON ImageEmbeddings (embedding)
USING 'vector_index'
WITH OPTIONS = {'similarity_function': 'COSINE', 'maximum_node_connections': '16'};
The vector column (``embedding``) is indexed to enable similarity search using
a global vector index. Additional filtering can be performed on the primary key
columns of the base table.
Example of a global vector index with additional filtering:
.. code-block:: cql
CREATE CUSTOM INDEX vectorIndex ON ImageEmbeddings (embedding, category, info)
USING 'vector_index'
WITH OPTIONS = {'similarity_function': 'COSINE', 'maximum_node_connections': '16'};
The vector column (``embedding``) is indexed to enable similarity search using
a global index. Additional columns are added for filtering the search results.
The filtering is possible on ``category``, ``info`` and all primary key columns
of the base table.
Example of a local vector index:
.. code-block:: cql
CREATE CUSTOM INDEX vectorIndex ON ImageEmbeddings ((id, created_at), embedding, category, info)
USING 'vector_index'
WITH OPTIONS = {'similarity_function': 'COSINE', 'maximum_node_connections': '16'};
The vector column (``embedding``) is indexed for similarity search (a local
index) and additional columns are added for filtering the search results. The
filtering is possible on ``category``, ``info`` and all primary key columns of
the base table. The columns ``id`` and ``created_at`` must be the partition key
of the base table.
Vector indexes support additional filtering columns of native data types
(excluding counter and duration). The indexed column itself must be a vector
column, while the extra columns can be used to filter search results.
The supported types are:
* ``ascii``
* ``bigint``
* ``blob``
* ``boolean``
* ``date``
* ``decimal``
* ``double``
* ``float``
* ``inet``
* ``int``
* ``smallint``
* ``text``
* ``varchar``
* ``time``
* ``timestamp``
* ``timeuuid``
* ``tinyint``
* ``uuid``
* ``varint``
The following options are supported for vector indexes. All of them are optional.
+------------------------------+----------------------------------------------------------------------------------------------------------+---------------+
@@ -177,6 +241,26 @@ The following options are supported for vector indexes. All of them are optional
| | as ``efSearch``. Higher values lead to better recall (i.e., more relevant results are found) | |
| | but increase query latency. Supported values are integers between 1 and 4096. | |
+------------------------------+----------------------------------------------------------------------------------------------------------+---------------+
| ``quantization`` | The quantization method to use for compressing vectors in Vector Index. Vectors in base table | ``f32`` |
| | are never compressed. Supported values (case-insensitive) are: | |
| | | |
| | * ``f32``: 32-bit single-precision IEEE 754 floating-point. | |
| | * ``f16``: 16-bit standard half-precision floating-point (IEEE 754). | |
| | * ``bf16``: 16-bit "Brain" floating-point (optimized for ML workloads). | |
| | * ``i8``: 8-bit signed integer. | |
| | * ``b1``: 1-bit binary value (packed 8 per byte). | |
+------------------------------+----------------------------------------------------------------------------------------------------------+---------------+
| ``oversampling`` | A multiplier for the candidate set size during the search phase. For example, if a query asks for 10 | ``1.0`` |
| | similar vectors (``LIMIT 10``) and ``oversampling`` is 2.0, the search will initially retrieve 20 | |
| | candidates. This can improve accuracy at the cost of latency. Supported values are | |
| | floating-point numbers between 1.0 (no oversampling) and 100.0. | |
+------------------------------+----------------------------------------------------------------------------------------------------------+---------------+
| ``rescoring`` | Flag enabling recalculation of similarity scores with full precision and re-ranking of the candidate set.| ``false`` |
| | Valid only for quantization below ``f32``. Supported values are: | |
| | | |
| | * ``true``: Enable rescoring. | |
| | * ``false``: Disable rescoring. | |
+------------------------------+----------------------------------------------------------------------------------------------------------+---------------+
.. _drop-index-statement:

View File

@@ -665,6 +665,14 @@ it is not possible to update only some elements of a vector (without updating th
Types stored in a vector are not implicitly frozen, so if you want to store a frozen collection or
frozen UDT in a vector, you need to explicitly wrap them using `frozen` keyword.
.. note::
The main application of vectors is to support vector search capabilities, which
are supported in ScyllaDB Cloud only in clusters that have the Vector Search feature enabled.
Note that Vector Search clusters do not support all ScyllaDB features (e.g., tracing, TTL, paging, and grouping). More information
about Vector Search is available in the
`ScyllaDB Cloud documentation <https://cloud.docs.scylladb.com/stable/vector-search/>`_.
.. .. _custom-types:
.. Custom Types

View File

@@ -221,6 +221,87 @@ scylla-bucket/prefix/
```
See the API [documentation](#copying-sstables-on-s3-backup) for more details about the actual backup request.
### The snapshot manifest
Each table snapshot directory contains a manifest.json file that lists the contents of the snapshot and some metadata.
The json structure is as follows:
```
{
"manifest": {
"version": "1.0",
"scope": "node"
},
"node": {
"host_id": "<UUID>",
"datacenter": "mydc",
"rack": "myrack"
},
"snapshot": {
"name": "snapshot name",
"created_at": seconds_since_epoch,
"expires_at": seconds_since_epoch | null,
},
"table": {
"keyspace_name": "my_keyspace",
"table_name": "my_table",
"table_id": "<UUID>",
"tablets_type": "none|powof2",
"tablet_count": N
},
"sstables": [
{
"id": "67e35000-d8c6-11f0-9599-060de9f3bd1b",
"toc_name": "me-3gw7_0ndy_3wlq829wcsddgwha1n-big-TOC.txt",
"data_size": 75,
"index_size": 8,
"first_token": -8629266958227979430,
"last_token": 9168982884335614769,
},
{
"id": "67e35000-d8c6-11f0-85dc-0625e9f3bd1b",
"toc_name": "me-3gw7_0ndy_3wlq821a6cqlbmxrtn-big-TOC.txt",
"data_size": 73,
"index_size": 8,
"first_token": 221146791717891383,
"last_token": 7354559975791427036,
},
...
],
"files": [ ... ]
}
The `manifest` member contains the following attributes:
- `version` - respresenting the version of the manifest itself. It is incremented when members are added or removed from the manifest.
- `scope` - the scope of metadata stored in this manifest file. The following scopes are supported:
- `node` - the manifest describes all SSTables owned by this node in this snapshot.
The `node` member contains metadata about this node that enables datacenter- or rack-aware restore.
- `host_id` - is the node's unique host_id (a UUID).
- `datacenter` - is the node's datacenter.
- `rack` - is the node's rack.
The `snapshot` member contains metadata about the snapshot.
- `name` - is the snapshot name (a.k.a. `tag`)
- `created_at` - is the time when the snapshot was created.
- `expires_at` - is an optional time when the snapshot expires and can be dropped, if a TTL was set for the snapshot. If there is no TTL, `expires_at` may be omitted, set to null, or set to 0.
The `table` member contains metadata about the table being snapshot.
- `keyspace_name` and `table_name` - are self-explanatory.
- `table_id` - a universally unique identifier (UUID) of the table set when the table is created.
- `tablets_type`:
- `none` - if the keyspace uses vnodes replication
- `powof2` - if the keyspace uses tables replication, and the tablet token ranges are based on powers of 2.
- `tablet_count` - Optional. If `tablets_type` is not `none`, contains the number of tablets allcated in the table. If `tablets_type` is `powof2`, tablet_count would be a power of 2.
The `sstables` member is a list containing metadata about the SSTables in the snapshot.
- `id` - is the STable's unique id (a UUID). It is carried over with the SSTable when it's streamed as part of tablet migration, even if it gets a new generation.
- `toc_name` - is the name of the SSTable Table Of Contents (TOC) component.
- `data_size` and `index_size` - are the sizes of the SSTable's data and index components, respectively. They can be used to estimate how much disk space is needed for restore.
- `first_token` and `last_token` - are the first and last tokens in the SSTable, respectively. They can be used to determine if a SSTable is fully contained in a (tablet) token range to enable efficient file-based streaming of the SSTable.
The optional `files` member may contain a list of non-SSTable files included in the snapshot directory, not including the manifest.json file and schema.cql.
```
3. `CREATE KEYSPACE` with S3/GS storage
When creating a keyspace with S3/GS storage, the data is stored under the bucket passed as argument to the `CREATE KEYSPACE` statement.

View File

@@ -6,10 +6,10 @@ same amount of disk space. This means that the number of tablets located on a no
proportional to the gross disk capacity of that node. Because the used disk space of
different tablets can vary greatly, this could create imbalance in disk utilization.
Size based load balancing aims to achieve better disk utilization accross nodes in a
Size based load balancing aims to achieve better disk utilization across nodes in a
cluster. The load balancer will continuously gather information about available disk
space and tablet sizes from all the nodes. It then incrementally computes tablet
migration plans which equalize disk utilization accross the cluster.
migration plans which equalize disk utilization across the cluster.
# Basic operation
@@ -75,7 +75,7 @@ migrations), and will wait for correct tablet sizes to arrive after the next ``l
refresh by the topology coordinator.
One exception to this are nodes which have been excluded from the cluster. These nodes
are down and therefor are not able to send fresh ``load_stats``. But they have to be drained
are down and therefore are not able to send fresh ``load_stats``. But they have to be drained
of their tablets (via tablet rebuild), and the balancer must do this even with incomplete
tablet data. So, only excluded nodes are allowed to have missing tablet sizes.

View File

@@ -357,6 +357,7 @@ Schema:
CREATE TABLE system.load_per_node (
node uuid PRIMARY KEY,
dc text,
effective_capacity bigint,
rack text,
storage_allocated_load bigint,
storage_allocated_utilization double,
@@ -372,6 +373,7 @@ Columns:
* `storage_allocated_load` - Disk space allocated for tablets, assuming each tablet has a fixed size (target_tablet_size).
* `storage_allocated_utilization` - Fraction of node's disk capacity taken for `storage_allocated_load`, where 1.0 means full utilization.
* `storage_capacity` - Total disk capacity in bytes. Used to compute `storage_allocated_utilization`. By default equal to file system's capacity.
* `effective_capacity` - Sum of available disk space and tablet sizes on a node. Used to compute load on a node for size based balancing.
* `storage_load` - Disk space allocated for tablets, computed with actual tablet sizes. Can be null if some of the tablet sizes are not known.
* `storage_utilization` - Fraction of node's disk capacity taken for `storage_load` (with actual tablet sizes), where 1.0 means full utilization. Can be null if some of the tablet sizes are not known.
* `tablets_allocated` - Number of tablet replicas on the node. Migrating tablets are accounted as if migration already finished.

View File

@@ -46,14 +46,11 @@ stateDiagram-v2
state replacing {
rp_join_group0 : join_group0
rp_left_token_ring : left_token_ring
rp_tablet_draining : tablet_draining
rp_write_both_read_old : write_both_read_old
rp_write_both_read_new : write_both_read_new
[*] --> rp_join_group0
rp_join_group0 --> rp_left_token_ring: rollback
rp_join_group0 --> rp_tablet_draining
rp_tablet_draining --> rp_write_both_read_old
rp_tablet_draining --> rp_left_token_ring: rollback
rp_join_group0 --> rp_write_both_read_old
rp_join_group0 --> [*]: rejected
rp_write_both_read_old --> rp_write_both_read_new: streaming completed
rp_write_both_read_old --> rp_left_token_ring: rollback
@@ -69,34 +66,34 @@ stateDiagram-v2
normal --> decommissioning: leave
normal --> removing: remove
state decommissioning {
de_tablet_migration0 : tablet_migration (draining)
de_tablet_migration1 : tablet_migration
[*] --> de_tablet_migration0
de_tablet_migration0 --> [*]: aborted
de_left_token_ring : left_token_ring
de_tablet_draining : tablet_draining
de_tablet_migration : tablet_migration
de_write_both_read_old: write_both_read_old
de_write_both_read_new : write_both_read_new
de_rollback_to_normal : rollback_to_normal
[*] --> de_tablet_draining
de_tablet_draining --> de_rollback_to_normal: rollback
de_rollback_to_normal --> de_tablet_migration
de_tablet_draining --> de_write_both_read_old
de_tablet_migration --> [*]
de_rollback_to_normal --> de_tablet_migration1
de_tablet_migration0 --> de_write_both_read_old
de_tablet_migration1 --> [*]
de_write_both_read_old --> de_write_both_read_new: streaming completed
de_write_both_read_old --> de_rollback_to_normal: rollback
de_write_both_read_new --> de_left_token_ring
de_left_token_ring --> [*]
}
state removing {
re_tablet_migration0 : tablet_migration (draining)
re_tablet_migration1 : tablet_migration
[*] --> re_tablet_migration0
re_tablet_migration0 --> [*]: aborted
re_left_token_ring : left_token_ring
re_tablet_draining : tablet_draining
re_tablet_migration : tablet_migration
re_write_both_read_old : write_both_read_old
re_write_both_read_new : write_both_read_new
re_rollback_to_normal : rollback_to_normal
[*] --> re_tablet_draining
re_tablet_draining --> re_rollback_to_normal: rollback
re_rollback_to_normal --> re_tablet_migration
re_tablet_migration --> [*]
re_tablet_draining --> re_write_both_read_old
re_rollback_to_normal --> re_tablet_migration1
re_tablet_migration1 --> [*]
re_tablet_migration0 --> re_write_both_read_old
re_write_both_read_old --> re_write_both_read_new: streaming completed
re_write_both_read_old --> re_rollback_to_normal: rollback
re_write_both_read_new --> re_left_token_ring
@@ -181,6 +178,27 @@ are the currently supported global topology operations:
contain replicas of the table being truncated. It uses [sessions](#Topology guards)
to make sure that no stale RPCs are executed outside of the scope of the request.
## Tablet draining
Presence of node requests of type `leave` (decommission) and `remove` will cause the load
balancer to start migrating tablets away from those nodes. Until they are drained
of tablets, the requests are in paused state and are not picked by topology
coordinator for execution.
Paused requests are also pending, the "topology_request" field is engaged.
The node's state is still `normal` when request is paused.
Canceling the requests will stop the draining process, because absence of a request
lifts the draining state on the node.
When tablet scheduler is done with migrating all tablets away from a draining node,
the associated request will be unpaused, and can be picked by topology coordinator
for execution. This time it will enter the `write_both_read_old` transition and proceed
with the vnode part.
This allows multiple requests to drain tablets in parallel, and other requests to not be blocked
by long `leave` and `remove` requests.
## Zero-token nodes
Zero-token nodes (the nodes started with `join_ring=false`) never own tokens or become
@@ -221,12 +239,6 @@ that there are no tablet transitions in the system.
Tablets are migrated in parallel and independently.
There is a variant of tablet migration track called tablet draining track, which is invoked
as a step of certain topology operations (e.g. decommission, removenode). Its goal is to readjust tablet replicas
so that a given topology change can proceed. For example, when decommissioning a node, we
need to migrate tablet replicas away from the node being decommissioned.
Tablet draining happens before making changes to vnode-based replication.
## Node replace with tablets
Tablet replicas on the replaced node are rebuilt after the replacing node is already in the normal state and

View File

@@ -0,0 +1,23 @@
.. _automatic-repair:
Automatic Repair
================
Traditionally, launching :doc:`repairs </operating-scylla/procedures/maintenance/repair>` in a ScyllaDB cluster is left to an external process, typically done via `Scylla Manager <https://manager.docs.scylladb.com/stable/repair/index.html>`_.
Automatic repair offers built-in scheduling in ScyllaDB itself. If the time since the last repair is greater than the configured repair interval, ScyllaDB will start a repair for the :doc:`tablet table </architecture/tablets>` automatically.
Repairs are spread over time and among nodes and shards, to avoid load spikes or any adverse effects on user workloads.
To enable automatic repair, add this to the configuration (``scylla.yaml``):
.. code-block:: yaml
auto_repair_enabled_default: true
auto_repair_threshold_default_in_seconds: 86400
This will enable automatic repair for all tables with a repair period of 1 day. This configuration has to be set on each node, to an identical value.
More featureful configuration methods will be implemented in the future.
To disable, set ``auto_repair_enabled_default: false``.
Automatic repair relies on :doc:`Incremental Repair </features/incremental-repair>` and as such it only works with :doc:`tablet </architecture/tablets>` tables.

View File

@@ -3,7 +3,7 @@
Incremental Repair
==================
ScyllaDB's standard repair process scans and processes all the data on a node, regardless of whether it has changed since the last repair. This operation can be resource-intensive and time-consuming. The Incremental Repair feature provides a much more efficient and lightweight alternative for maintaining data consistency.
ScyllaDB's standard :doc:`repair </operating-scylla/procedures/maintenance/repair>` process scans and processes all the data on a node, regardless of whether it has changed since the last repair. This operation can be resource-intensive and time-consuming. The Incremental Repair feature provides a much more efficient and lightweight alternative for maintaining data consistency.
The core idea of incremental repair is to repair only the data that has been written or changed since the last repair was run. It intelligently skips data that has already been verified, dramatically reducing the time, I/O, and CPU resources required for the repair operation.
@@ -28,8 +28,7 @@ Incremental Repair is only supported for tables that use the tablets architectur
Incremental Repair Modes
------------------------
Incremental is currently disabled by default. You can control its behavior for a given repair operation using the ``incremental_mode`` parameter.
This is useful for enabling incremental repair, or in situations where you might need to force a full data validation.
While incremental repair is the default and recommended mode, you can control its behavior for a given repair operation using the ``incremental_mode`` parameter. This is useful for situations where you might need to force a full data validation.
The available modes are:
@@ -38,7 +37,12 @@ The available modes are:
* ``disabled``: Completely disables the incremental repair logic for the current operation. The repair behaves like a classic, non-incremental repair, and it does not read or update any incremental repair status markers.
The incremental_mode parameter can be specified using nodetool cluster repair, e.g., nodetool cluster repair --incremental-mode incremental. It can also be specified with the REST API, e.g., curl -X POST "http://127.0.0.1:10000/storage_service/tablets/repair?ks=ks1&table=tb1&tokens=all&incremental_mode=incremental"
The incremental_mode parameter can be specified using nodetool cluster repair, e.g., nodetool cluster repair --incremental-mode incremental.
It can also be specified with the REST API, e.g.:
.. code::
curl -X POST "http://127.0.0.1:10000/storage_service/tablets/repair?ks=ks1&table=tb1&tokens=all&incremental_mode=incremental"
Benefits of Incremental Repair
------------------------------
@@ -47,6 +51,8 @@ Benefits of Incremental Repair
* **Reduced Resource Usage:** Consumes significantly less CPU, I/O, and network bandwidth compared to a full repair.
* **More Frequent Repairs:** The efficiency of incremental repair allows you to run it more frequently, ensuring a higher level of data consistency across your cluster at all times.
Tables using Incremental Repair can schedule repairs in ScyllaDB itself, with :doc:`Automatic Repair </features/automatic-repair>`.
Notes
-----

View File

@@ -17,6 +17,7 @@ This document highlights ScyllaDB's key data modeling features.
Workload Prioritization </features/workload-prioritization>
Backup and Restore </features/backup-and-restore>
Incremental Repair </features/incremental-repair/>
Automatic Repair </features/automatic-repair/>
Vector Search </features/vector-search/>
.. panel-box::
@@ -44,5 +45,7 @@ This document highlights ScyllaDB's key data modeling features.
* :doc:`Incremental Repair </features/incremental-repair/>` provides a much more
efficient and lightweight approach to maintaining data consistency by
repairing only the data that has changed since the last repair.
* :doc:`Automatic Repair </features/automatic-repair/>` schedules and runs repairs
directly in ScyllaDB, without external schedulers.
* :doc:`Vector Search in ScyllaDB </features/vector-search/>` enables
similarity-based queries on vector embeddings.

View File

@@ -86,7 +86,7 @@ Compaction Strategies with Materialized Views
Materialized views, just like regular tables, use one of the available :doc:`compaction strategies </architecture/compaction/compaction-strategies>`.
When a materialized view is created, it does not inherit its base table compaction strategy settings, because the data model
of a view does not necessarily have the same characteristics as the one from its base table.
Instead, the default compaction strategy (SizeTieredCompactionStrategy) is used.
Instead, the default compaction strategy (IncrementalCompactionStrategy) is used.
A compaction strategy for a new materialized view can be explicitly set during its creation, using the following command:

View File

@@ -10,7 +10,6 @@ Install ScyllaDB |CURRENT_VERSION|
/getting-started/install-scylla/launch-on-azure
/getting-started/installation-common/scylla-web-installer
/getting-started/install-scylla/install-on-linux
/getting-started/installation-common/install-jmx
/getting-started/install-scylla/run-in-docker
/getting-started/installation-common/unified-installer
/getting-started/installation-common/air-gapped-install
@@ -36,7 +35,6 @@ Keep your versions up-to-date. The two latest versions are supported. Also, alwa
* :doc:`Install ScyllaDB with Web Installer (recommended) </getting-started/installation-common/scylla-web-installer>`
* :doc:`Install ScyllaDB Linux Packages </getting-started/install-scylla/install-on-linux>`
* :doc:`Install scylla-jmx Package </getting-started/installation-common/install-jmx>`
* :doc:`Install ScyllaDB Without root Privileges </getting-started/installation-common/unified-installer>`
* :doc:`Air-gapped Server Installation </getting-started/installation-common/air-gapped-install>`
* :doc:`ScyllaDB Developer Mode </getting-started/installation-common/dev-mod>`

View File

@@ -4,9 +4,9 @@
.. |RHEL_EPEL_8| replace:: https://dl.fedoraproject.org/pub/epel/epel-release-latest-8.noarch.rpm
.. |RHEL_EPEL_9| replace:: https://dl.fedoraproject.org/pub/epel/epel-release-latest-9.noarch.rpm
======================================
Install ScyllaDB Linux Packages
======================================
========================================================
Install ScyllaDB |CURRENT_VERSION| Linux Packages
========================================================
We recommend installing ScyllaDB using :doc:`ScyllaDB Web Installer for Linux </getting-started/installation-common/scylla-web-installer/>`,
a platform-agnostic installation script, to install ScyllaDB on any supported Linux platform.
@@ -46,13 +46,13 @@ Install ScyllaDB
.. code-block:: console
sudo gpg --homedir /tmp --no-default-keyring --keyring /tmp/temp.gpg --keyserver hkp://keyserver.ubuntu.com:80 --recv-keys a43e06657bac99e3
sudo gpg --homedir /tmp --no-default-keyring --keyring /tmp/temp.gpg --export --armor a43e06657bac99e3 | gpg --dearmor > /etc/apt/keyrings/scylladb.gpg
sudo gpg --homedir /tmp --no-default-keyring --keyring /tmp/temp.gpg --keyserver hkp://keyserver.ubuntu.com:80 --recv-keys c503c686b007f39e
sudo gpg --homedir /tmp --no-default-keyring --keyring /tmp/temp.gpg --export --armor c503c686b007f39e | gpg --dearmor > /etc/apt/keyrings/scylladb.gpg
.. code-block:: console
:substitutions:
sudo wget -O /etc/apt/sources.list.d/scylla.list http://downloads.scylladb.com/deb/debian/|UBUNTU_SCYLLADB_LIST|
sudo wget -O /etc/apt/sources.list.d/scylla.list https://downloads.scylladb.com/deb/debian/|UBUNTU_SCYLLADB_LIST|
#. Install ScyllaDB packages.
@@ -94,16 +94,6 @@ Install ScyllaDB
apt-get install scylla{,-server,-kernel-conf,-node-exporter,-conf,-python3,-cqlsh}=2025.3.1-0.20250907.2bbf3cf669bb-1
#. (Ubuntu only) Set Java 11.
.. code-block:: console
sudo apt-get update
sudo apt-get install -y openjdk-11-jre-headless
sudo update-java-alternatives --jre-headless -s java-1.11.0-openjdk-amd64
.. group-tab:: Centos/RHEL
#. Install the EPEL repository.
@@ -135,7 +125,7 @@ Install ScyllaDB
.. code-block:: console
:substitutions:
sudo curl -o /etc/yum.repos.d/scylla.repo -L http://downloads.scylladb.com/rpm/centos/|CENTOS_SCYLLADB_REPO|
sudo curl -o /etc/yum.repos.d/scylla.repo -L https://downloads.scylladb.com/rpm/centos/|CENTOS_SCYLLADB_REPO|
#. Install ScyllaDB packages.
@@ -143,27 +133,19 @@ Install ScyllaDB
sudo yum install scylla
Running the command installs the latest official version of ScyllaDB Open Source.
Alternatively, you can to install a specific patch version:
Running the command installs the latest official version of ScyllaDB.
Alternatively, you can install a specific patch version:
.. code-block:: console
sudo yum install scylla-<your patch version>
Example: The following example shows the command to install ScyllaDB 5.2.3.
Example: The following example shows installing ScyllaDB 2025.3.1.
.. code-block:: console
:class: hide-copy-button
sudo yum install scylla-5.2.3
(Optional) Install scylla-jmx
-------------------------------
scylla-jmx is an optional package and is not installed by default.
If you need JMX server, see :doc:`Install scylla-jmx Package </getting-started/installation-common/install-jmx>`.
sudo yum install scylla-2025.3.1
.. include:: /getting-started/_common/setup-after-install.rst

View File

@@ -1,6 +1,6 @@
==========================
Launch ScyllaDB on AWS
==========================
===============================================
Launch ScyllaDB |CURRENT_VERSION| on AWS
===============================================
This article will guide you through self-managed ScyllaDB deployment on AWS. For a fully-managed deployment of ScyllaDB
as-a-service, see `ScyllaDB Cloud documentation <https://cloud.docs.scylladb.com/>`_.

View File

@@ -1,6 +1,6 @@
==========================
Launch ScyllaDB on Azure
==========================
===============================================
Launch ScyllaDB |CURRENT_VERSION| on Azure
===============================================
This article will guide you through self-managed ScyllaDB deployment on Azure. For a fully-managed deployment of ScyllaDB
as-a-service, see `ScyllaDB Cloud documentation <https://cloud.docs.scylladb.com/>`_.

View File

@@ -1,6 +1,6 @@
==========================
Launch ScyllaDB on GCP
==========================
=============================================
Launch ScyllaDB |CURRENT_VERSION| on GCP
=============================================
This article will guide you through self-managed ScyllaDB deployment on GCP. For a fully-managed deployment of ScyllaDB
as-a-service, see `ScyllaDB Cloud documentation <https://cloud.docs.scylladb.com/>`_.

View File

@@ -1,78 +0,0 @@
======================================
Install scylla-jmx Package
======================================
scylla-jmx is an optional package and is not installed by default.
If you need JMX server, you can still install it from scylla-jmx GitHub page.
.. tabs::
.. group-tab:: Debian/Ubuntu
#. Download .deb package from scylla-jmx page.
Access to https://github.com/scylladb/scylla-jmx, select latest
release from "releases", download a file end with ".deb".
#. (Optional) Transfer the downloaded package to the install node.
If the pc from which you downloaded the package is different from
the node where you install scylladb, you will need to transfer
the files to the node.
#. Install scylla-jmx package.
.. code-block:: console
sudo apt install -y ./scylla-jmx_<version>_all.deb
.. group-tab:: Centos/RHEL
#. Download .rpm package from scylla-jmx page.
Access to https://github.com/scylladb/scylla-jmx, select latest
release from "releases", download a file end with ".rpm".
#. (Optional) Transfer the downloaded package to the install node.
If the pc from which you downloaded the package is different from
the node where you install scylladb, you will need to transfer
the files to the node.
#. Install scylla-jmx package.
.. code-block:: console
sudo yum install -y ./scylla-jmx-<version>.noarch.rpm
.. group-tab:: Install without root privileges
#. Download .tar.gz package from scylla-jmx page.
Access to https://github.com/scylladb/scylla-jmx, select latest
release from "releases", download a file end with ".tar.gz".
#. (Optional) Transfer the downloaded package to the install node.
If the pc from which you downloaded the package is different from
the node where you install scylladb, you will need to transfer
the files to the node.
#. Install scylla-jmx package.
.. code:: console
tar xpf scylla-jmx-<version>.noarch.tar.gz
cd scylla-jmx
./install.sh --nonroot
Next Steps
-----------
* :doc:`Configure ScyllaDB </getting-started/system-configuration>`
* Manage your clusters with `ScyllaDB Manager <https://manager.docs.scylladb.com/>`_
* Monitor your cluster and data with `ScyllaDB Monitoring <https://monitoring.docs.scylladb.com/>`_
* Get familiar with ScyllaDBs :doc:`command line reference guide </operating-scylla/nodetool>`.
* Learn about ScyllaDB at `ScyllaDB University <https://university.scylladb.com/>`_

View File

@@ -36,11 +36,8 @@ release versions, run:
curl -sSf get.scylladb.com/server | sudo bash -s -- --list-active-releases
Versions 2025.1 and Later
==============================
Run the command with the ``--scylla-version`` option to specify the version
you want to install.
To install a non-default version, run the command with the ``--scylla-version``
option to specify the version you want to install.
**Example**
@@ -50,20 +47,4 @@ you want to install.
curl -sSf get.scylladb.com/server | sudo bash -s -- --scylla-version |CURRENT_VERSION|
Versions Earlier than 2025.1
================================
To install a supported version of *ScyllaDB Enterprise*, run the command with:
* ``--scylla-product scylla-enterprise`` to specify that you want to install
ScyllaDB Entrprise.
* ``--scylla-version`` to specify the version you want to install.
For example:
.. code:: console
curl -sSf get.scylladb.com/server | sudo bash -s -- --scylla-product scylla-enterprise --scylla-version 2024.1
.. include:: /getting-started/_common/setup-after-install.rst

View File

@@ -14,44 +14,35 @@ Prerequisites
Ensure your platform is supported by the ScyllaDB version you want to install.
See :doc:`OS Support </getting-started/os-support>` for information about supported Linux distributions and versions.
Note that if you're on CentOS 7, only root offline installation is supported.
Download and Install
-----------------------
#. Download the latest tar.gz file for ScyllaDB version (x86 or ARM) from ``https://downloads.scylladb.com/downloads/scylla/relocatable/scylladb-<version>/``.
Example for version 6.1: https://downloads.scylladb.com/downloads/scylla/relocatable/scylladb-6.1/
**Example** for version 2025.1:
- Go to https://downloads.scylladb.com/downloads/scylla/relocatable/scylladb-2025.1/
- Download the ``scylla-unified`` file for the patch version you want to
install. For example, to install 2025.1.9 (x86), download
``scylla-unified-2025.1.9-0.20251010.6c539463bbda.x86_64.tar.gz``.
#. Uncompress the downloaded package.
The following example shows the package for ScyllaDB 6.1.1 (x86):
**Example** for version 2025.1.9 (x86) (downloaded in the previous step):
.. code:: console
.. code::
tar xvfz scylla-unified-6.1.1-0.20240814.8d90b817660a.x86_64.tar.gz
tar xvfz scylla-unified-2025.1.9-0.20251010.6c539463bbda.x86_64.tar.gz
#. Install OpenJDK 8 or 11.
The following example shows Java installation on a CentOS-like system:
.. code:: console
sudo yum install -y java-11-openjdk-headless
For root offline installation on Debian-like systems, two additional packages, ``xfsprogs``
and ``mdadm``, should be installed to be used in RAID setup.
#. (Root offline installation only) For root offline installation on Debian-like
systems, two additional packages, ``xfsprogs`` and ``mdadm``, should be
installed to be used in RAID setup.
#. Install ScyllaDB as a user with non-root privileges:
.. code:: console
./install.sh --nonroot --python3 ~/scylladb/python3/bin/python3
#. (Optional) Install scylla-jmx
scylla-jmx is an optional package and is not installed by default.
If you need JMX server, see :doc:`Install scylla-jmx Package </getting-started/installation-common/install-jmx>`.
./install.sh --nonroot
Configure and Run ScyllaDB
----------------------------
@@ -81,19 +72,14 @@ Run nodetool:
.. code:: console
~/scylladb/share/cassandra/bin/nodetool status
~/scylladb/bin/nodetool nodetool status
Run cqlsh:
.. code:: console
~/scylladb/share/cassandra/bin/cqlsh
~/scylladb/bin/cqlsh
Run cassandra-stress:
.. code:: console
~/scylladb/share/cassandra/bin/cassandra-stress write
.. note::
@@ -124,7 +110,7 @@ Nonroot install
./install.sh --upgrade --nonroot
.. note:: The installation script does not upgrade scylla-jmx and scylla-tools. You will have to upgrade them separately.
.. note:: The installation script does not upgrade scylla-tools. You will have to upgrade them separately.
Uninstall
===========
@@ -154,4 +140,4 @@ Next Steps
* Manage your clusters with `ScyllaDB Manager <https://manager.docs.scylladb.com/>`_
* Monitor your cluster and data with `ScyllaDB Monitoring <https://monitoring.docs.scylladb.com/>`_
* Get familiar with ScyllaDBs :doc:`command line reference guide </operating-scylla/nodetool>`.
* Learn about ScyllaDB at `ScyllaDB University <https://university.scylladb.com/>`_
* Learn about ScyllaDB at `ScyllaDB University <https://university.scylladb.com/>`_

View File

@@ -57,7 +57,6 @@ Knowledge Base
* :doc:`Customizing CPUSET </kb/customizing-cpuset>`
* :doc:`Recreate RAID devices </kb/raid-device>` - How to recreate your RAID devices without running scylla-setup
* :doc:`Configure ScyllaDB Networking with Multiple NIC/IP Combinations </kb/yaml-address>` - examples for setting the different IP addresses in scylla.yaml
* :doc:`Updating the Mode in perftune.yaml After a ScyllaDB Upgrade </kb/perftune-modes-sync>`
* :doc:`Kafka Sink Connector Quickstart </using-scylla/integrations/kafka-connector>`
* :doc:`Kafka Sink Connector Configuration </using-scylla/integrations/sink-config>`

View File

@@ -1,48 +0,0 @@
==============================================================
Updating the Mode in perftune.yaml After a ScyllaDB Upgrade
==============================================================
We improved ScyllaDB's performance by `removing the rx_queues_count from the mode
condition <https://github.com/scylladb/seastar/pull/949>`_. As a result, ScyllaDB operates in
the ``sq_split`` mode instead of the ``mq`` mode (see :doc:`Seastar Perftune </operating-scylla/admin-tools/perftune>` for information about the modes).
If you upgrade from an earlier version of ScyllaDB, your cluster's existing nodes may use the ``mq`` mode,
while new nodes will use the ``sq_split`` mode. As using different modes across one cluster is not recommended,
you should change the configuration to ensure that the ``sq_split`` mode is used on all nodes.
This section describes how to update the `perftune.yaml` file to configure the ``sq_split`` mode on all nodes.
Procedure
------------
The examples below assume that you are using the default locations for storing data and the `scylla.yaml` file,
and that your NIC is ``eth5``.
#. Backup your old configuration.
.. code-block:: console
sudo mv /etc/scylla.d/cpuset.conf /etc/scylla.d/cpuset.conf.old
sudo mv /etc/scylla.d/perftune.yaml /etc/scylla.d/perftune.yaml.old
#. Create a new configuration.
.. code-block:: console
sudo scylla_sysconfig_setup --nic eth5 --homedir /var/lib/scylla --confdir /etc/scylla
A new ``/etc/scylla.d/cpuset.conf`` will be generated on the output.
#. Compare the contents of the newly generated ``/etc/scylla.d/cpuset.conf`` with ``/etc/scylla.d/cpuset.conf.old`` you created in step 1.
- If they are exactly the same, rename ``/etc/scylla.d/perftune.yaml.old`` you created in step 1 back to ``/etc/scylla.d/perftune.yaml`` and continue to the next node.
- If they are different, move on to the next steps.
#. Restart the ``scylla-server`` service.
.. code-block:: console
nodetool drain
sudo systemctl restart scylla-server
#. Wait for the service to become up and running (similarly to how it is done during a :doc:`rolling restart </operating-scylla/procedures/config-change/rolling-restart>`). It may take a considerable amount of time before the node is in the UN state due to resharding.
#. Continue to the next node.

View File

@@ -25,4 +25,8 @@ Port Description Protocol
19142 Native shard-aware transport port (ssl) TCP
====== ============================================ ========
If you're using ScyllaDB Alternator, ensure that the ports configured
for Alternator with the ``alternator_port`` or ``alternator_https_port`` parameter
are open. See :doc:`ScyllaDB Alternator </alternator/alternator>` for details.
.. note:: For ScyllaDB Manager ports, see the `ScyllaDB Manager <https://manager.docs.scylladb.com/>`_ documentation.

View File

@@ -93,6 +93,25 @@ API calls
Cluster tasks are not unregistered from task manager with API calls.
Node operations module
----------------------
There is a module named ``node_ops``, which allows tracking node operations: decommission, removenode, bootstrap, replace, rebuild.
The ``type`` field designates the operation, and is one of:
- ``decommission``
- ``remove node``
- ``bootstrap``
- ``replace``
- ``rebuild``
The ``scope`` and ``kind`` fields are set to ``cluster``.
The ``entity`` field holds the host id of the node which is being operated on, as long as the request
is not finished. In case of the ``replace`` operation, it will hold the host id of the replacing node.
``decommission`` and ``remove node`` tasks are abortable, but only before they finish tablet migration.
Tasks API
---------

View File

@@ -55,7 +55,7 @@ ScyllaDB nodetool cluster repair command supports the following options:
nodetool cluster repair --tablet-tokens 1,10474535988
- ``--incremental-mode`` specifies the incremental repair mode. Can be 'disabled', 'incremental', or 'full'. 'incremental': The incremental repair logic is enabled. Unrepaired sstables will be included for repair. Repaired sstables will be skipped. The incremental repair states will be updated after repair. 'full': The incremental repair logic is enabled. Both repaired and unrepaired sstables will be included for repair. The incremental repair states will be updated after repair. 'disabled': The incremental repair logic is disabled completely. The incremental repair states, e.g., repaired_at in sstables and sstables_repaired_at in the system.tablets table, will not be updated after repair. When the option is not provided, it defaults to 'disabled'.
- ``--incremental-mode`` specifies the incremental repair mode. Can be 'disabled', 'incremental', or 'full'. 'incremental': The incremental repair logic is enabled. Unrepaired sstables will be included for repair. Repaired sstables will be skipped. The incremental repair states will be updated after repair. 'full': The incremental repair logic is enabled. Both repaired and unrepaired sstables will be included for repair. The incremental repair states will be updated after repair. 'disabled': The incremental repair logic is disabled completely. The incremental repair states, e.g., repaired_at in sstables and sstables_repaired_at in the system.tablets table, will not be updated after repair. When the option is not provided, it defaults to incremental.
For example:

Some files were not shown because too many files have changed in this diff Show More