As reported in SCYLLADB-1013, the directory lister must be closed also when an exception is thrown.
For example, see backtrace below:
```
seastar::on_internal_error(seastar::logger&, std::basic_string_view<char, std::char_traits<char>>) at ./build/release/seastar/./seastar/src/core/on_internal_error.cc:57
directory_lister::~directory_lister() at ./utils/lister.cc:77
replica::table::get_snapshot_details(std::filesystem::__cxx11::path, std::filesystem::__cxx11::path) (.resume) at ./replica/table.cc:4081
std::__n4861::coroutine_handle<seastar::internal::coroutine_traits_base<db::snapshot_ctl::table_snapshot_details>::promise_type>::resume() const at /usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/coroutine:247
(inlined by) seastar::internal::coroutine_traits_base<db::snapshot_ctl::table_snapshot_details>::promise_type::run_and_dispose() at ././seastar/include/seastar/core/coroutine.hh:129
seastar::reactor::task_queue::run_tasks() at ./build/release/seastar/./seastar/src/core/reactor.cc:2695
(inlined by) seastar::reactor::task_queue_group::run_tasks() at ./build/release/seastar/./seastar/src/core/reactor.cc:3201
seastar::reactor::task_queue_group::run_some_tasks() at ./build/release/seastar/./seastar/src/core/reactor.cc:3185
(inlined by) seastar::reactor::do_run() at ./build/release/seastar/./seastar/src/core/reactor.cc:3353
seastar::reactor::run() at ./build/release/seastar/./seastar/src/core/reactor.cc:3245
seastar::app_template::run_deprecated(int, char**, std::function<void ()>&&) at ./build/release/seastar/./seastar/src/core/app-template.cc:266
seastar::app_template::run(int, char**, std::function<seastar::future<int> ()>&&) at ./build/release/seastar/./seastar/src/core/app-template.cc:160
scylla_main(int, char**) at ./main.cc:756
```
Fixes: [SCYLLADB-1013](https://scylladb.atlassian.net/browse/SCYLLADB-1013)
* Requires backport to 2026.1 since the leak exists since 004c08f525
[SCYLLADB-1013]: https://scylladb.atlassian.net/browse/SCYLLADB-1013?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQClosesscylladb/scylladb#29084
* github.com:scylladb/scylladb:
test/boost/database_test: add test_snapshot_ctl_details_exception_handling
table: get_snapshot_details: fix indentation inside try block
table: per-snapshot get_snapshot_details: fix typo in comment
table: per-snapshot get_snapshot_details: always close lister using try/catch
table: get_snapshot_details: always close lister using deferred_close
(cherry picked from commit f27dc12b7c)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closesscylladb/scylladb#29125
During tests I noticed that if the number of tablets is very small,
say 2, and the number of nodes is 3 (2 shards per node), using the
number of storage groups on each shard, a shard may end up holding 0 groups,
whilst the other holds 1 group. And in some nodes even both shards have
0 groups.
Taking the minimum among shards here was showing in manifests a tablet
count of 0 for all 3 nodes, which is incorrect.
Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
Closesscylladb/scylladb#28978
(cherry picked from commit 29619e48d7)
Closesscylladb/scylladb#29101
The test intentionally creates huge index pages.
But since 5e7fb08bf3,
the index reader allocates a block of memory for a whole index page,
instead of incrementally allocating small pieces during index parsing.
This giant allocation causes the test to fail spuriously in CI sometimes.
Fix this by disabling sstable compression on the test table,
which puts a hard cap of 2000 keys per index page.
Fixes: SCYLLADB-1152
Closesscylladb/scylladb#29152
(cherry picked from commit f29525f3a6)
Closesscylladb/scylladb#29172
When it deadlocks, groups stop merging and compaction group merge
backlog will run-away.
Also, graceful shutdown will be blocked on it.
Found by flaky unit test
test_merge_chooses_best_replica_with_odd_count, which timed-out in 1
in 100 runs.
Reason for deadlock:
When storage groups are merged, the main compaction group of the new
storage group takes a compaction lock, which is appended to
_compaction_reenablers_for_merging, and released when the merge
completion fiber is done with the whole batch.
If we accumulate more than 1 merge cycle for the fiber, deadlock
occurs. Lock order will be this
Initial state:
cg0: main
cg1: main
cg2: main
cg3: main
After 1st merge:
cg0': main [locked], merging_groups=[cg0.main, cg1.main]
cg1': main [locked], merging_groups=[cg2.main, cg3.main]
After 2nd merge:
cg0'': main [locked], merging_groups=[cg0'.main [locked], cg0.main, cg1.main, cg1'.main [locked], cg2.main, cg3.main]
merge completion fiber will try to stop cg0'.main, which will be
blocked on compaction lock. which is held by the reenabler in
_compaction_reenablers_for_merging, hence deadlock.
The fix is to wait for background merge to finish before we start the
next merge. It's achieved by holding old erm in the background merge,
and doing a topology barrier from the merge finalizing transition.
Background merge is supposed to be a relatively quick operation, it's
stopping compaction groups. So may wait for active requests. It
shouldn't prolong the barrier indefinitely.
Tablet tests which trigger merge need to be adjusted to call the
barrier, otherwise they will be vulnerable to the deadlock.
Fixes SCYLLADB-928
Backport to >= 2025.4 because it's the earliest vulnerable due to f9021777d8.
Closesscylladb/scylladb#29007
* github.com:scylladb/scylladb:
tablets: Fix deadlock in background storage group merge fiber
replica: table: Propagate old erm to storage group merge
test: boost: tablets_test: Save tablet metadata when ACKing split resize decision
storage_service: Extract local_topology_barrier()
(cherry picked from commit 5573c3b18e)
Closesscylladb/scylladb#29144
File streaming only releases the file descriptors of a tablet being
streamed in the very streaming end. Which means that if the streaming
tablet has compaction on largest tier finished after streaming
started, there will be always ~2x space amplification for that
single tablet. Since there can be up to 4 tablets being migrated
away, it can add up to a significant amount, since nodes are pushed
to a substantial usage of available space (~90%).
We want to optimize this by dropping reference to a sstable after
it was fully streamed. This way, we reduce the chances of hitting
2x space amplification for a given tablet.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-790.
Closesscylladb/scylladb#28505
(cherry picked from commit 5b550e94a6)
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#28769
Move all ${{ }} expression interpolations into env: blocks so they are
passed as environment variables instead of being expanded directly into
shell scripts. This prevents an attacker from escaping the heredoc in
the Validate Comment Trigger step and executing arbitrary commands on
the runner.
The Verify Org Membership step is hardened in the same way for
defense-in-depth.
Refs: GHSA-9pmq-v59g-8fxp
Fixes: SCYLLADB-954
Closesscylladb/scylladb#28935
(cherry picked from commit 977bdd6260)
Closesscylladb/scylladb#28946
Fixes: SCYLLADB-915
Test was quite broken; Not waiting for coro:s, as well as a bunch
of checks no longer even close to valid (this is a ported dtest, and
not a very good one).
Closesscylladb/scylladb#28887
(cherry picked from commit ef795eda5b)
Closesscylladb/scylladb#28966
The tests in test_out_of_space_prevention.py are flaky. Three issues contribute:
1. After creating/removing the blob file that simulates disk pressure,
the tests immediately checked derived state (e.g., "compaction_manager
- Drained") without first confirming the disk space monitor had detected
the utilization change. Fix: explicitly wait for "Reached/Dropped below
critical disk utilization level" right after creating/removing the blob
file, before checking downstream effects.
2. Several tests called `manager.driver_connect()` or omitted reconnection
entirely after `server_restart()` / `server_start()`. The pre-existing
driver session can silently reconnect multiple times, causing subsequent
CQL queries to fail. Fix: call `reconnect_driver()` after every node restart.
Additionally, call `wait_for_cql_and_get_hosts()` where CQL is used afterward,
to ensure all connection pools are established.
3. Some log assertions used marks captured before a restart, so they could
match pre-restart messages or miss messages emitted in the correct post-restart
window. Fix: refresh marks at the right points.
Apart from that, the patch fixes a typo: autotoogle -> autotoggle.
Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-655Closesscylladb/scylladb#28626
(cherry picked from commit 826fd5d6c3)
Closesscylladb/scylladb#28967
The first node in the cluster is not guaranteed to be the coordinator
node. Hardcoding node 0 as the coordinator causes test flakiness. This
patch dynamically finds the actual coordinator node and targets it for
error injection, log checking, and restarts.
Additionally, inject `tablet_force_tablet_count_decrease_once` across
all servers to force the tablet merge process to trigger once.
Fixes SCYLLADB-865
Closesscylladb/scylladb#28945
(cherry picked from commit e0483f6001)
Closesscylladb/scylladb#28969
Instead of dht::partition_ranges_vector, which is an std::vector<> and
have been seen to cause large allocations when calculating ranges to be
invalidated after compaction:
seastar_memory - oversized allocation: 147456 bytes. This is non-fatal, but could lead to latency and/or fragmentation issues. Please report: at
[Backtrace #0]
void seastar::backtrace<seastar::current_backtrace_tasklocal()::$_0>(seastar::current_backtrace_tasklocal()::$_0&&, bool) at ./build/release/seastar/./seastar/include/seastar/util/backtrace.hh:89
(inlined by) seastar::current_backtrace_tasklocal() at ./build/release/seastar/./seastar/src/util/backtrace.cc:99
seastar::current_tasktrace() at ./build/release/seastar/./seastar/src/util/backtrace.cc:136
seastar::current_backtrace() at ./build/release/seastar/./seastar/src/util/backtrace.cc:169
seastar::memory::cpu_pages::warn_large_allocation(unsigned long) at ./build/release/seastar/./seastar/src/core/memory.cc:840
seastar::memory::cpu_pages::check_large_allocation(unsigned long) at ./build/release/seastar/./seastar/src/core/memory.cc:903
(inlined by) seastar::memory::cpu_pages::allocate_large(unsigned int, bool) at ./build/release/seastar/./seastar/src/core/memory.cc:910
(inlined by) seastar::memory::allocate_large(unsigned long, bool) at ./build/release/seastar/./seastar/src/core/memory.cc:1533
(inlined by) seastar::memory::allocate_slowpath(unsigned long) at ./build/release/seastar/./seastar/src/core/memory.cc:1679
seastar::memory::allocate(unsigned long) at ././seastar/src/core/memory.cc:1698
(inlined by) operator new(unsigned long) at ././seastar/src/core/memory.cc:2440
(inlined by) std::__new_allocator<interval<dht::ring_position>>::allocate(unsigned long, void const*) at /usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/bits/new_allocator.h:151
(inlined by) std::allocator<interval<dht::ring_position>>::allocate(unsigned long) at /usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/bits/allocator.h:203
(inlined by) std::allocator_traits<std::allocator<interval<dht::ring_position>>>::allocate(std::allocator<interval<dht::ring_position>>&, unsigned long) at /usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/bits/alloc_traits.h:614
(inlined by) std::_Vector_base<interval<dht::ring_position>, std::allocator<interval<dht::ring_position>>>::_M_allocate(unsigned long) at /usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/bits/stl_vector.h:387
(inlined by) std::vector<interval<dht::ring_position>, std::allocator<interval<dht::ring_position>>>::reserve(unsigned long) at /usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/bits/vector.tcc:79
dht::to_partition_ranges(utils::chunked_vector<interval<dht::token>, 131072ul> const&, seastar::bool_class<utils::can_yield_tag>) at ./dht/i_partitioner.cc:347
compaction::compaction::get_ranges_for_invalidation(std::vector<seastar::lw_shared_ptr<sstables::sstable>, std::allocator<seastar::lw_shared_ptr<sstables::sstable>>> const&) at ./compaction/compaction.cc:619
(inlined by) compaction::compaction::get_compaction_completion_desc(std::vector<seastar::lw_shared_ptr<sstables::sstable>, std::allocator<seastar::lw_shared_ptr<sstables::sstable>>>, std::vector<seastar::lw_shared_ptr<sstables::sstable>, std::allocator<seastar::lw_shared_ptr<sstables::sstable>>>) at ./compaction/compaction.cc:719
(inlined by) compaction::regular_compaction::replace_remaining_exhausted_sstables() at ./compaction/compaction.cc:1362
compaction::compaction::finish(std::chrono::time_point<db_clock, std::chrono::duration<long, std::ratio<1l, 1000l>>>, std::chrono::time_point<db_clock, std::chrono::duration<long, std::ratio<1l, 1000l>>>) at ./compaction/compaction.cc:1021
compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0::operator()() at ./compaction/compaction.cc:1960
(inlined by) compaction::compaction_result std::__invoke_impl<compaction::compaction_result, compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0>(std::__invoke_other, compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0&&) at /usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/bits/invoke.h:63
(inlined by) std::__invoke_result<compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0>::type std::__invoke<compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0>(compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0&&) at /usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/bits/invoke.h:98
(inlined by) decltype(auto) std::__apply_impl<compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0, std::tuple<>>(compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0&&, std::tuple<>&&, std::integer_sequence<unsigned long, ...>) at /usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/tuple:2920
(inlined by) decltype(auto) std::apply<compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0, std::tuple<>>(compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0&&, std::tuple<>&&) at /usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/tuple:2935
(inlined by) seastar::future<compaction::compaction_result> seastar::futurize<compaction::compaction_result>::apply<compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0>(compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0&&, std::tuple<>&&) at ././seastar/include/seastar/core/future.hh:1930
(inlined by) seastar::futurize<std::invoke_result<compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0>::type>::type seastar::async<compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0>(seastar::thread_attributes, compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0&&)::'lambda'()::operator()() const at ././seastar/include/seastar/core/thread.hh:267
(inlined by) seastar::noncopyable_function<void ()>::direct_vtable_for<seastar::futurize<std::invoke_result<compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0>::type>::type seastar::async<compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0>(seastar::thread_attributes, compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0&&)::'lambda'()>::call(seastar::noncopyable_function<void ()> const*) at ././seastar/include/seastar/util/noncopyable_function.hh:138
seastar::noncopyable_function<void ()>::operator()() const at ./build/release/seastar/./seastar/include/seastar/util/noncopyable_function.hh:224
(inlined by) seastar::thread_context::main() at ./build/release/seastar/./seastar/src/core/thread.cc:318
dht::partition_ranges_vector is used on the hot path, so just convert
the problematic user -- cache invalidation -- to use
utils::chunked_vector<dht::partition_range> instead.
Fixes: SCYLLADB-121
Closesscylladb/scylladb#28855
(cherry picked from commit 13ff9c4394)
Closesscylladb/scylladb#28975
`e4da0afb8d5491bf995cbd1d7a7efb966c79ac34` introduces a protection
against resources that are "made up" of thin air to
`reader_concurrency_semaphore`. If there are more `_resources` than
the `_initial_resources`, it means there is a negative leak, and
`on_internal_error_noexcept` is called. In addition to it,
`_resources` is set to `std::max(_resources, _initial_resources)`.
However, the commit message of `e4da0afb8d5491bf995cbd1d7a7efb966c79ac34`
states the opposite: "The detection also clamps the
_resources to _initial_resources, to prevent any damage".
Before this commit, the protection mechanism doesn't clamp
`_resources` to `_initial_resources` but instead keeps `_resources` high,
possibly even indefinitely growing. This commit changes `std::max` to
`std::min` to make the code behave as intended.
Fixes: SCYLLADB-1014
Refs: SCYLLADB-163
Closesscylladb/scylladb#28982
(cherry picked from commit 9247dff8c2)
Closesscylladb/scylladb#28988
Currently, for repair tasks tablet_virtual_task::wait gathers the
ids of tablets that are to be repaired. The gathered set is later
used to check if the repair is still ongoing.
However, if the tablets are resized (split or merged), the gathered
set becomes irrelevant. Those, we may end up with invalid tablet id
error being thrown.
Wait until repair is done for all tablets in the table.
Fixes: https://github.com/scylladb/scylladb/issues/28202
Backport to 2026.1 needed as it contains the change introducing the issue d51b1fea94Closesscylladb/scylladb#28323
* github.com:scylladb/scylladb:
service: fix indentation
test: add test_tablet_repair_wait
service: remove status_helper::tablets
service: tasks: scan all tablets in tablet_virtual_task::wait
(cherry picked from commit 3fed6f9eff)
Closesscylladb/scylladb#28991
Document how to migrate a ScyllaDB cluster to different instance
types using the add-and-replace node cycling approach.
Closes: QAINFRA-42
Closesscylladb/scylladb#28458
(cherry picked from commit cee44716db)
Closesscylladb/scylladb#28995
query_processor::prepare() could race with prepared statement invalidation: after loading from the prepared cache, we converted the cached object to a checked weak pointer and then continued asynchronous work (including error-injection waitpoints). If invalidation happened in that window, the weak handle could no longer be promoted and the prepare path could fail nondeterministically.
This change keeps a strong cache entry reference alive across the whole critical section in prepare() by using a pinned cache accessor (get_pinned()), and only deriving the weak handle while the entry is pinned. This removes the lifetime gap without adding retry loops.
Test coverage was extended in test/cluster/test_prepare_race.py:
- reproduces the invalidation-during-prepare window with injection,
- verifies prepare completes successfully,
- then invalidates again and executes the same stale client prepared object,
- confirms the driver transparently re-requests/re-prepares and execution succeeds.
This change introduces:
- no behavior change for normal prepare flow besides stronger lifetime guarantees,
- no new protocol semantics,
- preserves existing cache invalidation logic,
- adds explicit cluster-level regression coverage for both the race and driver reprepare path.
- pushes the re prepare operation twards the driver, the server will return unprepared error for the first time and the driver will have to re prepare during execution stage
Fixes: https://github.com/scylladb/scylladb/issues/27657
Backport to active branches recommended: No node crash, but user-visible PREPARE failures under rare schema-invalidation race; low-risk timeout-bounded retry improves robustness.
Closesscylladb/scylladb#28952
* github.com:scylladb/scylladb:
transport/messages: hold pinned prepared entry in PREPARE result
cql3: pin prepared cache entry in prepare() to avoid invalid weak handle race
(cherry picked from commit d9a277453e)
Closesscylladb/scylladb#29001
The limiter scans ranges to decide whether or not to rate-limit the
query. However, when considering each range only the front one's token
is accounted. This looks like a misprint.
The limiter was introduced in cc9a2ad41f
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#29050
(cherry picked from commit 8b1ca6dcd6)
Closesscylladb/scylladb#29107
During decommission, we first mark a topology request as done, then shut
down a node and in the following steps we remove node from the topology.
Thus, finished request does not imply that a node is removed from
the topology.
Due to that, in node_ops_virtual_task::wait, while gathering children
from the whole cluster, we may hit the connection exception - because
a node is still in topology, even though it is down.
Modify the get_children method to ignore the exception and warn
about the failure instead.
Keep token_metadata_ptr in get_children to prevent topology from changing.
Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-867
Needs backports to all versions
Closesscylladb/scylladb#29035
* github.com:scylladb/scylladb:
tasks: fix indentation
tasks: do not fail the wait request if rpc fails
tasks: pass token_metadata_ptr to task_manager::virtual_task::impl::get_children
(cherry picked from commit 2e47fd9f56)
Closesscylladb/scylladb#29126
This PR fixes the Installation page:
- Replaces `http `with `https `in the download command.
- Replaces the Open Source example from the Installation section for CentOS (we overlooked this example before).
Fixes https://github.com/scylladb/scylladb/issues/29087
Fixes https://github.com/scylladb/scylladb/issues/29087
This update affects all supported versions and should be backported as a bug fix.
Closesscylladb/scylladb#29088
* github.com:scylladb/scylladb:
doc: remove the Open Source Example from Installation
doc: replace http with https in the installation instructions
(cherry picked from commit e8b37d1a89)
Closesscylladb/scylladb#29135
The removenove initiator could have an outdated token ring (still considering
the node removed by the previous removenode a token owner) and unexpectedly
reject the operation.
Fix that by waiting for token ring and group0 consistency before removenode.
Note that the test already checks that consistency, but only for one node,
which is different from the removenode initiator.
This test has been removed in master together with the code being tested
(the gossip-based topology). Hence, the fix is submitted directly to 2026.1.
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1103
Backport to all supported branches (other than 2026.1), as the test can fail
there.
Closesscylladb/scylladb#29108
When computing table sizes via load_stats to determine if a split/merge is needed, we are filtering tablets which are being migrated, in order to avoid counting them twice (both on leaving and pending replica) in the total table size. The tablets are filtered so that they are counted on the leaving replica until the streaming stage, and on the pending replica after the streaming stage.
Currently, the procedure for collecting tablet sizes for load balancing also uses this same filter. This should be changed, because the load balancer needs to have as much information about tablet sizes as possible, and could ignore a node due to missing tablet sizes for tablets in the `write_both_read_new` and `use_new` stages.
For tablet size collection, we should include all the tablets which are currently taking up disk space. This means:
- on leaving replica, include all tablets until the `cleanup` stage
- on pending replica, include all tablets starting with the `write_both_read_new` and later stages
While this is an improvement, it causes problems with some of the tests, and therefore needs to be backported to 2026.1
Fixes: SCYLLADB-829
Closesscylladb/scylladb#28587
* github.com:scylladb/scylladb:
load_stats: add filtering for tablet sizes
load_stats: move tablet filtering for table size computation
load_stats: bring the comment and code in sync
(cherry picked from commit 518470e89e)
Closesscylladb/scylladb#29034
Remove outdated references to filtering on columns provided in the
index definition, and remove the note about equal relations (= and IN)
being the only supported operations. Vector search filtering currently
supports WHERE clauses on primary key columns only.
Closesscylladb/scylladb#28949
(cherry picked from commit 40d180a7ef)
Closesscylladb/scylladb#29069
This patch is mostly for the purpose of running pgo CI job.
We may receive connection error if asyncio.sleep(5) in
pgo.py is not sufficient waiting time.
In pgo.py we do wait for port but only for cql,
anyway it's better to have high level check than
trying to wait for alternator port there.
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1071
Backport: 2026.1 - it failed on CI for that build
Closesscylladb/scylladb#29063
* github.com:scylladb/scylladb:
perf: add abort_source support to wait-for-port loops
perf-alternator: wait for alternator port before running workload
(cherry picked from commit 172c786079)
Closesscylladb/scylladb#29098
`test_raft_no_quorum.py::test_cannot_add_new_node` is currently flaky in dev
mode. The bootstrap of the first node can fail due to `add_entry()` timing
out (with the 1s timeout set by the test case).
Other test cases in this test file could fail in the same way as well, so we
need a general fix. We don't want to increase the timeout in dev mode, as it
would slow down the test. The solution is to keep the timeout unchanged, but
set it only after quorum is lost. This prevents unexpected timeouts of group0
operations with almost no impact on the test running time.
A note about the new `update_group0_raft_op_timeout` function: waiting for
the log seems to be necessary only for
`test_quorum_lost_during_node_join_response_handler`, but let's do it
for all test cases just in case (including `test_can_restart` that shouldn't
be flaky currently).
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-913Closesscylladb/scylladb#28998
(cherry picked from commit 526e5986fe)
Closesscylladb/scylladb#29068
SNI works only with DNS hostnames. Adding an IP address causes warnings
on the server side.
This change adds SNI only if it is not an IP address.
This change has no unit tests, as this behavior is not critical,
since it causes a warning on the server side.
The critical part, that the server name is verified, is already covered.
This PR also adds warning logs to improve future troubleshooting of connections to the vector-store nodes.
Fixes: VECTOR-528
Backports to 2025.04 and 2026.01 are required, as these branches are also affected.
Closesscylladb/scylladb#28637
* github.com:scylladb/scylladb:
vector_search: fix TLS server name with IP
vector_search: add warn log for failed ann requests
(cherry picked from commit 23ed0d4df8)
Closesscylladb/scylladb#28964
Currently, the view_update_generator::mutate_MV function acquires a
reference to the keyspace relevant to the operation, then it calls
max_concurrent_for_each and uses that reference inside the lambda passed
to that function. max_concurrent_for_each can preempt and there is no
mechanism that makes sure that the keyspace is alive until the view
updates are generated, so it is possible that the keyspace is freed by
the time the reference is used.
Fix the issue by precomputing the necessary information based on the
keyspace reference right away, and then passing that information by
value to the other parts of the code. It turns out that we only need to
know whether the keyspace uses tablets and whether it uses a network
topology strategy.
Fixes: scylladb/scylladb#28925Closesscylladb/scylladb#28928
(cherry picked from commit 42d70baad3)
Closesscylladb/scylladb#28968
mv: allow skipping view updates when a collection is unmodified
When we generate view updates, we check whether we can skip the
entire view update if all columns selected by the view are unmodified.
However, for collection columns, we only check if they were unset
before and after the update.
In this patch we add a check for the actual collection contents.
We perform this check for both virtual and non-virtual selections.
When the column is only a virtual column in the view, it would be
enough to check the liveness of each collection cell, however for
that we'd need to deserialize the entire collection anyway, which
should be effectively as expensive as comparing all of its bytes.
Fixes: SCYLLADB-996
- (cherry picked from commit 01ddc17ab9)
Parent PR: #28839Closesscylladb/scylladb#28977
* github.com:scylladb/scylladb:
Merge 'mv: allow skipping view updates when a collection is unmodified' from Wojciech Mitros
mv: remove dead code in view_updates::can_skip_view_updates
nodetool cluster repair without additional params repairs all tablet
keyspaces in a cluster. Currently, if a table is dropped while
the command is running, all tables are repaired but the command finishes
with a failure.
Modify nodetool cluster repair. If a table wasn't specified
(i.e. all tables are repaired), the command finishes successfully
even if a table was dropped.
If a table was specified and it does not exist (e.g. because it was
dropped before the repair was requested), then the behavior remains
unchanged.
Fixes: SCYLLADB-568.
Closesscylladb/scylladb#28739
(cherry picked from commit 2e68f48068)
Closesscylladb/scylladb#29006
serialize_collection_mutation() copies the serialized collection into
the returned collection_mutation object. Change to move to avoid the
copy.
Fixes: SCYLLADB-1041
Closesscylladb/scylladb#29010
(cherry picked from commit 15cfa5beeb)
Closesscylladb/scylladb#29024
`test_proxy_protocol_port_preserved_in_system_clients` failed because it
didn't see the just created connection in system.clients immediately. The
last lines of the stacktrace are:
```
# Complete CQL handshake
await do_cql_handshake(reader, writer)
# Now query system.clients using the driver to see our connection
cql = manager.get_cql()
rows = list(cql.execute(
f"SELECT address, port FROM system.clients WHERE address = '{fake_src_addr}' ALLOW FILTERING"
))
# We should find our connection with the fake source address and port
> assert len(rows) > 0, f"Expected to find connection from {fake_src_addr} in system.clients"
E AssertionError: Expected to find connection from 203.0.113.200 in system.clients
E assert 0 > 0
E + where 0 = len([])
```
Explanation: we first await for the hand-made connection to be completed,
then, via another connection, we're querying system.clients, and we don't
get this hand-made connection in the resultset.
The solution is to replace the bare cql.execute() calls with await wait_for_results(), a helper
that polls via cql.run_async() until the expected row count is reached
(30 s timeout, 100 ms period).
Fixes: SCYLLADB-819
The flaky test is present on master and in previous release, so backporting only there.
Closesscylladb/scylladb#28849
* github.com:scylladb/scylladb:
test_proxy_protocol: introduce extra logging to aid debugging
test_proxy_protocol: fix flaky system.clients visibility checks
(cherry picked from commit 4150c62f29)
Closesscylladb/scylladb#28951
When we generate view updates, we check whether we can skip the
entire view update if all columns selected by the view are unmodified.
However, for collection columns, we only check if they were unset
before and after the update.
In this patch we add a check for the actual collection contents.
We perform this check for both virtual and non-virtual selections.
When the column is only a virtual column in the view, it would be
enough to check the liveness of each collection cell, however for
that we'd need to deserialize the entire collection anyway, which
should be effectively as expensive as comparing all of its bytes.
Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-808Closesscylladb/scylladb#28839
* github.com:scylladb/scylladb:
mv: allow skipping view updates when a collection is unmodified
mv: allow skipping view updates if an empty collection remains unset
(cherry picked from commit 01ddc17ab9)
When we create a materialized view, we consider 2 cases:
1. the view's primary key contains a column that is not
in the primary key of the base table
2. the view's primary key doesn't contain such a column
In the 2nd case, we add all columns from the base table
to the schema of the view (as virtual columns). As a result,
all of these columns are effectively "selected" in
view_updates::can_skip_view_updates. Same thing happens when
we add new columns to the base table using ALTER.
Because of this, we can never have !column_is_selected and
!has_base_non_pk_columns_in_view_pk at the same time. And
thus, the check (!column_is_selected
&& _base_info.has_base_non_pk_columns_in_view_pk) is always
the same as (!column_is_selected).
Because we immediately return after this check, the tail of
this function is also never reached - all checks after the
(column_is_selected) are affected by this. Also, the condition
(!column_is_selected && base_has_nonexpiring_marker) is always
false at the point it is called. And this in turn makes the
`base_has_nonexpiring_marker` unused, so we delete it as well.
It's worth considering, why did we even have
`base_has_nonexpiring_marker` if it's effectively unused. We
initially introduced it in bd52e05ae2 and we (incorrectly)
used it to allow skipping view updates even if the liveness of
virtual columns changed. Soon after, in 5f85a7a821, we
started categorizing virtual columns as column_is_selected == true
and we moved the liveness checks for virtual columns to the
`if (column_is_selected)` clause, before the `base_has_nonexpiring_marker`
check. We changed this because even if we have a nonexpiring marker
right now, it may be changed in the future, in which case the liveness
of the view row will depend on liveness of the virtual columns and
we'll need to have the view updates from the time the row marker was
nonexpiring.
(cherry picked from commit ca1c8ff209)
Set enable_schema_commitlog for each group0 tables.
Assert that group0 tables use schema commitlog in ensure_group0_schema
(per each command).
Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-914.
Needs backport to all live releases as all are vulnerable
Closesscylladb/scylladb#28876
* github.com:scylladb/scylladb:
test: add test_group0_tables_use_schema_commitlog
db: service: remove group0 tables from schema commitlog schema initializer
service: ensure that tables updated via group0 use schema commitlog
db: schema: remove set_is_group0_table param
(cherry picked from commit b90fe19a42)
Closesscylladb/scylladb#28916
Consider this:
- repair takes the lock holder
- tablet merge filber destories the compaction group and the compaction state
- repair fails
- repair destroy the lock holder
This is observed in the test:
```
repair - repair[5d73d094-72ee-4570-a3cc-1cd479b2a036] Repair 1 out of 1 tablets: table=sec_index.users range=(432345564227567615,504403158265495551] replicas=[0e9d51a5-9c99-4d6e-b9db-ad36a148b0ea:15, 498e354c-1254-4d8d-a565-2f5c6523845a:9, 5208598c-84f0-4526-bb7f-573728592172:28]
...
repair - repair[5d73d094-72ee-4570-a3cc-1cd479b2a036]: Started to repair 1 out of 1 tables in keyspace=sec_index, table=users, table_id=ea2072d0-ccd9-11f0-8dba-c5ab01bffb77, repair_reason=repair
repair - Enable incremental repair for table=sec_index.users range=(432345564227567615,504403158265495551]
table - Disabled compaction for range=(432345564227567615,504403158265495551] session_id=a13a72cc-cd2d-11f0-8e9b-76d54580ab09 for incremental repair
table - Got unrepaired compaction and repair lock for range=(432345564227567615,504403158265495551] session_id=a13a72cc-cd2d-11f0-8e9b-76d54580ab09 for incremental repair
table - Disabled compaction for range=(432345564227567615,504403158265495551] session_id=a13a72cc-cd2d-11f0-8e9b-76d54580ab09 for incremental repair
table - Got unrepaired compaction and repair lock for range=(432345564227567615,504403158265495551] session_id=a13a72cc-cd2d-11f0-8e9b-76d54580ab09 for incremental repair
repair - repair[5d73d094-72ee-4570-a3cc-1cd479b2a036]: get_sync_boundary: got error from node=0e9d51a5-9c99-4d6e-b9db-ad36a148b0ea, keyspace=sec_index, table=users, range=(432345564227567615,504403158265495551], error=seastar::rpc::remote_verb_error (Compaction state for table [0x60f008fa34c0] not found)
compaction_manager - Stopping 1 tasks for 1 ongoing compactions for table sec_index.users compaction_group=238 due to tablet merge
compaction_manager - Stopping 1 tasks for 1 ongoing compactions for table sec_index.users compaction_group=238 due to tablet merge
....
scylla[10793] Segmentation fault on shard 28, in scheduling group streaming
```
The rwlock in compaction_state could be destroyed before the lock holder
of the rwlock is destroyed. This causes user after free when the lock
the holder is destroyed.
To fix it, users of repair lock will now be waited when a compaction
group is being stopped.
That way, compaction group - which controls the lifetime of rwlock -
cannot be destroyed while the lock is held.
Additionally, the merge completion fiber - that might remove groups -
is properly serialized with incremental repair.
The issue can be reproduced using sanitize build consistently and can not
be reproduced after the fix.
Fixes#27365Closesscylladb/scylladb#28823
* github.com:scylladb/scylladb:
repair: Fix rwlock in compaction_state and lock holder lifecycle
repair: Prevent repair lock holder leakage after table drop
(cherry picked from commit 509f2af8db)
Closesscylladb/scylladb#28934
Some assertions in the Raft-based topology are likely to cause crashes of
multiple nodes due to the consistent nature of the Raft-based code. If the
failing assertion is executed in the code run by each follower (e.g., the code
reloading the in-memory topology state machine), then all nodes can crash. If
the failing assertion is executed only by the leader (e.g., the topology
coordinator fiber), then multiple consecutive group0 leaders will chain-crash
until there is no group0 majority.
Crashing multiple nodes is much more severe than necessary. It's enough to
prevent the topology state machine from making more progress. This will
naturally happen after throwing a runtime error. The problematic fiber will be
killed or will keep failing in a loop. Note that it should be safe to block
the topology state machine, but not the whole group0, as the topology state
machine is mostly isolated from the rest of group0.
We replace some occurrences of `on_fatal_internal_error` and `SCYLLA_ASSERT`
with `on_internal_error`. These are not all occurrences, as some fatal
assertions make sense, for example, in the bootstrap procedure.
We also raise an internal error to prevent a segmentation fault in a few places.
Fixes#27987
Backporting this PR is not required, but we can consider it at least for 2026.1
because:
- it is LTS,
- the changes are low-risk,
- there shouldn't be many conflicts.
- (cherry picked from commit e21ecf69de)
- (cherry picked from commit 8e9c7397c5)
Parent PR: #28558
Manually cherry-picked 2a3476094e.
Closesscylladb/scylladb#28735
* github.com:scylladb/scylladb:
storage_service: raft_topology_cmd_handler: fix use-after-free
raft topology: prevent accessing nullptr returned by topology::find
raft topology: make some assertions non-crashing
`isclose` function checks if returned similarity floats are close enough to expected value, but it doesn't `assert` by itself.
Several tests missed that `assert`, effectively always passing.
With this patch similarity values checks are wrapped in helper function `assert_similarity` with predefined tolerance.
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-877Closesscylladb/scylladb#28748
(cherry picked from commit 4c4673e8f9)
Closesscylladb/scylladb#28907
This commit updates the documentation for the unified installer.
- The Open Source example is replaced with version 2025.1 (Source Available, currently supported, LTS).
- The info about CentOS 7 is removed (no longer supported).
- Java 8 is removed.
- The example for cassandra-stress is removed (as it was already removed on other installation pages).
Fixes https://github.com/scylladb/scylladb/issues/28150Closesscylladb/scylladb#28152
(cherry picked from commit 855c503c63)
Closesscylladb/scylladb#28910
Two calls in test_client_routes_upgrade were missing `await`,
so they were never actually executed. This caused Python
to emit RuntimeWarning about unawaited coroutines, and more
importantly, the test skipped important verification steps, which
could mask real bugs or cause flakiness.
Additionally, increase 10s timeouts to 60s to avoid flakiness in slow
environments. Although these tests haven't failed so far, similar
issues have already been observed in other tests with too-short
timeouts.
Fixes: [SCYLLADB-909](https://scylladb.atlassian.net/browse/SCYLLADB-909)
Backport to 2026.1, as the test is also there.
[SCYLLADB-909]: https://scylladb.atlassian.net/browse/SCYLLADB-909?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQClosesscylladb/scylladb#28877
* github.com:scylladb/scylladb:
test: increase timeouts in test_client_routes.py
test: add missing awaits in test_client_routes_upgrade
(cherry picked from commit 9697b6013f)
Closesscylladb/scylladb#28896
8e9c7397c5 made `rs` a reference, which can
lead to use-after-free. The `normal_nodes` map containing the referenced
value can be destroyed before the last use of `rs` when the topology state
is reloaded after a context switch on some `co_await`. The following move
assignment in `storage_service::topology_state_load` causes this:
```
_topology_state_machine._topology = co_await _sys_ks.local().load_topology_state(tablet_hosts);
```
This issue has been discovered in next-2026.1 CI after queueing the
backport of #28558. `test_truncate_during_topology_change` failed after
ASan reported a heap-use-after-free in
```
co_await _repair.local().bootstrap_with_repair(get_token_metadata_ptr(), rs.ring.value().tokens, session);
```
This test enables `delay_bootstrap_120s`, which makes the bug much more
likely to reproduce, but it could happen elsewhere.
No backport needed, as the only backport of #28558 hasn't been merged yet.
The backport PR will cherry-pick this commit.
Closesscylladb/scylladb#28772
(cherry picked from commit 2a3476094e)
For 2025.3 and 2025.4 this test runs order of magnitude
slower in debug mode. Potentially due to passwords::check
running in alien thread and overwhelming the CPU (this is
fixed in newer versions).
Decreasing the number of connections in test makes it fast
again, without breaking reproducibility.
As additional measure we double the timeout.
The fix is now cherry-picked to master as sometimes
test fails there too.
(cherry picked from commit 1f1fc2c2ac)
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-795
backport: 2026.1, already on other stable branches
Closesscylladb/scylladb#28848
* github.com:scylladb/scylladb:
test: add more logs to test_startup_no_auth_response
test: decrease strain in test_startup_response
(cherry picked from commit f156bcddab)
Closesscylladb/scylladb#28878
The default 100ms timeout for client readiness in tests is too
aggressive. In some test environments, this is not enough time for
client creation, which involves address resolution and TLS certificate
reading, leading to flaky tests.
This commit increases the default client creation timeout to 10 seconds.
This makes the tests more robust, especially in slower execution
environments, and prevents similar flakiness in other test cases.
Fixes: VECTOR-547, SCYLLADB-802, SCYLLADB-825, SCYLLADB-826
Backport to 2025.4 and 2026.1, as the same problem occurs on these branches and can potentially make the CI flaky there as well.
Closesscylladb/scylladb#28846
* github.com:scylladb/scylladb:
vector_search: test: include ANN error in assertion
vector_search: test: fix HTTPS client test flakiness
(cherry picked from commit 2fb981413a)
Closesscylladb/scylladb#28879
Recently we suffered a regression on how Alternator TTL behaves when a node goes down when tablets are used.
Usually, expiration of data in a particular tablet are handled by this tablet's "primary replica". However, if that node is down, we want another node to perform these expiration until the primary replica goes back online. We created a function `tablet_map::get_secondary_replica()` to select that "other node". We don't care too much what the "secondary replica" means, but we do care that it's different from the primary replica - if it's the same the expiration of that tablet will never be done.
It turns out that recently, in commits 817fdad and d88036d, the implementation of get_primary_replica() changed without a corresponding change to get_secondary_replica(). After those changes, the two functions are mismatched, and sometimes return the same node for both primary and secondary replica.
Unfortunately, although we had a dtest for the handling of a dead node in Alternator TTL, it failed to reproduce this bug, so this regression was missed - nothing else besides Alternator TTL ever used the get_secondary_replica() function.
So this series, in addition to fixing the bug, we add two tests that reproduce this bug (fail before the fix, pass with the fix):
1. A unit test that checks that get_secondary_replica() always returns a different node from get_primary_replica()
2. A cluster test based on the original dtest, which does reproduce this bug in Alternator TTL where some of the data was never expired (but only failed in release build, for an unknown reason).
Fixes SCYLLADB-777.
- (cherry picked from commit 9ab3d5b946)
- (cherry picked from commit 0c7f499750)
- (cherry picked from commit e463d528fe)
Parent PR: #28771Closesscylladb/scylladb#28803
* github.com:scylladb/scylladb:
test: add unit test for tablet_map::get_secondary_replica()
test, alternator: add test for TTL expiration with a node down
locator: fix get_secondary_replica() to match get_primary_replica()
3f7ee3ce5d introduced system.batchlog_v2, with a schema designed to speed up batchlog replays and make post-replay cleanups much more effective.
It did not introduce a cluster feature for the new table, because it is node local table, so the cluster can switch to the new table gradually, one node at a time.
However, https://github.com/scylladb/scylladb/issues/27886 showed that the switching causes timeouts during upgrades, in mixed clusters. Furthermore, switching to the new table unconditionally on upgrades nodes, means that on rollback, the batches saved into the v2 table are lost.
This PR introduces re-introduces v1 (`system.batchlog`) support and guards the use of the v2 table with a cluster feature, so mixed clusters keep using v1 and thus be rollback-compatible.
The re-introduced v1 support doesn't support post-replay cleanups for simplicity. The cleanup in v1 was never particularly effective anyway and we ended up disabling it for heavy batchlog users, so I don't think the lack of support for cleanup is a problem.
Fixes: https://github.com/scylladb/scylladb/issues/27886
Needs backport to 2026.1, to fix upgrades for clusters using batches
Closesscylladb/scylladb#28736
* github.com:scylladb/scylladb:
test/boost/batchlog_manager_test: add tests for v1 batchlog
test/boost/batchlog_manager_test: make prepare_batches() work with both v1 and v2
test/boost/batchlog_manager_test: fix indentation
test/boost/batchlog_manager_test: extract prepare_batches() method
test/lib/cql_assertions: is_rows(): add dump parameter
tools/scylla-sstable: extract query result printers
tools/scylla-sstable: add std::ostream& arg to query result printers
repair/row_level: repair_flush_hints_batchlog_handler(): add all_replayed to finish log
db/batchlog_manager: re-add v1 support
db/batchlog_manager: return all_replayed from process_batch()
db/batchlog_manager: process_bath() fix indentation
db/batchlog_manager: make batch() a standalone function
db/batchlog_manager: make structs stats public
db/batchlog_manager: allocate limiter on the stack
db/batchlog_manager: add feature_service dependency
gms/feature_service: add batchlog_v2 feature
(cherry picked from commit a83ee6cf66)
Closesscylladb/scylladb#28853
This commit adds the upgrade guide for version 2026.1.
According to the new upgrade policy, the user can now upgrade to the major version (2026.1)
from any previous minor version.
So instead of adding a separate guide form 2025.4 to 2026.1, we need a guide from 2025.x to 2026.1.
In addition, this commit:
- Updates the upgrade policy for reflect the above change.
- Removes the upgrade guides for the previous version.
Fixes https://github.com/scylladb/scylladb/issues/28533
Fixes https://github.com/scylladb/scylladb/issues/28532Closesscylladb/scylladb#28789
(cherry picked from commit dfd46ad3fb)
Closesscylladb/scylladb#28835
scylladb/scylla container image doesn't include systemctl binary, while it
is used by perftune.py script shipped within the same image.
Scylla Operator runs this script to tune Scylla nodes/containers,
expecting its all dependencies to be available in the container's PATH.
Without systemctl, the script fails on systems that run irqbalance
(e.g., on EKS nodes) as the script tries to reconfigure irqbalance and
restart it via systemctl afterwards.
Fixes: scylladb/scylla-operator#3080Closesscylladb/scylladb#28567
(cherry picked from commit b4f0eb666f)
Closesscylladb/scylladb#28845
The test is currently flaky with `reuse_ip = True`. The issue is that the
test retries replace before the first replace is rolled back and the
first replacing node is removed from gossip. The second replacing node
can see the entry of the first replacing node in gossip. This entry has
a newer generation than the entry of the node being replaced, and both
replacing nodes have the same IP as the node being replaced. Therefore,
the second replacing node incorrectly considers this entry as the entry
of the node being replaced. This entry is missing rack and DC, so the
second replace fails with
```
ERROR 2026-02-24 21:19:03,420 [shard 0:main] init - Startup failed:
std::runtime_error (Cannot replace node
8762a9d2-3b30-4e66-83a1-98d16c5dd007/127.61.127.1 with a node on
a different data center or rack.
Current location=UNKNOWN_DC/UNKNOWN_RACK, new location=dc1/rack2)
```
Fixes SCYLLADB-805
Closesscylladb/scylladb#28829
(cherry picked from commit ba7f314cdc)
Closesscylladb/scylladb#28850
In nonroot installations, the install.sh script was hardcoding the
api_ui_dir and api_doc_dir paths to /opt/scylladb/ in scylla.yaml,
even though the actual files were installed to a different location
(typically ~/scylladb). This caused REST API endpoints like
/api-doc/failure_detector/ to fail with "transfer closed with
outstanding read data remaining" error because Scylla couldn't find
the API documentation files at the configured paths.
Fix this by using the $prefix variable instead of hardcoded
/opt/scylladb/ paths. This ensures that:
- In regular installations: $prefix = /opt/scylladb (no change)
- In nonroot installations: $prefix = ~/scylladb (paths now correct)
Fixes: SCYLLADB-721
Backport: The hardcoded paths in install.sh have been present since
the nonroot installation feature was introduced, making REST API
endpoints non-functional in all nonroot installations across all
live versions of Scylla.
Closesscylladb/scylladb#28805
(cherry picked from commit 822c1597c9)
Closesscylladb/scylladb#28836
Remove bootstrap and decomission from allowed_repair_based_node_ops.
Using RBNO over streaming for these operations has no benefits, as they
are not exposed to the out-of-date replica problem that replace,
removenode and rebuild are.
On top of that, RBNO is known to have problems with empty user tables.
Using streaming for boostrap and decomission is safe and faster
than RBNO in all condition, especially when the table is small.
One test needs adjustment as it relies on RBNO being used for all node
ops.
Fixes: SCYLLADB-105
Closesscylladb/scylladb#28080
(cherry picked from commit b637e17b19)
Closesscylladb/scylladb#28725
The futurization refactoring in 9d3755f276 ("replica: Futurize
retrieval of sstable sets in compaction_group_view") changed
maybe_wait_for_sstable_count_reduction() from a single predicated
wait:
```
co_await cstate.compaction_done.wait([..] {
return num_runs_for_compaction() <= threshold
|| !can_perform_regular_compaction(t);
});
```
to a while loop with a predicated wait:
```
while (can_perform_regular_compaction(t)
&& co_await num_runs_for_compaction() > threshold) {
co_await cstate.compaction_done.wait([this, &t] {
return !can_perform_regular_compaction(t);
});
}
```
This was necessary because num_runs_for_compaction() became a
coroutine (returns future<size_t>) and can no longer be called
inside a condition_variable predicate (which must be synchronous).
However, the inner wait's predicate — !can_perform_regular_compaction(t)
— only returns true when compaction is disabled or the table is being
removed. During normal operation, every signal() from compaction_done
wakes the waiter, the predicate returns false, and the waiter
immediately goes back to sleep without ever re-checking the outer
while loop's num_runs_for_compaction() condition.
This causes memtable flushes to hang forever in
maybe_wait_for_sstable_count_reduction() whenever the sstable run
count exceeds the threshold, because completed compactions signal
compaction_done but the signal is swallowed by the predicate.
Fix by replacing the predicated wait with a bare wait(), so that
any signal (including from completed compactions) causes the outer
while loop to re-evaluate num_runs_for_compaction().
Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-610Closesscylladb/scylladb#28801
(cherry picked from commit bb57b0f3b7)
This workflow calls the reusable backport-with-jira workflow from
scylladb/github-automation to enable automatic backport PR creation with
Jira sub-issue integration.
The workflow triggers on:
- Push to master/next-*/branch-* branches (for promotion events)
- PR labeled with backport/X.X pattern (for manual backport requests)
- PR closed/merged on version branches (for chain backport processing)
Features enabled by calling the shared workflow:
- Creates Jira sub-issues under the main issue for each backport version
- Sorts versions descending (highest first: 2025.4 -> 2025.3 -> 2025.2)
- Cherry-picks from previous version branch to avoid repeated conflicts
- On Jira API failure: adds comment to main issue, applies 'jira-sub-issue-creation-failed' label, continues with PR
Closesscylladb/scylladb#28804
(cherry picked from commit b211590bc0)
Closesscylladb/scylladb#28812
This commit removes the information that Alternator doesn't support tablets.
The limitation is no longer valid.
Fixes SCYLLADB-778
Closesscylladb/scylladb#28781
(cherry picked from commit e2333a57ad)
Closesscylladb/scylladb#28795
`test_autoretrain_dict` sporadically fails because the default
compression algorithm was changed after the test was written.
`9ffa62a986815709d0a09c705d2d0caf64776249` was an attempt to fix it by
changing the compression configuration during node startup. However,
the configuration change had an incorrect YAML format and was
ignored by ScyllaDB. This commit fixes it.
Fixes: scylladb/scylladb#28204Closesscylladb/scylladb#28746
(cherry picked from commit cd4caed3d3)
Closesscylladb/scylladb#28794
The ANN vector queries with all-zero vectors are allowed even on vector indexes with similarity function set to cosine.
When enabling the rescoring option, those queries would fail as the rescoring calls `similarity_cosine` function underneath, causing an `InvalidRequest` exception as all-zero vectors were not allowed matching Cassandra's behaviour.
To eliminate the discrepancy we want the all-zero vector `similarity_cosine` calls to pass, but return the NaN as the cosine similarity for zero vectors is mathematically incorrect. We decided not to use arbitrary values contrary to USearch, for which the distance (not to be confused with similarity) is defined as cos(0, 0) = 0, cos(0, x) = 1 while supporting the range of values [0, 2].
If we wanted to convert that to similarity, that would mean sim_cos(0, x) = 0.5, which does not support mathematical reasoning why that would be more similar than for example vectors marking obtuse angles.
It's safe to assume that all-zero vectors for cosine similarity shouldn't make any impact, therefore we return NaN and eliminate them from best results.
Adjusted the tests accordingly to check both proper Cassandra and Scylla's behaviour.
Fixes: SCYLLADB-456
Backport to 2026.1 needed, as it fixes the bug for ANN vector queries using rescoring introduced there.
- (cherry picked from commit af0889d194)
- (cherry picked from commit 4e32502bb3)
Parent PR: #28609Closesscylladb/scylladb#28775
* github.com:scylladb/scylladb:
test/vector_search: add reproducer for rescoring with zero vectors
vector_search: return NaN for similarity_cosine with all-zero vectors
This patch adds a unit test for tablet_map::get_secondary_replica().
It was never officially defined how the "primary" and "secondary"
replicas were chosen, and their implementation changed over time,
but the one invariant that this test verifies is that the secondary
replica and the primary replica must be a different node.
This test reproduces issue SCYLLADB-777, where we discovered that
the get_primary_replica() changed without a corresponding change to
get_primary_replica(). So before the previous patch, this test failed,
and after the previous patch - it passes.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
(cherry picked from commit e463d528fe)
We have many single-node functional tests for Alternator TTL in
test/alternator/test_ttl.py. This patch adds a multi-node test in
test/cluster/test_alternator.py. The new test verifies that:
1. Even though Alternator TTL splits the work of scanning and expiring
items between nodes, all the items get correctly expired.
2. When one node is down, all the items still expire because the
"secondary" owner of each token range takes over expiring the
items in this range while the "primary" owner is down.
This new test is actually a port of a test we already had in dtest
(alternator_ttl_tests.py::test_multinode_expiration). This port is
faster and smaller then the original (fewer nodes, fewer rows), but it
still found a regression (SCYLLADB-777) that dtest missed - the new test
failed when running with tablets and in release build mode.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
(cherry picked from commit 0c7f499750)
The function tablet_map::get_secondary_replica() is used by Alternator
TTL to choose a node different from get_primary_replica(). Unfortunately,
recently (commits 817fdad and d88037d) the implementation of the latter
function changed, without changing the former. So this patch changes
the former to match.
The next two patches will have two tests that fail before this patch,
and pass with it:
1. A unit test that checks that get_secondary_replica() returns a
different node than get_primary_replica().
2. An Alternator TTL test that checks that when a node is down,
expirations still happen because the secondary replica takes over
the primary replica's work.
Fixes SCYLLADB-777
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
(cherry picked from commit 9ab3d5b946)
- Correct `calc_part_size` function since it could return more than 10k parts
- Add tests
- Add more checks in `calc_part_size` to comply with S3 limits
Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-640
Must be ported back to 2025.3/4 and 2026.1 since we may encounter this bug in production clusters
- (cherry picked from commit 289e910cec)
- (cherry picked from commit 6280cb91ca)
- (cherry picked from commit 960adbb439)
Parent PR: #28592Closesscylladb/scylladb#28697
* github.com:scylladb/scylladb:
s3_client: add more constrains to the calc_part_size
s3_client: add tests for calc_part_size
s3_client: correct multipart part-size logic to respect 10k limit
Add reproducer for the SCYLLADB-456 issue following exception
on ANN vector queries with rescoring with similarity cosine.
(cherry picked from commit 4e32502bb3)
The ANN vector queries with all-zero vectors are allowed even on vector
indexes with similarity function set to cosine.
When enabling the rescoring option, those queries would fail as the rescoring
calls `similarity_cosine` function underneath, causing an `InvalidRequest` exception
as all-zero vectors were not allowed matching Cassandra's behaviour.
To eliminate the discrepancy we want the all-zero vector `similarity_cosine` calls to pass,
but return the NaN as the cosine similarity for zero vectors is mathematically incorrect.
We decided not to use arbitrary values contrary to USearch, for which the distance
(not to be confused with similarity) is defined as cos(0, 0) = 0, cos(0, x) = 1 while
supporting the range of values [0, 2].
If we wanted to convert that to similarity, that would mean sim_cos(0, x) = 0.5,
which does not support mathematical reasoning why that would be more similar than
for example vectors marking obtuse angles.
It's safe to assume that all-zero vectors for cosine similarity shouldn't make any impact,
therefore we return NaN and eliminate them from best results.
Adjusted the tests accordingly to check both proper Cassandra and Scylla's behaviour.
Fixes: SCYLLADB-456
(cherry picked from commit af0889d194)
In storage_service::load_stats_for_tablet_based_tables(), we are passing
a reference to sum_tablet_sizes to the lambda which increments this value
on each shard via map_reduce0(). This means we could have a race
condition because this is executed on separate threads/CPUs.
This patch fixed the problem by collecting the sums by shard into a
vector, then summing those up.
Refs: SCYLLADB-678
Closesscylladb/scylladb#28703
(cherry picked from commit f1bc17bd4c)
Closesscylladb/scylladb#28729
Currently, the test assumes that when
'topology_coordinator_pause_before_processing_backlog: waiting' is
logged, the task for decommission must be there. This was based on the
assumption that topology coordinator is idle and decommission request
wakes it up. But if the server is slow enough, it may still be running
the load balancer in reaction to table creation, and block on that
injection point before decommission request was added.
Fix by waiting for the task to appear rather than the injection.
Fixes SCYLLADB-715
Only 2026.1 vulnerable.
- (cherry picked from commit e14eca46af)
- (cherry picked from commit 2454de4f8f)
- (cherry picked from commit d33d38139f)
Parent PR: #28688Closesscylladb/scylladb#28750
* github.com:scylladb/scylladb:
test_tablets_parallel_decommission: Fix flakiness due to delayed task appearance
test: cluster: task_manager_client: Introduce wait_task_appears()
tests: pylib: util: Add exponential backoff to wait_for
Currently, the test assumes that when
'topology_coordinator_pause_before_processing_backlog: waiting' is
logged, the task for decommission must be there. This was based on the
assumption that topology coordinator is idle and decommission request
wakes it up. But if the server is slow enough, it may still be running
the load balancer in reaction to table creation, and block on that
injection point before decommission request was added.
Fix by waiting for the task to appear rather than the injection.
Fixes SCYLLADB-715
(cherry picked from commit d33d38139f)
Allows balancing the trade-off between fast execution in case the
condition is satisfied quickly and not adding load when it's not.
(cherry picked from commit e14eca46af)
Some assertions in the Raft-based topology are likely to cause crashes of
multiple nodes due to the consistent nature of the Raft-based code. If the
failing assertion is executed in the code run by each follower (e.g., the code
reloading the in-memory topology state machine), then all nodes can crash. If
the failing assertion is executed only by the leader (e.g., the topology
coordinator fiber), then multiple consecutive group0 leaders will chain-crash
until there is no group0 majority.
Crashing multiple nodes is much more severe than necessary. It's enough to
prevent the topology state machine from making more progress. This will
naturally happen after throwing a runtime error. The problematic fiber will be
killed or will keep failing in a loop. Note that it should be safe to block
the topology state machine, but not the whole group0, as the topology state
machine is mostly isolated from the rest of group0.
We replace some occurrences of `on_fatal_internal_error` and `SCYLLA_ASSERT`
with `on_internal_error`. These are not all occurrences, as some fatal
assertions make sense, for example, in the bootstrap procedure.
(cherry picked from commit e21ecf69de)
Improves performance of deserialization of vector data for calculating similarity functions.
Instead of deserializing vector data into a std::vector<data_value>, we deserialize directly into a std::vector<float>
and then pass it to similarity functions as a std::span<const float>.
This avoids overhead of data_value allocations and conversions.
Example QPS of `SELECT id, similarity_cosine({vector<float, 1536>}, {vector<float, 1536>}) ...`:
client concurrency 1: before: ~135 QPS, after: ~1005 QPS
client concurrency 20: before: ~280 QPS, after: ~2097 QPS
Measured using https://github.com/zilliztech/VectorDBBench (modified to call above query without ANN search)
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-471Closesscylladb/scylladb#28615
(cherry picked from commit 668d6fe019)
Closesscylladb/scylladb#28690
There is no point running repair for tables using RF one. Row level
repair will skip it but the auto repair scheduler will keep scheduling
such repairs since repair_time could not be updated.
Skip such repairs at the scheduler level for auto repair.
If the request is issued by user, we will have to schedule such
repair otherwise the user request will never be finished.
Fixes SCYLLADB-561
Closesscylladb/scylladb#28640
(cherry picked from commit 1be80c9e86)
Closesscylladb/scylladb#28714
The connection's `cpu_concurrency_t` struct tracks the state of a connection
to manage the admission of new requests and prevent CPU overload during
connection storms. When a connection holds units (allowed only 0 or 1), it is
considered to be in the "CPU state" and contributes to the concurrency limits
used when accepting new connections.
The bug stems from the fact that `counted_data_source_impl::get` and
`counted_data_sink_impl::put` calls can interleave during execution. This
occurs because of `should_parallelize` and `_ready_to_respond`, the latter being
a future chain that can run in the background while requests are being read.
Consequently, while reading request (N), the system may concurrently be
writing the response for request (N-1) on the same connection.
This interleaving allows `return_all()` to be called twice before the
subsequent `consume_units()` is invoked. While the second `return_all()` call
correctly returns 0 units, the matching `consume_units()` call would
mistakenly take an extra unit from the semaphore. Over time, a connection
blocked on a read operation could end up holding an unreturned semaphore
unit. If this pattern repeats across multiple connections, the semaphore
units are eventually depleted, preventing the server from accepting any
new connections.
The fix ensures that we always consume the exact number of units that were
previously returned. With this change, interleaved operations behave as
follows:
get() return_all — returns 1 unit
put() return_all — returns 0 units
get() consume_units — takes back 1 unit
put() consume_units — takes back 0 units
Logically, the networking phase ends when the first network operation
concludes. But more importantly, when a network operation
starts, we no longer hold any units.
Other solutions are possible but the chosen one seems to be the
simplest and safest to backport.
Fixes SCYLLADB-485
Backport: all supported affected versions, bug introduced with initial feature implementation in: ed3e4f33fd
- (cherry picked from commit 0376d16ad3)
- (cherry picked from commit 3b98451776)
Parent PR: #28530Closesscylladb/scylladb#28716
* github.com:scylladb/scylladb:
test: auth_cluster: add test for hanged AUTHENTICATING connections
transport: fix connection code to consume only initially taken semaphore units
The test creates a single node cluster, then creates 3 tables which remain empty. Then it adds another node with half the disk capacity of the first one, and then it waits for the balancer to migrate tablets to the newly added node by calling the quiesce topology API. The number of tablets on the smaller node should be exactly half the number of tablets on the larger node.
After waiting for quiesce topology, we could have a situation where we query the number of tablets from the node which still hasn't processed the last tablet migrations and updated system.tablets.
This patch adds a read barrier so that both nodes see the same tablets metadata before we query the number of tablets.
Fixes: SCYLLADB-603
The test is present in master and 2026.1, so we need to backport this.
- (cherry picked from commit 4ca40929ef)
Parent PR: #28598Closesscylladb/scylladb#28638
* github.com:scylladb/scylladb:
test/cluster: Remove short_tablet_stats_refresh_interval injection
test: add read barrier to test_balance_empty_tablets
The connection's cpu_concurrency_t struct tracks the state of a connection
to manage the admission of new requests and prevent CPU overload during
connection storms. When a connection holds units (allowed only 0 or 1), it is
considered to be in the "CPU state" and contributes to the concurrency limits
used when accepting new connections.
The bug stems from the fact that `counted_data_source_impl::get` and
`counted_data_sink_impl::put` calls can interleave during execution. This
occurs because of `should_parallelize` and `_ready_to_respond`, the latter being
a future chain that can run in the background while requests are being read.
Consequently, while reading request (N), the system may concurrently be
writing the response for request (N-1) on the same connection.
This interleaving allows `return_all()` to be called twice before the
subsequent `consume_units()` is invoked. While the second `return_all()` call
correctly returns 0 units, the matching `consume_units()` call would
mistakenly take an extra unit from the semaphore. Over time, a connection
blocked on a read operation could end up holding an unreturned semaphore
unit. If this pattern repeats across multiple connections, the semaphore
units are eventually depleted, preventing the server from accepting any
new connections.
The fix ensures that we always consume the exact number of units that were
previously returned. With this change, interleaved operations behave as
follows:
get() return_all — returns 1 unit
put() return_all — returns 0 units
get() consume_units — takes back 1 unit
put() consume_units — takes back 0 units
Logically, the networking phase ends when the first network operation
concludes. But more importantly, when a network operation
starts, we no longer hold any units.
Other solutions are possible but the chosen one seems to be the
simplest and safest to backport.
Fixes SCYLLADB-485
(cherry picked from commit 0376d16ad3)
Paxos state tables are internal tables fully managed by Scylla
and they shouldn't be exposed to the user nor they shouldn't be backed up.
This commit hides those kind of tables from all listings and if such table
is directly described with `DESC ks."tbl$paxos"`, the description is generated
withing a comment and a note for the user is added.
Fixes https://github.com/scylladb/scylladb/issues/28183
LWT on tablets and paxos state tables are present in 2025.4, so the patch should be backported to this version.
- (cherry picked from commit f89a8c4ec4)
- (cherry picked from commit 9baaddb613)
Parent PR: #28230Closesscylladb/scylladb#28508
* github.com:scylladb/scylladb:
test/cqlpy: add reproducer for hidden Paxos table being shown by DESC
cql3/statements/describe_statement: hide paxos state tables
Hints destined for some other node can only be drained after the other node is no longer a replica of any vnode or tablet. In case when tablets are present, a node might still technically be a replica of some tablets after it moved to left state. When it no longer is a replica of any tablet, it becomes "released" and storage service generates a notification about it. Hinted handoff listens to this notification and kicks off draining hints after getting it.
The current implementation of the "released" notification would trigger every time raft topology state is reloaded and a left node without any tokens is present in the raft topology. Although draining hints is idempotent, generating duplicate notifications is wasteful and recently became very noisy after in 44de563 verbosity of the draining-related log messages have been increased. The verbosity increase itself makes sense as draining is supposed to be a rare operation, but the duplicate notification bug now needs to be addressed.
Fix the duplicate notification problem by passing the list of previously released nodes to the `storage_service::raft_topology_update_ip` function and filtering based on it. If this function processes the topology state for the first time, it will not produce any notifications. This is fine as hinted handoff is prepared to detect "released" nodes during the startup sequence in main.cc and start draining the hints there, if needed.
Fixes: scylladb/scylladb#28301
Refs: scylladb/scylladb#25031
The log messages added in 44de563 cause a lot of noise during topology operations and tablet migrations, so the fix should be backported to all affected versions (2025.4 and 2026.1).
- (cherry picked from commit 10e9672852)
- (cherry picked from commit d28c841fa9)
- (cherry picked from commit 29da20744a)
Parent PR: #28367Closesscylladb/scylladb#28612
* github.com:scylladb/scylladb:
storage_service: fix indentation after previous patch
raft topology: generate notification about released nodes only once
raft topology: extract "released" nodes calculation to external function
test_remove_node_violating_rf_rack_with_rack_list creates a cluster
with four nodes. One of the nodes is excluded, then another one is
stopped, excluded, and removed. If the two stopped nodes were both
voters, the majority is lost and the cluster loses its raft leader.
As a result, the node cannot be removed and the operation times out.
Add the 5th node to the cluster. This way the majority is always up.
Fixes: https://github.com/scylladb/scylladb/issues/28596.
Closesscylladb/scylladb#28610
(cherry picked from commit f955a90309)
Closesscylladb/scylladb#28639
Fixes#28398Fixes#28399
When used as path elements in google storage paths, the object names need to be URL encoded. Due to
a.) tests not really using prefixes including non-url valid chars (i.e. / etc)
and
b.) the mock server used for most testing not enforcing this particular aspect,
this was missed.
Modified unit tests to use prefixing for all names, so when running real GS, any errors like this will show.
"Real" GCS also behaves a bit different when listing with pager, compared to mock;
The former will not give a pager token for last page, only penultimate.
Adds handling for this.
Needs backport to the releases that have (though might not really use) the feature, as it is technically possible to use google storage for backup and whatnot there, and it should work as expected.
- (cherry picked from commit a896d8d5e3)
- (cherry picked from commit 87aa6c8387)
Parent PR: #28400Closesscylladb/scylladb#28685
* github.com:scylladb/scylladb:
utils/gcp/object_storage: URL-encode object names in URL:s
utils::gcp::object_storage: Fix list object pager end condition detection
Fixes#28678
If replenish loop exits the sleep condition, with an empty queue,
when "_shutdown" is already set, a waiter might get stuck, unsignalled
waiting for segments, even though we are exiting.
Simply move queue abort to always be done on loop exit.
Closesscylladb/scylladb#28679
(cherry picked from commit ab4e4a8ac7)
Closesscylladb/scylladb#28693
Introduce tests that validate the corrected multipart part-size
calculation, including boundary conditions and error cases.
(cherry picked from commit 6280cb91ca)
The previous calculation could produce more than 10,000 parts for large
uploads because we mixed values in bytes and MiB when determining the
part size. This could result in selecting a part size that still
exceeded the AWS multipart upload limit. The updated logic now ensures
the number of parts never exceeds the allowed maximum.
This change also aligns the implementation with the code comment: we
prefer a 50 MiB part size because it provides the best performance, and
we use it whenever it fits within the 10,000-part limit. If it does not,
we increase the part size (in bytes, aligned to MiB) to stay within the
limit.
(cherry picked from commit 289e910cec)
Fixes#28398
When used as path elements in google storage paths, the object names
need to be URL encoded. Due to a.) tests not really using prefixes including
non-url valid chars (i.e. / etc) and the mock server used for most
testing not enforcing this particular aspect, this was missed.
Modified unit tests to use prefixing for all names, so when run
in real GS, any errors like this will show.
(cherry picked from commit 87aa6c8387)
Fixes#28399
When iterating with pager, the mock server and real GCS behaves differently.
The latter will not give a pager token for last page, only penultimate.
Need to handle.
(cherry picked from commit a896d8d5e3)
Most likely, the root cause of the flaky test was that the TLS handshake hung for an extended period (60s). This caused
the test case to fail because the ANN request duration exceeded the test case timeout.
The PR introduces two changes:
* Mitigation of the hanging TLS handshake: This issue likely occurred because the test performed certificate rewrites
simultaneously with ANN requests that utilize those certificates.
* Production code fix: This addresses a bug where the TLS handshake itself was not covered by the connection timeout.
Since tls::connect does not perform the handshake immediately, the handshake only occurs during the first write
operation, potentially bypassing connect timeout.
Fixes: #28012
Backport to 2026.01 and 2025.04 is needed, as these branches are also affected and may experience CI flakiness due to this test.
- (cherry picked from commit aef5ff7491)
- (cherry picked from commit 079fe17e8b)
Parent PR: #28617Closesscylladb/scylladb#28643
* github.com:scylladb/scylladb:
vector_search: Fix missing timeout on TLS handshake
vector_search: test: Fix flaky cert rewrite test
When `test_autoretrain_dict` was originally written, the default
`sstable_compression_user_table_options` was `LZ4Compressor`. The
test assumed (correctly) that initially the compression doesn't use
a trained dictionary, and later in the test scenario, it changed
the algorithm to one with a dictionary.
However, the default `sstable_compression_user_table_options` is now
`LZ4WithDictsCompressor`, so the old assumption is no longer correct.
As a result, the assertion that data is initially not compressed well
may or may not fail depending on dictionary training timing.
To fix this, this commit explicitly sets `ZstdCompressor`
as the initial `sstable_compression_user_table_options`, ensuring that
the assumption that initial compression is without a dictionary
is always met.
Note: `ZstdCompressor` differs from the former default `LZ4Compressor`.
However, it's a better choice — the test aims to show the benefit of
using a dictionary, not the benefit of Zstd over LZ4 (and the test uses
ZstdWithDictsCompressor as the algorithm with the dictionary).
Fixes: https://github.com/scylladb/scylladb/issues/28204
Backport: 2025.4, as test already failed there (and also backport to 2026.1 to make everything consistent).
- (cherry picked from commit e63cfc38b3)
- (cherry picked from commit 9ffa62a986)
Parent PR: #28625Closesscylladb/scylladb#28667
* https://github.com/scylladb/scylladb:
test: explicitly set compression algorithm in test_autoretrain_dict
test: remove unneeded semicolons from python test
The test can currently fail like this:
```
> await cql.run_async(f"ALTER TABLE {ks}.test WITH tablets = {{'min_tablet_count': 1}}")
E cassandra.cluster.NoHostAvailable: ('Unable to complete the operation against any hosts', {<Host: 127.158.27.9:9042 datacenter1>: <Error from server: code=0000 [Server error] message="Failed to apply group 0 change due to concurrent modification">})
```
The following happens:
- node A is restarted and becomes the group0 leader,
- the driver sends the ALTER TABLE request to node B,
- the request hits group 0 concurrent modification error 10 times and fails
because node A performs tablet migrations at the the same time.
What is unexpected is that even though the driver session uses the default
retry policy, the driver doesn't retry the request on node A. The request
is guaranteed to succeed on node A because it's the only node adding group0
entries.
The driver doesn't retry the request on node A because of a missing
`wait_for_cql_and_get_hosts` call. We add it in this commit. We also reconnect
the driver just in case to prevent hitting scylladb/python-driver#295.
Moreover, we can revert the workaround from
4c9efc08d8, as the fix from this commit also
prevents DROP KEYSPACE failures.
The commit has been tested in byo with `_concurrent_ddl_retries{0}` to
verify that node A really can't hit group 0 concurrent modification error
and always receives the ALTER TABLE request from the driver. All 300 runs in
each build mode passed.
Fixes#25938Closesscylladb/scylladb#28632
(cherry picked from commit 0693091aff)
Closesscylladb/scylladb#28673
The test `test_size_based_load_balancing.py::test_balance_empty_tablets`
waits for tablet load stats to be refreshed and uses the
`short_tablet_stats_refresh_interval` injection to speed up the refresh
interval.
This injection has no effect; it was replaced by the
`tablet_load_stats_refresh_interval_in_seconds` config option (patch: 1d6808aec4),
so the test currently waits for 60 seconds (default refresh interval).
Use the config option. This reduces the execution time to ~8 seconds.
Fixes SCYLLADB-556.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
Closesscylladb/scylladb#28536
(cherry picked from commit 5d1e6243af)
The test creates a single node cluster, then creates 3 tables which
remain empty. Then it adds another node with half the disk capacity of
the first one, and then it waits for the balancer to migrate tablets to
the newly added node by calling the quiesce topology API. The number of
tablets on the smaller node should be exactly half the number of tablets
on the larger node.
After waiting for quiesce topology, we could have a situation where we
query the number of tablets from the node which still hasn't processed
the last tablet migrations and updated system.tablets.
This patch adds a read barrier so that both nodes see the same tablets
metadata before we query the number of tablets.
Fixes: SCYLLADB-603
Closesscylladb/scylladb#28598
(cherry picked from commit 4ca40929ef)
When `test_autoretrain_dict` was originally written, the default
`sstable_compression_user_table_options` was `LZ4Compressor`. The
test assumed (correctly) that initially the compression doesn't use
a trained dictionary, and later in the test scenario, it changed
the algorithm to one with a dictionary.
However, the default `sstable_compression_user_table_options` is now
`LZ4WithDictsCompressor`, so the old assumption is no longer correct.
As a result, the assertion that data is initially not compressed well
may or may not fail depending on dictionary training timing.
To fix this, this commit explicitly sets `ZstdCompressor`
as the initial `sstable_compression_user_table_options`, ensuring that
the assumption that initial compression is without a dictionary
is always met.
Note: `ZstdCompressor` differs from the former default `LZ4Compressor`.
However, it's a better choice — the test aims to show the benefit of
using a dictionary, not the benefit of Zstd over LZ4 (and the test uses
ZstdWithDictsCompressor as the algorithm with the dictionary).
Fixes: scylladb/scylladb#28204
(cherry picked from commit 9ffa62a986)
The test `test_sync_point` had a few shortcomings that made it flaky
or simply wrong:
1. We were verifying that hints were written by checking the size of
in-flight hints. However, that could potentially lead to problems
in rare situations.
For instance, if all of the hints failed to be written to disk, the
size of in-flight hints would drop to zero, but creating a sync point
would correspond to the empty state.
In such a situation, we should fail immediately and indicate what
the cause was.
2. A sync point corresponds to the hints that have already been written
to disk. The number of those is tracked by the metric `written`.
It's a much more reliable way to make sure that hints have been
written to the commitlog. That ensures that the sync point we'll
create will really correspond to those hints.
3. The auxiliary function `wait_for` used in the test works like this:
it executes the passed callback and looks at the result. If it's
`None`, it retries it. Otherwise, the callback is deemed to have
finished its execution and no further retries will be attempted.
Before this commit, we simply returned a bool, and so the code was
wrong. We improve it.
---
Note that this fixes scylladb/scylladb#28203, which was a manifestation
of scylladb/scylladb#25879. We created a sync point that corresponded
to the empty state, and so it immediately resolved, even when node 3
was still dead.
As a bonus, we rewrite the auxiliary code responsible for fetching
metrics and manipulating sync points. Now it's asynchronous and
uses the existing standard mechanisms available to developers.
Furthermore, we reduce the time needed for executing
`test_sync_point` by 27 seconds.
---
The total difference in time needed to execute the whole test file
(on my local machine, in dev mode):
Before:
CPU utilization: 0.9%
real 2m7.811s
user 0m25.446s
sys 0m16.733s
After:
CPU utilization: 1.1%
real 1m40.288s
user 0m25.218s
sys 0m16.566s
---
Refs scylladb/scylladb#25879
Fixes scylladb/scylladb#28203
Backport: This improves the stability of our CI, so let's
backport it to all supported versions.
- (cherry picked from commit 628e74f157)
- (cherry picked from commit ac4af5f461)
- (cherry picked from commit c5239edf2a)
- (cherry picked from commit a256ba7de0)
- (cherry picked from commit f83f911bae)
Parent PR: #28602Closesscylladb/scylladb#28623
* github.com:scylladb/scylladb:
test: cluster: Reduce wait time in test_sync_point
test: cluster: Fix test_sync_point
test: cluster: Await sync points asynchronously
test: cluster: Create sync points asynchronously
test: cluster: Fetch hint metrics asynchronously
Currently the TLS handshake in the vector search client does not have a timeout.
This is because tls::connect does not perform handshake itself; the handshake
is deferred until the first read/write operation is performed. This can lead to long
hangs on ANN requests.
This commit calls tls::check_session_is_resumed() after tls::connect
to force the handshake to happen immediately and to run under with_timeout.
(cherry picked from commit 079fe17e8b)
The test is flaky most likely because when TLS certificate rewrite
happens simultaneously with an ANN request, the handshake can hang for a
long time (~60s). This leads to a timeout in the test case.
This change introduces a checkpoint in the test so that it will
wait for the certificate rewrite to happen before sending an ANN request,
which should prevent the handshake from hanging and make the test more reliable.
Fixes: #28012
(cherry picked from commit aef5ff7491)
If everything is OK, the sync point will not resolve with node 3 dead.
As a result, the waiting will use all of the time we allocate for it,
i.e. 30 seconds. That's a lot of time.
There's no easy way to verify that the sync point will NOT resolve, but
let's at least reduce the waiting to 3 seconds. If there's a bug, it
should be enough to trigger it at some point, while reducing the average
time needed for CI.
(cherry picked from commit f83f911bae)
The test had a few shortcomings that made it flaky or simply wrong:
1. We were verifying that hints were written by checking the size of
in-flight hints. However, that could potentially lead to problems
in rare situations.
For instance, if all of the hints failed to be written to disk, the
size of in-flight hints would drop to zero, but creating a sync point
would correspond to the empty state.
In such a situation, we should fail immediately and indicate what
the cause was.
2. A sync point corresponds to the hints that have already been written
to disk. The number of those is tracked by the metric `written`.
It's a much more reliable way to make sure that hints have been
written to the commitlog. That ensures that the sync point we'll
create will really correspond to those hints.
3. The auxiliary function `wait_for` used in the test works like this:
it executes the passed callback and looks at the result. If it's
`None`, it retries it. Otherwise, the callback is deemed to have
finished its execution and no further retries will be attempted.
Before this commit, we simply returned a bool, and so the code was
wrong. We improve it.
Note that this fixesscylladb/scylladb#28203, which was a manifestation
of scylladb/scylladb#25879. We created a sync point that corresponded
to the empty state, and so it immediately resolved, even when node 3
was still dead.
Refs scylladb/scylladb#25879Fixesscylladb/scylladb#28203
(cherry picked from commit a256ba7de0)
There's a dedicated HTTP API for communicating with the cluster, so
let's use it instead of yet another custom solution.
(cherry picked from commit c5239edf2a)
There's a dedicated HTTP API for communicating with the nodes, so let's
use it instead of yet another custom solution.
(cherry picked from commit ac4af5f461)
There's a dedicated API for fetching metrics now. Let's use it instead
of developing yet another solution that's also worse.
(cherry picked from commit 628e74f157)
Hints destined for some other node can only be drained after the other
node is no longer a replica of any vnode or tablet. In case when tablets
are present, a node might still technically be a replica of some tablets
after it moved to left state. When it no longer is a replica of any
tablet, it becomes "released" and storage service generates a
notification about it. Hinted handoff listens to this notification and
kicks off draining hints after getting it.
The current implementation of the "released" notification would trigger
every time raft topology state is reloaded and a left node without any
tokens is present in the raft topology. Although draining hints is
idempotent, generating duplicate notifications is wasteful and recently
became very noisy after in 44de563 verbosity of the draining-related log
messages have been increased. The verbosity increase itself makes sense
as draining is supposed to be a rare operation, but the duplicate
notification bug now needs to be addressed.
Fix the duplicate notification problem by passing the list of previously
released nodes to the `storage_service::raft_topology_update_ip`
function and filtering based on it. If this function processes the
topology state for the first time, it will not produce any
notifications. This is fine as hinted handoff is prepared to detect
"released" nodes during the startup sequence in main.cc and start
draining the hints there, if needed.
Fixes: #28301
Refs: #25031
(cherry picked from commit d28c841fa9)
In the following commits we will need to compare the set of released
nodes before and after reload of raft topology state. Moving the logic
that calculates such a set to a separate function will make it easier to
do.
(cherry picked from commit 10e9672852)
This reverts commit bcd1758911, reversing
changes made to b2c2a99741.
There is a design decision to not introduce additional test
orchestration tool for scylladb.git (see comments for #27499). One
commit has already been reverted in 55c7bc7. Last CI runs made validator
test flaky, so it is a time to remove all remaining validator tests.
It needs a backport to 2026.1 to remove remaining validator tests from there.
Fixes: VECTOR-497
Closesscylladb/scylladb#28568
(cherry picked from commit 81d11a23ce)
Closesscylladb/scylladb#28577
It turns out that the cdc driver requires permissions to two additional system tables. This patch adds them to VECTOR_SEARCH_INDEXING and modifies the unit tests. The integration with vector store was tested manually, integration tests will be added in vector-store repository in a follow up PR.
Fixes: SCYLLADB-522
Closesscylladb/scylladb#28519
(cherry picked from commit 6b9fcc6ca3)
Closesscylladb/scylladb#28538
We currently make the local node the only token owner (that owns the
whole ring) in maintenance mode, but we don't update the topology properly.
The node is present in the topology, but in the `none` state. That's how
it's inserted by `tm.get_topology().set_host_id_cfg(host_id);` in
`scylla_main`. As a result, the node started in maintenance mode crashes
in the following way in the presence of a vnodes-based keyspace with the
NetworkTopologyStrategy:
```
scylla: locator/network_topology_strategy.cc:207:
locator::natural_endpoints_tracker::natural_endpoints_tracker(
const token_metadata &, const network_topology_strategy::dc_rep_factor_map &):
Assertion `!_token_owners.empty() && !_racks.empty()' failed.
```
Both `_token_owners` and `_racks` are empty. The reason is that
`_tm.get_datacenter_token_owners()` and
`_tm.get_datacenter_racks_token_owners()` called above filter out nodes
in the `none` state.
This bug basically made maintenance mode unusable in customer clusters.
We fix it by changing the node state to `normal`.
We also extend `test_maintenance_mode` to provide a reproducer for
Fixes#27988
This PR must be backported to all branches, as maintenance mode is
currently unusable everywhere.
- (cherry picked from commit a08c53ae4b)
- (cherry picked from commit 9d4a5ade08)
- (cherry picked from commit c92962ca45)
- (cherry picked from commit 408c6ea3ee)
- (cherry picked from commit 53f58b85b7)
- (cherry picked from commit 867a1ca346)
- (cherry picked from commit 6c547e1692)
- (cherry picked from commit 7e7b9977c5)
Parent PR: #28322Closesscylladb/scylladb#28499
* https://github.com/scylladb/scylladb:
test: test_maintenance_mode: enable maintenance mode properly
test: test_maintenance_mode: shutdown cluster connections
test: test_maintenance_mode: run with different keyspace options
test: test_maintenance_mode: check that group0 is disabled by creating a keyspace
test: test_maintenance_mode: get rid of the conditional skip
test: test_maintenance_mode: remove the redundant value from the query result
storage_proxy: skip validate_read_replica in maintenance mode
storage_service: set up topology properly in maintenance mode
This patch adds a reproducer test showing issue #28183 - that when LWT
is used, hidden tables "...$paxos" are created but they are unexpectedly
shown by DESC TABLES, DESC SCHEMA and DESC KEYSPACE.
The new test was failing (in three places) on Scylla, as those internal
(and illegally-named) tables are listed, and passes on Cassandra
(which doesn't add hidden tables for LWT).
The commit also contains another test, which verifies if direct
description of paxos state table is wrapped in comment.
Refs #28183.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
(cherry picked from commit 9baaddb613)
Paxos state tables are internal tables fully managed by Scylla
and they shouldn't be exposed to the user nor they shouldn't be backed up.
This commit hides those kind of tables from all listings and if such table
is directly described with `DESC ks."tbl$paxos"`, the description is generated
withing a comment and a note for the user is added.
Fixesscylladb/scylladb#28183
(cherry picked from commit f89a8c4ec4)
The same issue as the one fixed in
394207fd69.
This one didn't cause real problems, but it's still cleaner to fix it.
(cherry picked from commit 7e7b9977c5)
We extend the test to provide a reproducer for #27988 and to avoid
similar bugs in the future.
The test slows down from ~14s to ~19s on my local machine in dev
mode. It seems reasonable.
(cherry picked from commit 867a1ca346)
In the following commit, we make the rest run with multiple keyspaces,
and the old check becomes inconvenient. We also move it below to the
part of the code that won't be executed for each keyspace.
Additionally, we check if the error message is as expected.
(cherry picked from commit 53f58b85b7)
This skip has already caused trouble.
After 0668c642a2, the skip was always hit, and
the test was silently doing nothing. This made us miss #26816 for a long
time. The test was fixed in 222eab45f8, but we
should get rid of the skip anyway.
We increase the number of writes from 256 to 1000 to make the chance of not
finding the key on server A even lower. If that still happens, it must be
due to a bug, so we fail the test. We also make the test insert rows until
server A is a replica of one row. The expected number of inserted rows is
a small constant, so it should, in theory, make the test faster and cleaner
(we need one row on server A, so we insert exactly one such row).
It's possible to make the test fully deterministic, by e.g., hardcoding
the key and tokens of all nodes via `initial_token`, but I'm afraid it would
make the test "too deterministic" and could hide a bug.
(cherry picked from commit 408c6ea3ee)
In maintenance mode, the local node adds only itself to the topology. However,
the effective replication map of a keyspace with tablets enabled contains all
tablet replicas. It gets them from the tablets map, not the topology. Hence,
`network_topology_strategy::sanity_check_read_replicas` hits
```
throw std::runtime_error(format("Requested location for node {} not in topology. backtrace {}", id, lazy_backtrace()));
```
for tablet replicas other than the local node.
As a result, all requests to a keyspace with tablets enabled and RF > 1 fail
in debug mode (`validate_read_replica` does nothing in other modes). We don't
want to skip maintenance mode tests in debug mode, so we skip the check in
maintenance mode.
We move the `is_debug_build()` check because:
- `validate_read_replicas` is a static function with no access to the config,
- we want the `!_db.local().get_config().maintenance_mode()` check to be
dropped by the compiler in non-debug builds.
We also suppress `-Wunneeded-internal-declaration` with `[[maybe_unused]]`.
(cherry picked from commit 9d4a5ade08)
We currently make the local node the only token owner (that owns the
whole ring) in maintenance mode, but we don't update the topology properly.
The node is present in the topology, but in the `none` state. That's how
it's inserted by `tm.get_topology().set_host_id_cfg(host_id);` in
`scylla_main`. As a result, the node started in maintenance mode crashes
in the following way in the presence of a vnodes-based keyspace with the
NetworkTopologyStrategy:
```
scylla: locator/network_topology_strategy.cc:207:
locator::natural_endpoints_tracker::natural_endpoints_tracker(
const token_metadata &, const network_topology_strategy::dc_rep_factor_map &):
Assertion `!_token_owners.empty() && !_racks.empty()' failed.
```
Both `_token_owners` and `_racks` are empty. The reason is that
`_tm.get_datacenter_token_owners()` and
`_tm.get_datacenter_racks_token_owners()` called above filter out nodes
in the `none` state.
This bug basically made maintenance mode unusable in customer clusters.
We fix it by changing the node state to `normal`. We also update its
rack, datacenter, and shards count. Rack and datacenter are present in the
topology somehow, but there is nothing wrong with updating them again.
The shard count is also missing, so we better update it to avoid other
issues.
Fixes#27988
(cherry picked from commit a08c53ae4b)
When the topology coordinator refreshes load_stats, it caches load_stats for every node. In case the node becomes unresponsive, and fresh load_stats can not be read from the node, the cached version of load_stats will be used. This is to allow the load balancer to have at least some information about the table sizes and disk capacities of the host.
During load_stats refresh, we aggregate the table sizes from all the nodes. This procedure calls db.find_column_family() for each table_id found in load_stats. This function will throw if the table is not found. This will cause load_stats refresh to fail.
It is also possible for a table to have been dropped between the time load_stats has been prepared on the host, and the time it is processed on the topology coordinator. This would also cause an exception in the refresh procedure.
This fixes this problem by checking if the table still exists.
Fixes: #28359
- (cherry picked from commit 71be10b8d6)
- (cherry picked from commit 92dbde54a5)
Parent PR: #28440Closesscylladb/scylladb#28471
* github.com:scylladb/scylladb:
test: add test and reproducer for load_stats refresh exception
load_stats: handle dropped tables when refreshing load_stats
This patch adds a test and reproducer for the issue where the load_stats
refresh procedure throws exceptions if any of the tables have been
dropped since load_stats was produced.
(cherry picked from commit 92dbde54a5)
When the topology coordinator refreshes load_stats, it caches load_stats
for every node. In case the node becomes unresponsive, and fresh
load_stats can not be read from the node, the cached version of
load_stats will be used. This is to allow the load balancer to
have at least some information about the table sizes and disk capacities
of the host.
During load_stats refresh, we aggregate the table sizes from all the
nodes. This procedure calls db.find_column_family() for each table_id
found in load_stats. This function will throw if the table is not found.
This will cause load_stats refresh to fail.
It is also possible for a table to have been dropped between the time
load_stats has been prepared on the host, and the time it is processed
on the topology coordinator. This would also cause an exception in the
refresh procedure.
This patch fixes this problem by checking if the table still exists.
(cherry picked from commit 71be10b8d6)
Explain what automatic repair is and how to configure it. While at it, improve the existing repair documentation a bit.
Fixes: SCYLLADB-130
This PR missed the 2026.1 branch date, so it needs backport to 2026.1, where the auto repair feature debuts.
- (cherry picked from commit a84b1b8b78)
- (cherry picked from commit 57b2cd2c16)
- (cherry picked from commit 1713d75c0d)
Parent PR: #28199Closesscylladb/scylladb#28424
* github.com:scylladb/scylladb:
docs: add feature page for automatic repair
docs: inter-link incremental-repair and repair documents
docs: incremental-repair: fix curl example
test_alternator_proxy_protocol starts a node and connects via the alternator ports.
Starting a node, by default, waits until the CQL ports are up. This does not guarantee
that the alternator ports are up (they will be up very soon after this), so there is a short
window where a connection to the alternator ports will fail.
Fix by adding a ServerUpState=SERVING mode, which waits for the node to report
to its supervisor (systemd, which we are pretending to be) that its ports are open.
The test is then adjusted to request this new ServerUpState.
Fixes#28210Fixes#28211
Flaky tests are only in master and branch-2026.1, so backporting there.
- (cherry picked from commit ebac810c4e)
- (cherry picked from commit 59f2a3ce72)
Parent PR: #28291Closesscylladb/scylladb#28443
* github.com:scylladb/scylladb:
test: test_alternator_proxy_protocol: wait for the node to report itself as serving
test: cluster_manager: add ability to wait for supervisor STATUS=serving
Contains various improvements to tablet load balancer. Batched together to save on the bill for CI.
Most notably:
- Make plan summary more concise, and print info only about present elements.
- Print rack name in addition to DC name when making a per-rack plan
- Print "Not possible to achieve balance" only when this is the final plan with no active migrations
- Print per-node stats when "Not possible to achieve balance" is printed
- amortize metrics lookup cost
- avoid spamming logs with per-node "Node {} does not have complete tablet stats, ignoring"
Backport to 2026.1: since the changes enhance debuggability and are relatively low risk
Fixes#28423Fixes#28422
- (cherry picked from commit 32b336e062)
- (cherry picked from commit df32318f66)
- (cherry picked from commit f2b0146f0f)
- (cherry picked from commit 0d090aa47b)
- (cherry picked from commit 12fdd205d6)
- (cherry picked from commit 615b86e88b)
- (cherry picked from commit 7228bd1502)
- (cherry picked from commit 4a161bff2d)
- (cherry picked from commit ef0e9ad34a)
- (cherry picked from commit 9715965d0c)
- (cherry picked from commit 8e831a7b6d)
Parent PR: #28337Closesscylladb/scylladb#28428
* github.com:scylladb/scylladb:
tablets: tablet_allocator.cc: Convert tabs to spaces
tablets: load_balancer: Warn about incomplete stats once for all offending nodes
tablets: load_balancer: Improve node stats printout
tablets: load_balancer: Warn about imbalance only when there are no more active migrations
tablets: load_balancer: Extract print_node_stats()
tablet: load_balancer: Use empty() instead of size() where applicable
tablets: Fix redundancy in migration_plan::empty()
tablets: Cache pointer to stats during plan-making
tablets: load_balancer: Print rack in addition to DC when giving context
tablets: load_balancer: Make plan summary concise
tablets: load_balancer: Move "tablet_migration_bypass" injection point to make_plan()
Vector Search feature needs to support creating vector indexes with additional
filtering column. There will be two types of indexes: global which indexes
vectors per table, and local which indexes vectors per partition key. The new
syntaxes are based on ScyllaDB's Global Secondary Index and Local Secondary
Index. Vector indexes don't use secondary indexes functionalities in any way -
all indexing, filtering and processing data will be done on Vector Store side.
This patch allows creating vector indexes using this CQL syntax:
```
CREATE TABLE IF NOT EXISTS cycling.comments_vs (
commenter text,
comment text,
comment_vector VECTOR <FLOAT, 5>,
created_at timestamp,
discussion_board_id int,
country text,
lang text,
PRIMARY KEY ((commenter, discussion_board_id), created_at)
);
CREATE CUSTOM INDEX IF NOT EXISTS global_ann_index
ON cycling.comments_vs(comment_vector, country, lang) USING 'vector_index'
WITH OPTIONS = { 'similarity_function': 'DOT_PRODUCT' };
CREATE CUSTOM INDEX IF NOT EXISTS local_ann_index
ON cycling.comments_vs((commenter, discussion_board_id), comment_vector, country, lang)
USING 'vector_index'
WITH OPTIONS = { 'similarity_function': 'DOT_PRODUCT' };
```
Currently, if we run these queries to create indexes we will receive such errors:
```
InvalidRequest: Error from server: code=2200 [Invalid query] message="Vector index can only be created on a single column"
InvalidRequest: Error from server: code=2200 [Invalid query] message="Local index definition must contain full partition key only. Redundant column: XYZ"
```
This commit refactors `vector_index::check_target` to correctly validate
columns building the index. Vector-store currently support filtering by native
types, so the type of columns is checked. The first column from the list must
be a vector (to build index based on these vectors), so it is also checked.
Allowed types for columns are native types without counter (it is not possible
to create a table with counter and vector) and without duration (it is not
possible to correctly compare durations, this type is even not allowed in
secondary indexes).
This commits adds cqlpy test to check errors while creating indexes.
Fixes: SCYLLADB-298
This needs to be backported to version 2026.1 as this is a fix for filtering support.
Closesscylladb/scylladb#28366
(cherry picked from commit f49c9e896a)
Closesscylladb/scylladb#28448
In production environments, we observed cases where the S3 client would repeatedly fail to connect due to DNS entries becoming stale. Because the existing logic only attempted the first resolved address and lacked a way to refresh DNS state, the client could get stuck in a failure loop.
Introduce RR TTL and connection failure retry to
- re-resolve the RR in a timely manner
- forcefully reset and re-resolve addresses
- add a special case when the TTL is 0 and the record must be resolved for every request
Fixes: CUSTOMER-96
Fixes: CUSTOMER-139
Should be backported to 2025.3/4 and 2026.1 since we already encountered it in the production clusters for 2025.3
- (cherry picked from commit bd9d5ad75b)
- (cherry picked from commit 359d0b7a3e)
- (cherry picked from commit ce0c7b5896)
- (cherry picked from commit 5b3e513cba)
- (cherry picked from commit 66a33619da)
- (cherry picked from commit 6eb7dba352)
- (cherry picked from commit a05a4593a6)
- (cherry picked from commit 3a31380b2c)
- (cherry picked from commit 912c48a806)
Parent PR: #27891Closesscylladb/scylladb#28405
* https://github.com/scylladb/scylladb:
connection_factory: includes cleanup
dns_connection_factory: refine the move constructor
connection_factory: retry on failure
connection_factory: introduce TTL timer
connection_factory: get rid of shared_future in dns_connection_factory
connection_factory: extract connection logic into a member
connection_factory: remove unnecessary `else`
connection_factory: use all resolved DNS addresses
s3_test: remove client double-close
Use the new ServerUpState=SERVING mechanism to wait to the alternator
ports to be up, rather than relying on the default waiting for CQL,
which happens earlier and therefore opens a window where a connection to
the alternator ports will fail.
(cherry picked from commit 59f2a3ce72)
When running under systemd, ScyllaDB sends a STATUS=serving message
to systemd. Co-opt this mechanism by setting up NOTIFY_SOCKET, thus
making the cluster manager pretend it is systemd. Users of the cluster
manager can now wait for the node to report itself up, rather than
having to parse log files or retry connections.
(cherry picked from commit ebac810c4e)
The test is currently flaky. It tries to get the host ID of the bootstrapping
node via the REST API after the node crashes. This can obviously fail. The
test usually doesn't fail, though, as it relies on the host ID being saved
in `ScyllaServer._host_id` at this point by `ScyllaServer.try_get_host_id()`
repeatedly called in `ScyllaServer.start()`. However, with a very fast crash
and unlucky timings, no such call may succeed.
We deflake the test by getting the host ID before the crash. Note that at this
point, the bootstrapping node must be serving the REST API requests because
`await log.wait_for("finished do_send_ack2_msg")` above guarantees that the
node has started the gossip shadow round, which happens after starting the REST
API.
Fixes#28385Closesscylladb/scylladb#28388
(cherry picked from commit a2c1569e04)
Closesscylladb/scylladb#28417
Otherwise, it may be only a temporary situation due to lack of
candidates, and may be unnecessarily alerting.
Also, print node stats to allow assessing how bad the situation is on
the spot. Those stats can hint to a cause of imbalance, if balancing
is per-DC and racks have different capacity.
(cherry picked from commit 4a161bff2d)
Saves on lookup cost, esp. for candidate evaluation. This showed up in
perf profile in the past.
Also, lays the ground for splitting stats per rack.
(cherry picked from commit 0d090aa47b)
Load-balancing can be now per-rack instead of per-DC. So just printing
"in DC" is confusing. If we're balancing a rack, we should print which
rack is that.
(cherry picked from commit f2b0146f0f)
Before:
load_balancer - Prepared 1 migration plans, out of which there were 1 tablet migration(s) and 0 resize decision(s) and 0 tablet repair(s) and 0 rack-list colocation(s)
After:
load_balancer - Prepared plan: migrations: 1
We print only stats about elements which are present.
(cherry picked from commit df32318f66)
Explain what the feature is and how to confiture it.
Inter-link all the repair related pages, so one can discover all about
repair, regardless of which page they land on.
(cherry picked from commit 1713d75c0d)
The user can now discover the general explanatio of repair when reading
about incremental repair, useful if they don't know what repair is.
The user can now discover incremental repair while reading the generic
repair procedure document.
(cherry picked from commit 57b2cd2c16)
Previously we only inspected std::system_error inside
std::nested_exception to support a specific TLS-related failure
mode. However, nested exceptions may contain any type, including
other restartable (retryable) errors. This change unwraps one
nested exception per iteration and re-applies all known handlers
until a match is found or the chain is exhausted.
Closesscylladb/scylladb#28240
(cherry picked from commit cb2aa85cf5)
Closesscylladb/scylladb#28345
Clean up the awkward move constructor that was declared in the header
but defaulted in a separate compilation unit, improving clarity and
consistency.
(cherry picked from commit 3a31380b2c)
If connecting to a provided address throws, renew the address list and
retry once (and only once) before giving up.
(cherry picked from commit a05a4593a6)
Add a TTL-based timer to connection_factory to automatically refresh
resolved host name addresses when they expire.
(cherry picked from commit 6eb7dba352)
Move state management from dns_connection_factory into state class
itself to encapsulate its internal state and stop managing it from the
`dns_connection_factory`
(cherry picked from commit 66a33619da)
`test_chunked_download_data_source_with_delays` was calling `close()` on a client twice, remove the unnecessary call
(cherry picked from commit bd9d5ad75b)
Commit 0156e97560 ("storage_proxy: cas: reject for
tablets-enabled tables") marked a bunch of LWT tests as
XFAIL with tablets enabled, pending resolution of #18066.
But since that event is now in the past, we undo the XFAIL
markings (or in some cases, use an any-keyspace fixture
instead of a vnodes-only fixture).
Ref #18066.
Closesscylladb/scylladb#28336
(cherry picked from commit ec70cea2a1)
Closesscylladb/scylladb#28365
The class in question has internal implementations of in-memory data_sink_impl and data_source_impl. In seastar there's generic implementation of the same facilities. From the "code re-use" perspective it makes sense to use both. TODO-s in Scylla code supports that.
Using newer seastar facilities, not backporting.
Closesscylladb/scylladb#28321
* github.com:scylladb/scylladb:
sstable: Replace buffer_data_sink_impl with seastar::util::basic_memory_data_sink
sstables: Use seastar::util::as_input_stream() and remove buffer_data_source_impl
There are two checks for live endpoints performed in test_gossiper.py,
but one of those sits in test_gossiper_unreachable_endpoints somehow.
This patch moves live endpoints check into live endpoints test.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#28224
This series implements rescoring algorithm.
Index options allowing to enable this functionality were introduced in earlier PR https://github.com/scylladb/scylladb/pull/28165.
When Vector Index has enabled quantization, Vector Store uses reduced vector representation to save memory, but it may degrade correctness of ANN queries. For quantized index we can enable rescoring algorithm, which recalculates similarity score from full vector representation stored in Scylla and reorder returned result set.
It works also with oversampling - we fetch more candidates from Vector Store, rescore them at Scylla and return only requested number of results.
Example:
Creating a Vector Index with Rescoring
```sql
-- Create a table with a vector column
CREATE TABLE ks.products (
id int PRIMARY KEY,
embedding vector<float, 128>
);
-- Create a vector index with rescoring enabled
CREATE INDEX products_embedding_idx ON ks.products (embedding)
USING 'vector_index'
WITH OPTIONS = {
'similarity_function': 'cosine',
'quantization': 'i8',
'oversampling': '2.0',
'rescoring': 'true'
};
```
1. **Quantization** (`i8`) compresses vectors in the index, reducing memory usage but introducing precision loss in distance calculations
2. **Oversampling** (`2.0`) retrieves 2× more candidates than requested from the vector store (e.g., `LIMIT 10` fetches 20 candidates)
3. **Rescoring** (`true`) recalculates similarity scores using full-precision (`f32`) vectors from the base table and re-ranks results
Query example:
```sql
-- Find 10 most similar products
SELECT id, similarity_cosine(embedding, [0.1, 0.2, ...]) AS score
FROM ks.products
ORDER BY embedding ANN OF [0.1, 0.2, ...]
LIMIT 10;
```
With rescoring enabled, the query:
1. Fetches 20 candidates from the quantized index (due to oversampling=2.0)
2. Reads full-precision embeddings from the base table
3. Recalculates similarity scores with full precision
4. Re-ranks and returns the top 10 results
In this implementation we use CQL similarity function implementation to calculate new score values and use them in post query ordering. We add that column manually to selection, but it has to be removed from the final response.
Follow-up https://github.com/scylladb/scylladb/pull/28165
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-83
New feature - doesn't need backport.
Closesscylladb/scylladb#27769
* github.com:scylladb/scylladb:
vector_index: rescoring: Fetch oversampled rows
vector_index: rescoring: Sort by similarity column
select_statement: Modify `needs_post_query_ordering` condition
vector_index: rescoring: Add hidden similarity score column
vector_index: Refactor extracting ANN query information
The storage_proxy::stop() is not called by main (it is commented out due to #293), so the corresponding message injection is never hit. When the test releases paxos_state_learn_after_mutate, shutdown may already be in progress or even completed by the time we try to trigger the storage_proxy::stop injection, which makes the test flaky.
Fix this by completely removing the storage_proxy::stop injection. The injection is not required for test correctness. Shutdown must wait for the background LWT learn to finish, which is released via the paxos_state_learn_after_mutate injection. The shutdown process blocks on in-flight HTTP requests through seastar::httpd::http_server::stop and its _task_gate, so the HTTP request that releases paxos_state_learn_after_mutate is guaranteed to complete before the node is shut down.
Fixesscylladb/scylladb#28260
backport: 2025.4, the `test_lwt_shutdown` test was introduced in this version
Closesscylladb/scylladb#28315
* https://github.com/scylladb/scylladb:
storage_proxy: drop stop() method
test_lwt_shutdown: fix flakiness by removing storage_proxy::stop injection
It was observed twice that the test times out in debug mode.
Fix by increasing the timeout.
The test never expects a timeout, so increasing it won't increase
the test duration.
Fixes#28028Closesscylladb/scylladb#28272
- Pass pytest request fixture into coro_task (used for scylla_tmp_dir
and core dump path)
- Rename duplicate `test_sstable_summary` that runs sstable-index-cache
to `test_sstable_index_cache` so both tests are collected
Refs https://github.com/scylladb/scylladb/issues/22501Closesscylladb/scylladb#28286
The former accumulates sstable writer writes into a vector of temporary
buffers. In seastar there's a generic memory data sink that provides a
sink to accumulate stream of bytes into any container.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The latter is used to wrap vector of buffers into an input_stream.
Seastar already provides the very same functionality with the
convenience as_input_stream() helper.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
In 8df61f6d99 we changed the requirements for creating materialized
views and MV-based indexes - instead of requiring the
rf_rack_valid_keyspaces flag to be set, we now require the keyspace to
be RF-rack-valid at the time of creation, and it is enforced to remain
RF-rack-valid while the MV exists. This validation is done in the cql
create view/index statements.
The same should be done also for alternator - when creating a table with
GSI or LSI, or when adding a GSI to an existing table, previously we
required the flag rf_rack_valid_keyspaces to be set. Now we change it to
instead check if the keyspace is RF-rack-valid, and if not the operation
fails with an appropriate error.
Fixes https://github.com/scylladb/scylladb/issues/28214
backport to 2025.4 to add RF-rack-valid enforcements in alternator
Closesscylladb/scylladb#28154
* github.com:scylladb/scylladb:
locator: document the exception type of assert_rf_rack_valid_keyspace
alternator: don't require rf_rack flag for indexes, validate instead
When read or write operations are performed on a DC with RF=0 with LOCAL_QUORUM
or LOCAL_ONE consistency level, Cassandra throws `Unavailable` exception.
Scylla allowed such read operations and failed write operations with a cryptic:
"broken promise" error. This occured because the initial availability
check passed (quorum of 0 requires 0 replicas), but execution failed
later when no replicas existed to process the mutation.
This patch adds an explicit RF=0 validation for LOCAL_ONE and LOCAL_QUORUM that
throws before attempting operation execution.
The change also requires `test_query_dc_with_rf_0_does_not_crash_db` to be
upgraded. This testcase was asserting somewhat similar scenario, but wasn't
taking into account the whole matrix of combinations:
- scenarios: successful vs unsuccesful operation outcome
- local consistency levels: LOCAL_QUORUM & LOCAL_ONE
- operations: SELECT (read) & INSERT (write)
and so it's been extended to cover both the pre-existing and the current issues
and the whole matrix of combinations.
Fixes: scylladb/scylladb#27893
A minor change, no need to backport.
Closesscylladb/scylladb#27894
* github.com:scylladb/scylladb:
db: fail reads and writes with local consistencty level to a DC with RF=0
db: consistency_level: split `local_quorum_for()`
db: consistency_level: fix nrs -> nts abbreviation
storage_proxy::stop() is not called by main (it is commented out due to #293),
so the corresponding message injection is never hit. When the test releases
paxos_state_learn_after_mutate, shutdown may already be in progress or even
completed by the time we try to trigger the storage_proxy::stop injection,
which makes the test flaky.
Fix this by completely removing the storage_proxy::stop injection.
The injection is not required for test correctness. Shutdown must wait for the
background LWT learn to finish, which is released via the
paxos_state_learn_after_mutate injection.
The shutdown process blocks on in-flight api HTTP requests through
seastar::httpd::http_server::stop and its _task_gate, so the
shutdown will not prevent the HTTP request that released the
paxos_state_learn_after_mutate from completing successfully.
Fixesscylladb/scylladb#28260
During rewrite --extra-scylla-cmdline-options was missed and it was not passed to the tests that are using pytest. The issue that there were no possibility to pass these parameters via cmd to the Scylla, while tests were not affected because they were using the parameters from the yaml file.
This PR fixes this issue so it will be easier to modify the Scylla start parameters without modifying code.
No backport needed, only framework enhancement.
Closesscylladb/scylladb#28156
* github.com:scylladb/scylladb:
test.py: do not crash when there is no boost log
test.py: pass correctly extra cmd line arguments
Before this change, the test function `_verify_tasks_processed_metrics`
verified that after service level reconfiguration, a given number of
`scylla_scheduler_tasks_processed` were processed by a given scheduling
group. Moreover, the check verified that another scheduling group
didn't process a high number of requests. The second check was vulnerable
to flakiness, because sometimes additional load caused extensive work
in the second scheduling group (e.g. password hashing in `sl:driver`
due to new connections being created).
To avoid test failures, this commit changes which metric is verified:
instead of `scylla_scheduler_tasks_processed`, the metric
`scylla_transport_cql_requests_count` is checked. This prevents similar
problems, because there is no reason for a high number of
requests to be processed by the second scheduling group. Moreover,
it allows decreasing the number of requests that are sent for
verification, and thus speeds up the test.
Fixes: scylladb/scylladb#27715Closesscylladb/scylladb#28318
In this PR we add a basic implementation of the strongly-consistent tables:
* generate raft group id when a strongly-consistent table is created
* persist it into system.tables table
* start raft groups on replicas when a strongly-consistent tablet_map reaches them
* add strongly-consistent version of the storage_proxy, with the `query` and `mutate` methods
* the `mutate` method submits a command to the tablets raft group, the query method reads the data with `raft.read_barrier()`
* strongly-consistent versions of the `select_statement` and `modification_statement` are added
* a basic `test_strong_consistency.py/test_basic_write_read` is added which to check that we can write and read data in a strongly consistent fashion.
Limitations:
* for now the strongly consistent tables can have tablets only on shard zero. This is because we (ab/re) use the existing raft system tables which live only on shard0. In the next PRs we'll create separate tables for the new tablets raft groups.
* No Scylla-side proxying - the test has to figure out who is the leader and submit the command to the right node. This will be fixed separately.
* No tablet balancing -- migration/split/merges require separate complicated code.
The new behavior is hidden behind `STRONGLY_CONSISTENT_TABLES` feature, which is enabled when the `STRONGLY_CONSISTENT_TABLES` experimental feature flag is set.
Requirements, specs and general overview of the feature can be found [here](https://scylladb.atlassian.net/wiki/spaces/RND/pages/91422722/Strong+Consistency). Short term implementation plan is [here](https://docs.google.com/document/d/1afKeeHaCkKxER7IThHkaAQlh2JWpbqhFLIQ3CzmiXhI/edit?tab=t.0#heading=h.thkorgfek290)
One can check the strongly consistent writes and reads locally via cqlsh:
scylla.yaml:
```
experimental_features:
- strongly-consistent-tables
```
cqlsh:
```
CREATE KEYSPACE IF NOT EXISTS my_ks WITH replication = {'class': 'NetworkTopologyStrategy', 'replication_factor': 1} AND tablets = {'initial': 1} AND consistency = 'local';
CREATE TABLE my_ks.test (pk int PRIMARY KEY, c int);
INSERT INTO my_ks.test (pk, c) VALUES (10, 20);
SELECT * FROM my_ks.test WHERE pk = 10;
```
Fixes SCYLLADB-34
Fixes SCYLLADB-32
Fixes SCYLLADB-31
Fixes SCYLLADB-33
Fixes SCYLLADB-56
backport: no need
Closesscylladb/scylladb#27614
* https://github.com/scylladb/scylladb:
test_encryption: capture stderr
test/cluster: add test_strong_consistency.py
raft_group_registry: disable metrics for non-0 groups
strong consistency: implement select_statement::do_execute()
cql: add select_statement.cc
strong consistency: implement coordinator::query()
cql: add modification_statement
cql: add statement_helpers
strong consistency: implement coordinator::mutate()
raft.hh: make server::wait_for_leader() public
strong_consistency: add coordinator
modification_statement: make get_timeout public
strong_consistency: add groups_manager
strong_consistency: add state_machine and raft_command
table: add get_max_timestamp_for_tablet
tablets: generate raft group_id-s for new table
tablet_replication_strategy: add consistency field
tablets: add raft_group_id
modification_statement: remove virtual where it's not needed
modification_statement: inline prepare_statement()
system_keyspace: disable tablet_balancing for strongly_consistent_tables
cql: rename strongly_consistent statements to broadcast statements
This patch adds links to the Vector Search documentation that is hosted
together with Scylla Cloud docs to the CQL documentation.
It also make the note about supported capabilities consistent and
removes the experimental label as the feature is GAed.
Fixes: SCYLLADB-371
Closesscylladb/scylladb#28312
The function assert_rf_rack_valid_keyspace uses the exception type
std::invalid_argument when the RF-rack validation fails. Document it and
change all callers to catch this specific exception type when checking
for RF-rack validation failures, so that other exception types can be
propagated properly.
In 8df61f6d99 we changed the requirements for creating materialized
views and MV-based indexes - instead of requiring the
rf_rack_valid_keyspaces flag to be set, we now require the keyspace to
be RF-rack-valid at the time of creation, and it is enforced to remain
RF-rack-valid while the MV exists. This validation is done in the cql
create view/index statements.
The same should be done also for alternator - when creating a table with
GSI or LSI, or when adding a GSI to an existing table, previously we
required the flag rf_rack_valid_keyspaces to be set. Now we change it to
instead check if the keyspace is RF-rack-valid, and if not the operation
fails with an appropriate error.
So far with oversampling the extended set of keys was returned from VS,
but query to the base table was still limited by the query `limit`.
Now for rescoring we want to fetch rows for all the keys returned from VS.
However later we need to restore the command limit, to trim result_set accordingly.
For non-rescoring scenarios we trim directly keys set returned from VS if it happens to exceed query limit.
With this change rescoring validation tests (except `no_nulls_in_rescored_results`) pass fully.
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-83
This patch implements second part of rescoring - ordering results by similarity column added in earlier patch.
For this purpose in this patch we define `_ordering_comparator`, which enables pre-existing post-query ordering functionality.
However, none additional test passes yet, as they include ovesampling, which will be the subject of following patches.
Our plan for rescoring is to use the existing post-query ordering mechanism to sort (and trim) result_set by similarity column.
For general SELECT case this ordering is permitted only for queries with IN on the partition key and an ORDER BY, which is checked in `needs_post_query_ordering`.
Recently this check was overriden for ANN queries in https://github.com/scylladb/scylladb/pull/28109 to enable IN queries handled by VS without excessive post-processing.
In this patch we revert that change - ANN case will be handled by general check.
However we change the condition - we will enable post processing anytime `_ordering_comparator` is set.
In current implementation `_ordering_comparator` is created only in `select_statement::prepare` with `get_ordering_comparator`,
only for the same conditions as were checked in `needs_post_query_ordering`, so this change should be transparent for general SELECT.
For ANN query it is also not set (yet), so it will not influence ANN filtering, but we confirm that this functionality still works
by adding filtering test: `test/vector_search/filter_test.cc::vector_store_client_test_filtering_ann_cql`.
Rescoring ordering for ANN queries will be enabled when we add `_ordering_comparator` in following patch.
Rescoring consist of recalculating similarity score and reordering results based on it.
In this patch we add calculation of similarity score as a hidden (non-serialized) column and following patch will add reordering.
Normal ordering uses `add_column_for_post_processing`, however this works only for regular columns, not function.
So we create it together with user requested columns (this also forces the use of `selection_with_processing`) and hide the column later.
This also requires special handling for 'SELECT *' case - we need to manually add all columns before adding similarity column.
In case user already asks for similarity score in the SELECT clause, this value will be calculated twice - is should be optimized in future patches.
When setting up coredump handling, if there are old mounts in a deleted state (e.g. from an older installation),
systemd might fail to activate the new `.mount` unit properly because it assumes the path is already mounted.
Explicitly unmount `/var/lib/systemd/coredump` before proceeding with the setup to ensure a clean state.
Fix: scylladb/scylla-enterprise#5692Closesscylladb/scylladb#28300
Allows other topology operations to execute while tablets are being
drained on decommission. In particular, bootstrap on scale-out. This
is important for elasticity.
Allows multiple decommission/removenode to happen in parallel, which
is important for efficiency.
Flow of decommission/removenode request:
1) pending and paused, has tablet replicas on target node.
Tablet scheduler will start draining tablets.
2) No tablets on target node, request is pending but not paused
3) Request is scheduled, node is in transition
4) Request is done
Nodes are considered draining as soon as there is a leave or remove
request on them. If there are tablet replicas present on the target
node, the request is in a paused state and will not be picked by
topology coordinator. The paused state is computed from topology state
automatically on reload.
When request is not paused, its execution starts in
write_both_read_old state. The old tablet_draining state is not
entered (it's deprecated now).
Tablet load balancing will yield the state machine as soon as some
request is no longer paused and ready to be scheduled, based on
standard preemption mechanics.
Fixes#21452Closesscylladb/scylladb#24129
* https://github.com/scylladb/scylladb:
docs: Document parallel decommission and removenode and relevant task API
test: Add tests for parallel decommission/removenode
test: util: Introduce ensure_group0_leader_on()
test: tablets: Check that there are no migrations scheduled on draining nodes
test: lib: topology_builder: Introduce add_draining_request()
topology_coordinator, tablets: Fail draining operations when tablet migration fails due to critical disk utilization
tablets: topology_coordinator: Refactor to propagate reason for migration rollback
tablet_allocator: Skip co-location on draining nodes
node_ops: task_manager_module: Populate entity field also for active requests
tasks: node_ops: Put node id in the entity field
tasks, node_ops: Unify setting of task_stats in get_status() and get_stats()
topology: Protect against empty cancelation reason
tasks, topology: Make pending node operations abortable
doc: topology-over-raft.md: Fix diagram for replacing, tablet_draining is not engaged
raft_topology, tablets: Drain tablets in parallel with other topology operations
virtual_tables: Show draining and excluded fields in system.cluster_status and system.load_by_node
locator: topology: Add "draining" flag to a node
topology_coordinator: Extract generate_cancel_request_update()
storage_service: Drop dependency in topology_state_machine.hh in the header
locator: Extract common code in assert_rf_rack_valid_keyspace()
topology_coordinator, storage_service: Validate node removal/decommission at request submission time
The test is currently flaky. With `remove_dead_nodes_with == "remove"`,
it sends several ALTER KEYSPACE requests. The request performed just
after adding 3 new nodes can unexpectedly be sent twice to two
different nodes by the driver. The second receiver rejects the request
through the new guardrail added in 2e7ba1f8ce,
and the test fails.
This has been acknowledged as a bug in the Python driver. It shouldn't
retry non-idempotent requests with the default retry policy. There could
be one more bug in the driver, as it looks like the driver decides to
resend the request after it disconnects from the first receiver. The
first receiver has just bootstrapped, so the driver shouldn't disconnect.
We deflake the test by reconnecting the driver before performing the
problematic ALTER KEYSPACE request.
The change has been tested in byo, as the failure reproduces only in CI.
Without the change, the test fails once in ~250 runs in dev mode. With
the change, more than 1000 runs passed.
Fixes#27862
No backport needed as 2e7ba1f8ce is only
in master.
Closesscylladb/scylladb#28290
For the purpose of rescoring we will need information if the query is an ANN query
and the access to index option earlier in the `select_statement::prepare` than it happened before.
This patch refactors extracting this information to new helper structure `ann_ordering_info`
and is consistently using it.
Add enforce_rack_list option. When the option is set to true,
all tablet keyspaces have rack list replication factor.
When the option is on:
- CREATE STATEMENT always auto-extends rf to rack lists;
- ALTER STATEMENT fails when there is numeric rf in any DC.
The flag is set to false by default and a node needs to be restarted
in order to change its value. Starting a node with enforce_rack_list
option will fail, if there are any tablet keyspaces with numeric rf
in any DC.
enforce_rack_list is a per-node option and a user needs to ensure
that no tablet keyspace is altered or created while nodes in
the cluster don't have the consistent value.
Mark rf_rack_valid_keyspaces as deprecated.
Fixes: https://github.com/scylladb/scylladb/issues/26399.
New feature; no backport needed
Closesscylladb/scylladb#28084
* github.com:scylladb/scylladb:
test: add test for enforce_rack_list option
db: mark rf_rack_valid_keyspaces as deprecated
config: add enforce_rack_list option
Revert "alternator: require rf_rack_valid_keyspaces when creating index"
Adds a "sstables" array member to manifest.json.
For each sstables, keep the following metadata:
id - a uuid for the sstable (the sstable identifier
if the use-sstable-identifier option was used, otherwise
the sstable uuid generation)
toc_name - the name of the TOC.txt file
data_size and index_size - in bytes
first_token and last_token - of the sstable first and last keys.
Fixes: SCYLLADB-196
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Add a table member to manifest.json with the keyspace_name,
table_name, table_id, tablets_type, and, for tablets-enabled tables, get
tablet_count on each shard and write the minimum to manifest.json.
For vnodes-based tables, tablet_count=0.
For now, `tablets_type` may be either `none` for vnodes tables, or
`powof2` for tablets tables. In the future, when we support arbitrary
tablt boundaries, this will be reflected here, and it is likely we
would backup the whole tablets map sperately to get all tablet boundaries.
Fixes SCYLLADB-195
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
And keep the options for now in the local_snapshot_writer.
The options will be used by following patches to pass
extra metadata like the snapshot creation time, expiration time, etc.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
If tablets are enabled via db::config add the `tablet = {'enabled': true}'
option when creating a keyspace, even if `cql_test_config.initial_tablets`
is disengaged.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Add metadata about the node: host_id, datacenter, and rack.
This enables dc- or rack- aware restore.
Today this information is "encoded" into the snapshot hierarchy
prefixes, but if all manifest files would be stored in a flat
directory, we'd need to encode that metadata in the object name,
but it'd be better for the manifest contents to be self descriptive.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Add metadata about the manifest itself:
A version and the manifest scope (currently "node",
but in the future, may also be "shard", or "tablet")
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Validate the manifest.json format by loading it using rjson::parse
and then validate its contents to ensure it lists exactly the
SSTables present in the snapshot directory.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The test is currently flaky. It tries to get the host ID of the bootstrapping
node via the REST API after the node crashes. This can obviously fail. The
test usually doesn't fail, though, as it relies on the host ID being saved
in `ScyllaServer._host_id` at this point by `ScyllaServer.try_get_host_id()`
repeatedly called in `ScyllaServer.start()`. However, with a very fast crash
and unlucky timings, no such call may succeed.
We deflake the test by getting the host ID before the crash. Note that at this
point, the bootstrapping node must be serving the REST API requests because
`await coordinator_log.wait_for("delay_node_bootstrap: waiting for message")`
above guarantees that the node has submitted the join topology request, which
happens after starting the REST API.
Fixes#28227Closesscylladb/scylladb#28233
In PR 5b6570be52 we introduced the config option `sstable_compression_user_table_options` to allow adjusting the default compression settings for user tables. However, the new option was hooked into the CQL layer and applied only to CQL base tables, not to the whole spectrum of user tables: CQL auxiliary tables (materialized views, secondary indexes, CDC log tables), Alternator base tables, Alternator auxiliary tables (GSIs, LSIs, Streams).
This gap also led to inconsistent default compression algorithms after we changed the option’s default algorithm from LZ4 to LZ4WithDicts (adf9c426c2).
This series introduces a general “schema initializer” mechanism in `schema_builder` and uses it to apply the default compression settings uniformly across all user tables. This ensures that all base and aux tables take their default compression settings from config.
Fixes#26914.
Backport justification: LZ4WithDicts is the new default since 2025.4, but the config option exists since 2025.2. Based on severity, I suggest we backport only to 2025.4 to maintain consistency of the defaults.
Closesscylladb/scylladb#27204
* github.com:scylladb/scylladb:
db/config: Update sstable_compression_user_table_options description
schema: Add initializer for compression defaults
schema: Generalize static configurators into schema initializers
schema: Initialize static properties eagerly
db: config: Add accessor for sstable_compression_user_table_options
test: Check that CQL and Alternator tables respect compression config
The test is currently flaky. It incorrectly assumes that a read with
CL=LOCAL_ONE will see the data inserted by a preceding write with
CL=LOCAL_ONE in the same datacenter with RF=2.
The same issue has already been fixed for CL=ONE in
21edec1ace. The difference is that
for CL=LOCAL_ONE, only dc1 is problematic, as dc2 has RF=1.
We fix the issue for CL=LOCAL_ONE by skipping the check for dc1.
Fixes#28253
The fix addresses CI flakiness and only changes the test, so it
should be backported.
Closesscylladb/scylladb#28274
Add a basic test that creates a strongly consistent keyspace and table,
writes some data, and verifies that the same data can be read back.
Since Scylla-side request proxying is not yet implemented, writes are
handled only on the leader node. The test uses the existing
`/raft/leader_host` REST endpoint to determine the leader of the tablets
Raft group.
The `raft::server` registers metrics using the `server_id` label. When
both a group0 Raft server and the tablets Raft server are created on
the same node/shard, duplicate metrics cause conflicts.
This commit temporarily disables metrics for non-0 groups. A proper fix
will likely require adding a `group_id` label in the future.
We use decoration instead of inheritance, since inheritance already
serves to differentiate statement types (modification_statement has
update_statement and delete_statement as descendants). A better
solution would likely involve refactoring modification_statement and
extracting the mutation-generation logic into a reusable component
shared by both eventual and strongly consistent statements.
Introduce two helper methods that will be used for strongly consistent
select_statement and modification_statement.
redirect_statement() forwards the request to another shard or node.
Currently, only shard forwarding is implemented; node-level proxying
will be added in follow-up PRs.
is_strongly_consistent() will be used in the prepare() method of raw
statements to determine whether a strongly consistent statement should
be created for the given CQL statement.
To guarantee monotonic mutation timestamps, we compute the maximum
timestamp used so far for the current tablet. This is done by calling
read_barrier() on the tablet’s Raft group server and extracting the
maximum timestamp from the local database via
table::get_max_timestamp_for_tablet().
Because read_barrier() may take a while, we perform it proactively in a
dedicated fiber, leader_info_updater, rather than during the mutation
request. This fiber is started when the Raft group server starts for a
tablet. It reacts to wait_for_state_change(), computes the maximum
timestamp, and stores it per term.
The new groups_manager::begin_mutate() function checks whether the
maximum timestamp has already been computed for the current term. If
not, it asks the client to wait. This two-step interface (synchronous
begin_mutate() + asynchronous wait on the need_wait_for_leader future)
is needed because the term can change at any asynchronous point.
If begin_mutate() were asynchronous, the client would need to recheck
the term after `co_await begin_mutate()`.
We currently do not handle raft::commit_status_unknown. We rethrow it to
the CQL client, which must check whether the command was applied and
retry if necessary. Handling this inside Scylla would require persisting
a deduplication key after applying the mutation, which introduces write
amplification. Additionally, connection breaks between Scylla and the
driver can always occur, so the client must be prepared to verify the
command status regardless.
When a strongly consistent request arrives at a node, we
need to know which replica is the leader, since such requests
are generally executed only on the leader. If a leader has
not yet been elected, we must wait. This commit exposes
wait_for_leader() so it can be used for that purpose.
We cannot rely solely on wait_for_state_change(), because it does not
trigger when some other node becomes a leader.
Add the `coordinator` class, which will be responsible for coordinating
reads and writes to strongly consistent tables. This commit includes
only the boilerplate; the methods will be implemented in separate
commits.
These commands will be used by strongly consistent tablets to submit
mutations to Raft. A simple state_machine implementation is introduced
to apply these commands.
We apply commands in batches to reduce commitlog I/O overhead. The
batched variant of database::apply has known atomicity issues. For
example, it does not guarantee atomicity under memory pressure: some
mutations may be published to the memtable while others are blocked in
run_when_memory_available. We will address these issues later.
Strongly consistent writes require knowing the maximum timestamp of
locally applied mutations to guarantee monotonically increasing
timestamps for subsequent writes.
This commit adds a function that returns the maximum timestamp for a
given tablet.
Why it is safe to use this function with deleted cells:
* Tombstones are included in memtable.get_max_timestamp() calculations.
* The maximum timestamp of a memtable is used to initialize the maximum
timestamp of the resulting sstable.
* During compaction, a new sstable’s maximum timestamp is initialized as
the maximum of the contributing sstables.
This commit adds a `consistency` field to `tablet_replication_strategy`.
In upcoming commits we'll use this field to determine if a
`raft_group_id` should be generated for a new table.
Add a `raft_group_id` column to `system.tablets` and to the `tablet_map`
class. The column is populated only when the
`strongly_consistent_tables` feature is enabled.
This feature is currently disabled by default and is enabled only when
the user sets the `STRONGLY_CONSISTENT_TABLES` experimental flag.
The `raft_group_id` column is added to `system.tablets` only when this
flag is set. This allows the schema to evolve freely while the feature
is experimental, without requiring complex migrations.
This is a refactoring/simplification commit.
There are many 'prepare' functions in this class that don't
meaningfully differ from each other. The prepare_statement() adds
accidental complexity by adding a level of indirection -- the reader
has to jump between the call site and the function body to reconstruct
the full picture.
In preparation for upcoming work on strongly consistent queries in
Scylla, this commit renames the existing `strongly_consistent`
statements to `broadcast_statements` to avoid confusion.
The old code paths are kept temporarily, as they may be useful for
reference or for copying parts during the implementation of the new
strongly consistent statements.
This patch changes the layout of user-facing scheduling groups from
/
`- statement
`- sl:default
`- sl:*
`- other groups (compaction, streaming, etc.)
into
/
`- user (supergroup)
`- statement
`- sl:default
`- sl:*
`- other groups (compaction, streaming, etc.)
The new supergroup has 1000 static shares and is name-less, in a sense
that it only have a variable in the code to refer to and is not exported
via metrics (should be fixed in seastar if we want to).
The moved groups don't change their names or shares, only move inside
the scheduling hierarchy.
The goal of the change is to improve resource consumption of sl:*
groups. Right now activities in low-shares service levels are scheduled
on-par with e.g. streaming activity, which is considered to be low-prio
one. By moving all sl:* groups into their own supergroup with 1000
shares changes the meaning of sl:* shares. From now on these shares
values describe preirities of service level between each-other, and the
user activities compete with the rest of the system with 1000 shares,
regardless of how many service levels are there.
Unit tests keep their user groups under root supergroup (for simplicity)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#28235
Currently, raft-based node operations with streaming use topology guards, but repair-based don't.
Topology guards ensure that if a respective session is closed (the operation has finished), each leftover operation being a part of this session fails. Thanks to that we won't incorrectly assume that e.g. the old rpc received late belongs to the newly started operation. This is especially important if the operation involves writes.
Pass a topology_guard down from raft_topology_cmd_handler to repair tasks. Repair tasks already support topology guards.
Fixes: https://github.com/scylladb/scylladb/issues/27759
No topology_guard in any version; needs backport to all versions
Closesscylladb/scylladb#27839
* github.com:scylladb/scylladb:
service: use session variable for streaming
service: pass topology guard to RBNO
backup and restore tests. This made the testing times explode
with both cluster/object_store/test_backup.py and
cluster/test_refresh.py taking more than an hour each to complete
under test.py and around 14min under pytest directly.
This was painful especially in CI because it runs tests under test.py which
suffers from the issue of not being able to run test cases from within
the same file in parallel (a fix is attempted in 27618).
This patch reduces the dataset of these tests to the minimum and
gets rid of one of the tested topology as it was redundant.
The test times are reduced to 2min under pytest and 14 mins under
test.py.
Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
Closesscylladb/scylladb#28280
This series introduces `rescoring` index option.
There is no rescoring algorithm implementation yet.
This series prepares it by:
- adding new index option
- adding documentation
- adding tests for option handling
- adding tests for rescoring implementation - at this point they report errors and are marked that this is expected, because rescoring is not implemented.
Follow-up https://github.com/scylladb/scylladb/pull/27677
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-293
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-294
No backporting - it is a new feature.
Closesscylladb/scylladb#28165
* github.com:scylladb/scylladb:
vector_search: Add more rescoring validation tests
vector_search: Add rescoring validation test
vector_search: doc: Document new index option
vector_search: test: Add `rescoring` index option test
vector_index: introduce rescoring option
vector_index: improve options validation
The streamed_mutation_freezer class uses a deque to avoid large
allocations, but fails as seen in the referenced issue when the
vector backing the deque grows too large. This may be a problem
in itself, but the issue doesn't provide enough information to tell.
Fix the immediate problem by switching to chunked_vector, which
is better in avoiding large allocations. We do lose some early-free
in serialize_mutation_fragments(), but since most of the memory should
be in the clustering row itself, not in the deque/chunked_vector holding
it, it should not be a problem.
Fixes#28275Closesscylladb/scylladb#28281
Auth v2 migration uses non-paged queries via `execute_internal` API.
This commit changes it to use `query_internal` instead, which uses
paging under the hood.
Fixes: https://github.com/scylladb/scylladb/issues/27577
A minor enhancement, no need to backport.
Closesscylladb/scylladb#25395
* github.com:scylladb/scylladb:
auth: use paged internal queries during migration
auth: move some code in migrate_to_auth_v2 up
auth: re-align pieces of migrate_to_auth_v2
cql: extend `query_internal` with `query_state` param
This reverts commit c8cff94a5a.
Re-enabling incremental repair on master with "Aborting on shard 0 during
scaleout + repair #26041" and "Failure to attach sstables in streaming consumer
leaves sealed sstables on disk #27414" fixed.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#28120
Since #28109 was merged, those tests started to pass as we allow
the filtering on primary key columns within ANN vector queries.
Closesscylladb/scylladb#28231
Adding tests for specific cases of rescoring processing:
- wildcard selection - "SELECT * ..." is a case with slightly different path of rescoring processing. We want to confirm that it is handled correctly.
- calculating similarity with other vectors in SELECT clause should not influence ANN ordering.
- NULL handling - results that for any reason have NULL in a score should be filtered out.
As rescoring is not implemented yet, the tests use boost::unit_test::expected_failures
to indicate that the test reports errors.
Verify that vector store results will be correctly rescored and reordered
according to the rescoring algorithm.
As rescoring is not implemented yet, the tests use `boost::unit_test::expected_failures`
to indicate that they report errors.
First test checks rescoring with a simple selection list.
Second makes sure that rescoring is not triggered for quantization=f32 - full representation of vectors.
Third repeats the first one, but adds to it returning of similarity score value.
This patch adds vector index option allowing to enable rescoring - recalculation of similarity metric and re-ranking of quantized VS candidates.
Quantization is a necessary condition to run rescoring - checked in convenience function `is_rescoring_enabled`.
Rescoring itself is not implemented - it will come in following patches.
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-294
Add `prepared_filter` class which handles the preparation, construction
and caching of Vector Search filtering compatible JSON object.
If no bind markers found in SELECT statement, the JSON object will be built
once at prepare time and cached for use during execution calls.
Adjust tests accordingly to use prepared filters.
Follow-up: #28109
Fixes: SCYLLADB-299
During rewrite --extra-scylla-cmdline-options was missed and it was not
passed to the tests that are using pytest. The issue that there were no
possibility to pass these parameters via cmd to the Scylla, while tests
were not affected because they were using the parameters from the yaml
file. This PR fixes this issue so it will be easier to modify the Scylla
start parameters without modifying code.
Move Vector Search filter functions from `cql3::restrictions` to
`vector_search` namespace as it's a better place according to
it's purpose.
The effective name has now changed from `cql3::restrictions::to_json`
to `vector_search::to_json` which clearly mentions that the JSON
object will be used for Vector Search.
Rename the auxilary functions to use `to_json` suffix instead of
variety of verbs as those functions logic focuses on building JSON
object from different structures. The late naming emphasized too
much distinction between those functions, while they do pretty much
the same thing.
Follow-up: #28109
The API contract in partition_version.hh states that when dealing with
evictable entries, a real cache tracker pointer has to be passed to all
methods that ask for it. The nonpopulating reader violates this, passing
a nullptr to the snapshot. This was observed to cause a crash when a
concurrent cache read accessed the snapshot with the null tracker.
A reproducer is included which fails before and passes after the fix.
Fixes: #26847Closesscylladb/scylladb#28163
Bound_weight and partition_region are defined in both paging_state.idl.hh and
position_in_partition.idl.hh. This isn't currently causing any issues, but if
a future RPC uses both the paging_state and position_in_partition, after
including both files we'll get a duplicate error.
In this patch we prevent this by removing the definitions from paging_state.idl.hh
and including position_in_partition.idl.hh in their place.
Closesscylladb/scylladb#28228
Use session that was retrieved at the beginning of the handler for
node operations with streaming to ensure that the session id won't
change in between.
Currently, raft-based node operations with streaming use topology
guards, but repair-based don't.
Topology guards ensure that if a respective session is closed
(the operation has finished), each leftover operation being a part
of this session fails. Thanks to that we won't incorrectly assume
that e.g. the old rpc received late belongs to the newly started
operation. This is especially important if the operation involves
writes.
Pass a topology_guard down from raft_topology_cmd_handler to repair
tasks. Repair tasks already support topology guards.
Fixes: https://github.com/scylladb/scylladb/issues/27759
Mark rf_rack_valid_keyspaces option as deprecated. User should
use enforce_rack_list option instead.
The option can still be used and it does not change it's behavior.
Docs is updated accordingly.
Add enforce_rack_list option. When the option is set to true,
all tablet keyspaces have rack list replication factor.
When the option is on:
- CREATE STATEMENT always auto-extends rf to rack lists;
- ALTER STATEMENT fails when there is numeric rf in any DC.
The flag is set to false by default and a node needs to be restarted
in order to change its value. Starting a node with enforce_rack_list
option will fail, if there are any tablet keyspaces with numeric rf
in any DC.
enforce_rack_list is a per-node option and a user needs to ensure
that no tablet keyspace is altered or created while nodes in
the cluster don't have the consistent value.
Currently the suite generates config in old format, and only a single
test validates that using new format "works".
This change updates the suite (mainly the MinioServer::create_conf()
method) to generate endpoint confit in new format.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#28113
The datagram_channel::send() method that sends net::packet-s is
deprecated in favor of using span<temporary_buffer> one. Auditing code
still uses the former one -- it constructs a packet by using formatted
string by copying the string into the packet's fragment, then sends it.
This patch releases string into temporary_buffer and then passes
one-element span to send().
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#28198
do_query() is a coroutine but uses some continuations to take
advantage of exceptions being propagated via future::then() without
being thrown. We can accomplish the same thing with a nested coroutine
and coroutine::try_future(), simplifying the code.
While this area isn't performance intensive, we're not adding allocations.
The coroutine frame may add an allocation, but since read_page()
certainly does not return immediately, the following then() will allocate
as well. Since we eliminated that then(), the change is at least neutral
allocation-wise.
Closesscylladb/scylladb#28258
Consider the following scenario:
1. Let nodes A,B,C form a cluster with RF=3
2. Write query with CL=QUORUM is submitted and is acknowledged by
nodes B,C
3. Follow-up read query with CL=QUORUM is sent to verify the write
from the previous step
4. Coordinator sends data/digest requests to the nodes A,B. Since the
node A is missing data, digest mismatches and data reconciliation
is triggered
5. The node A or B fails, becomes unavailable, etc
6. During reconciliation, data requests are sent to node A,B and fail
failing the entire read query
When the above scenario happens, the tests using `start_writes()` fail
with the following stacktrace:
```
...
> await finish_writes()
test/cluster/test_tablets_migration.py:259:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
test/pylib/util.py:241: in finish
await asyncio.gather(*tasks)
test/pylib/util.py:227: in do_writes
raise e
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
worker_id = 1
...
> rows = await cql.run_async(rd_stmt, [pk])
E cassandra.ReadFailure: Error from server: code=1300 [Replica(s) failed to execute read] message="Operation failed for test_1767777001181_bmsvk.test - received 1 responses and 1 failures from 2 CL=QUORUM." info={'consistency': 'QUORUM', 'required_responses': 2, 'received_responses': 1, 'failures': 1}
```
Note that when a node failure happens before/during a read query,
there is no test failure as the speculative retries are enabled
by default. Hence an additional data/digest read is sent to the third
remaining node.
However, the same speculative read is cancelled the moment, the read
query reaches CL which may trigger a read-repair.
This change:
- Retries the verification read in start_writes() on failure to mitigate
races between reads and node failures
- Adds additional logging to correlate Python exceptions with Scylla logs
Fixes https://github.com/scylladb/scylladb/issues/27478
Fixes https://github.com/scylladb/scylladb/issues/27974
Fixes https://github.com/scylladb/scylladb/issues/27494
Fixes https://github.com/scylladb/scylladb/issues/23529
Note that this change test flakiness observed during tablet transitions.
However, it serves as a workaround for a higher-level issue
https://github.com/scylladb/scylladb/issues/28125Closesscylladb/scylladb#28140
Auth v2 migration uses non-paged queries via `execute_internal` API.
This commit changes it to use `query_internal` instead, which uses
paging under the hood.
Fixes: scylladb/scylladb#27577
Just move the touched code above so the next commit is more readable.
But this has a drawback: previously, if the returned rows were empty,
this code was not executed, but now is independently of the query
results. This shouldn't be a big deal, though, as auth shouldn't be
empty.
This patch adds vector index options allowing to enable quantization and oversampling.
Specific quantization value will be used internally by vector store.
In the current implementation, get_oversampling allows us to decide how many times more candidates
to retrieve from vector store - final response is still trimmed to the given limit.
It is a first step to allow rescoring - recalculation of similarity metric and re-ranking.
Without rescoring oversampling will be also further optimized to happen internally in vector store.
`test/vector_search/rescoring_test.cc` implements basic tests of added functionality.
New options are documented in `docs/cql/secondary-indexes.rst`.
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-82
Ref https://scylladb.atlassian.net/browse/SCYLLADB-83
New feature - no backporting
Closesscylladb/scylladb#27677
* github.com:scylladb/scylladb:
vector_search: doc: Document new index options
vector_search: test: Test oversampling
vector_search: test: Add rescoring index options test
vector_search: test: Extract Configure utility to shared header
vector_index: introduce `quantization` and `oversampling` options
Cassandra changed their system tables in 3.0. We migrated to the new system table layout in 2017, in ScyllaDB 2.0.
System tables introduced in Cassandra 3.0, as well as the 3.0 variant of pre-existing system tables were added to the db::system_table::v3 namespace.
We ended up adding some new ScyllaDB-only system tables to this namespace as well.
As the dust settled, most of the v3 system tables ended up being either simple aliases to non-v3 tables, or new tables.
Either way, the codebase uses just one variant of each table for a long time now the v3:: distinction is pointless.
Remove the v3 namespace and unify the table listing under the top-level db::system_keyspace scope.
Code cleanup, no backport
Closesscylladb/scylladb#28146
* github.com:scylladb/scylladb:
db/system_keyspace: move remining tables out of v3 keyspace
db/system_keyspace: relocate truncated() and commitlog_cleanups()
db/system_keyspace: drop v3::local()
db/system_keyspace: remove duplicate table names from v3
The loops in `ongoing_rf_change()` perform explicit yields, but they
also perform coroutine operations which can yield implicitly. The
explicit yields are redundant.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
`effective_capacity` is a value used in size based load balancing. It contains the sum of available disk space of a node and all the tablet sizes.
This change adds this value to the virtual table `system.load_per_node`. This can be useful for debugging size based load balancing.
Size based load balancing is currently only on master, so no backport is needed.
Closesscylladb/scylladb#28220
* github.com:scylladb/scylladb:
docs: add effective_capacity to system keyspace docs
virtual_table: add effective_capacity to load_per_node
This test has to be adjusted in lock-step with scylladb.git, due to changes in https://github.com/scylladb/scylladb/pull/27836. It is simpler to just take the time and import it, so https://github.com/scylladb/scylladb/pull/27836 can patch all the affected tests, including this one.
All code is imported verbatim, then patched later, such that the series remains bisectable.
dtest import, no backport needed
Closesscylladb/scylladb#28085
* github.com:scylladb/scylladb:
test/cluster/dtest: remove is_win() and users
test/cluster/dtest/scrub_test.py: add license blurb
test/cluster/dtest: import scrub_test.py
test/cluster/dtest/ccmlib: scylla_node.py: adapt run_scylla_sstable() at al
test/cluster/dtest/ccmlib: scylla_node.py: import run_scylla_sstable()
The original scrub test was done by the Cassandra project, hence there
is two Licenses notices: one for the original work by Cassandra
(2015) and one for our modifications on top (2021).
And dependencies: get_sstables() and __gather_sstables().
Code is importend verbatim, but doesn't work yet (no users yet either).
Will be patched to work in the next commit.
The last remining tables in the v3 keyspace are those that are genuinely
distinct -- added by Cassandra 3.0 or >= ScyllaDB 2.0.
Move these out of the v3 keyspace too, with this the v3 keyspace is
defunct and removed.
The name variables of these tables is outside the v3 namespace but the
method defining their schema is in the v3 namespace. Relocate the
methods out from the v3 namespace, to the scope where the name variables
live.
The methods are moved to the private: part of system_keyspace, as they
don't have external users currently.
Those table names that are effectively just an alias of the their
counterpart outside of the v3 namespace (struct).
scylla_local() is made public. Currently it is private, but it has
external users, working around the private designation by using the
public v3::scylla_local() alias. This change just makes the existing
status clear.
This patch adds vector index options allowing to enable quantization and oversampling.
Specific quantization value will be used internally by vector store.
In the current implementation, `get_oversampling` allows us to decide how many times more candidates
to retrieve from vector store - final response is still trimmed to the given limit.
It is a first step to allow rescoring - recalculation of similarity metric and re-ranking.
Without rescoring oversampling will be also further optimized to happen internally in vector store.
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-82
Ref https://scylladb.atlassian.net/browse/SCYLLADB-83
Currently, we only know about long reclaims from lsa-timing stall
reports. Shorter reclaims can go under the radar.
Those metrics will help to asses increase in LSA activity, which
translates to higher CPU cost of a workload.
reclaim tracks memory which goes to the standard allocator, e.g. when
entering and allocating_section or in the background reclaimer.
evict/compact count activity towrads building LSA reserve, in
allocating_section entry, or naked LSA allocation.
Closesscylladb/scylladb#27774
The way that test.py runs test/cqlpy tests requires that tests end their
session with all keyspaces deleted. If we forget to delete a keyspace,
test.py suspects some test fails and reports a failure. As reported in
issue #26291, the test file test/cqlpy/test_describe.py caused this check
to trigger, so this file was added to the blacklist "dirties_cluster"
in suite.yaml to force test.py to ignore this problem.
I believe the cause of the problem was as follows: test_describe.py
didn't really leave any undeleted keyspace. Rather, test_describe.py had
one test which used "USE" and this broke DESC KEYSPACES (Refs #26334) -
which test.py used to see which keyspaces remained.
We solved this problem not just once, but twice:
1. In pull request #26345, I fixed the test not to use "USE" on the main
CQL session.
2. In pull request #27971, I fixed DESC KEYSPACES implementation so even
if "USE" was in effect, it will return the correct results.
I checked manually, and after removing test_describe.py from the
dirties_cluster blacklist, all cqlpy tests now pass, without
spurious failures in the test following test_describe.py. So it's time
to remove it from the blacklist.
Fixes#26291
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#27973
Commit d54d409 (audit: write out to both table and syslog) unified
create_audit and start_audit, which moved the audit service creation later
in the startup sequence. This broke startup when audit is enabled because
view_builder prepares CQL queries before start_audit runs, and
query preparation calls audit_instance().local_is_initialized()
which crashes on the non-existent sharded service.
Move start_audit to run before view_builder::start() and other components
that may prepare CQL queries during their initialization.
Fixes SCYLLADB-252
Closesscylladb/scylladb#28139
Our glossary is stuck in the past, still discussing token ownership in
terms of vnodes and cluster synchronization in terms of gossip.
This patch tries to improve this a bit, although much more work needs to
be done.
The term `Tablet` is added and the definition of `Token` and `Token
Range` is rephrased to be tablet inclusive.
The term `Cluster` is changed to mention raft as the synchronization
mechanism instead of gossip.
One oustanding problem is that our general architecture page describing
the ring acrhitecture is still Vnode only. We have a seprate Tablets
page, but the two don't link to each other and most documentation refer
only to the former. A casual reader might be able to spend a a lot of
time on our documentation page, without even seeing the word: tablets.
Closesscylladb/scylladb#28170
The loop that unwraps nested exception, rethrows nested exception and saves pointer to the temporary std::exception& inner on stack, then continues. This pointer is, thus, pointing to a released temporary
Closesscylladb/scylladb#28143
reader_permit::release_base_resources() is a soft evict for the permit:
it releases the resources aquired during admission. This is used in
cases where a single process owns multiple permits, creating a risk for
deadlock, like it is the case for repair. In this case,
release_base_resources() acts as a manual eviction mechanism to prevent
permits blockings each other from admission.
Recently we found a bad interaction between release_base_resources() and
permit eviction. Repair uses both mechanism: it marks its permits as
inactive and later it also uses release_base_resources(). This partice
might be worth reconsidering, but the fact remains that there is a bug
in the reader permit which causes the base resources to be released
twice when release_base_resources() is called on an already evicted
permit. This is incorrect and is fixed in this patch.
Improve release_base_resources():
* make _base_resources const
* move signal call into the if (_base_resources_consumed()) { }
* use reader_permit::impl::signal() instead of
reader_concurrency_semaphore::signal()
* all places where base resources are released now call
release_base_resources()
A reproducer unit test is added, which fails before and passes after the
fix.
Fixes: #28083Closesscylladb/scylladb#28155
This is a translation of Cassandra's CQL unit test source file
validation/operations/InsertUpdateIfConditionTest.java into our cqlpy
framework.
This test file checks various LWT conditional updates. After that
file became too big, the Cassandra developers split parts from it -
moving tests for LWT with collections, UDTs, and static columns to
separate test files - which I already translated (pull request #13663).
This patch translates the remaining, main, LWT tests.
Strangely, this test file also has, in the middle of the file, several
tests for conditional schema changes, like CREATE KEYSPACE IF NOT EXISTS,
a feature which has *nothing* to do with LWT so really didn't belong in
this file. But I translated those as well.
These new tests all pass on both ScyllaDB and Cassandra, and have not
uncovered any new bug.
However these tests do demonstrate yet again something that users and
developers of ScyllaDB's LWT must be aware of: Whereas usually
ScyllaDB's goal has been compatiblity with Cassandra's CQL, in LWT
this has *not* been the case: ScyllaDB deviated from Cassandra's
behavior in its LWT implementation in several places. These intentional
deviations were documented in docs/kb/lwt-differences.rst.
Accordingly, the tests here include almost a hundred (!) modificatons
(search for "if is_scylla") to allow the same test to pass on both
ScyllaDB and Cassandra, as well as many comments explaining the types
of differences we're seeing.
Although these deviations from Cassandra compatibility are known and
intentional, it's worth listing here the ones re-discovered by these
new tests:
1. On a successful conditional write, Cassandra returns just true, Scylla
also returns the old contents of the row.
2. Similarly, in an IF EXISTS write that failed (the row did not exist),
Cassandra returns just false, Scylla also returns extra null values for
each and every column of the row.
3. Cassandra allows in "IF v IN (?, ?)" to bind individual values to
UNSET_VALUE and skips them, Scylla treats this as an error. Refs #13659.
4. When there are static columns, Scylla's LWT response returns the static
column first, Cassandra returns the modified column first. Since both
also say which columns they return, neither is more correct than the other,
a normally users will address specific columns by name, not by position.
5. docs/kb/lwt-differences.rst explains that "the returned result set
contains an old row for every conditional statement in the batch".
Beyond this different, actually non-conditional updates in the batch will
also get a row in Scylla's result. Refs #27955.
6. For batch statement, ScyllaDB allows mixing `IF EXISTS`, `IF NOT EXISTS`,
and other conditions for the same row. Cassandra doesn't, so checks that
these combinations are not allowed were commented out.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#27961
This PR marks system_replicated_keys as a system keyspace.
It was missing when the keyspace was added.
A side effect of that is that metrics that are not supposed to be reported are.
Fixes#27903Closesscylladb/scylladb#27954
* github.com:scylladb/scylladb:
distributed_loader: system_replicated_keys as system keyspace
replicated_key_provider: make KSNAME public
In storage_service::raft_topology_cmd_handler we pass a lambda
wrapped in coroutine::lambda to a function that creates streaming_task_impl.
The lambda is kept in streaming_task_impl that invokes it in its run
method.
The lambda captures may be destroyed before the lambda is called, leading
to use after free.
Do not wrap a lambda passed to streaming_task_impl into coroutine::lambda.
Use this auto dissociate the lambda lifetime from the calling statement.
Fixes: https://github.com/scylladb/scylladb/issues/28200.
Closesscylladb/scylladb#28201
Currently, if a rf change request is paused, it immediately changes
the system_schema.keyspaces to use rack list for this keyspace.
If the request is aborted, the co-location might not be finished.
Hence, we can end up with inconsistent schema and tablet replica state.
Update the system_schema.keyspaces only after the co-location is done (and
not when it's started).
Fixes: https://github.com/scylladb/scylladb/issues/28167
No backport needed; changes that introduced a bug are only on master
Closesscylladb/scylladb#28168
* github.com:scylladb/scylladb:
service: fin indentation
test: add test_numeric_rf_to_rack_list_conversion_abort
service: tasks: fix type of global_topology_request_virtual_task
service: do not change the schema while pausing the rf change
Refs #27429
Re-implement the dtest with same name as a scylla pytest, using a python level network proxy instead of tcpdump etc.
Both to avoid sudo and also to ensure we don't race.
Juggles different listen_address and broadcast_address values to insert a proxy measuring RPC traffic.
Note: the measuring relies on python network IO not splitting data chunks, since we don't really have packet-level view of the connections.
Note that a scylla change is required to make the ip address magic work, otherwise topology mechanism gets
confused. This should maybe at some point be looked into more, since we should be more resilient against various services in scylla binding to different addresses.
When this test is merged, we can drop the flaky test from dtest. And hope no new flakiness comes from this one...
Closesscylladb/scylladb#28133
* github.com:scylladb/scylladb:
test/cluster/test_internode_compression: Transpose test from dtest
gossiper/main: Extend special treatment of node ID resolve for rpc_address
All the tests under test/cqlpy/cassandra_tests/ were translated from
Cassandra's unit tests originally written in Java into our own test
framework, and accordinly carry a clear mention of their origin and
original license.
However, we did modify these original tests - even if the modification
was slight and mostly straightforward. Therefore I was asked to also
mention our own copyright (and license) for these modifications.
So this patch adds to every file in test/cqlpy/cassandra_tests/ text like:
# Modifications: Copyright 2026-present ScyllaDB
# SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
with the appropriate year instead of 2026.
Fixes#28215
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#28216
Many tests want to assume that group0 leader runs on a particualr
server, typically the first server in the list.
And they cannot be easily made to work with arbitrary leader, becuase
they setup a particular topology and then stop particular nodes, and
want to assume the leader is stable. They open leader's log and
expect things to appear in that log.
It's much easier to ensure the leader, than to prepare tests to
handle failovers.
In case of decommission, it's not desirable because it's less urgent.
In case of removenode, it leads to failure of removenode operation
because scheduled co-locating migration will fail if the destination
is on the excluded node, and this failure will be interpreted as drain
failure and coordinator will cancel the request.
Not a problem before "parallel decommission" because this failure is
only a streaming failure, not a barrier failure, so exception doesn't
escape into the catch clause in transition stage handler, and the
migration is simply rolled back. Once draining happens in the tablet
migration track, streaming failure will be interpreted as drain
failure and cancel the request.
Reaching critical disk utilization on destination means the draining
either caused it, or at least works against reliveing it. So it's
better to cancel those requests. In case of decommission, if critical
disk utilization was caused by it due to not enough capacity, aborting
decomission will bring capacity back to the system and rebalancing
will relieve critical disk utlization.
In case of decommission, it's not desirable because it's less
urgent.
In case of removenode, it leads to failure of removenode operation
because scheduled co-locating migration will fail if the destination
is on the excluded node, and this failure will be interpreted as drain
failure and coordinator will cancel the request.
Not a problem before "parallel decommission" because this failure is
only a streaming failure, not a barrier failure, so exception doesn't
escape into the catch clause in transition stage handler, and the
migration is simply rolled back. Once draining happens in the tablet
migration track, streaming failure will be interpreted as drain
failure and cancel the request.
We want to be able to cancel decommission when it's still in the
tablet draining phase. Such a request is in a pending and paused
state, and can be safely canceled. We set the node's "draining" flag
back to false.
Allows other topology operations to execute while tablets are being
drained on decommission. In particular, bootstrap on scale-out. This
is important for elasticity.
Allows multiple decommission/removenode to happen in parallel, which
is important for efficiency.
Flow of decommission/removenode request:
1) pending and paused, has tablet replicas on target node.
Tablet scheduler will start draining tablets.
2) No tablets on target node, request is pending but not paused
3) Request is scheduled, node is in transition
4) Request is done
Nodes are considered draining as soon as there is a leave or remove
request on them. If there are tablet replicas present on the target
node, the request is in a paused state and will not be picked by
topology coordinator. The paused state is computed from topology state
automatically on reload.
When request is not paused, its execution starts in
write_both_read_old state. The old tablet_draining state is not
entered (it's deprecated now).
Tablet load balancing will yield the state machine as soon as some
request is no longer paused and ready to be scheduled, based on
standard preemption mechanics.
The test case test_explicit_tablet_movement_during_decommission is
removed. It verifies that tablet move API works during tablet draining
transition. After this PR, we no longer enter this transition, so the
test doesn't work. It loses its purpose, because movement during
normal tablet balancing is not special and tested elsewhere.
They are being drained of tablet replicas, tablet scheduler works to
move replicas away from such nodes. This state is set at the
beginning of decommission and removenode operations.
After parallel tablet draining, the validation at the time request
starts executing is too late, tablets will be already drained.
This trips tests which expect validation failure, but get tablet
draining failure instead.
Also, in case of decommission, it's a waste to go through draining
only to discover that the operation has to be rolled back due to
validation.
So avoid submitting a request altogether if it's invalid.
The validation at request execution start remains, for extra sefety.
validate_removing_node() was extracted out of topology_coordinator,
so that it can be called by storage_service on non-coordinator.
Some tests need adjusting for the fact that after failed removenode
the node may still not be marked as excluded, so we need to explicitly
exclude it or add to the list of ignored nodes in the next removenode
operation.
Since Vector Store service filtering API has been implemented (scylladb/vector-store#334), there is a need for the implementation of Scylla side part.
This patch should implement a `statement_restrictions` parsing into Vector Store filtering API compatible JSON objects.
Those objects should be added to ANN query vector POST requests as `filter` object.
After this patch, the subset of all operations ([Vector Search Filtering Milestone 1](https://scylladb.atlassian.net/wiki/spaces/RND/pages/156729450/Vector+Search+Filtering+Design+Document#Milestone-1)) happy path should be completed, allowing users to filter on primary key columns with single column `=` and `IN` or multiple column `()=()` and `() IN ()`.
The restrictions for other operations should be implemented in a PR on Vector Store service side.
---
This PR implements parsing the `statement_restrictions` into Vector Store filtering API compatible JSON objects.
The JSON objects are created and used in ANN vector queries with filtering.
It closes the Scylla side implementation of Vector Search filtering milestone 1.
Unit tests for `statement_restrictions` parsing are added. Integration tests will be added on Vector Store service side PR.
---
Fixes: SCYLLADB-249
New feature, should land into 2026.1
Closesscylladb/scylladb#28109
* github.com:scylladb/scylladb:
docs: update documentation on filtering with vector queries
test/vector_search: add test for filtered ANN with VS mock
test/vector_search: add restriction to JSON conversion unit tests
vector_search: cql: construct and use filter in ANN vector queries
select_statement: do not require post query ordering for vector queries
vector_search: add `statement_restrictions` to JSON parsing
seastar dd46b6f..e00f1513
```
e00f1513 Merge 'net: Add DNS TTL to the net::hostent' from Ernest Zaslavsky
8a69e1f4 net: extract common implementation of inet_address::find_all
cb469fd1 net: deprecate the addr_list in hostent
1d59c0ca net: expose DNS TTL via net::hostent
3c6d919f http: add virtual close() to connection_factory
bbd0001a Revert "net: expose DNS TTL via net::hostent"
```
Closesscylladb/scylladb#28147
Workload prioritization was added in scylladb/scylladb#22031.
The functionality of updating service levels was implemented as
a lambda coroutine, leaving room for the lambda coroutine fiasco.
The problem was noticed and addressed in scylladb/scylladb#26404.
There are currently three functions that call switch_tenant:
- update_user_scheduling_group_v1 and update_user_scheduling_group_v2
use the deducing this (this auto self) to ensure the proper
lifecycle of the lambda capture.
- update_control_connection_scheduling_group doesn’t use the deducing
this, but the lambda captures only `this`, which is used before
the first possible coroutine preemption. Therefore, it doesn’t seem
that any memory corruption or undefined behavior is possible here.
Nevertheless, it seems better to start using the deducing this in
update_control_connection_scheduling_group as well, to avoid problems
in the future if someone modifies the code and forgets to add it.
Fixes: SCYLLADB-284
Closesscylladb/scylladb#28158
The `make_key` lambda erroneously allocates a fixed 8-byte buffer
(`sizeof(s.size())`) for variable-length strings, potentially causing
uninitialized bytes to be included. If such bytes exist and they are
not valid UTF-8 characters, deserialization fails:
```
ERROR 2026-01-16 08:18:26,062 [shard 0:main] testlog - snapshot_list_contains_dropped_tables: cql env callback failed, error: exceptions::invalid_request_exception (Exception while binding column p1: marshaling error: Validation failed - non-UTF8 character in a UTF8 string, at byte offset 7)
```
Fixes#28195.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
Closesscylladb/scylladb#28197
Continuing the read once it is aborted (e.g. due to timeout) is a waste
of resources, as the produced results will be discarded.
Poll the permit's abort exception in the memtable and cache reader's
fill_buffer(). This results in one poll per buffer filled (8KB of data).
We already have similar poll for sstable readers, as disk reads are
usually much heavier and therefore it is more important to stop them
ASAP after abort. Cache and memtable reads are usually quick but not
always, hence it is important to also have polling in the cache and
memtable readers.
Refs: #11469Fixes: #28148Closesscylladb/scylladb#28149
0ede8d154b introduced the dev doc for size
based load balancing, but also added spelling errors.
This PR fixed these errors.
Closesscylladb/scylladb#28196
Currently, the type of global_topology_request_virtual_task isn't
taken out of std::variant before printing, which results with
a task of type variant(actual_type).
Retrieve the type from the variant before passing it to task type.
Currently, if a rf change request is paused, it immediately changes
the system_schema.keyspaces to use rack list for this keyspace.
If the request is aborted, the co-location might not be finished.
Hence, we can end up with inconsistent schema and tablet replica state.
Update the system_schema.keyspaces only after the co-location is done (and
not when it's started).
We switched to the size-based load balancing, which now has more
strict requirements for load stats. We no longer need only per-node
stats, but also per-tablet stats.
Bootstrapping a node triggers stats refresh, but allocating tablets on
table creation didn't. So after creating a table, load balancer
couldn't make progress for up to 60s (stats refresh period).
This makes tests take longer, and can even cause failures if tests are
using a low-enough timeout.
Fixes https://github.com/scylladb/scylladb/issues/27921
No backport becuse only master is vulnerable (size-based load balancing).
Closesscylladb/scylladb#27926
* https://github.com/scylladb/scylladb:
test: cluster: Add reproducer for missed notification in topology coordinator
topology_coordinator: Wake up the state machine after stats refresh
topology_coordinator: Move tablet_load_stats_refresh_before_rebalancing injection earlier
topology_coordinator: Fix potential missed notification
topology_coordinator: Refresh load stats after table is created or altered
tablets: Do a group0 read barrier on tablet load stats refresh
topology_coordinator: Ensure stats are refreshed in the gossip scheduling group
test: Use ManagerClient.{disable,enable}_tablet_balancing()
test: Add missing calls to disable_tablet_balancing() in tests which use move_tablet() API
test: pylib: Introduce ManagerClient.{disable,enable}_tablet_balancing()
Add a description of available filtering options with ANN vector queries.
Provide an example of such query and a reference to `WHERE` clause restrictions.
Add `filter` option in `ann()` function to write the filter JSON
object as the POST request in ANN vector queries.
Adjust existing `vector_store_client_test` tests accordingly.
As there is only one `ORDER BY` clause with `ANN OF` ordering supported
in ANN vector queries, there is no need to require post query ordering
for the ANN vector queries. The standard ordering is not allowed here.
In fact the ordering is done on the Vector Store service side within
the ANN search, so that the returned primary keys are already sorted
accordingly.
If left unchanged, the filtering with `IN` clauses would cause
a `bad_function_call` server error as the filtering with `IN` clauses
require the post query ordering in a standard case.
Such rebuild has no read_from replica, but we know the tablet size will be 0.
If we don't, stats will be incomplete until the next refresh.
This is important for test cases which do removenode or replace while
all replicas are down. So for example test_replace from
test_tablets_removenode.py, which uses RF=1 and replaces a node.
Without this, the test waits for 60s needlessly after the first round
of rebuilding migrations before scheduling more migrations. This can
cause the test to time out.
Fixes#28115Closesscylladb/scylladb#28121
Refs #27429
re-implement the dtest with same name as a scylla pytest, using
a python level network proxy instead of tcpdump etc. Both to avoid
sudo and also to ensure we don't race.
v2:
* Included positive test (mode=all)
In PR 5b6570be52 we introduced the config option
`sstable_compression_user_table_options` to allow adjusting the default
compression settings for user tables. However, the new option was hooked
into the CQL layer and applied only to CQL base tables, not to the whole
spectrum of user tables: CQL auxiliary tables (materialized views,
secondary indexes, CDC log tables), Alternator base tables, Alternator
auxiliary tables (GSIs, LSIs, Streams).
Fix this by moving the logic into the `schema_builder` via a schema
initializer. This ensures that the default compression settings are
applied uniformly regardless of how the table is created, while also
keeping the logic in a central place.
Register the initializer at startup in all executables where schemas are
being used (`scylla_main()`, `scylla_sstable_main()`, `cql_test_env`).
Finally, remove the ad-hoc logic from `create_table_statement`
(redundant as of this patch), remove the xfail markers from the relevant
tests and adjust `test_describe_cdc_log_table_create_statement` to
expect LZ4WithDicts as the default compressor.
Fixes#26914.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
Extend the `static_configurator` mechanism to support initialization of
arbitrary schema properties, not only static ones, by passing a
`schema_builder` reference to the configurator interface.
As part of this change, rename `static_configurator` to
`schema_initializer` to better reflect its broader responsibility.
Add a checkpoint/restore mechanism to allow de-registering an
initializer (useful for testing; will be used in the next patch).
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
Schemas maintain a set of so-called "static properties". These are not
user-visible schema properties; they are internal values carried by
in-memory `schema` objects for convenience (349bc1a9b6,
https://github.com/scylladb/scylladb/pull/13170#issuecomment-1469848086).
Currently, the initialization of these properties happens when a
`schema_builder` builds a schema (`schema_builder::build()`), by
invoking all registered "static configurators".
This patch moves the initialization of static properties into the
`schema_builder` constructor. With this change, the builder initializes
the properties once, stores them in a data member, and reuses them for
all schema objects that it builds. This doesn't affect correctness as
the values produced by static configurators are "static" by
nature; they do not depend on runtime state.
In the next patch, we will replace the "static configurator" pattern
with a more general pattern that also supports initialization of regular
schema properties, not just static ones. Regular properties cannot be
initialized in `build()` because users may have already explicitly set
values via setters, and there is no way to distinguish between default
values and explicitly assigned ones.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
The `sstable_compression_user_table_options` config option determines
the default compression settings for user tables.
In patch 2fc812a1b9, the default value of this option was changed from
LZ4 to LZ4WithDicts and a fallback logic was introduced during startup
to temporarily revert the option to LZ4 until the dictionary compression
feature is enabled.
Replace this fallback logic with an accessor that returns the correct
settings depending on the feature flag. This is cleaner and more
consistent with the way we handle the `sstable_format` option, where the
same problem appears (see `get_preferred_sstable_version()`).
As a consequence, the configuration option must always be accessed
through this accessor. Add a comment to point this out.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
In patches 11f6a25d44 and 7b9428d8d7 we added tests to verify that
auxiliary tables for both CQL and Alternator have the same default
compression settings as their base tables. These tests do not check
where these defaults originate from; they just verify that they are
consistent.
Add some more tests to verify the actual source of the defaults, which
is expected to be the `sstable_compression_user_table_options`
from the configuration. Unlike the previous tests, these tests require
dedicated Scylla instances with custom configuration, so they must be
placed under `test/cluster/`.
Mark them as xfail-ing. The marker will be removed later in this series.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
Refs #27429
If running with broadcast_address != listen/cql/rpc address, topology
gets confused about the varying addresses. Need to special case
resolve both addresses as "self". I.e. extend broadcast_address
treatment to cql_address as well.
Added export of this via gossiper for symmetry.
Otherwise, coordinator may not react to changing stats after explicit
calls to trigger_load_stats_refresh() done on node replace or table
creation, if stats take longer to refresh than it takes the
coordinator to go idle.
The periodic refresh does wake up the topology coordinator, so the
issue is not dramatic in production, but it's annoying in tests, which
take longer because of that.
Fixes#25163
Refreshing stats will signal _topo_sm.event, so do it before waiting
for the event, to avoiding busy looping in the coordinator.
This will produce lots of logs in test cases which enable debug-level
logging in the raft logger.
Refs #28086
Checking for work is not atomic, so there is room for missed
notification. Especially that notifications are not always triggered
from fibers which take the group0 guard.
Fix by subscribing for the event before checking for work.
Fixes#27958
We switched to the size-based load balancing, which now has more
strict requirements for load stats. We no longer need only per-node
stats (capacity), but also per-tablet stats.
Bootstrapping a node triggers stats refresh, but allocating tablets on
table creation didn't. So after creating a table, load balancer
couldn't make progress for up to 60s (stats refresh period).
This makes tests take longer, and can even cause failures if tests are
using a low-enough timeout.
Fixes#27921
Stats refresh will be triggered on topology coordinator by events like
allocating new tablets on table creation. For refresh to be effective,
all replicas must see the new tablets, otherwise stats will be
incomplete.
If a test tries to move a tablet, it assumes the tablets are stable.
This fixes flakiness exposed by size-based load-balancing and a later
change to refresh stats sooner.
It's a global operation, so we can use any server.
It's not only convenient. The call via api.disable_tablet_balancing()
confuse people to think that it's a per-server operation. This leads
to proliferation of code which does it needlessly on all servers.
Move KSNAME constant from internal static to public member of
replicated_key_provider_factory class.
It will be used to identify it as a system keyspace.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2026-01-02 16:39:51 +02:00
429 changed files with 14048 additions and 4212 deletions
co_returnapi_error::validation("GlobalSecondaryIndexes and LocalSecondaryIndexes with tablets require the rf_rack_valid_keyspaces option to be enabled.");
// Creating an index in tablets mode requires the keyspace to be RF-rack-valid.
// GSI and LSI indexes are based on materialized views which require RF-rack-validity to avoid consistency issues.
"description":"Set the incremental repair mode. Can be 'disabled', 'incremental', or 'full'. 'incremental': The incremental repair logic is enabled. Unrepaired sstables will be included for repair. Repaired sstables will be skipped. The incremental repair states will be updated after repair. 'full': The incremental repair logic is enabled. Both repaired and unrepaired sstables will be included for repair. The incremental repair states will be updated after repair. 'disabled': The incremental repair logic is disabled completely. The incremental repair states, e.g., repaired_at in sstables and sstables_repaired_at in the system.tablets table, will not be updated after repair. When the option is not provided, it defaults to 'disabled' mode.",
"description":"Set the incremental repair mode. Can be 'disabled', 'incremental', or 'full'. 'incremental': The incremental repair logic is enabled. Unrepaired sstables will be included for repair. Repaired sstables will be skipped. The incremental repair states will be updated after repair. 'full': The incremental repair logic is enabled. Both repaired and unrepaired sstables will be included for repair. The incremental repair states will be updated after repair. 'disabled': The incremental repair logic is disabled completely. The incremental repair states, e.g., repaired_at in sstables and sstables_repaired_at in the system.tablets table, will not be updated after repair. When the option is not provided, it defaults to incremental mode.",
,ignore_dead_nodes_for_replace(this,"ignore_dead_nodes_for_replace",value_status::Used,"","List dead nodes to ignore for replace operation using a comma-separated list of host IDs. E.g., scylla --ignore-dead-nodes-for-replace 8d5ed9f4-7764-4dbd-bad8-43fddce94b7c,125ed9f4-7777-1dbn-mac8-43fddce9123e")
,override_decommission(this,"override_decommission",value_status::Deprecated,false,"Set true to force a decommissioned node to join the cluster (cannot be set if consistent-cluster-management is enabled).")
,enable_repair_based_node_ops(this,"enable_repair_based_node_ops",liveness::LiveUpdate,value_status::Used,true,"Set true to use enable repair based node operations instead of streaming based.")
,allowed_repair_based_node_ops(this,"allowed_repair_based_node_ops",liveness::LiveUpdate,value_status::Used,"replace,removenode,rebuild,bootstrap,decommission","A comma separated list of node operations which are allowed to enable repair based node operations. The operations can be bootstrap, replace, removenode, decommission and rebuild.")
,allowed_repair_based_node_ops(this,"allowed_repair_based_node_ops",liveness::LiveUpdate,value_status::Used,"replace,removenode,rebuild","A comma separated list of node operations which are allowed to enable repair based node operations. The operations can be bootstrap, replace, removenode, decommission and rebuild.")
,enable_compacting_data_for_streaming_and_repair(this,"enable_compacting_data_for_streaming_and_repair",liveness::LiveUpdate,value_status::Used,true,"Enable the compacting reader, which compacts the data for streaming and repair (load'n'stream included) before sending it to, or synchronizing it with peers. Can reduce the amount of data to be processed by removing dead data, but adds CPU overhead.")
"If the compacting reader is enabled for streaming and repair (see enable_compacting_data_for_streaming_and_repair), allow it to garbage-collect tombstones."
,enable_create_table_with_compact_storage(this,"enable_create_table_with_compact_storage",liveness::LiveUpdate,value_status::Used,false,"Enable the deprecated feature of CREATE TABLE WITH COMPACT STORAGE. This feature will eventually be removed in a future version.")
@@ -190,7 +190,7 @@ then every rack in every datacenter receives a replica, except for racks compris
of only :doc:`zero-token nodes </architecture/zero-token-nodes>`. Racks added after
the keyspace creation do not receive replicas.
When ``rf_rack_valid_keyspaces``` is enabled in the config and the keyspace is tablet-based,
When ``enforce_rack_list`` (or (deprecated) ``rf_rack_valid_keyspaces``) is enabled in the config and the keyspace is tablet-based,
the numeric replication factor is automatically expanded into a rack list when the statement is
executed, which can be observed in the DESCRIBE output afterwards. If the numeric RF is smaller than
the number of racks in a DC, a subset of racks is chosen arbitrarily.
@@ -200,8 +200,6 @@ for two cases. One is setting replication factor to 0, in which case the number
The other is when the numeric replication factor is equal to the current number of replicas
for a given datacanter, in which case the current rack list is preserved.
Altering from a numeric replication factor to a rack list is not supported yet.
Note that when ``ALTER`` ing keyspaces and supplying ``replication_factor``,
auto-expansion will only *add* new datacenters for safety, it will not alter
existing datacenters or remove any even if they are no longer in the cluster.
@@ -424,6 +422,21 @@ Altering from a rack list to a numeric replication factor is not supported.
Keyspaces which use rack lists are :term:`RF-rack-valid <RF-rack-valid keyspace>` if each rack in the rack list contains at least one node (excluding :doc:`zero-token nodes </architecture/zero-token-nodes>`).
.._conversion-to-rack-list-rf:
Conversion to rack-list replication factor
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
To migrate a keyspace from a numeric replication factor to a rack-list replication factor, provide the rack-list replication factor explicitly in ALTER KEYSPACE statement. The number of racks in the list must be equal to the numeric replication factor. The replication factor can be converted in any number of DCs at once. In a statement that converts replication factor, no replication factor updates (increase or decrease) are allowed in any DC.
The `manifest` member contains the following attributes:
- `version` - respresenting the version of the manifest itself. It is incremented when members are added or removed from the manifest.
- `scope` - the scope of metadata stored in this manifest file. The following scopes are supported:
- `node` - the manifest describes all SSTables owned by this node in this snapshot.
The `node` member contains metadata about this node that enables datacenter- or rack-aware restore.
- `host_id` - is the node's unique host_id (a UUID).
- `datacenter` - is the node's datacenter.
- `rack` - is the node's rack.
The `snapshot` member contains metadata about the snapshot.
- `name` - is the snapshot name (a.k.a. `tag`)
- `created_at` - is the time when the snapshot was created.
- `expires_at` - is an optional time when the snapshot expires and can be dropped, if a TTL was set for the snapshot. If there is no TTL, `expires_at` may be omitted, set to null, or set to 0.
The `table` member contains metadata about the table being snapshot.
- `keyspace_name` and `table_name` - are self-explanatory.
- `table_id` - a universally unique identifier (UUID) of the table set when the table is created.
- `tablets_type`:
- `none` - if the keyspace uses vnodes replication
- `powof2` - if the keyspace uses tables replication, and the tablet token ranges are based on powers of 2.
- `tablet_count` - Optional. If `tablets_type` is not `none`, contains the number of tablets allcated in the table. If `tablets_type` is `powof2`, tablet_count would be a power of 2.
The `sstables` member is a list containing metadata about the SSTables in the snapshot.
- `id` - is the STable's unique id (a UUID). It is carried over with the SSTable when it's streamed as part of tablet migration, even if it gets a new generation.
- `toc_name` - is the name of the SSTable Table Of Contents (TOC) component.
- `data_size` and `index_size` - are the sizes of the SSTable's data and index components, respectively. They can be used to estimate how much disk space is needed for restore.
- `first_token` and `last_token` - are the first and last tokens in the SSTable, respectively. They can be used to determine if a SSTable is fully contained in a (tablet) token range to enable efficient file-based streaming of the SSTable.
The optional `files` member may contain a list of non-SSTable files included in the snapshot directory, not including the manifest.json file and schema.cql.
```
3.`CREATE KEYSPACE` with S3/GS storage
When creating a keyspace with S3/GS storage, the data is stored under the bucket passed as argument to the `CREATE KEYSPACE` statement.
*`storage_allocated_load` - Disk space allocated for tablets, assuming each tablet has a fixed size (target_tablet_size).
*`storage_allocated_utilization` - Fraction of node's disk capacity taken for `storage_allocated_load`, where 1.0 means full utilization.
*`storage_capacity` - Total disk capacity in bytes. Used to compute `storage_allocated_utilization`. By default equal to file system's capacity.
*`effective_capacity` - Sum of available disk space and tablet sizes on a node. Used to compute load on a node for size based balancing.
*`storage_load` - Disk space allocated for tablets, computed with actual tablet sizes. Can be null if some of the tablet sizes are not known.
*`storage_utilization` - Fraction of node's disk capacity taken for `storage_load` (with actual tablet sizes), where 1.0 means full utilization. Can be null if some of the tablet sizes are not known.
*`tablets_allocated` - Number of tablet replicas on the node. Migrating tablets are accounted as if migration already finished.
Traditionally, launching :doc:`repairs </operating-scylla/procedures/maintenance/repair>` in a ScyllaDB cluster is left to an external process, typically done via `Scylla Manager <https://manager.docs.scylladb.com/stable/repair/index.html>`_.
Automatic repair offers built-in scheduling in ScyllaDB itself. If the time since the last repair is greater than the configured repair interval, ScyllaDB will start a repair for the :doc:`tablet table </architecture/tablets>` automatically.
Repairs are spread over time and among nodes and shards, to avoid load spikes or any adverse effects on user workloads.
To enable automatic repair, add this to the configuration (``scylla.yaml``):
..code-block::yaml
auto_repair_enabled_default:true
auto_repair_threshold_default_in_seconds:86400
This will enable automatic repair for all tables with a repair period of 1 day. This configuration has to be set on each node, to an identical value.
More featureful configuration methods will be implemented in the future.
To disable, set ``auto_repair_enabled_default: false``.
Automatic repair relies on :doc:`Incremental Repair </features/incremental-repair>` and as such it only works with :doc:`tablet </architecture/tablets>` tables.
ScyllaDB's standard repair process scans and processes all the data on a node, regardless of whether it has changed since the last repair. This operation can be resource-intensive and time-consuming. The Incremental Repair feature provides a much more efficient and lightweight alternative for maintaining data consistency.
ScyllaDB's standard :doc:`repair </operating-scylla/procedures/maintenance/repair>` process scans and processes all the data on a node, regardless of whether it has changed since the last repair. This operation can be resource-intensive and time-consuming. The Incremental Repair feature provides a much more efficient and lightweight alternative for maintaining data consistency.
The core idea of incremental repair is to repair only the data that has been written or changed since the last repair was run. It intelligently skips data that has already been verified, dramatically reducing the time, I/O, and CPU resources required for the repair operation.
@@ -28,8 +28,7 @@ Incremental Repair is only supported for tables that use the tablets architectur
Incremental Repair Modes
------------------------
Incremental is currently disabled by default. You can control its behavior for a given repair operation using the ``incremental_mode`` parameter.
This is useful for enabling incremental repair, or in situations where you might need to force a full data validation.
While incremental repair is the default and recommended mode, you can control its behavior for a given repair operation using the ``incremental_mode`` parameter. This is useful for situations where you might need to force a full data validation.
The available modes are:
@@ -38,7 +37,12 @@ The available modes are:
*``disabled``: Completely disables the incremental repair logic for the current operation. The repair behaves like a classic, non-incremental repair, and it does not read or update any incremental repair status markers.
The incremental_mode parameter can be specified using nodetool cluster repair, e.g., nodetool cluster repair --incremental-mode incremental. It can also be specified with the REST API, e.g., curl -X POST "http://127.0.0.1:10000/storage_service/tablets/repair?ks=ks1&table=tb1&tokens=all&incremental_mode=incremental"
The incremental_mode parameter can be specified using nodetool cluster repair, e.g., nodetool cluster repair --incremental-mode incremental.
It can also be specified with the REST API, e.g.:
..code::
curl -X POST "http://127.0.0.1:10000/storage_service/tablets/repair?ks=ks1&table=tb1&tokens=all&incremental_mode=incremental"
Benefits of Incremental Repair
------------------------------
@@ -47,6 +51,8 @@ Benefits of Incremental Repair
***Reduced Resource Usage:** Consumes significantly less CPU, I/O, and network bandwidth compared to a full repair.
***More Frequent Repairs:** The efficiency of incremental repair allows you to run it more frequently, ensuring a higher level of data consistency across your cluster at all times.
Tables using Incremental Repair can schedule repairs in ScyllaDB itself, with :doc:`Automatic Repair </features/automatic-repair>`.
Ensure your platform is supported by the ScyllaDB version you want to install.
See :doc:`OS Support </getting-started/os-support>` for information about supported Linux distributions and versions.
Note that if you're on CentOS 7, only root offline installation is supported.
Download and Install
-----------------------
#. Download the latest tar.gz file for ScyllaDB version (x86 or ARM) from ``https://downloads.scylladb.com/downloads/scylla/relocatable/scylladb-<version>/``.
Example for version 6.1: https://downloads.scylladb.com/downloads/scylla/relocatable/scylladb-6.1/
**Example** for version 2025.1:
- Go to https://downloads.scylladb.com/downloads/scylla/relocatable/scylladb-2025.1/
- Download the ``scylla-unified`` file for the patch version you want to
install. For example, to install 2025.1.9 (x86), download
*:doc:`Recreate RAID devices </kb/raid-device>` - How to recreate your RAID devices without running scylla-setup
*:doc:`Configure ScyllaDB Networking with Multiple NIC/IP Combinations </kb/yaml-address>` - examples for setting the different IP addresses in scylla.yaml
*:doc:`Updating the Mode in perftune.yaml After a ScyllaDB Upgrade </kb/perftune-modes-sync>`
We improved ScyllaDB's performance by `removing the rx_queues_count from the mode
condition <https://github.com/scylladb/seastar/pull/949>`_. As a result, ScyllaDB operates in
the ``sq_split`` mode instead of the ``mq`` mode (see :doc:`Seastar Perftune </operating-scylla/admin-tools/perftune>` for information about the modes).
If you upgrade from an earlier version of ScyllaDB, your cluster's existing nodes may use the ``mq`` mode,
while new nodes will use the ``sq_split`` mode. As using different modes across one cluster is not recommended,
you should change the configuration to ensure that the ``sq_split`` mode is used on all nodes.
This section describes how to update the `perftune.yaml` file to configure the ``sq_split`` mode on all nodes.
Procedure
------------
The examples below assume that you are using the default locations for storing data and the `scylla.yaml` file,
A new ``/etc/scylla.d/cpuset.conf`` will be generated on the output.
#. Compare the contents of the newly generated ``/etc/scylla.d/cpuset.conf`` with ``/etc/scylla.d/cpuset.conf.old`` you created in step 1.
- If they are exactly the same, rename ``/etc/scylla.d/perftune.yaml.old`` you created in step 1 back to ``/etc/scylla.d/perftune.yaml`` and continue to the next node.
- If they are different, move on to the next steps.
#. Restart the ``scylla-server`` service.
.. code-block:: console
nodetool drain
sudo systemctl restart scylla-server
#. Wait for the service to become up and running (similarly to how it is done during a :doc:`rolling restart </operating-scylla/procedures/config-change/rolling-restart>`). It may take a considerable amount of time before the node is in the UN state due to resharding.
-``--incremental-mode`` specifies the incremental repair mode. Can be 'disabled', 'incremental', or 'full'. 'incremental': The incremental repair logic is enabled. Unrepaired sstables will be included for repair. Repaired sstables will be skipped. The incremental repair states will be updated after repair. 'full': The incremental repair logic is enabled. Both repaired and unrepaired sstables will be included for repair. The incremental repair states will be updated after repair. 'disabled': The incremental repair logic is disabled completely. The incremental repair states, e.g., repaired_at in sstables and sstables_repaired_at in the system.tablets table, will not be updated after repair. When the option is not provided, it defaults to 'disabled'.
-``--incremental-mode`` specifies the incremental repair mode. Can be 'disabled', 'incremental', or 'full'. 'incremental': The incremental repair logic is enabled. Unrepaired sstables will be included for repair. Repaired sstables will be skipped. The incremental repair states will be updated after repair. 'full': The incremental repair logic is enabled. Both repaired and unrepaired sstables will be included for repair. The incremental repair states will be updated after repair. 'disabled': The incremental repair logic is disabled completely. The incremental repair states, e.g., repaired_at in sstables and sstables_repaired_at in the system.tablets table, will not be updated after repair. When the option is not provided, it defaults to incremental.
For example:
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.