Compare commits

..

183 Commits

Author SHA1 Message Date
Pavel Emelyanov
233da83dd9 Merge 'Fix directory lister leak in table::get_snapshot_details: ' from Benny Halevy
As reported in SCYLLADB-1013, the directory lister must be closed also when an exception is thrown.

For example, see backtrace below:
```
seastar::on_internal_error(seastar::logger&, std::basic_string_view<char, std::char_traits<char>>) at ./build/release/seastar/./seastar/src/core/on_internal_error.cc:57
directory_lister::~directory_lister() at ./utils/lister.cc:77
replica::table::get_snapshot_details(std::filesystem::__cxx11::path, std::filesystem::__cxx11::path) (.resume) at ./replica/table.cc:4081
std::__n4861::coroutine_handle<seastar::internal::coroutine_traits_base<db::snapshot_ctl::table_snapshot_details>::promise_type>::resume() const at /usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/coroutine:247
 (inlined by) seastar::internal::coroutine_traits_base<db::snapshot_ctl::table_snapshot_details>::promise_type::run_and_dispose() at ././seastar/include/seastar/core/coroutine.hh:129
seastar::reactor::task_queue::run_tasks() at ./build/release/seastar/./seastar/src/core/reactor.cc:2695
 (inlined by) seastar::reactor::task_queue_group::run_tasks() at ./build/release/seastar/./seastar/src/core/reactor.cc:3201
seastar::reactor::task_queue_group::run_some_tasks() at ./build/release/seastar/./seastar/src/core/reactor.cc:3185
 (inlined by) seastar::reactor::do_run() at ./build/release/seastar/./seastar/src/core/reactor.cc:3353
seastar::reactor::run() at ./build/release/seastar/./seastar/src/core/reactor.cc:3245
seastar::app_template::run_deprecated(int, char**, std::function<void ()>&&) at ./build/release/seastar/./seastar/src/core/app-template.cc:266
seastar::app_template::run(int, char**, std::function<seastar::future<int> ()>&&) at ./build/release/seastar/./seastar/src/core/app-template.cc:160
scylla_main(int, char**) at ./main.cc:756
```

Fixes: [SCYLLADB-1013](https://scylladb.atlassian.net/browse/SCYLLADB-1013)

* Requires backport to 2026.1 since the leak exists since 004c08f525

[SCYLLADB-1013]: https://scylladb.atlassian.net/browse/SCYLLADB-1013?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ

Closes scylladb/scylladb#29084

* github.com:scylladb/scylladb:
  test/boost/database_test: add test_snapshot_ctl_details_exception_handling
  table: get_snapshot_details: fix indentation inside try block
  table: per-snapshot get_snapshot_details: fix typo in comment
  table: per-snapshot get_snapshot_details: always close lister using try/catch
  table: get_snapshot_details: always close lister using deferred_close

(cherry picked from commit f27dc12b7c)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#29125
2026-03-25 10:04:39 +01:00
Robert Bindar
9b81939a93 replica/table: calculate manifest tablet_count from tablet map
During tests I noticed that if the number of tablets is very small,
say 2, and the number of nodes is 3 (2 shards per node), using the
number of storage groups on each shard, a shard may end up holding 0 groups,
whilst the other holds 1 group. And in some nodes even both shards have
0 groups.
Taking the minimum among shards here was showing in manifests a tablet
count of 0 for all 3 nodes, which is incorrect.

Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>

Closes scylladb/scylladb#28978

(cherry picked from commit 29619e48d7)

Closes scylladb/scylladb#29101
2026-03-24 16:37:28 +01:00
Michał Chojnowski
804842e95c test/boost/cache_algorithm_test: disable sstable compression to avoid giant index pages
The test intentionally creates huge index pages.
But since 5e7fb08bf3,
the index reader allocates a block of memory for a whole index page,
instead of incrementally allocating small pieces during index parsing.
This giant allocation causes the test to fail spuriously in CI sometimes.

Fix this by disabling sstable compression on the test table,
which puts a hard cap of 2000 keys per index page.

Fixes: SCYLLADB-1152

Closes scylladb/scylladb#29152

(cherry picked from commit f29525f3a6)

Closes scylladb/scylladb#29172
2026-03-24 16:02:48 +02:00
Botond Dénes
4f77cb621f Merge 'tablets: Fix deadlock in background storage group merge fiber' from Tomasz Grabiec
When it deadlocks, groups stop merging and compaction group merge
backlog will run-away.

Also, graceful shutdown will be blocked on it.

Found by flaky unit test
test_merge_chooses_best_replica_with_odd_count, which timed-out in 1
in 100 runs.

Reason for deadlock:

When storage groups are merged, the main compaction group of the new
storage group takes a compaction lock, which is appended to
_compaction_reenablers_for_merging, and released when the merge
completion fiber is done with the whole batch.

If we accumulate more than 1 merge cycle for the fiber, deadlock
occurs. Lock order will be this

Initial state:

 cg0: main
 cg1: main
 cg2: main
 cg3: main

After 1st merge:

 cg0': main [locked], merging_groups=[cg0.main, cg1.main]
 cg1': main [locked], merging_groups=[cg2.main, cg3.main]

After 2nd merge:

 cg0'': main [locked], merging_groups=[cg0'.main [locked], cg0.main, cg1.main, cg1'.main [locked], cg2.main, cg3.main]

merge completion fiber will try to stop cg0'.main, which will be
blocked on compaction lock. which is held by the reenabler in
_compaction_reenablers_for_merging, hence deadlock.

The fix is to wait for background merge to finish before we start the
next merge. It's achieved by holding old erm in the background merge,
and doing a topology barrier from the merge finalizing transition.

Background merge is supposed to be a relatively quick operation, it's
stopping compaction groups. So may wait for active requests. It
shouldn't prolong the barrier indefinitely.

Tablet tests which trigger merge need to be adjusted to call the
barrier, otherwise they will be vulnerable to the deadlock.

Fixes SCYLLADB-928

Backport to >= 2025.4 because it's the earliest vulnerable due to f9021777d8.

Closes scylladb/scylladb#29007

* github.com:scylladb/scylladb:
  tablets: Fix deadlock in background storage group merge fiber
  replica: table: Propagate old erm to storage group merge
  test: boost: tablets_test: Save tablet metadata when ACKing split resize decision
  storage_service: Extract local_topology_barrier()

(cherry picked from commit 5573c3b18e)

Closes scylladb/scylladb#29144
2026-03-21 01:37:30 +01:00
Raphael S. Carvalho
eb6c333e1b streaming: Release space incrementally during file streaming
File streaming only releases the file descriptors of a tablet being
streamed in the very streaming end. Which means that if the streaming
tablet has compaction on largest tier finished after streaming
started, there will be always ~2x space amplification for that
single tablet. Since there can be up to 4 tablets being migrated
away, it can add up to a significant amount, since nodes are pushed
to a substantial usage of available space (~90%).

We want to optimize this by dropping reference to a sstable after
it was fully streamed. This way, we reduce the chances of hitting
2x space amplification for a given tablet.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-790.
Closes scylladb/scylladb#28505

(cherry picked from commit 5b550e94a6)
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes scylladb/scylladb#28769
2026-03-20 15:50:46 +02:00
Calle Wilund
8d21636a81 test_internode_compression: Add await for "run" coro:s
Fixes: SCYLLADB-907

Closes scylladb/scylladb#28885

(cherry picked from commit 35aab75256)

Closes scylladb/scylladb#28905
2026-03-20 10:32:31 +02:00
Yaron Kaikov
7f236baf61 .github/workflows/trigger-scylla-ci: fix heredoc injection in trigger-scylla-ci workflow
Move all ${{ }} expression interpolations into env: blocks so they are
passed as environment variables instead of being expanded directly into
shell scripts. This prevents an attacker from escaping the heredoc in
the Validate Comment Trigger step and executing arbitrary commands on
the runner.

The Verify Org Membership step is hardened in the same way for
defense-in-depth.

Refs: GHSA-9pmq-v59g-8fxp
Fixes: SCYLLADB-954

Closes scylladb/scylladb#28935

(cherry picked from commit 977bdd6260)

Closes scylladb/scylladb#28946
2026-03-20 10:32:06 +02:00
Calle Wilund
4da8641d83 test_encryption: Fix test_system_auth_encryption
Fixes: SCYLLADB-915

Test was quite broken; Not waiting for coro:s, as well as a bunch
of checks no longer even close to valid (this is a ported dtest, and
not a very good one).

Closes scylladb/scylladb#28887

(cherry picked from commit ef795eda5b)

Closes scylladb/scylladb#28966
2026-03-20 10:30:48 +02:00
Łukasz Paszkowski
3ab789e1ca test/storage: harden out-of-space prevention tests around restart and disk-utilization transitions
The tests in test_out_of_space_prevention.py are flaky. Three issues contribute:

1. After creating/removing the blob file that simulates disk pressure,
   the tests immediately checked derived state (e.g., "compaction_manager
   - Drained") without first confirming the disk space monitor had detected
   the utilization change. Fix: explicitly wait for "Reached/Dropped below
   critical disk utilization level" right after creating/removing the blob
   file, before checking downstream effects.

2. Several tests called `manager.driver_connect()` or omitted reconnection
   entirely after `server_restart()` / `server_start()`. The pre-existing
   driver session can silently reconnect multiple times, causing subsequent
   CQL queries to fail. Fix: call `reconnect_driver()` after every node restart.
   Additionally, call `wait_for_cql_and_get_hosts()` where CQL is used afterward,
   to ensure all connection pools are established.

3. Some log assertions used marks captured before a restart, so they could
   match pre-restart messages or miss messages emitted in the correct post-restart
   window. Fix: refresh marks at the right points.

Apart from that, the patch fixes a typo: autotoogle -> autotoggle.

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-655

Closes scylladb/scylladb#28626

(cherry picked from commit 826fd5d6c3)

Closes scylladb/scylladb#28967
2026-03-20 10:30:23 +02:00
Asias He
25a17282bd test: Fix coordinator assumption in do_test_tablet_incremental_repair_merge_error
The first node in the cluster is not guaranteed to be the coordinator
node. Hardcoding node 0 as the coordinator causes test flakiness.  This
patch dynamically finds the actual coordinator node and targets it for
error injection, log checking, and restarts.

Additionally, inject `tablet_force_tablet_count_decrease_once` across
all servers to force the tablet merge process to trigger once.

Fixes SCYLLADB-865

Closes scylladb/scylladb#28945

(cherry picked from commit e0483f6001)

Closes scylladb/scylladb#28969
2026-03-20 10:30:00 +02:00
Botond Dénes
7afcc56128 db,compaction: use utils::chunked_vector for cache invalidation ranges
Instead of dht::partition_ranges_vector, which is an std::vector<> and
have been seen to cause large allocations when calculating ranges to be
invalidated after compaction:

    seastar_memory - oversized allocation: 147456 bytes. This is non-fatal, but could lead to latency and/or fragmentation issues. Please report: at
    [Backtrace #0]
    void seastar::backtrace<seastar::current_backtrace_tasklocal()::$_0>(seastar::current_backtrace_tasklocal()::$_0&&, bool) at ./build/release/seastar/./seastar/include/seastar/util/backtrace.hh:89
     (inlined by) seastar::current_backtrace_tasklocal() at ./build/release/seastar/./seastar/src/util/backtrace.cc:99
    seastar::current_tasktrace() at ./build/release/seastar/./seastar/src/util/backtrace.cc:136
    seastar::current_backtrace() at ./build/release/seastar/./seastar/src/util/backtrace.cc:169
    seastar::memory::cpu_pages::warn_large_allocation(unsigned long) at ./build/release/seastar/./seastar/src/core/memory.cc:840
    seastar::memory::cpu_pages::check_large_allocation(unsigned long) at ./build/release/seastar/./seastar/src/core/memory.cc:903
     (inlined by) seastar::memory::cpu_pages::allocate_large(unsigned int, bool) at ./build/release/seastar/./seastar/src/core/memory.cc:910
     (inlined by) seastar::memory::allocate_large(unsigned long, bool) at ./build/release/seastar/./seastar/src/core/memory.cc:1533
     (inlined by) seastar::memory::allocate_slowpath(unsigned long) at ./build/release/seastar/./seastar/src/core/memory.cc:1679
    seastar::memory::allocate(unsigned long) at ././seastar/src/core/memory.cc:1698
     (inlined by) operator new(unsigned long) at ././seastar/src/core/memory.cc:2440
     (inlined by) std::__new_allocator<interval<dht::ring_position>>::allocate(unsigned long, void const*) at /usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/bits/new_allocator.h:151
     (inlined by) std::allocator<interval<dht::ring_position>>::allocate(unsigned long) at /usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/bits/allocator.h:203
     (inlined by) std::allocator_traits<std::allocator<interval<dht::ring_position>>>::allocate(std::allocator<interval<dht::ring_position>>&, unsigned long) at /usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/bits/alloc_traits.h:614
     (inlined by) std::_Vector_base<interval<dht::ring_position>, std::allocator<interval<dht::ring_position>>>::_M_allocate(unsigned long) at /usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/bits/stl_vector.h:387
     (inlined by) std::vector<interval<dht::ring_position>, std::allocator<interval<dht::ring_position>>>::reserve(unsigned long) at /usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/bits/vector.tcc:79
    dht::to_partition_ranges(utils::chunked_vector<interval<dht::token>, 131072ul> const&, seastar::bool_class<utils::can_yield_tag>) at ./dht/i_partitioner.cc:347
    compaction::compaction::get_ranges_for_invalidation(std::vector<seastar::lw_shared_ptr<sstables::sstable>, std::allocator<seastar::lw_shared_ptr<sstables::sstable>>> const&) at ./compaction/compaction.cc:619
     (inlined by) compaction::compaction::get_compaction_completion_desc(std::vector<seastar::lw_shared_ptr<sstables::sstable>, std::allocator<seastar::lw_shared_ptr<sstables::sstable>>>, std::vector<seastar::lw_shared_ptr<sstables::sstable>, std::allocator<seastar::lw_shared_ptr<sstables::sstable>>>) at ./compaction/compaction.cc:719
     (inlined by) compaction::regular_compaction::replace_remaining_exhausted_sstables() at ./compaction/compaction.cc:1362
    compaction::compaction::finish(std::chrono::time_point<db_clock, std::chrono::duration<long, std::ratio<1l, 1000l>>>, std::chrono::time_point<db_clock, std::chrono::duration<long, std::ratio<1l, 1000l>>>) at ./compaction/compaction.cc:1021
    compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0::operator()() at ./compaction/compaction.cc:1960
     (inlined by) compaction::compaction_result std::__invoke_impl<compaction::compaction_result, compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0>(std::__invoke_other, compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0&&) at /usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/bits/invoke.h:63
     (inlined by) std::__invoke_result<compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0>::type std::__invoke<compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0>(compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0&&) at /usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/bits/invoke.h:98
     (inlined by) decltype(auto) std::__apply_impl<compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0, std::tuple<>>(compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0&&, std::tuple<>&&, std::integer_sequence<unsigned long, ...>) at /usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/tuple:2920
     (inlined by) decltype(auto) std::apply<compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0, std::tuple<>>(compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0&&, std::tuple<>&&) at /usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/tuple:2935
     (inlined by) seastar::future<compaction::compaction_result> seastar::futurize<compaction::compaction_result>::apply<compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0>(compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0&&, std::tuple<>&&) at ././seastar/include/seastar/core/future.hh:1930
     (inlined by) seastar::futurize<std::invoke_result<compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0>::type>::type seastar::async<compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0>(seastar::thread_attributes, compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0&&)::'lambda'()::operator()() const at ././seastar/include/seastar/core/thread.hh:267
     (inlined by) seastar::noncopyable_function<void ()>::direct_vtable_for<seastar::futurize<std::invoke_result<compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0>::type>::type seastar::async<compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0>(seastar::thread_attributes, compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0&&)::'lambda'()>::call(seastar::noncopyable_function<void ()> const*) at ././seastar/include/seastar/util/noncopyable_function.hh:138
    seastar::noncopyable_function<void ()>::operator()() const at ./build/release/seastar/./seastar/include/seastar/util/noncopyable_function.hh:224
     (inlined by) seastar::thread_context::main() at ./build/release/seastar/./seastar/src/core/thread.cc:318

dht::partition_ranges_vector is used on the hot path, so just convert
the problematic user -- cache invalidation -- to use
utils::chunked_vector<dht::partition_range> instead.

Fixes: SCYLLADB-121

Closes scylladb/scylladb#28855

(cherry picked from commit 13ff9c4394)

Closes scylladb/scylladb#28975
2026-03-20 10:29:23 +02:00
Andrzej Jackowski
32443ed6f7 reader_concurrency_semaphore: fix leak workaround
`e4da0afb8d5491bf995cbd1d7a7efb966c79ac34` introduces a protection
against resources that are "made up" of thin air to
`reader_concurrency_semaphore`. If there are more `_resources` than
the `_initial_resources`, it means there is a negative leak, and
`on_internal_error_noexcept` is called. In addition to it,
`_resources` is set to `std::max(_resources, _initial_resources)`.

However, the commit message of `e4da0afb8d5491bf995cbd1d7a7efb966c79ac34`
states the opposite: "The detection also clamps the
_resources to _initial_resources, to prevent any damage".

Before this commit, the protection mechanism doesn't clamp
`_resources` to `_initial_resources` but instead keeps `_resources` high,
possibly even indefinitely growing. This commit changes `std::max` to
`std::min` to make the code behave as intended.

Fixes: SCYLLADB-1014
Refs: SCYLLADB-163

Closes scylladb/scylladb#28982

(cherry picked from commit 9247dff8c2)

Closes scylladb/scylladb#28988
2026-03-20 10:28:45 +02:00
Botond Dénes
3e9b984020 Merge 'service: tasks: scan all tablets in tablet_virtual_task::wait' from Aleksandra Martyniuk
Currently, for repair tasks tablet_virtual_task::wait gathers the
ids of tablets that are to be repaired. The gathered set is later
used to check if the repair is still ongoing.

However, if the tablets are resized (split or merged), the gathered
set becomes irrelevant. Those, we may end up with invalid tablet id
error being thrown.

Wait until repair is done for all tablets in the table.

Fixes: https://github.com/scylladb/scylladb/issues/28202

Backport to 2026.1 needed as it contains the change introducing the issue d51b1fea94

Closes scylladb/scylladb#28323

* github.com:scylladb/scylladb:
  service: fix indentation
  test: add test_tablet_repair_wait
  service: remove status_helper::tablets
  service: tasks: scan all tablets in tablet_virtual_task::wait

(cherry picked from commit 3fed6f9eff)

Closes scylladb/scylladb#28991
2026-03-20 10:28:03 +02:00
Dmitriy Kruglov
2d199fb609 docs: add cluster platform migration procedure
Document how to migrate a ScyllaDB cluster to different instance
types using the add-and-replace node cycling approach.

Closes: QAINFRA-42

Closes scylladb/scylladb#28458

(cherry picked from commit cee44716db)

Closes scylladb/scylladb#28995
2026-03-20 10:27:29 +02:00
Piotr Dulikowski
35cd7f9239 Merge 'cql3: pin prepared cache entry in prepare() to avoid invalid weak handle race' from Alex Dathskovsky
query_processor::prepare() could race with prepared statement invalidation: after loading from the prepared cache, we converted the cached object to a checked weak pointer and then continued asynchronous work (including error-injection waitpoints). If invalidation happened in that window, the weak handle could no longer be promoted and the prepare path could fail nondeterministically.

This change keeps a strong cache entry reference alive across the whole critical section in prepare() by using a pinned cache accessor (get_pinned()), and only deriving the weak handle while the entry is pinned. This removes the lifetime gap without adding retry loops.

  Test coverage was extended in test/cluster/test_prepare_race.py:

  - reproduces the invalidation-during-prepare window with injection,
  - verifies prepare completes successfully,
  - then invalidates again and executes the same stale client prepared object,
  - confirms the driver transparently re-requests/re-prepares and execution succeeds.

  This change introduces:

  - no behavior change for normal prepare flow besides stronger lifetime guarantees,
  - no new protocol semantics,
  - preserves existing cache invalidation logic,
  - adds explicit cluster-level regression coverage for both the race and driver reprepare path.
  - pushes the re prepare operation twards the driver, the server will return unprepared error for the first time and the driver will have to re prepare during execution stage

Fixes: https://github.com/scylladb/scylladb/issues/27657

Backport to active branches recommended: No node crash, but user-visible PREPARE failures under rare schema-invalidation race; low-risk timeout-bounded retry improves robustness.

Closes scylladb/scylladb#28952

* github.com:scylladb/scylladb:
  transport/messages: hold pinned prepared entry in PREPARE result
  cql3: pin prepared cache entry in prepare() to avoid invalid weak handle race

(cherry picked from commit d9a277453e)

Closes scylladb/scylladb#29001
2026-03-20 10:27:04 +02:00
Pavel Emelyanov
32ce43d4b1 database: Rate limit all tokens from a range
The limiter scans ranges to decide whether or not to rate-limit the
query. However, when considering each range only the front one's token
is accounted. This looks like a misprint.

The limiter was introduced in cc9a2ad41f

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#29050

(cherry picked from commit 8b1ca6dcd6)

Closes scylladb/scylladb#29107
2026-03-20 10:25:32 +02:00
Botond Dénes
fef7750eb6 Merge 'tasks: do not fail the wait request if rpc fails' from Aleksandra Martyniuk
During decommission, we first mark a topology request as done, then shut
down a node and in the following steps we remove node from the topology.
Thus,  finished request does not imply that a node is removed from
the topology.

Due to that, in node_ops_virtual_task::wait, while gathering children
from the whole cluster, we may hit the connection exception - because
a node is still in topology, even though it is down.

Modify the get_children method to ignore the exception and warn
about the failure instead.

Keep token_metadata_ptr in get_children to prevent topology from changing.

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-867

Needs backports to all versions

Closes scylladb/scylladb#29035

* github.com:scylladb/scylladb:
  tasks: fix indentation
  tasks: do not fail the wait request if rpc fails
  tasks: pass token_metadata_ptr to task_manager::virtual_task::impl::get_children

(cherry picked from commit 2e47fd9f56)

Closes scylladb/scylladb#29126
2026-03-20 10:24:30 +02:00
Botond Dénes
213442227d Merge 'doc: fix the installation section' from Anna Stuchlik
This PR fixes the Installation page:

- Replaces `http `with `https `in the download command.
- Replaces the Open Source example from the Installation section for CentOS (we overlooked this example before).

Fixes https://github.com/scylladb/scylladb/issues/29087
Fixes https://github.com/scylladb/scylladb/issues/29087

This update affects all supported versions and should be backported as a bug fix.

Closes scylladb/scylladb#29088

* github.com:scylladb/scylladb:
  doc: remove the Open Source Example from Installation
  doc: replace http with https in the installation instructions

(cherry picked from commit e8b37d1a89)

Closes scylladb/scylladb#29135
2026-03-20 10:23:55 +02:00
Patryk Jędrzejczak
1398a55d16 test: test_remove_garbage_group0_members: wait for token ring and group0 consistency before removenode
The removenove initiator could have an outdated token ring (still considering
the node removed by the previous removenode a token owner) and unexpectedly
reject the operation.

Fix that by waiting for token ring and group0 consistency before removenode.
Note that the test already checks that consistency, but only for one node,
which is different from the removenode initiator.

This test has been removed in master together with the code being tested
(the gossip-based topology). Hence, the fix is submitted directly to 2026.1.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1103

Backport to all supported branches (other than 2026.1), as the test can fail
there.

Closes scylladb/scylladb#29108
2026-03-20 10:22:40 +02:00
Anna Stuchlik
a0a2a67634 doc: remove the instructions to install old versions from Web Installer
The Web Installer page includes instructions to install the old pre-2025.1 Enterprise versions,
which are no longer supported (since we released 2026.1).

This commit removes those redundant and misleading instructions.

Fixes https://github.com/scylladb/scylladb/issues/29099

Closes scylladb/scylladb#29103

(cherry picked from commit 6b1df5202c)

Closes scylladb/scylladb#29137
2026-03-20 10:22:08 +02:00
Avi Kivity
d4e454b5bc Merge 'Fix bad performance for densely populated partition index pages' from Tomasz Grabiec
This applies to small partition workload where index pages have high partition count, and the index doesn't fit in cache. It was observed that the count can be in the order of hundreds. In such a workload pages undergo constant population, LSA compaction, and LSA eviction, which has severe impact on CPU utilization.

Refs https://scylladb.atlassian.net/browse/SCYLLADB-620

This PR reduces the impact by several changes:

  - reducing memory footprint in the partition index. Assuming partition key size is 16 bytes, the cost dropped from 96 bytes to 36 bytes per partition.

  - flattening the object graph and amortizing storage. Storing entries directly in the vector. Storing all key values in a single managed_bytes. Making index_entry a trivial struct.

  - index entries and key storage are now trivially moveable, and batched inside vector storage
    so LSA migration can use memcpy(), which amortizes the cost per key. This reduces the cost of LSA segment compaction.

 - LSA eviction is now pretty much constant time for the whole page
   regardless of the number of entries, because elements are trivial and batched inside vectors.
   Page eviction cost dropped from 50 us to 1 us.

Performance evaluated with:

   scylla perf-simple-query -c1 -m200M --partitions=1000000

Before:

```
7774.96 tps (166.0 allocs/op, 521.7 logallocs/op,  54.0 tasks/op,  802428 insns/op,  430457 cycles/op,        0 errors)
7511.08 tps (166.1 allocs/op, 527.2 logallocs/op,  54.0 tasks/op,  804185 insns/op,  430752 cycles/op,        0 errors)
7740.44 tps (166.3 allocs/op, 526.2 logallocs/op,  54.2 tasks/op,  805347 insns/op,  432117 cycles/op,        0 errors)
7818.72 tps (165.2 allocs/op, 517.6 logallocs/op,  53.7 tasks/op,  794965 insns/op,  427751 cycles/op,        0 errors)
7865.49 tps (165.1 allocs/op, 513.3 logallocs/op,  53.6 tasks/op,  788898 insns/op,  425171 cycles/op,        0 errors)
```

After (+318%):

```
32492.40 tps (130.7 allocs/op,  12.8 logallocs/op,  36.1 tasks/op,  109236 insns/op,  103203 cycles/op,        0 errors)
32591.99 tps (130.4 allocs/op,  12.8 logallocs/op,  36.0 tasks/op,  108947 insns/op,  102889 cycles/op,        0 errors)
32514.52 tps (130.6 allocs/op,  12.8 logallocs/op,  36.0 tasks/op,  109118 insns/op,  103219 cycles/op,        0 errors)
32491.14 tps (130.6 allocs/op,  12.8 logallocs/op,  36.0 tasks/op,  109349 insns/op,  103272 cycles/op,        0 errors)
32582.90 tps (130.5 allocs/op,  12.8 logallocs/op,  36.0 tasks/op,  109269 insns/op,  102872 cycles/op,        0 errors)
32479.43 tps (130.6 allocs/op,  12.8 logallocs/op,  36.0 tasks/op,  109313 insns/op,  103242 cycles/op,        0 errors)
32418.48 tps (130.7 allocs/op,  12.8 logallocs/op,  36.1 tasks/op,  109201 insns/op,  103301 cycles/op,        0 errors)
31394.14 tps (130.7 allocs/op,  12.8 logallocs/op,  36.1 tasks/op,  109267 insns/op,  103301 cycles/op,        0 errors)
32298.55 tps (130.7 allocs/op,  12.8 logallocs/op,  36.1 tasks/op,  109323 insns/op,  103551 cycles/op,        0 errors)
```

When the workload is miss-only, with both row cache and index cache disabled (no cache maintenance cost):

  perf-simple-query -c1 -m200M --duration 6000 --partitions=100000 --enable-index-cache=0 --enable-cache=0

Before:

```
9124.57 tps (146.2 allocs/op, 789.0 logallocs/op,  45.3 tasks/op,  889320 insns/op,  357937 cycles/op,        0 errors)
9437.23 tps (146.1 allocs/op, 789.3 logallocs/op,  45.3 tasks/op,  889613 insns/op,  357782 cycles/op,        0 errors)
9455.65 tps (146.0 allocs/op, 787.4 logallocs/op,  45.2 tasks/op,  887606 insns/op,  357167 cycles/op,        0 errors)
9451.22 tps (146.0 allocs/op, 787.4 logallocs/op,  45.3 tasks/op,  887627 insns/op,  357357 cycles/op,        0 errors)
9429.50 tps (146.0 allocs/op, 787.4 logallocs/op,  45.3 tasks/op,  887761 insns/op,  358148 cycles/op,        0 errors)
9430.29 tps (146.1 allocs/op, 788.2 logallocs/op,  45.3 tasks/op,  888501 insns/op,  357679 cycles/op,        0 errors)
9454.08 tps (146.0 allocs/op, 787.3 logallocs/op,  45.3 tasks/op,  887545 insns/op,  357132 cycles/op,        0 errors)
```

After (+55%):

```
14484.84 tps (150.7 allocs/op,   6.5 logallocs/op,  44.7 tasks/op,  396164 insns/op,  229490 cycles/op,        0 errors)
14526.21 tps (150.8 allocs/op,   6.5 logallocs/op,  44.8 tasks/op,  396401 insns/op,  228824 cycles/op,        0 errors)
14567.53 tps (150.7 allocs/op,   6.5 logallocs/op,  44.7 tasks/op,  396319 insns/op,  228701 cycles/op,        0 errors)
14545.63 tps (150.6 allocs/op,   6.5 logallocs/op,  44.7 tasks/op,  395889 insns/op,  228493 cycles/op,        0 errors)
14626.06 tps (150.5 allocs/op,   6.5 logallocs/op,  44.7 tasks/op,  395254 insns/op,  227891 cycles/op,        0 errors)
14593.74 tps (150.5 allocs/op,   6.5 logallocs/op,  44.7 tasks/op,  395480 insns/op,  227993 cycles/op,        0 errors)
14538.10 tps (150.8 allocs/op,   6.5 logallocs/op,  44.8 tasks/op,  397035 insns/op,  228831 cycles/op,        0 errors)
14527.18 tps (150.8 allocs/op,   6.5 logallocs/op,  44.8 tasks/op,  396992 insns/op,  228839 cycles/op,        0 errors)
```

Same as above, but with summary ratio increased from 0.0005 to 0.005 (smaller pages):

Before:

```
33906.70 tps (146.1 allocs/op,  83.6 logallocs/op,  45.1 tasks/op,  170553 insns/op,   98104 cycles/op,        0 errors)
32696.16 tps (146.0 allocs/op,  83.5 logallocs/op,  45.1 tasks/op,  170369 insns/op,   98405 cycles/op,        0 errors)
33889.05 tps (146.1 allocs/op,  83.6 logallocs/op,  45.1 tasks/op,  170551 insns/op,   98135 cycles/op,        0 errors)
33893.24 tps (146.1 allocs/op,  83.5 logallocs/op,  45.1 tasks/op,  170488 insns/op,   98168 cycles/op,        0 errors)
33836.73 tps (146.1 allocs/op,  83.6 logallocs/op,  45.1 tasks/op,  170528 insns/op,   98226 cycles/op,        0 errors)
33897.61 tps (146.0 allocs/op,  83.5 logallocs/op,  45.1 tasks/op,  170428 insns/op,   98081 cycles/op,        0 errors)
33834.73 tps (146.1 allocs/op,  83.5 logallocs/op,  45.1 tasks/op,  170438 insns/op,   98178 cycles/op,        0 errors)
33776.31 tps (146.3 allocs/op,  83.9 logallocs/op,  45.2 tasks/op,  170958 insns/op,   98418 cycles/op,        0 errors)
33808.08 tps (146.3 allocs/op,  83.9 logallocs/op,  45.2 tasks/op,  170940 insns/op,   98388 cycles/op,        0 errors)
```

After (+18%):

```
40081.51 tps (148.2 allocs/op,   4.4 logallocs/op,  45.0 tasks/op,  121047 insns/op,   82231 cycles/op,        0 errors)
40005.85 tps (148.6 allocs/op,   4.4 logallocs/op,  45.2 tasks/op,  121327 insns/op,   82545 cycles/op,        0 errors)
39816.75 tps (148.3 allocs/op,   4.4 logallocs/op,  45.1 tasks/op,  121067 insns/op,   82419 cycles/op,        0 errors)
39953.11 tps (148.1 allocs/op,   4.4 logallocs/op,  45.0 tasks/op,  121027 insns/op,   82258 cycles/op,        0 errors)
40073.96 tps (148.2 allocs/op,   4.4 logallocs/op,  45.0 tasks/op,  121006 insns/op,   82313 cycles/op,        0 errors)
39882.25 tps (148.2 allocs/op,   4.4 logallocs/op,  45.0 tasks/op,  120925 insns/op,   82320 cycles/op,        0 errors)
39916.08 tps (148.3 allocs/op,   4.4 logallocs/op,  45.1 tasks/op,  121054 insns/op,   82393 cycles/op,        0 errors)
39786.30 tps (148.2 allocs/op,   4.4 logallocs/op,  45.0 tasks/op,  121027 insns/op,   82465 cycles/op,        0 errors)
38662.45 tps (148.3 allocs/op,   4.4 logallocs/op,  45.0 tasks/op,  121108 insns/op,   82312 cycles/op,        0 errors)
39849.42 tps (148.3 allocs/op,   4.4 logallocs/op,  45.1 tasks/op,  121098 insns/op,   82447 cycles/op,        0 errors)
```

Closes scylladb/scylladb#28603

* github.com:scylladb/scylladb:
  sstables: mx: index_reader: Optimize parsing for no promoted index case
  vint: Use std::countl_zero()
  test: sstable_partition_index_cache_test: Validate scenario of pages with sparse promoted index placement
  sstables: mx: index_reader: Amoritze partition key storage
  managed_bytes: Hoist write_fragmented() to common header
  utils: managed_vector: Use std::uninitialized_move() to move objects
  sstables: mx: index_reader: Keep promoted_index info next to index_entry
  sstables: mx: index_reader: Extract partition_index_page::clear_gently()
  sstables: mx: index_reader: Shave-off 16 bytes from index_entry by using raw_token
  sstables: mx: index_reader: Reduce allocation_section overhead during index page parsing by batching allocation
  sstables: mx: index_reader: Keep index_entry directly in the vector
  dht: Introduce raw_token
  test: perf_simple_query: Add 'sstable-format' command-line option
  test: perf_simple_query: Add 'sstable-summary-ratio' command-line option
  test: perf-simple-query: Add option to disable index cache
  test: cql_test_env: Respect enable-index-cache config

(cherry picked from commit 5e7fb08bf3)

Closes scylladb/scylladb#29136
2026-03-20 01:21:49 +01:00
Anna Stuchlik
825a36c97a doc: update the warning about shared dictionary training
This commit updates the inadequate warning on the Advanced Internode (RPC) Compression page.

The warning is replaced with a note about how training data is encrypted.

Fixes https://github.com/scylladb/scylladb/issues/29109

Closes scylladb/scylladb#29111

(cherry picked from commit 88b98fac3a)

Closes scylladb/scylladb#29119
2026-03-19 16:39:44 +02:00
Tomasz Grabiec
45413e99a5 Merge 'load_stats: improve tablet filtering for load stats' from Ferenc Szili
When computing table sizes via load_stats to determine if a split/merge is needed, we are filtering tablets which are being migrated, in order to avoid counting them twice (both on leaving and pending replica) in the total table size. The tablets are filtered so that they are counted on the leaving replica until the streaming stage, and on the pending replica after the streaming stage.

Currently, the procedure for collecting tablet sizes for load balancing also uses this same filter. This should be changed, because the load balancer needs to have as much information about tablet sizes as possible, and could ignore a node due to missing tablet sizes for tablets in the `write_both_read_new` and `use_new` stages.

For tablet size collection, we should include all the tablets which are currently taking up disk space. This means:
- on leaving replica, include all tablets until the `cleanup` stage
- on pending replica, include all tablets starting with the `write_both_read_new` and later stages

While this is an improvement, it causes problems with some of the tests, and therefore needs to be backported to 2026.1

Fixes: SCYLLADB-829

Closes scylladb/scylladb#28587

* github.com:scylladb/scylladb:
  load_stats: add filtering for tablet sizes
  load_stats: move tablet filtering for table size computation
  load_stats: bring the comment and code in sync

(cherry picked from commit 518470e89e)

Closes scylladb/scylladb#29034
2026-03-19 14:03:40 +01:00
Michał Hudobski
c93a935564 docs: update vector search filtering to reflect primary key support only
Remove outdated references to filtering on columns provided in the
index definition, and remove the note about equal relations (= and IN)
being the only supported operations. Vector search filtering currently
supports WHERE clauses on primary key columns only.

Closes scylladb/scylladb#28949

(cherry picked from commit 40d180a7ef)

Closes scylladb/scylladb#29069
2026-03-19 11:19:53 +01:00
Botond Dénes
69f78ce74a Merge 'perf-alternator: wait for alternator port before running workload' from Marcin Maliszkiewicz
This patch is mostly for the purpose of running pgo CI job.

We may receive connection error if asyncio.sleep(5) in
pgo.py is not sufficient waiting time.

In pgo.py we do wait for port but only for cql,
anyway it's better to have high level check than
trying to wait for alternator port there.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1071
Backport: 2026.1 - it failed on CI for that build

Closes scylladb/scylladb#29063

* github.com:scylladb/scylladb:
  perf: add abort_source support to wait-for-port loops
  perf-alternator: wait for alternator port before running workload

(cherry picked from commit 172c786079)

Closes scylladb/scylladb#29098
2026-03-18 13:13:01 +01:00
Patryk Jędrzejczak
3513ce6069 test: test_raft_no_quorum: decrease group0_raft_op_timeout_in_ms after quorum loss
`test_raft_no_quorum.py::test_cannot_add_new_node` is currently flaky in dev
mode. The bootstrap of the first node can fail due to `add_entry()` timing
out (with the 1s timeout set by the test case).

Other test cases in this test file could fail in the same way as well, so we
need a general fix. We don't want to increase the timeout in dev mode, as it
would slow down the test. The solution is to keep the timeout unchanged, but
set it only after quorum is lost. This prevents unexpected timeouts of group0
operations with almost no impact on the test running time.

A note about the new `update_group0_raft_op_timeout` function: waiting for
the log seems to be necessary only for
`test_quorum_lost_during_node_join_response_handler`, but let's do it
for all test cases just in case (including `test_can_restart` that shouldn't
be flaky currently).

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-913

Closes scylladb/scylladb#28998

(cherry picked from commit 526e5986fe)

Closes scylladb/scylladb#29068
2026-03-17 18:04:27 +01:00
Piotr Dulikowski
0ca7253315 Merge 'vector_search: fix TLS server name with IP' from Karol Nowacki
SNI works only with DNS hostnames. Adding an IP address causes warnings
on the server side.
This change adds SNI only if it is not an IP address.

This change has no unit tests, as this behavior is not critical,
since it causes a warning on the server side.
The critical part, that the server name is verified, is already covered.

This PR also adds warning logs to improve future troubleshooting of connections to the vector-store nodes.

Fixes: VECTOR-528

Backports to 2025.04 and 2026.01 are required, as these branches are also affected.

Closes scylladb/scylladb#28637

* github.com:scylladb/scylladb:
  vector_search: fix TLS server name with IP
  vector_search: add warn log for failed ann requests

(cherry picked from commit 23ed0d4df8)

Closes scylladb/scylladb#28964
2026-03-17 15:30:07 +01:00
Piotr Dulikowski
c7ac3b5394 db: view: mutate_MV: don't hold keyspace ref across preemption
Currently, the view_update_generator::mutate_MV function acquires a
reference to the keyspace relevant to the operation, then it calls
max_concurrent_for_each and uses that reference inside the lambda passed
to that function. max_concurrent_for_each can preempt and there is no
mechanism that makes sure that the keyspace is alive until the view
updates are generated, so it is possible that the keyspace is freed by
the time the reference is used.

Fix the issue by precomputing the necessary information based on the
keyspace reference right away, and then passing that information by
value to the other parts of the code. It turns out that we only need to
know whether the keyspace uses tablets and whether it uses a network
topology strategy.

Fixes: scylladb/scylladb#28925

Closes scylladb/scylladb#28928

(cherry picked from commit 42d70baad3)

Closes scylladb/scylladb#28968
2026-03-17 13:35:22 +01:00
Piotr Dulikowski
d6ed05efc1 Merge '[Backport 2026.1] mv: allow skipping view updates when a collection is unmodified' from Scylladb[bot]
mv: allow skipping view updates when a collection is unmodified
When we generate view updates, we check whether we can skip the
entire view update if all columns selected by the view are unmodified.
However, for collection columns, we only check if they were unset
before and after the update.
In this patch we add a check for the actual collection contents.
We perform this check for both virtual and non-virtual selections.
When the column is only a virtual column in the view, it would be
enough to check the liveness of each collection cell, however for
that we'd need to deserialize the entire collection anyway, which
should be effectively as expensive as comparing all of its bytes.

Fixes: SCYLLADB-996

- (cherry picked from commit 01ddc17ab9)

Parent PR: #28839

Closes scylladb/scylladb#28977

* github.com:scylladb/scylladb:
  Merge 'mv: allow skipping view updates when a collection is unmodified' from Wojciech Mitros
  mv: remove dead code in view_updates::can_skip_view_updates
2026-03-17 13:34:28 +01:00
Jenkins Promoter
39fcc83e75 Update ScyllaDB version to: 2026.1.1 2026-03-17 13:21:54 +02:00
Jenkins Promoter
6250f1e967 Update pgo profiles - aarch64 2026-03-15 05:07:46 +02:00
Aleksandra Martyniuk
b307c9301d nodetool: cluster repair: do not fail if a table was dropped
nodetool cluster repair without additional params repairs all tablet
keyspaces in a cluster. Currently, if a table is dropped while
the command is running, all tables are repaired but the command finishes
with a failure.

Modify nodetool cluster repair. If a table wasn't specified
(i.e. all tables are repaired), the command finishes successfully
even if a table was dropped.

If a table was specified and it does not exist (e.g. because it was
dropped before the repair was requested), then the behavior remains
unchanged.

Fixes: SCYLLADB-568.

Closes scylladb/scylladb#28739

(cherry picked from commit 2e68f48068)

Closes scylladb/scylladb#29006
2026-03-14 22:28:00 +02:00
Botond Dénes
f26af8cd30 mutation/collection_mutation: don't copy the serialized collection
serialize_collection_mutation() copies the serialized collection into
the returned collection_mutation object. Change to move to avoid the
copy.

Fixes: SCYLLADB-1041

Closes scylladb/scylladb#29010

(cherry picked from commit 15cfa5beeb)

Closes scylladb/scylladb#29024
2026-03-12 23:34:58 +02:00
Marcin Maliszkiewicz
2bd10bff5e Merge 'test_proxy_protocol: fix flaky system.clients visibility checks' from Piotr Smaron
`test_proxy_protocol_port_preserved_in_system_clients` failed because it
didn't see the just created connection in system.clients immediately. The
last lines of the stacktrace are:
```
            # Complete CQL handshake
            await do_cql_handshake(reader, writer)

            # Now query system.clients using the driver to see our connection
            cql = manager.get_cql()
            rows = list(cql.execute(
                f"SELECT address, port FROM system.clients WHERE address = '{fake_src_addr}' ALLOW FILTERING"
            ))

            # We should find our connection with the fake source address and port
>           assert len(rows) > 0, f"Expected to find connection from {fake_src_addr} in system.clients"
E           AssertionError: Expected to find connection from 203.0.113.200 in system.clients
E           assert 0 > 0
E            +  where 0 = len([])
```
Explanation: we first await for the hand-made connection to be completed,
then, via another connection, we're querying system.clients, and we don't
get this hand-made connection in the resultset.
The solution is to replace the bare cql.execute() calls with await wait_for_results(), a helper
 that polls via cql.run_async() until the expected row count is reached
 (30 s timeout, 100 ms period).

Fixes: SCYLLADB-819

The flaky test is present on master and in previous release, so backporting only there.

Closes scylladb/scylladb#28849

* github.com:scylladb/scylladb:
  test_proxy_protocol: introduce extra logging to aid debugging
  test_proxy_protocol: fix flaky system.clients visibility checks

(cherry picked from commit 4150c62f29)

Closes scylladb/scylladb#28951
2026-03-10 22:48:14 +02:00
Avi Kivity
1105d83893 Merge 'mv: allow skipping view updates when a collection is unmodified' from Wojciech Mitros
When we generate view updates, we check whether we can skip the
entire view update if all columns selected by the view are unmodified.
However, for collection columns, we only check if they were unset
before and after the update.
In this patch we add a check for the actual collection contents.
We perform this check for both virtual and non-virtual selections.
When the column is only a virtual column in the view, it would be
enough to check the liveness of each collection cell, however for
that we'd need to deserialize the entire collection anyway, which
should be effectively as expensive as comparing all of its bytes.

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-808

Closes scylladb/scylladb#28839

* github.com:scylladb/scylladb:
  mv: allow skipping view updates when a collection is unmodified
  mv: allow skipping view updates if an empty collection remains unset

(cherry picked from commit 01ddc17ab9)
2026-03-10 21:27:23 +01:00
Wojciech Mitros
9b9d5cee8a mv: remove dead code in view_updates::can_skip_view_updates
When we create a materialized view, we consider 2 cases:
1. the view's primary key contains a column that is not
in the primary key of the base table
2. the view's primary key doesn't contain such a column

In the 2nd case, we add all columns from the base table
to the schema of the view (as virtual columns). As a result,
all of these columns are effectively "selected" in
view_updates::can_skip_view_updates. Same thing happens when
we add new columns to the base table using ALTER.
Because of this, we can never have !column_is_selected and
!has_base_non_pk_columns_in_view_pk at the same time. And
thus, the check (!column_is_selected
&& _base_info.has_base_non_pk_columns_in_view_pk) is always
the same as (!column_is_selected).
Because we immediately return after this check, the tail of
this function is also never reached - all checks after the
(column_is_selected) are affected by this. Also, the condition
(!column_is_selected && base_has_nonexpiring_marker) is always
false at the point it is called. And this in turn makes the
`base_has_nonexpiring_marker` unused, so we delete it as well.

It's worth considering, why did we even have
`base_has_nonexpiring_marker` if it's effectively unused. We
initially introduced it in bd52e05ae2 and we (incorrectly)
used it to allow skipping view updates even if the liveness of
virtual columns changed. Soon after, in 5f85a7a821, we
started categorizing virtual columns as column_is_selected == true
and we moved the liveness checks for virtual columns to the
`if (column_is_selected)` clause, before the `base_has_nonexpiring_marker`
check. We changed this because even if we have a nonexpiring marker
right now, it may be changed in the future, in which case the liveness
of the view row will depend on liveness of the virtual columns and
we'll need to have the view updates from the time the row marker was
nonexpiring.

(cherry picked from commit ca1c8ff209)
2026-03-10 21:24:53 +01:00
Tomasz Grabiec
a8fd9936a3 Merge 'service: assert that tables updated via group0 use schema commitlog' from Aleksandra Martyniuk
Set enable_schema_commitlog for each group0 tables.

Assert that group0 tables use schema commitlog in ensure_group0_schema
(per each command).

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-914.

Needs backport to all live releases as all are vulnerable

Closes scylladb/scylladb#28876

* github.com:scylladb/scylladb:
  test: add test_group0_tables_use_schema_commitlog
  db: service: remove group0 tables from schema commitlog schema initializer
  service: ensure that tables updated via group0 use schema commitlog
  db: schema: remove set_is_group0_table param

(cherry picked from commit b90fe19a42)

Closes scylladb/scylladb#28916
2026-03-10 11:58:03 +01:00
Botond Dénes
9190d42863 Merge 'repair: Fix rwlock in compaction_state and lock holder lifecycle' from Raphael Raph Carvalho
Consider this:

- repair takes the lock holder
- tablet merge filber destories the compaction group and the compaction state
- repair fails
- repair destroy the lock holder

This is observed in the test:

```
repair - repair[5d73d094-72ee-4570-a3cc-1cd479b2a036] Repair 1 out of 1 tablets: table=sec_index.users range=(432345564227567615,504403158265495551] replicas=[0e9d51a5-9c99-4d6e-b9db-ad36a148b0ea:15, 498e354c-1254-4d8d-a565-2f5c6523845a:9, 5208598c-84f0-4526-bb7f-573728592172:28]

...

repair - repair[5d73d094-72ee-4570-a3cc-1cd479b2a036]: Started to repair 1 out of 1 tables in keyspace=sec_index, table=users, table_id=ea2072d0-ccd9-11f0-8dba-c5ab01bffb77, repair_reason=repair
repair - Enable incremental repair for table=sec_index.users range=(432345564227567615,504403158265495551]
table - Disabled compaction for range=(432345564227567615,504403158265495551] session_id=a13a72cc-cd2d-11f0-8e9b-76d54580ab09 for incremental repair
table - Got unrepaired compaction and repair lock for range=(432345564227567615,504403158265495551] session_id=a13a72cc-cd2d-11f0-8e9b-76d54580ab09 for incremental repair
table - Disabled compaction for range=(432345564227567615,504403158265495551] session_id=a13a72cc-cd2d-11f0-8e9b-76d54580ab09 for incremental repair
table - Got unrepaired compaction and repair lock for range=(432345564227567615,504403158265495551] session_id=a13a72cc-cd2d-11f0-8e9b-76d54580ab09 for incremental repair
repair - repair[5d73d094-72ee-4570-a3cc-1cd479b2a036]: get_sync_boundary: got error from node=0e9d51a5-9c99-4d6e-b9db-ad36a148b0ea, keyspace=sec_index, table=users, range=(432345564227567615,504403158265495551], error=seastar::rpc::remote_verb_error (Compaction state for table [0x60f008fa34c0] not found)
compaction_manager - Stopping 1 tasks for 1 ongoing compactions for table sec_index.users compaction_group=238 due to tablet merge
compaction_manager - Stopping 1 tasks for 1 ongoing compactions for table sec_index.users compaction_group=238 due to tablet merge

....

scylla[10793] Segmentation fault on shard 28, in scheduling group streaming
```

The rwlock in compaction_state could be destroyed before the lock holder
of the rwlock is destroyed. This causes user after free when the lock
the holder is destroyed.

To fix it, users of repair lock will now be waited when a compaction
group is being stopped.
That way, compaction group - which controls the lifetime of rwlock -
cannot be destroyed while the lock is held.
Additionally, the merge completion fiber - that might remove groups -
is properly serialized with incremental repair.

The issue can be reproduced using sanitize build consistently and can not
be reproduced after the fix.

Fixes #27365

Closes scylladb/scylladb#28823

* github.com:scylladb/scylladb:
  repair: Fix rwlock in compaction_state and lock holder lifecycle
  repair: Prevent repair lock holder leakage after table drop

(cherry picked from commit 509f2af8db)

Closes scylladb/scylladb#28934
2026-03-09 10:25:47 +02:00
Jenkins Promoter
09b0e1ba8b Update ScyllaDB version to: 2026.1.0 2026-03-08 11:47:02 +02:00
Avi Kivity
8a7f5f1428 Merge '[Backport 2026.1] raft ropology: prevent crashes of multiple nodes' from Scylladb[bot]
Some assertions in the Raft-based topology are likely to cause crashes of
multiple nodes due to the consistent nature of the Raft-based code. If the
failing assertion is executed in the code run by each follower (e.g., the code
reloading the in-memory topology state machine), then all nodes can crash. If
the failing assertion is executed only by the leader (e.g., the topology
coordinator fiber), then multiple consecutive group0 leaders will chain-crash
until there is no group0 majority.

Crashing multiple nodes is much more severe than necessary. It's enough to
prevent the topology state machine from making more progress. This will
naturally happen after throwing a runtime error. The problematic fiber will be
killed or will keep failing in a loop. Note that it should be safe to block
the topology state machine, but not the whole group0, as the topology state
machine is mostly isolated from the rest of group0.

We replace some occurrences of `on_fatal_internal_error` and `SCYLLA_ASSERT`
with `on_internal_error`. These are not all occurrences, as some fatal
assertions make sense, for example, in the bootstrap procedure.

We also raise an internal error to prevent a segmentation fault in a few places.

Fixes #27987

Backporting this PR is not required, but we can consider it at least for 2026.1
because:
- it is LTS,
- the changes are low-risk,
- there shouldn't be many conflicts.

- (cherry picked from commit e21ecf69de)

- (cherry picked from commit 8e9c7397c5)

Parent PR: #28558

Manually cherry-picked 2a3476094e.

Closes scylladb/scylladb#28735

* github.com:scylladb/scylladb:
  storage_service: raft_topology_cmd_handler: fix use-after-free
  raft topology: prevent accessing nullptr returned by topology::find
  raft topology: make some assertions non-crashing
2026-03-05 20:49:03 +02:00
Szymon Malewski
185288f16e test: vector_similarity: Fix similarity value checks
`isclose` function checks if returned similarity floats are close enough to expected value, but it doesn't `assert` by itself.
Several tests missed that `assert`, effectively always passing.
With this patch similarity values checks are wrapped in helper function `assert_similarity` with predefined tolerance.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-877

Closes scylladb/scylladb#28748

(cherry picked from commit 4c4673e8f9)

Closes scylladb/scylladb#28907
2026-03-05 20:48:36 +02:00
Anna Stuchlik
7dfcb53197 doc: fix the unified installer instructions
This commit updates the documentation for the unified installer.

- The Open Source example is replaced with version 2025.1 (Source Available, currently supported, LTS).
- The info about CentOS 7 is removed (no longer supported).
- Java 8 is removed.
- The example for cassandra-stress is removed (as it was already removed on other installation pages).

Fixes https://github.com/scylladb/scylladb/issues/28150

Closes scylladb/scylladb#28152

(cherry picked from commit 855c503c63)

Closes scylladb/scylladb#28910
2026-03-05 18:07:00 +02:00
Marcin Maliszkiewicz
d552244812 Merge 'test: add missing awaits in test_client_routes_upgrade' from Andrzej Jackowski
Two calls in test_client_routes_upgrade were missing `await`,
so they were never actually executed. This caused Python
to emit RuntimeWarning about unawaited coroutines, and more
importantly, the test skipped important verification steps, which
could mask real bugs or cause flakiness.

Additionally, increase 10s timeouts to 60s to avoid flakiness in slow
environments. Although these tests haven't failed so far, similar
issues have already been observed in other tests with too-short
timeouts.

Fixes: [SCYLLADB-909](https://scylladb.atlassian.net/browse/SCYLLADB-909)

Backport to 2026.1, as the test is also there.

[SCYLLADB-909]: https://scylladb.atlassian.net/browse/SCYLLADB-909?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ

Closes scylladb/scylladb#28877

* github.com:scylladb/scylladb:
  test: increase timeouts in test_client_routes.py
  test: add missing awaits in test_client_routes_upgrade

(cherry picked from commit 9697b6013f)

Closes scylladb/scylladb#28896
2026-03-05 17:46:39 +02:00
Patryk Jędrzejczak
62b344cb55 storage_service: raft_topology_cmd_handler: fix use-after-free
8e9c7397c5 made `rs` a reference, which can
lead to use-after-free. The `normal_nodes` map containing the referenced
value can be destroyed before the last use of `rs` when the topology state
is reloaded after a context switch on some `co_await`. The following move
assignment in `storage_service::topology_state_load` causes this:
```
_topology_state_machine._topology = co_await _sys_ks.local().load_topology_state(tablet_hosts);
```

This issue has been discovered in next-2026.1 CI after queueing the
backport of #28558. `test_truncate_during_topology_change` failed after
ASan reported a heap-use-after-free in
```
co_await _repair.local().bootstrap_with_repair(get_token_metadata_ptr(), rs.ring.value().tokens, session);
```
This test enables `delay_bootstrap_120s`, which makes the bug much more
likely to reproduce, but it could happen elsewhere.

No backport needed, as the only backport of #28558 hasn't been merged yet.
The backport PR will cherry-pick this commit.

Closes scylladb/scylladb#28772

(cherry picked from commit 2a3476094e)
2026-03-05 13:41:44 +01:00
Botond Dénes
173bfd627c Merge 'test: decrease strain in test_startup_response' from Marcin Maliszkiewicz
For 2025.3 and 2025.4 this test runs order of magnitude
slower in debug mode. Potentially due to passwords::check
running in alien thread and overwhelming the CPU (this is
fixed in newer versions).

Decreasing the number of connections in test makes it fast
again, without breaking reproducibility.

As additional measure we double the timeout.

The fix is now cherry-picked to master as sometimes
test fails there too.

(cherry picked from commit 1f1fc2c2ac)

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-795

backport: 2026.1, already on other stable branches

Closes scylladb/scylladb#28848

* github.com:scylladb/scylladb:
  test: add more logs to test_startup_no_auth_response
  test: decrease strain in test_startup_response

(cherry picked from commit f156bcddab)

Closes scylladb/scylladb#28878
2026-03-05 09:54:51 +01:00
Piotr Dulikowski
bf369326d6 Merge 'vector_search: test: fix HTTPS client test flakiness' from Karol Nowacki
The default 100ms timeout for client readiness in tests is too
aggressive. In some test environments, this is not enough time for
client creation, which involves address resolution and TLS certificate
reading, leading to flaky tests.

This commit increases the default client creation timeout to 10 seconds.
This makes the tests more robust, especially in slower execution
environments, and prevents similar flakiness in other test cases.

Fixes: VECTOR-547, SCYLLADB-802, SCYLLADB-825, SCYLLADB-826

Backport to 2025.4 and 2026.1, as the same problem occurs on these branches and can potentially make the CI flaky there as well.

Closes scylladb/scylladb#28846

* github.com:scylladb/scylladb:
  vector_search: test: include ANN error in assertion
  vector_search: test: fix HTTPS client test flakiness

(cherry picked from commit 2fb981413a)

Closes scylladb/scylladb#28879
2026-03-04 18:09:04 +01:00
Anna Stuchlik
864774fb00 doc: fix the upgrage guide to 2026.1
The upgrade guide on branch-2026.1 has bugs caused by incorrectly
resolved conflicts in the backport PR: https://github.com/scylladb/scylladb/pull/28835#issuecomment-3992474167

This commit fixes the issue.
The fix only applies to branch-2026.1.

Fixes https://github.com/scylladb/scylladb/issues/28871

Closes scylladb/scylladb#28872
2026-03-04 16:53:49 +02:00
Botond Dénes
49ed97cec8 Merge '[Backport 2026.1] Fix regression in Alternator TTL with tablets and node going down' from Scylladb[bot]
Recently we suffered a regression on how Alternator TTL behaves when a node goes down when tablets are used.

Usually, expiration of data in a particular tablet are handled by this tablet's "primary replica". However, if that node is down, we want another node to perform these expiration until the primary replica goes back online. We created a function `tablet_map::get_secondary_replica()` to select that "other node". We don't care too much what the "secondary replica" means, but we do care that it's different from the primary replica - if it's the same the expiration of that tablet will never be done.

It turns out that recently, in commits 817fdad and d88036d, the implementation of get_primary_replica() changed without a corresponding change to get_secondary_replica(). After those changes, the two functions are mismatched, and sometimes return the same node for both primary and secondary replica.

Unfortunately, although we had a dtest for the handling of a dead node in Alternator TTL, it failed to reproduce this bug, so this regression was missed - nothing else besides Alternator TTL ever used the get_secondary_replica() function.

So this series, in addition to fixing the bug, we add two tests that reproduce this bug (fail before the fix, pass with the fix):

1. A unit test that checks that get_secondary_replica() always returns a different node from get_primary_replica()
2. A cluster test based on the original dtest, which does reproduce this bug in Alternator TTL where some of the data was never expired (but only failed in release build, for an unknown reason).

Fixes SCYLLADB-777.

- (cherry picked from commit 9ab3d5b946)

- (cherry picked from commit 0c7f499750)

- (cherry picked from commit e463d528fe)

Parent PR: #28771

Closes scylladb/scylladb#28803

* github.com:scylladb/scylladb:
  test: add unit test for tablet_map::get_secondary_replica()
  test, alternator: add test for TTL expiration with a node down
  locator: fix get_secondary_replica() to match get_primary_replica()
2026-03-04 14:21:44 +02:00
Marcin Maliszkiewicz
81685b0d06 Merge 'db/batchlog_manager: re-add v1 support for mixed clusters' from Botond Dénes
3f7ee3ce5d introduced system.batchlog_v2, with a schema designed to speed up batchlog replays and make post-replay cleanups much more effective.
It did not introduce a cluster feature for the new table, because it is node local table, so the cluster can switch to the new table gradually, one node at a time.
However, https://github.com/scylladb/scylladb/issues/27886 showed that the switching causes timeouts during upgrades, in mixed clusters. Furthermore, switching to the new table unconditionally  on upgrades nodes, means that on rollback, the batches saved into the v2 table are lost.
This PR introduces re-introduces v1 (`system.batchlog`) support and guards the use of the v2 table with a cluster feature, so mixed clusters keep using v1 and thus be rollback-compatible.
The re-introduced v1 support doesn't support post-replay cleanups for simplicity. The cleanup in v1 was never particularly effective anyway and we ended up disabling it for heavy batchlog users, so I don't think the lack of support for cleanup is a problem.

Fixes: https://github.com/scylladb/scylladb/issues/27886

Needs backport to 2026.1, to fix upgrades for clusters using batches

Closes scylladb/scylladb#28736

* github.com:scylladb/scylladb:
  test/boost/batchlog_manager_test: add tests for v1 batchlog
  test/boost/batchlog_manager_test: make prepare_batches() work with both v1 and v2
  test/boost/batchlog_manager_test: fix indentation
  test/boost/batchlog_manager_test: extract prepare_batches() method
  test/lib/cql_assertions: is_rows(): add dump parameter
  tools/scylla-sstable: extract query result printers
  tools/scylla-sstable: add std::ostream& arg to query result printers
  repair/row_level: repair_flush_hints_batchlog_handler(): add all_replayed to finish log
  db/batchlog_manager: re-add v1 support
  db/batchlog_manager: return all_replayed from process_batch()
  db/batchlog_manager: process_bath() fix indentation
  db/batchlog_manager: make batch() a standalone function
  db/batchlog_manager: make structs stats public
  db/batchlog_manager: allocate limiter on the stack
  db/batchlog_manager: add feature_service dependency
  gms/feature_service: add batchlog_v2 feature

(cherry picked from commit a83ee6cf66)

Closes scylladb/scylladb#28853
2026-03-04 08:28:39 +02:00
Anna Stuchlik
06013b2377 doc: add the upgrade guide from 2025.x to 2026.1
This commit adds the upgrade guide for version 2026.1.

According to the new upgrade policy, the user can now upgrade to the major version (2026.1)
from any previous minor version.
So instead of adding a separate guide form 2025.4 to 2026.1, we need a guide from 2025.x to 2026.1.

In addition, this commit:
- Updates the upgrade policy for reflect the above change.
- Removes the upgrade guides for the previous version.

Fixes https://github.com/scylladb/scylladb/issues/28533

Fixes  https://github.com/scylladb/scylladb/issues/28532

Closes scylladb/scylladb#28789

(cherry picked from commit dfd46ad3fb)

Closes scylladb/scylladb#28835
2026-03-03 13:04:16 +02:00
Grzegorz Burzyński
4cc5c2605f packaging: add systemctl command to dependencies
scylladb/scylla container image doesn't include systemctl binary, while it
is used by perftune.py script shipped within the same image.

Scylla Operator runs this script to tune Scylla nodes/containers,
expecting its all dependencies to be available in the container's PATH.
Without systemctl, the script fails on systems that run irqbalance
(e.g., on EKS nodes) as the script tries to reconfigure irqbalance and
restart it via systemctl afterwards.

Fixes: scylladb/scylla-operator#3080

Closes scylladb/scylladb#28567

(cherry picked from commit b4f0eb666f)

Closes scylladb/scylladb#28845
2026-03-03 13:03:46 +02:00
Anna Stuchlik
021851c5c5 doc: remove reduntant Java-related information
This commit removes:
- Instructions to install scylla-jmx (and all references)
- The Java 11 requirement for Ubuntu.

Fixes https://github.com/scylladb/scylladb/issues/28249
Fixes https://github.com/scylladb/scylladb/issues/28252

Closes scylladb/scylladb#28254

(cherry picked from commit 64b1798513)

Closes scylladb/scylladb#28818
2026-03-03 10:39:46 +01:00
Patryk Jędrzejczak
c4aa14c1a7 test: test_full_shutdown_during_replace: retry replace after the replacing node is removed from gossip
The test is currently flaky with `reuse_ip = True`. The issue is that the
test retries replace before the first replace is rolled back and the
first replacing node is removed from gossip. The second replacing node
can see the entry of the first replacing node in gossip. This entry has
a newer generation than the entry of the node being replaced, and both
replacing nodes have the same IP as the node being replaced. Therefore,
the second replacing node incorrectly considers this entry as the entry
of the node being replaced. This entry is missing rack and DC, so the
second replace fails with
```
ERROR 2026-02-24 21:19:03,420 [shard 0:main] init - Startup failed:
std::runtime_error (Cannot replace node
8762a9d2-3b30-4e66-83a1-98d16c5dd007/127.61.127.1 with a node on
a different data center or rack.
Current location=UNKNOWN_DC/UNKNOWN_RACK, new location=dc1/rack2)
```

Fixes SCYLLADB-805

Closes scylladb/scylladb#28829

(cherry picked from commit ba7f314cdc)

Closes scylladb/scylladb#28850
2026-03-03 10:21:11 +01:00
Jenkins Promoter
0fdb0961a2 Update ScyllaDB version to: 2026.1.0-rc4 2026-03-02 20:36:38 +02:00
Roy Dahan
2100ae2d0a install.sh: fix REST API paths for nonroot installations
In nonroot installations, the install.sh script was hardcoding the
api_ui_dir and api_doc_dir paths to /opt/scylladb/ in scylla.yaml,
even though the actual files were installed to a different location
(typically ~/scylladb). This caused REST API endpoints like
/api-doc/failure_detector/ to fail with "transfer closed with
outstanding read data remaining" error because Scylla couldn't find
the API documentation files at the configured paths.

Fix this by using the $prefix variable instead of hardcoded
/opt/scylladb/ paths. This ensures that:
- In regular installations: $prefix = /opt/scylladb (no change)
- In nonroot installations: $prefix = ~/scylladb (paths now correct)

Fixes: SCYLLADB-721

Backport: The hardcoded paths in install.sh have been present since
the nonroot installation feature was introduced, making REST API
endpoints non-functional in all nonroot installations across all
live versions of Scylla.

Closes scylladb/scylladb#28805

(cherry picked from commit 822c1597c9)

Closes scylladb/scylladb#28836
2026-03-01 23:20:12 +02:00
Jenkins Promoter
51fc498314 Update pgo profiles - aarch64 2026-03-01 05:01:46 +02:00
Jenkins Promoter
f4b938df09 Update pgo profiles - x86_64 2026-02-28 21:23:33 -05:00
Botond Dénes
0dfefc3f12 db/config: don't use RBNO for scaling
Remove bootstrap and decomission from allowed_repair_based_node_ops.
Using RBNO over streaming for these operations has no benefits, as they
are not exposed to the out-of-date replica problem that replace,
removenode and rebuild are.
On top of that, RBNO is known to have problems with empty user tables.
Using streaming for boostrap and decomission is safe and faster
than RBNO in all condition, especially when the table is small.

One test needs adjustment as it relies on RBNO being used for all node
ops.

Fixes: SCYLLADB-105

Closes scylladb/scylladb#28080

(cherry picked from commit b637e17b19)

Closes scylladb/scylladb#28725
2026-02-27 06:32:15 +02:00
Łukasz Paszkowski
883e3e014a compaction_manager: fix maybe_wait_for_sstable_count_reduction() hanging forever
The futurization refactoring in 9d3755f276 ("replica: Futurize
retrieval of sstable sets in compaction_group_view") changed
maybe_wait_for_sstable_count_reduction() from a single predicated
wait:
```
    co_await cstate.compaction_done.wait([..] {
        return num_runs_for_compaction() <= threshold
            || !can_perform_regular_compaction(t);
    });
```
to a while loop with a predicated wait:
```
    while (can_perform_regular_compaction(t)
           && co_await num_runs_for_compaction() > threshold) {
        co_await cstate.compaction_done.wait([this, &t] {
            return !can_perform_regular_compaction(t);
        });
    }
```

This was necessary because num_runs_for_compaction() became a
coroutine (returns future<size_t>) and can no longer be called
inside a condition_variable predicate (which must be synchronous).

However, the inner wait's predicate — !can_perform_regular_compaction(t)
— only returns true when compaction is disabled or the table is being
removed. During normal operation, every signal() from compaction_done
wakes the waiter, the predicate returns false, and the waiter
immediately goes back to sleep without ever re-checking the outer
while loop's num_runs_for_compaction() condition.

This causes memtable flushes to hang forever in
maybe_wait_for_sstable_count_reduction() whenever the sstable run
count exceeds the threshold, because completed compactions signal
compaction_done but the signal is swallowed by the predicate.

Fix by replacing the predicated wait with a bare wait(), so that
any signal (including from completed compactions) causes the outer
while loop to re-evaluate num_runs_for_compaction().

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-610

Closes scylladb/scylladb#28801

(cherry picked from commit bb57b0f3b7)
2026-02-27 01:38:13 +02:00
Yaron Kaikov
4ccb795beb .github/workflows: enable automatic backport PR creation with Jira sub-issue integration
This workflow calls the reusable backport-with-jira workflow from
scylladb/github-automation to enable automatic backport PR creation with
Jira sub-issue integration.

The workflow triggers on:
- Push to master/next-*/branch-* branches (for promotion events)
- PR labeled with backport/X.X pattern (for manual backport requests)
- PR closed/merged on version branches (for chain backport processing)

Features enabled by calling the shared workflow:
- Creates Jira sub-issues under the main issue for each backport version
- Sorts versions descending (highest first: 2025.4 -> 2025.3 -> 2025.2)
- Cherry-picks from previous version branch to avoid repeated conflicts
- On Jira API failure: adds comment to main issue, applies 'jira-sub-issue-creation-failed' label, continues with PR

Closes scylladb/scylladb#28804

(cherry picked from commit b211590bc0)

Closes scylladb/scylladb#28812
2026-02-26 09:29:19 +02:00
Yaron Kaikov
9e02b0f45f ci: harden trigger-scylla-ci workflow against credential leaks and untrusted PRs
refs: https://github.com/scylladb/scylladb/security/advisories/GHSA-wrqg-xx2q-r3fv

- Remove -v and -i flags from curl to prevent credentials from being
  logged in workflow output
- Move PR_NUMBER and PR_REPO_NAME into the env block with proper quoting
  to prevent shell injection via crafted PR metadata
- Add org membership verification step for pull_request_target events so
  that only PRs from scylladb org members can trigger Jenkins CI

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-796

Closes scylladb/scylladb#28785

(cherry picked from commit 98494e08eb)

Closes scylladb/scylladb#28811
2026-02-26 09:28:00 +02:00
Anna Stuchlik
eb9b8dbf62 doc: remove the tablets limitation for Alternator
This commit removes the information that Alternator doesn't support tablets.
The limitation is no longer valid.

Fixes SCYLLADB-778

Closes scylladb/scylladb#28781

(cherry picked from commit e2333a57ad)

Closes scylladb/scylladb#28795
2026-02-26 09:26:57 +02:00
Andrzej Jackowski
995df5dec6 test: fix configuration of test_autoretrain_dict
`test_autoretrain_dict` sporadically fails because the default
compression algorithm was changed after the test was written.

`9ffa62a986815709d0a09c705d2d0caf64776249` was an attempt to fix it by
changing the compression configuration during node startup. However,
the configuration change had an incorrect YAML format and was
ignored by ScyllaDB. This commit fixes it.

Fixes: scylladb/scylladb#28204

Closes scylladb/scylladb#28746

(cherry picked from commit cd4caed3d3)

Closes scylladb/scylladb#28794
2026-02-26 09:26:23 +02:00
Calle Wilund
beb781b829 gcp: Add handling of 429 (too many requests) to exponential backoff
Fixes: SCYLLADB-611

Adds http error code 429 to codes handled by exponential backoff.

Closes scylladb/scylladb#28588

(cherry picked from commit 8e71a6f52a)

Closes scylladb/scylladb#28724
2026-02-26 09:24:43 +02:00
Marcin Maliszkiewicz
502b7f296d Merge '[Backport 2026.1] vector_search: return NaN for similarity_cosine with all-zero vectors' from Scylladb[bot]
The ANN vector queries with all-zero vectors are allowed even on vector indexes with similarity function set to cosine.
When enabling the rescoring option, those queries would fail as the rescoring calls `similarity_cosine` function underneath, causing an `InvalidRequest` exception as all-zero vectors were not allowed matching Cassandra's behaviour.

To eliminate the discrepancy we want the all-zero vector `similarity_cosine` calls to pass, but return the NaN as the cosine similarity for zero vectors is mathematically incorrect. We decided not to use arbitrary values contrary to USearch, for which the distance (not to be confused with similarity) is defined as cos(0, 0) = 0, cos(0, x) = 1 while supporting the range of values [0, 2].
If we wanted to convert that to similarity, that would mean sim_cos(0, x) = 0.5, which does not support mathematical reasoning why that would be more similar than for example vectors marking obtuse angles.
It's safe to assume that all-zero vectors for cosine similarity shouldn't make any impact, therefore we return NaN and eliminate them from best results.

Adjusted the tests accordingly to check both proper Cassandra and Scylla's behaviour.

Fixes: SCYLLADB-456

Backport to 2026.1 needed, as it fixes the bug for ANN vector queries using rescoring introduced there.

- (cherry picked from commit af0889d194)

- (cherry picked from commit 4e32502bb3)

Parent PR: #28609

Closes scylladb/scylladb#28775

* github.com:scylladb/scylladb:
  test/vector_search: add reproducer for rescoring with zero vectors
  vector_search: return NaN for similarity_cosine with all-zero vectors
2026-02-25 14:34:58 +01:00
Nadav Har'El
b251ee02a4 test: add unit test for tablet_map::get_secondary_replica()
This patch adds a unit test for tablet_map::get_secondary_replica().
It was never officially defined how the "primary" and "secondary"
replicas were chosen, and their implementation changed over time,
but the one invariant that this test verifies is that the secondary
replica and the primary replica must be a different node.

This test reproduces issue SCYLLADB-777, where we discovered that
the get_primary_replica() changed without a corresponding change to
get_primary_replica(). So before the previous patch, this test failed,
and after the previous patch - it passes.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
(cherry picked from commit e463d528fe)
2026-02-25 12:59:26 +00:00
Nadav Har'El
f26d08dde2 test, alternator: add test for TTL expiration with a node down
We have many single-node functional tests for Alternator TTL in
test/alternator/test_ttl.py. This patch adds a multi-node test in
test/cluster/test_alternator.py. The new test verifies that:

 1. Even though Alternator TTL splits the work of scanning and expiring
    items between nodes, all the items get correctly expired.
 2. When one node is down, all the items still expire because the
    "secondary" owner of each token range takes over expiring the
   items in this range while the "primary" owner is down.

This new test is actually a port of a test we already had in dtest
(alternator_ttl_tests.py::test_multinode_expiration). This port is
faster and smaller then the original (fewer nodes, fewer rows), but it
still found a regression (SCYLLADB-777) that dtest missed - the new test
failed when running with tablets and in release build mode.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
(cherry picked from commit 0c7f499750)
2026-02-25 12:59:26 +00:00
Nadav Har'El
9cd1038c7a locator: fix get_secondary_replica() to match get_primary_replica()
The function tablet_map::get_secondary_replica() is used by Alternator
TTL to choose a node different from get_primary_replica(). Unfortunately,
recently (commits 817fdad and d88037d) the implementation of the latter
function changed, without changing the former. So this patch changes
the former to match.

The next two patches will have two tests that fail before this patch,
and pass with it:

1. A unit test that checks that get_secondary_replica() returns a
   different node than get_primary_replica().

2. An Alternator TTL test that checks that when a node is down,
   expirations still happen because the secondary replica takes over
   the primary replica's work.

Fixes SCYLLADB-777

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
(cherry picked from commit 9ab3d5b946)
2026-02-25 12:59:25 +00:00
Avi Kivity
fdae3e4f3a Merge '[Backport 2026.1] s3_client: Fix s3 part size and number of parts calculation' from Scylladb[bot]
- Correct `calc_part_size` function since it could return more than 10k parts
- Add tests
- Add more checks in `calc_part_size` to comply with S3 limits

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-640
Must be ported back to 2025.3/4 and 2026.1 since we may encounter this bug in production clusters

- (cherry picked from commit 289e910cec)

- (cherry picked from commit 6280cb91ca)

- (cherry picked from commit 960adbb439)

Parent PR: #28592

Closes scylladb/scylladb#28697

* github.com:scylladb/scylladb:
  s3_client: add more constrains to the calc_part_size
  s3_client: add tests for calc_part_size
  s3_client: correct multipart part-size logic to respect 10k limit
2026-02-24 14:23:33 +02:00
Avi Kivity
d47e4898ea Merge '[Backport 2026.1] docs: update a documentation of adding/removing DC and rebuilding a node' from Scylladb[bot]
Describe a procedure to convert tablet keyspace replication factor
to rack list. Update the procedures of adding and removing a node
to consider tablet keyspaces.

Fixes: [SCYLLADB-398](https://scylladb.atlassian.net/browse/SCYLLADB-398)
Fixes: https://github.com/scylladb/scylladb/issues/28306.
Fixes: https://github.com/scylladb/scylladb/issues/28307.
Fixes: https://github.com/scylladb/scylladb/issues/28270.

Needs backport to all live branches as they all include tablets.

[SCYLLADB-398]: https://scylladb.atlassian.net/browse/SCYLLADB-398?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ

- (cherry picked from commit eefe66b2b2)

- (cherry picked from commit e08ac60161)

- (cherry picked from commit 1c764cf6ea)

- (cherry picked from commit e4c42acd8f)

- (cherry picked from commit 9ccc95808f)

Parent PR: #28521

Closes scylladb/scylladb#28780

* github.com:scylladb/scylladb:
  docs: update nodetool rebuild docs
  docs: update a procedure of decommissioning a DC
  docs: update a procedure of adding a DC
  docs: describe upgrade to enforce_rack_list option
  docs: describe conversion to rack-list RF
2026-02-24 14:21:50 +02:00
Aleksandra Martyniuk
7bc87de838 docs: update nodetool rebuild docs
Update nodetool rebuild docs to mention that the command does not
work for tablet keyspaces.

Fixes: https://github.com/scylladb/scylladb/issues/28270.
(cherry picked from commit 9ccc95808f)
2026-02-24 09:26:27 +01:00
Aleksandra Martyniuk
2141b9b824 docs: update a procedure of decommissioning a DC
Update a procedure of decommissioning a DC for tablet keyspaces.

Fixes: https://github.com/scylladb/scylladb/issues/28307.
(cherry picked from commit e4c42acd8f)
2026-02-24 09:26:13 +01:00
Aleksandra Martyniuk
aa50edbf17 docs: update a procedure of adding a DC
Update a procedure of adding a DC for tablet keyspaces.

Fixes: https://github.com/scylladb/scylladb/issues/28306.
(cherry picked from commit 1c764cf6ea)
2026-02-23 20:58:23 +00:00
Aleksandra Martyniuk
7f836aa3ec docs: describe upgrade to enforce_rack_list option
(cherry picked from commit e08ac60161)
2026-02-23 20:58:23 +00:00
Aleksandra Martyniuk
bd26803c1a docs: describe conversion to rack-list RF
Fixes: SCYLLADB-398
(cherry picked from commit eefe66b2b2)
2026-02-23 20:58:23 +00:00
Dawid Pawlik
2feed49285 test/vector_search: add reproducer for rescoring with zero vectors
Add reproducer for the SCYLLADB-456 issue following exception
on ANN vector queries with rescoring with similarity cosine.

(cherry picked from commit 4e32502bb3)
2026-02-23 17:09:51 +00:00
Dawid Pawlik
3007cb6f37 vector_search: return NaN for similarity_cosine with all-zero vectors
The ANN vector queries with all-zero vectors are allowed even on vector
indexes with similarity function set to cosine.
When enabling the rescoring option, those queries would fail as the rescoring
calls `similarity_cosine` function underneath, causing an `InvalidRequest` exception
as all-zero vectors were not allowed matching Cassandra's behaviour.

To eliminate the discrepancy we want the all-zero vector `similarity_cosine` calls to pass,
but return the NaN as the cosine similarity for zero vectors is mathematically incorrect.
We decided not to use arbitrary values contrary to USearch, for which the distance
(not to be confused with similarity) is defined as cos(0, 0) = 0, cos(0, x) = 1 while
supporting the range of values [0, 2].
If we wanted to convert that to similarity, that would mean sim_cos(0, x) = 0.5,
which does not support mathematical reasoning why that would be more similar than
for example vectors marking obtuse angles.
It's safe to assume that all-zero vectors for cosine similarity shouldn't make any impact,
therefore we return NaN and eliminate them from best results.

Adjusted the tests accordingly to check both proper Cassandra and Scylla's behaviour.

Fixes: SCYLLADB-456
(cherry picked from commit af0889d194)
2026-02-23 17:09:50 +00:00
Ferenc Szili
1e2d1c7e85 load_stats: fix race condition when computing sum_tablet_sizes
In storage_service::load_stats_for_tablet_based_tables(), we are passing
a reference to sum_tablet_sizes to the lambda which increments this value
on each shard via map_reduce0(). This means we could have a race
condition because this is executed on separate threads/CPUs.

This patch fixed the problem by collecting the sums by shard into a
vector, then summing those up.

Refs: SCYLLADB-678

Closes scylladb/scylladb#28703

(cherry picked from commit f1bc17bd4c)

Closes scylladb/scylladb#28729
2026-02-23 15:02:48 +01:00
Jenkins Promoter
55ad575c8f Update ScyllaDB version to: 2026.1.0-rc3 2026-02-22 14:48:35 +02:00
Tomasz Grabiec
8982140cd9 Merge '[Backport 2026.1] test_tablets_parallel_decommission: Fix flakiness due to delayed task appearance' from Scylladb[bot]
Currently, the test assumes that when
'topology_coordinator_pause_before_processing_backlog: waiting' is
logged, the task for decommission must be there. This was based on the
assumption that topology coordinator is idle and decommission request
wakes it up. But if the server is slow enough, it may still be running
the load balancer in reaction to table creation, and block on that
injection point before decommission request was added.

Fix by waiting for the task to appear rather than the injection.

Fixes SCYLLADB-715

Only 2026.1 vulnerable.

- (cherry picked from commit e14eca46af)

- (cherry picked from commit 2454de4f8f)

- (cherry picked from commit d33d38139f)

Parent PR: #28688

Closes scylladb/scylladb#28750

* github.com:scylladb/scylladb:
  test_tablets_parallel_decommission: Fix flakiness due to delayed task appearance
  test: cluster: task_manager_client: Introduce wait_task_appears()
  tests: pylib: util: Add exponential backoff to wait_for
2026-02-21 01:45:52 +01:00
Tomasz Grabiec
e90449f770 test_tablets_parallel_decommission: Fix flakiness due to delayed task appearance
Currently, the test assumes that when
'topology_coordinator_pause_before_processing_backlog: waiting' is
logged, the task for decommission must be there. This was based on the
assumption that topology coordinator is idle and decommission request
wakes it up. But if the server is slow enough, it may still be running
the load balancer in reaction to table creation, and block on that
injection point before decommission request was added.

Fix by waiting for the task to appear rather than the injection.

Fixes SCYLLADB-715

(cherry picked from commit d33d38139f)
2026-02-20 16:35:39 +00:00
Tomasz Grabiec
98fd5c5e45 test: cluster: task_manager_client: Introduce wait_task_appears()
(cherry picked from commit 2454de4f8f)
2026-02-20 16:35:39 +00:00
Tomasz Grabiec
cca6a1c3dd tests: pylib: util: Add exponential backoff to wait_for
Allows balancing the trade-off between fast execution in case the
condition is satisfied quickly and not adding load when it's not.

(cherry picked from commit e14eca46af)
2026-02-20 16:35:39 +00:00
Patryk Jędrzejczak
9edd0ae3fb raft topology: prevent accessing nullptr returned by topology::find
It's better to raise an internal error than cause a segmentation fault on
possibly multiple nodes.

(cherry picked from commit 8e9c7397c5)
2026-02-20 13:00:21 +01:00
Patryk Jędrzejczak
dc3133b031 raft topology: make some assertions non-crashing
Some assertions in the Raft-based topology are likely to cause crashes of
multiple nodes due to the consistent nature of the Raft-based code. If the
failing assertion is executed in the code run by each follower (e.g., the code
reloading the in-memory topology state machine), then all nodes can crash. If
the failing assertion is executed only by the leader (e.g., the topology
coordinator fiber), then multiple consecutive group0 leaders will chain-crash
until there is no group0 majority.

Crashing multiple nodes is much more severe than necessary. It's enough to
prevent the topology state machine from making more progress. This will
naturally happen after throwing a runtime error. The problematic fiber will be
killed or will keep failing in a loop. Note that it should be safe to block
the topology state machine, but not the whole group0, as the topology state
machine is mostly isolated from the rest of group0.

We replace some occurrences of `on_fatal_internal_error` and `SCYLLA_ASSERT`
with `on_internal_error`. These are not all occurrences, as some fatal
assertions make sense, for example, in the bootstrap procedure.

(cherry picked from commit e21ecf69de)
2026-02-20 13:00:19 +01:00
Szymon Malewski
86554e6192 vector: Improve similarity functions performance
Improves performance of deserialization of vector data for calculating similarity functions.
Instead of deserializing vector data into a std::vector<data_value>, we deserialize directly into a std::vector<float>
and then pass it to similarity functions as a std::span<const float>.
This avoids overhead of data_value allocations and conversions.
Example QPS of `SELECT id, similarity_cosine({vector<float, 1536>}, {vector<float, 1536>}) ...`:
client concurrency 1: before: ~135 QPS, after: ~1005 QPS
client concurrency 20: before: ~280 QPS, after: ~2097 QPS
Measured using https://github.com/zilliztech/VectorDBBench (modified to call above query without ANN search)

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-471

Closes scylladb/scylladb#28615

(cherry picked from commit 668d6fe019)

Closes scylladb/scylladb#28690
2026-02-19 14:14:39 +02:00
Asias He
637618560b repair: Skip auto repair for tables using RF one
There is no point running repair for tables using RF one. Row level
repair will skip it but the auto repair scheduler will keep scheduling
such repairs since repair_time could not be updated.

Skip such repairs at the scheduler level for auto repair.

If the request is issued by user, we will have to schedule such
repair otherwise the user request will never be finished.

Fixes SCYLLADB-561

Closes scylladb/scylladb#28640

(cherry picked from commit 1be80c9e86)

Closes scylladb/scylladb#28714
2026-02-19 13:07:37 +02:00
Avi Kivity
8c3c5777da Merge '[Backport 2026.1] transport: fix connection code to consume only initially taken semaphore units' from Scylladb[bot]
The connection's `cpu_concurrency_t` struct tracks the state of a connection
to manage the admission of new requests and prevent CPU overload during
connection storms. When a connection holds units (allowed only 0 or 1), it is
considered to be in the "CPU state" and contributes to the concurrency limits
used when accepting new connections.

The bug stems from the fact that `counted_data_source_impl::get` and
`counted_data_sink_impl::put` calls can interleave during execution. This
occurs because of `should_parallelize` and `_ready_to_respond`, the latter being
a future chain that can run in the background while requests are being read.
Consequently, while reading request (N), the system may concurrently be
writing the response for request (N-1) on the same connection.

This interleaving allows `return_all()` to be called twice before the
subsequent `consume_units()` is invoked. While the second `return_all()` call
correctly returns 0 units, the matching `consume_units()` call would
mistakenly take an extra unit from the semaphore. Over time, a connection
blocked on a read operation could end up holding an unreturned semaphore
unit. If this pattern repeats across multiple connections, the semaphore
units are eventually depleted, preventing the server from accepting any
new connections.

The fix ensures that we always consume the exact number of units that were
previously returned. With this change, interleaved operations behave as
follows:

get() return_all     — returns 1 unit
put() return_all     — returns 0 units
get() consume_units  — takes back 1 unit
put() consume_units  — takes back 0 units

Logically, the networking phase ends when the first network operation
concludes. But more importantly, when a network operation
starts, we no longer hold any units.

Other solutions are possible but the chosen one seems to be the
simplest and safest to backport.

Fixes SCYLLADB-485
Backport: all supported affected versions, bug introduced with initial feature implementation in: ed3e4f33fd

- (cherry picked from commit 0376d16ad3)

- (cherry picked from commit 3b98451776)

Parent PR: #28530

Closes scylladb/scylladb#28716

* github.com:scylladb/scylladb:
  test: auth_cluster: add test for hanged AUTHENTICATING connections
  transport: fix connection code to consume only initially taken semaphore units
2026-02-19 12:47:38 +02:00
Tomasz Grabiec
bb9a5261ec Merge '[Backport 2026.1] test: fix flaky test_balance_empty_tablets' from Scylladb[bot]
The test creates a single node cluster, then creates 3 tables which remain empty. Then it adds another node with half the disk capacity of the first one, and then it waits for the balancer to migrate tablets to the newly added node by calling the quiesce topology API. The number of tablets on the smaller node should be exactly half the number of tablets on the larger node.

After waiting for quiesce topology, we could have a situation where we query the number of tablets from the node which still hasn't processed the last tablet migrations and updated system.tablets.

This patch adds a read barrier so that both nodes see the same tablets metadata before we query the number of tablets.

Fixes: SCYLLADB-603

The test is present in master and 2026.1, so we need to backport this.

- (cherry picked from commit 4ca40929ef)

Parent PR: #28598

Closes scylladb/scylladb#28638

* github.com:scylladb/scylladb:
  test/cluster: Remove short_tablet_stats_refresh_interval injection
  test: add read barrier to test_balance_empty_tablets
2026-02-18 23:39:03 +01:00
Marcin Maliszkiewicz
d5d81cc066 test: auth_cluster: add test for hanged AUTHENTICATING connections
Test runtime:
Release - 2s
Debug - 5s

(cherry picked from commit 3b98451776)
2026-02-18 19:43:02 +00:00
Marcin Maliszkiewicz
6a438543c2 transport: fix connection code to consume only initially taken semaphore units
The connection's cpu_concurrency_t struct tracks the state of a connection
to manage the admission of new requests and prevent CPU overload during
connection storms. When a connection holds units (allowed only 0 or 1), it is
considered to be in the "CPU state" and contributes to the concurrency limits
used when accepting new connections.

The bug stems from the fact that `counted_data_source_impl::get` and
`counted_data_sink_impl::put` calls can interleave during execution. This
occurs because of `should_parallelize` and `_ready_to_respond`, the latter being
a future chain that can run in the background while requests are being read.
Consequently, while reading request (N), the system may concurrently be
writing the response for request (N-1) on the same connection.

This interleaving allows `return_all()` to be called twice before the
subsequent `consume_units()` is invoked. While the second `return_all()` call
correctly returns 0 units, the matching `consume_units()` call would
mistakenly take an extra unit from the semaphore. Over time, a connection
blocked on a read operation could end up holding an unreturned semaphore
unit. If this pattern repeats across multiple connections, the semaphore
units are eventually depleted, preventing the server from accepting any
new connections.

The fix ensures that we always consume the exact number of units that were
previously returned. With this change, interleaved operations behave as
follows:

get() return_all     — returns 1 unit
put() return_all     — returns 0 units
get() consume_units  — takes back 1 unit
put() consume_units  — takes back 0 units

Logically, the networking phase ends when the first network operation
concludes. But more importantly, when a network operation
starts, we no longer hold any units.

Other solutions are possible but the chosen one seems to be the
simplest and safest to backport.

Fixes SCYLLADB-485

(cherry picked from commit 0376d16ad3)
2026-02-18 19:43:02 +00:00
Botond Dénes
99a67484bf Merge '[Backport 2026.1] cql3/statements/describe_statement: hide paxos state tables ' from Scylladb[bot]
Paxos state tables are internal tables fully managed by Scylla
and they shouldn't be exposed to the user nor they shouldn't be backed up.

This commit hides those kind of tables from all listings and if such table
is directly described with `DESC ks."tbl$paxos"`, the description is generated
withing a comment and a note for the user is added.

Fixes https://github.com/scylladb/scylladb/issues/28183

LWT on tablets and paxos state tables are present in 2025.4, so the patch should be backported to this version.

- (cherry picked from commit f89a8c4ec4)

- (cherry picked from commit 9baaddb613)

Parent PR: #28230

Closes scylladb/scylladb#28508

* github.com:scylladb/scylladb:
  test/cqlpy: add reproducer for hidden Paxos table being shown by DESC
  cql3/statements/describe_statement: hide paxos state tables
2026-02-18 12:41:08 +02:00
Anna Stuchlik
cabf2845d9 doc: fix the links on the repair-related pages
This is a follow-up to https://github.com/scylladb/scylladb/pull/28199.

This commit fixes the syntax of the internal links.

Fixes https://github.com/scylladb/scylladb/issues/28486

Closes scylladb/scylladb#28487

(cherry picked from commit 77480c9d8f)

Closes scylladb/scylladb#28512
2026-02-18 12:39:07 +02:00
Yaron Kaikov
ff4a0fc87e ci: fix PR number extraction for unlabeled events
When the workflow is triggered by removing the 'conflicts' label
(pull_request_target unlabeled event), github.event.issue.number is
not available. Use github.event.pull_request.number as fallback.

Fixes: https://scylladb.atlassian.net/browse/RELENG-245

Closes scylladb/scylladb#28543

(cherry picked from commit b30ecb72d5)

Closes scylladb/scylladb#28553
2026-02-18 12:38:10 +02:00
Botond Dénes
0a89dbb4d4 Merge '[Backport 2026.1] raft topology: generate notification about released nodes only once' from Scylladb[bot]
Hints destined for some other node can only be drained after the other node is no longer a replica of any vnode or tablet. In case when tablets are present, a node might still technically be a replica of some tablets after it moved to left state. When it no longer is a replica of any tablet, it becomes "released" and storage service generates a notification about it. Hinted handoff listens to this notification and kicks off draining hints after getting it.

The current implementation of the "released" notification would trigger every time raft topology state is reloaded and a left node without any tokens is present in the raft topology. Although draining hints is idempotent, generating duplicate notifications is wasteful and recently became very noisy after in 44de563 verbosity of the draining-related log messages have been increased. The verbosity increase itself makes sense as draining is supposed to be a rare operation, but the duplicate notification bug now needs to be addressed.

Fix the duplicate notification problem by passing the list of previously released nodes to the `storage_service::raft_topology_update_ip` function and filtering based on it. If this function processes the topology state for the first time, it will not produce any notifications. This is fine as hinted handoff is prepared to detect "released" nodes during the startup sequence in main.cc and start draining the hints there, if needed.

Fixes: scylladb/scylladb#28301
Refs: scylladb/scylladb#25031

The log messages added in 44de563 cause a lot of noise during topology operations and tablet migrations, so the fix should be backported to all affected versions (2025.4 and 2026.1).

- (cherry picked from commit 10e9672852)

- (cherry picked from commit d28c841fa9)

- (cherry picked from commit 29da20744a)

Parent PR: #28367

Closes scylladb/scylladb#28612

* github.com:scylladb/scylladb:
  storage_service: fix indentation after previous patch
  raft topology: generate notification about released nodes only once
  raft topology: extract "released" nodes calculation to external function
2026-02-18 12:37:33 +02:00
Aleksandra Martyniuk
19cbaa1be2 test: fix test_remove_node_violating_rf_rack_with_rack_list
test_remove_node_violating_rf_rack_with_rack_list creates a cluster
with four nodes. One of the nodes is excluded, then another one is
stopped, excluded, and removed. If the two stopped nodes were both
voters, the majority is lost and the cluster loses its raft leader.
As a result, the node cannot be removed and the operation times out.

Add the 5th node to the cluster. This way the majority is always up.

Fixes: https://github.com/scylladb/scylladb/issues/28596.

Closes scylladb/scylladb#28610

(cherry picked from commit f955a90309)

Closes scylladb/scylladb#28639
2026-02-18 12:36:52 +02:00
Anna Mikhlin
9cf0f0998d .github/workflows: ignore quoted comments for trigger CI
prevent CI from being triggered when trigger-ci command appears inside
quoted (>) comment text

Fixes: https://scylladb.atlassian.net/browse/RELENG-271

Closes scylladb/scylladb#28604

(cherry picked from commit 33cf97d688)

Closes scylladb/scylladb#28652
2026-02-18 12:36:28 +02:00
Ernest Zaslavsky
f56e1760d7 s3_client: limit multipart upload concurrency
Prevent launching hundreds or thousands of fibers during multipart uploads
by capping concurrent part submissions to 16.

Closes scylladb/scylladb#28554

(cherry picked from commit 034c6fbd87)

Closes scylladb/scylladb#28668
2026-02-18 12:35:23 +02:00
Ernest Zaslavsky
db4e3a664d api: report restore params
report restore params once the API's call for restore is invoked

Closes scylladb/scylladb#28431

(cherry picked from commit 30699ed84b)

Closes scylladb/scylladb#28683
2026-02-18 12:34:33 +02:00
Botond Dénes
c292892d5f Merge '[Backport 2026.1] GCS object storage. Fix incompatibilty issues with "real" GCS' from Scylladb[bot]
Fixes #28398
Fixes #28399

When used as path elements in google storage paths, the object names need to be URL encoded. Due to

a.) tests not really using prefixes including non-url valid chars (i.e. / etc)
and
b.) the mock server used for most testing not enforcing this particular aspect,

this was missed.

Modified unit tests to use prefixing for all names, so when running real GS, any errors like this will show.

"Real" GCS also behaves a bit different when listing with pager, compared to mock;
The former will not give a pager token for last page, only penultimate.
 Adds handling for this.

Needs backport to the releases that have (though might not really use) the feature, as it is technically possible to use google storage for backup and whatnot there, and it should work as expected.

- (cherry picked from commit a896d8d5e3)

- (cherry picked from commit 87aa6c8387)

Parent PR: #28400

Closes scylladb/scylladb#28685

* github.com:scylladb/scylladb:
  utils/gcp/object_storage: URL-encode object names in URL:s
  utils::gcp::object_storage: Fix list object pager end condition detection
2026-02-18 12:33:28 +02:00
Calle Wilund
d87467f77b commitlog: Always abort replenish queue on loop exit
Fixes #28678

If replenish loop exits the sleep condition, with an empty queue,
when "_shutdown" is already set, a waiter might get stuck, unsignalled
waiting for segments, even though we are exiting.

Simply move queue abort to always be done on loop exit.

Closes scylladb/scylladb#28679

(cherry picked from commit ab4e4a8ac7)

Closes scylladb/scylladb#28693
2026-02-18 12:32:43 +02:00
Ernest Zaslavsky
6e92ee1bb2 s3_client: add more constrains to the calc_part_size
Enforce more checks on part size and object size as defined in
"Amazon S3 multipart upload limits", see
https://docs.aws.amazon.com/AmazonS3/latest/userguide/qfacts.html and
https://docs.aws.amazon.com/AmazonS3/latest/userguide/UsingObjects.html

(cherry picked from commit 960adbb439)
2026-02-18 09:41:28 +00:00
Ernest Zaslavsky
4ecc402b79 s3_client: add tests for calc_part_size
Introduce tests that validate the corrected multipart part-size
calculation, including boundary conditions and error cases.

(cherry picked from commit 6280cb91ca)
2026-02-18 09:41:27 +00:00
Ernest Zaslavsky
b27adefc16 s3_client: correct multipart part-size logic to respect 10k limit
The previous calculation could produce more than 10,000 parts for large
uploads because we mixed values in bytes and MiB when determining the
part size. This could result in selecting a part size that still
exceeded the AWS multipart upload limit. The updated logic now ensures
the number of parts never exceeds the allowed maximum.

This change also aligns the implementation with the code comment: we
prefer a 50 MiB part size because it provides the best performance, and
we use it whenever it fits within the 10,000-part limit. If it does not,
we increase the part size (in bytes, aligned to MiB) to stay within the
limit.

(cherry picked from commit 289e910cec)
2026-02-18 09:41:27 +00:00
Calle Wilund
cad92d5100 utils/gcp/object_storage: URL-encode object names in URL:s
Fixes #28398

When used as path elements in google storage paths, the object names
need to be URL encoded. Due to a.) tests not really using prefixes including
non-url valid chars (i.e. / etc) and the mock server used for most
testing not enforcing this particular aspect, this was missed.

Modified unit tests to use prefixing for all names, so when run
in real GS, any errors like this will show.

(cherry picked from commit 87aa6c8387)
2026-02-17 19:32:19 +00:00
Calle Wilund
44cc5ae30b utils::gcp::object_storage: Fix list object pager end condition detection
Fixes #28399

When iterating with pager, the mock server and real GCS behaves differently.
The latter will not give a pager token for last page, only penultimate.

Need to handle.

(cherry picked from commit a896d8d5e3)
2026-02-17 19:32:18 +00:00
Piotr Dulikowski
05a5bd542a Merge '[Backport 2026.1] vector_search: Fix flaky vector_store_client_https_rewrite_ca_cert' from Scylladb[bot]
Most likely, the root cause of the flaky test was that the TLS handshake hung for an extended period (60s). This caused
the test case to fail because the ANN request duration exceeded the test case timeout.

The PR introduces two changes:

* Mitigation of the hanging TLS handshake: This issue likely occurred because the test performed certificate rewrites
simultaneously with ANN requests that utilize those certificates.
* Production code fix: This addresses a bug where the TLS handshake itself was not covered by the connection timeout.
Since tls::connect does not perform the handshake immediately, the handshake only occurs during the first write
operation, potentially bypassing connect timeout.

Fixes: #28012

Backport to 2026.01 and 2025.04 is needed, as these branches are also affected and may experience CI flakiness due to this test.

- (cherry picked from commit aef5ff7491)

- (cherry picked from commit 079fe17e8b)

Parent PR: #28617

Closes scylladb/scylladb#28643

* github.com:scylladb/scylladb:
  vector_search: Fix missing timeout on TLS handshake
  vector_search: test: Fix flaky cert rewrite test
2026-02-17 10:44:13 +01:00
Patryk Jędrzejczak
8a626bb458 Merge '[Backport 2026.1] test: explicitly set compression algorithm in test_autoretrain_dict' from Scylladb[bot]
When `test_autoretrain_dict` was originally written, the default
`sstable_compression_user_table_options` was `LZ4Compressor`. The
test assumed (correctly) that initially the compression doesn't use
a trained dictionary, and later in the test scenario, it changed
the algorithm to one with a dictionary.

However, the default `sstable_compression_user_table_options` is now
`LZ4WithDictsCompressor`, so the old assumption is no longer correct.
As a result, the assertion that data is initially not compressed well
may or may not fail depending on dictionary training timing.

To fix this, this commit explicitly sets `ZstdCompressor`
as the initial `sstable_compression_user_table_options`, ensuring that
the assumption that initial compression is without a dictionary
is always met.

Note: `ZstdCompressor` differs from the former default `LZ4Compressor`.
However, it's a better choice — the test aims to show the benefit of
using a dictionary, not the benefit of Zstd over LZ4 (and the test uses
ZstdWithDictsCompressor as the algorithm with the dictionary).

Fixes: https://github.com/scylladb/scylladb/issues/28204
Backport: 2025.4, as test already failed there (and also backport to 2026.1 to make everything consistent).

- (cherry picked from commit e63cfc38b3)

- (cherry picked from commit 9ffa62a986)

Parent PR: #28625

Closes scylladb/scylladb#28667

* https://github.com/scylladb/scylladb:
  test: explicitly set compression algorithm in test_autoretrain_dict
  test: remove unneeded semicolons from python test
2026-02-17 10:19:26 +01:00
Patryk Jędrzejczak
3a56a0cf99 test: test_restart_leaving_replica_during_cleanup: reconnect driver after restart
The test can currently fail like this:
```
>           await cql.run_async(f"ALTER TABLE {ks}.test WITH tablets = {{'min_tablet_count': 1}}")
E           cassandra.cluster.NoHostAvailable: ('Unable to complete the operation against any hosts', {<Host: 127.158.27.9:9042 datacenter1>: <Error from server: code=0000 [Server error] message="Failed to apply group 0 change due to concurrent modification">})
```
The following happens:
- node A is restarted and becomes the group0 leader,
- the driver sends the ALTER TABLE request to node B,
- the request hits group 0 concurrent modification error 10 times and fails
  because node A performs tablet migrations at the the same time.

What is unexpected is that even though the driver session uses the default
retry policy, the driver doesn't retry the request on node A. The request
is guaranteed to succeed on node A because it's the only node adding group0
entries.

The driver doesn't retry the request on node A because of a missing
`wait_for_cql_and_get_hosts` call. We add it in this commit. We also reconnect
the driver just in case to prevent hitting scylladb/python-driver#295.

Moreover, we can revert the workaround from
4c9efc08d8, as the fix from this commit also
prevents DROP KEYSPACE failures.

The commit has been tested in byo with `_concurrent_ddl_retries{0}` to
verify that node A really can't hit group 0 concurrent modification error
and always receives the ALTER TABLE request from the driver. All 300 runs in
each build mode passed.

Fixes #25938

Closes scylladb/scylladb#28632

(cherry picked from commit 0693091aff)

Closes scylladb/scylladb#28673
2026-02-17 10:03:50 +01:00
Nikos Dragazis
0cdac69aab test/cluster: Remove short_tablet_stats_refresh_interval injection
The test `test_size_based_load_balancing.py::test_balance_empty_tablets`
waits for tablet load stats to be refreshed and uses the
`short_tablet_stats_refresh_interval` injection to speed up the refresh
interval.

This injection has no effect; it was replaced by the
`tablet_load_stats_refresh_interval_in_seconds` config option (patch: 1d6808aec4),
so the test currently waits for 60 seconds (default refresh interval).

Use the config option. This reduces the execution time to ~8 seconds.

Fixes SCYLLADB-556.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>

Closes scylladb/scylladb#28536

(cherry picked from commit 5d1e6243af)
2026-02-17 08:35:16 +01:00
Ferenc Szili
f04a3acf33 test: add read barrier to test_balance_empty_tablets
The test creates a single node cluster, then creates 3 tables which
remain empty. Then it adds another node with half the disk capacity of
the first one, and then it waits for the balancer to migrate tablets to
the newly added node by calling the quiesce topology API. The number of
tablets on the smaller node should be exactly half the number of tablets
on the larger node.

After waiting for quiesce topology, we could have a situation where we
query the number of tablets from the node which still hasn't processed
the last tablet migrations and updated system.tablets.

This patch adds a read barrier so that both nodes see the same tablets
metadata before we query the number of tablets.

Fixes: SCYLLADB-603

Closes scylladb/scylladb#28598

(cherry picked from commit 4ca40929ef)
2026-02-17 08:31:50 +01:00
Andrzej Jackowski
ad716f9341 test: explicitly set compression algorithm in test_autoretrain_dict
When `test_autoretrain_dict` was originally written, the default
`sstable_compression_user_table_options` was `LZ4Compressor`. The
test assumed (correctly) that initially the compression doesn't use
a trained dictionary, and later in the test scenario, it changed
the algorithm to one with a dictionary.

However, the default `sstable_compression_user_table_options` is now
`LZ4WithDictsCompressor`, so the old assumption is no longer correct.
As a result, the assertion that data is initially not compressed well
may or may not fail depending on dictionary training timing.

To fix this, this commit explicitly sets `ZstdCompressor`
as the initial `sstable_compression_user_table_options`, ensuring that
the assumption that initial compression is without a dictionary
is always met.

Note: `ZstdCompressor` differs from the former default `LZ4Compressor`.
However, it's a better choice — the test aims to show the benefit of
using a dictionary, not the benefit of Zstd over LZ4 (and the test uses
ZstdWithDictsCompressor as the algorithm with the dictionary).

Fixes: scylladb/scylladb#28204
(cherry picked from commit 9ffa62a986)
2026-02-16 16:23:36 +00:00
Andrzej Jackowski
2edd87f2e1 test: remove unneeded semicolons from python test
(cherry picked from commit e63cfc38b3)
2026-02-16 16:23:36 +00:00
Piotr Dulikowski
ba10e74523 Merge '[Backport 2026.1] test: cluster: Fix test_sync_point' from Scylladb[bot]
The test `test_sync_point` had a few shortcomings that made it flaky
or simply wrong:

1. We were verifying that hints were written by checking the size of
   in-flight hints. However, that could potentially lead to problems
   in rare situations.

   For instance, if all of the hints failed to be written to disk, the
   size of in-flight hints would drop to zero, but creating a sync point
   would correspond to the empty state.

   In such a situation, we should fail immediately and indicate what
   the cause was.

2. A sync point corresponds to the hints that have already been written
   to disk. The number of those is tracked by the metric `written`.
   It's a much more reliable way to make sure that hints have been
   written to the commitlog. That ensures that the sync point we'll
   create will really correspond to those hints.

3. The auxiliary function `wait_for` used in the test works like this:
   it executes the passed callback and looks at the result. If it's
   `None`, it retries it. Otherwise, the callback is deemed to have
   finished its execution and no further retries will be attempted.

   Before this commit, we simply returned a bool, and so the code was
   wrong. We improve it.

---

Note that this fixes scylladb/scylladb#28203, which was a manifestation
of scylladb/scylladb#25879. We created a sync point that corresponded
to the empty state, and so it immediately resolved, even when node 3
was still dead.

As a bonus, we rewrite the auxiliary code responsible for fetching
metrics and manipulating sync points. Now it's asynchronous and
uses the existing standard mechanisms available to developers.

Furthermore, we reduce the time needed for executing
`test_sync_point` by 27 seconds.

---

The total difference in time needed to execute the whole test file
(on my local machine, in dev mode):

Before:

    CPU utilization: 0.9%

    real    2m7.811s
    user    0m25.446s
    sys     0m16.733s

After:

    CPU utilization: 1.1%

    real    1m40.288s
    user    0m25.218s
    sys     0m16.566s

---

Refs scylladb/scylladb#25879
Fixes scylladb/scylladb#28203

Backport: This improves the stability of our CI, so let's
          backport it to all supported versions.

- (cherry picked from commit 628e74f157)

- (cherry picked from commit ac4af5f461)

- (cherry picked from commit c5239edf2a)

- (cherry picked from commit a256ba7de0)

- (cherry picked from commit f83f911bae)

Parent PR: #28602

Closes scylladb/scylladb#28623

* github.com:scylladb/scylladb:
  test: cluster: Reduce wait time in test_sync_point
  test: cluster: Fix test_sync_point
  test: cluster: Await sync points asynchronously
  test: cluster: Create sync points asynchronously
  test: cluster: Fetch hint metrics asynchronously
2026-02-16 10:32:03 +01:00
Jenkins Promoter
5abc2fea9f Update pgo profiles - aarch64 2026-02-15 05:01:51 +02:00
Karol Nowacki
2ab81f768b vector_search: Fix missing timeout on TLS handshake
Currently the TLS handshake in the vector search client does not have a timeout.
This is because tls::connect does not perform handshake itself; the handshake
is deferred until the first read/write operation is performed. This can lead to long
hangs on ANN requests.

This commit calls tls::check_session_is_resumed() after tls::connect
to force the handshake to happen immediately and to run under with_timeout.

(cherry picked from commit 079fe17e8b)
2026-02-13 21:24:45 +00:00
Karol Nowacki
cfebb52db0 vector_search: test: Fix flaky cert rewrite test
The test is flaky most likely because when TLS certificate rewrite
happens simultaneously with an ANN request, the handshake can hang for a
long time (~60s). This leads to a timeout in the test case.

This change introduces a checkpoint in the test so that it will
wait for the certificate rewrite to happen before sending an ANN request,
which should prevent the handshake from hanging and make the test more reliable.

Fixes: #28012
(cherry picked from commit aef5ff7491)
2026-02-13 21:24:45 +00:00
Dawid Mędrek
37ef37e8ab test: cluster: Reduce wait time in test_sync_point
If everything is OK, the sync point will not resolve with node 3 dead.
As a result, the waiting will use all of the time we allocate for it,
i.e. 30 seconds. That's a lot of time.

There's no easy way to verify that the sync point will NOT resolve, but
let's at least reduce the waiting to 3 seconds. If there's a bug, it
should be enough to trigger it at some point, while reducing the average
time needed for CI.

(cherry picked from commit f83f911bae)
2026-02-12 12:13:19 +00:00
Dawid Mędrek
fdad814aa3 test: cluster: Fix test_sync_point
The test had a few shortcomings that made it flaky or simply wrong:

1. We were verifying that hints were written by checking the size of
   in-flight hints. However, that could potentially lead to problems
   in rare situations.

   For instance, if all of the hints failed to be written to disk, the
   size of in-flight hints would drop to zero, but creating a sync point
   would correspond to the empty state.

   In such a situation, we should fail immediately and indicate what
   the cause was.

2. A sync point corresponds to the hints that have already been written
   to disk. The number of those is tracked by the metric `written`.
   It's a much more reliable way to make sure that hints have been
   written to the commitlog. That ensures that the sync point we'll
   create will really correspond to those hints.

3. The auxiliary function `wait_for` used in the test works like this:
   it executes the passed callback and looks at the result. If it's
   `None`, it retries it. Otherwise, the callback is deemed to have
   finished its execution and no further retries will be attempted.

   Before this commit, we simply returned a bool, and so the code was
   wrong. We improve it.

Note that this fixes scylladb/scylladb#28203, which was a manifestation
of scylladb/scylladb#25879. We created a sync point that corresponded
to the empty state, and so it immediately resolved, even when node 3
was still dead.

Refs scylladb/scylladb#25879
Fixes scylladb/scylladb#28203

(cherry picked from commit a256ba7de0)
2026-02-12 12:13:19 +00:00
Dawid Mędrek
0257f7cc89 test: cluster: Await sync points asynchronously
There's a dedicated HTTP API for communicating with the cluster, so
let's use it instead of yet another custom solution.

(cherry picked from commit c5239edf2a)
2026-02-12 12:13:19 +00:00
Dawid Mędrek
07bfd920e7 test: cluster: Create sync points asynchronously
There's a dedicated HTTP API for communicating with the nodes, so let's
use it instead of yet another custom solution.

(cherry picked from commit ac4af5f461)
2026-02-12 12:13:19 +00:00
Dawid Mędrek
698ba5bd0b test: cluster: Fetch hint metrics asynchronously
There's a dedicated API for fetching metrics now. Let's use it instead
of developing yet another solution that's also worse.

(cherry picked from commit 628e74f157)
2026-02-12 12:13:19 +00:00
Piotr Dulikowski
d8c7303d14 storage_service: fix indentation after previous patch
(cherry picked from commit 29da20744a)
2026-02-11 12:07:15 +00:00
Piotr Dulikowski
9365adb2fb raft topology: generate notification about released nodes only once
Hints destined for some other node can only be drained after the other
node is no longer a replica of any vnode or tablet. In case when tablets
are present, a node might still technically be a replica of some tablets
after it moved to left state. When it no longer is a replica of any
tablet, it becomes "released" and storage service generates a
notification about it. Hinted handoff listens to this notification and
kicks off draining hints after getting it.

The current implementation of the "released" notification would trigger
every time raft topology state is reloaded and a left node without any
tokens is present in the raft topology. Although draining hints is
idempotent, generating duplicate notifications is wasteful and recently
became very noisy after in 44de563 verbosity of the draining-related log
messages have been increased. The verbosity increase itself makes sense
as draining is supposed to be a rare operation, but the duplicate
notification bug now needs to be addressed.

Fix the duplicate notification problem by passing the list of previously
released nodes to the `storage_service::raft_topology_update_ip`
function and filtering based on it. If this function processes the
topology state for the first time, it will not produce any
notifications. This is fine as hinted handoff is prepared to detect
"released" nodes during the startup sequence in main.cc and start
draining the hints there, if needed.

Fixes: #28301
Refs: #25031
(cherry picked from commit d28c841fa9)
2026-02-11 12:07:14 +00:00
Piotr Dulikowski
6a55396e90 raft topology: extract "released" nodes calculation to external function
In the following commits we will need to compare the set of released
nodes before and after reload of raft topology state. Moving the logic
that calculates such a set to a separate function will make it easier to
do.

(cherry picked from commit 10e9672852)
2026-02-11 12:07:14 +00:00
Pawel Pery
f4b79c1b1d Revert "Merge 'vector_search: add validator tests' from Pawel Pery"
This reverts commit bcd1758911, reversing
changes made to b2c2a99741.

There is a design decision to not introduce additional test
orchestration tool for scylladb.git (see comments for #27499). One
commit has already been reverted in 55c7bc7. Last CI runs made validator
test flaky, so it is a time to remove all remaining validator tests.

It needs a backport to 2026.1 to remove remaining validator tests from there.

Fixes: VECTOR-497

Closes scylladb/scylladb#28568

(cherry picked from commit 81d11a23ce)

Closes scylladb/scylladb#28577
2026-02-09 15:16:40 +02:00
Michał Hudobski
f633f57163 auth: add CDC streams and timestamps to vector search permissions
It turns out that the cdc driver requires permissions to two additional system tables. This patch adds them to VECTOR_SEARCH_INDEXING and modifies the unit tests. The integration with vector store was tested manually, integration tests will be added in vector-store repository in a follow up PR.

Fixes: SCYLLADB-522

Closes scylladb/scylladb#28519

(cherry picked from commit 6b9fcc6ca3)

Closes scylladb/scylladb#28538
2026-02-05 10:31:39 +01:00
Jenkins Promoter
09ed4178a6 Update ScyllaDB version to: 2026.1.0-rc2 2026-02-03 18:15:05 +02:00
Patryk Jędrzejczak
2bf7a0f65e Merge '[Backport 2026.1] storage_service: set up topology properly in maintenance mode' from Scylladb[bot]
We currently make the local node the only token owner (that owns the
whole ring) in maintenance mode, but we don't update the topology properly.
The node is present in the topology, but in the `none` state. That's how
it's inserted by `tm.get_topology().set_host_id_cfg(host_id);` in
`scylla_main`. As a result, the node started in maintenance mode crashes
in the following way in the presence of a vnodes-based keyspace with the
NetworkTopologyStrategy:
```
scylla: locator/network_topology_strategy.cc:207:
    locator::natural_endpoints_tracker::natural_endpoints_tracker(
    const token_metadata &, const network_topology_strategy::dc_rep_factor_map &):
    Assertion `!_token_owners.empty() && !_racks.empty()' failed.
```
Both `_token_owners` and `_racks` are empty. The reason is that
`_tm.get_datacenter_token_owners()` and
`_tm.get_datacenter_racks_token_owners()` called above filter out nodes
in the `none` state.

This bug basically made maintenance mode unusable in customer clusters.

We fix it by changing the node state to `normal`.

We also extend `test_maintenance_mode` to provide a reproducer for

Fixes #27988

This PR must be backported to all branches, as maintenance mode is
currently unusable everywhere.

- (cherry picked from commit a08c53ae4b)

- (cherry picked from commit 9d4a5ade08)

- (cherry picked from commit c92962ca45)

- (cherry picked from commit 408c6ea3ee)

- (cherry picked from commit 53f58b85b7)

- (cherry picked from commit 867a1ca346)

- (cherry picked from commit 6c547e1692)

- (cherry picked from commit 7e7b9977c5)

Parent PR: #28322

Closes scylladb/scylladb#28499

* https://github.com/scylladb/scylladb:
  test: test_maintenance_mode: enable maintenance mode properly
  test: test_maintenance_mode: shutdown cluster connections
  test: test_maintenance_mode: run with different keyspace options
  test: test_maintenance_mode: check that group0 is disabled by creating a keyspace
  test: test_maintenance_mode: get rid of the conditional skip
  test: test_maintenance_mode: remove the redundant value from the query result
  storage_proxy: skip validate_read_replica in maintenance mode
  storage_service: set up topology properly in maintenance mode
2026-02-03 10:39:41 +01:00
Nadav Har'El
5b15c52f1e test/cqlpy: add reproducer for hidden Paxos table being shown by DESC
This patch adds a reproducer test showing issue #28183 - that when LWT
is used, hidden tables "...$paxos" are created but they are unexpectedly
shown by DESC TABLES, DESC SCHEMA and DESC KEYSPACE.

The new test was failing (in three places) on Scylla, as those internal
(and illegally-named) tables are listed, and passes on Cassandra
(which doesn't add hidden tables for LWT).

The commit also contains another test, which verifies if direct
description of paxos state table is wrapped in comment.

Refs #28183.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
(cherry picked from commit 9baaddb613)
2026-02-02 23:31:58 +00:00
Michał Jadwiszczak
26e17202f6 cql3/statements/describe_statement: hide paxos state tables
Paxos state tables are internal tables fully managed by Scylla
and they shouldn't be exposed to the user nor they shouldn't be backed up.

This commit hides those kind of tables from all listings and if such table
is directly described with `DESC ks."tbl$paxos"`, the description is generated
withing a comment and a note for the user is added.

Fixes scylladb/scylladb#28183

(cherry picked from commit f89a8c4ec4)
2026-02-02 23:31:58 +00:00
Patryk Jędrzejczak
b62e1b405b test: test_maintenance_mode: enable maintenance mode properly
The same issue as the one fixed in
394207fd69.
This one didn't cause real problems, but it's still cleaner to fix it.

(cherry picked from commit 7e7b9977c5)
2026-02-02 17:02:16 +00:00
Patryk Jędrzejczak
f3d2a16e66 test: test_maintenance_mode: shutdown cluster connections
Leaked connections are known to cause inter-test issues.

(cherry picked from commit 6c547e1692)
2026-02-02 17:02:16 +00:00
Patryk Jędrzejczak
eee99ebb3d test: test_maintenance_mode: run with different keyspace options
We extend the test to provide a reproducer for #27988 and to avoid
similar bugs in the future.

The test slows down from ~14s to ~19s on my local machine in dev
mode. It seems reasonable.

(cherry picked from commit 867a1ca346)
2026-02-02 17:02:16 +00:00
Patryk Jędrzejczak
c248744c5a test: test_maintenance_mode: check that group0 is disabled by creating a keyspace
In the following commit, we make the rest run with multiple keyspaces,
and the old check becomes inconvenient. We also move it below to the
part of the code that won't be executed for each keyspace.

Additionally, we check if the error message is as expected.

(cherry picked from commit 53f58b85b7)
2026-02-02 17:02:16 +00:00
Patryk Jędrzejczak
4ba3c08d45 test: test_maintenance_mode: get rid of the conditional skip
This skip has already caused trouble.
After 0668c642a2, the skip was always hit, and
the test was silently doing nothing. This made us miss #26816 for a long
time. The test was fixed in 222eab45f8, but we
should get rid of the skip anyway.

We increase the number of writes from 256 to 1000 to make the chance of not
finding the key on server A even lower. If that still happens, it must be
due to a bug, so we fail the test. We also make the test insert rows until
server A is a replica of one row. The expected number of inserted rows is
a small constant, so it should, in theory, make the test faster and cleaner
(we need one row on server A, so we insert exactly one such row).

It's possible to make the test fully deterministic, by e.g., hardcoding
the key and tokens of all nodes via `initial_token`, but I'm afraid it would
make the test "too deterministic" and could hide a bug.

(cherry picked from commit 408c6ea3ee)
2026-02-02 17:02:16 +00:00
Patryk Jędrzejczak
c8c21cc29c test: test_maintenance_mode: remove the redundant value from the query result
(cherry picked from commit c92962ca45)
2026-02-02 17:02:16 +00:00
Patryk Jędrzejczak
e95689c96b storage_proxy: skip validate_read_replica in maintenance mode
In maintenance mode, the local node adds only itself to the topology. However,
the effective replication map of a keyspace with tablets enabled contains all
tablet replicas. It gets them from the tablets map, not the topology. Hence,
`network_topology_strategy::sanity_check_read_replicas` hits
```
throw std::runtime_error(format("Requested location for node {} not in topology. backtrace {}", id, lazy_backtrace()));
```
for tablet replicas other than the local node.

As a result, all requests to a keyspace with tablets enabled and RF > 1 fail
in debug mode (`validate_read_replica` does nothing in other modes). We don't
want to skip maintenance mode tests in debug mode, so we skip the check in
maintenance mode.

We move the `is_debug_build()` check because:
- `validate_read_replicas` is a static function with no access to the config,
- we want the `!_db.local().get_config().maintenance_mode()` check to be
  dropped by the compiler in non-debug builds.

We also suppress `-Wunneeded-internal-declaration` with `[[maybe_unused]]`.

(cherry picked from commit 9d4a5ade08)
2026-02-02 17:02:15 +00:00
Patryk Jędrzejczak
6094f4b7b2 storage_service: set up topology properly in maintenance mode
We currently make the local node the only token owner (that owns the
whole ring) in maintenance mode, but we don't update the topology properly.
The node is present in the topology, but in the `none` state. That's how
it's inserted by `tm.get_topology().set_host_id_cfg(host_id);` in
`scylla_main`. As a result, the node started in maintenance mode crashes
in the following way in the presence of a vnodes-based keyspace with the
NetworkTopologyStrategy:
```
scylla: locator/network_topology_strategy.cc:207:
    locator::natural_endpoints_tracker::natural_endpoints_tracker(
    const token_metadata &, const network_topology_strategy::dc_rep_factor_map &):
    Assertion `!_token_owners.empty() && !_racks.empty()' failed.
```
Both `_token_owners` and `_racks` are empty. The reason is that
`_tm.get_datacenter_token_owners()` and
`_tm.get_datacenter_racks_token_owners()` called above filter out nodes
in the `none` state.

This bug basically made maintenance mode unusable in customer clusters.

We fix it by changing the node state to `normal`. We also update its
rack, datacenter, and shards count. Rack and datacenter are present in the
topology somehow, but there is nothing wrong with updating them again.
The shard count is also missing, so we better update it to avoid other
issues.

Fixes #27988

(cherry picked from commit a08c53ae4b)
2026-02-02 17:02:15 +00:00
Avi Kivity
ad64dc7c01 Merge '[Backport 2026.1] load_stats: fix problem with load_stats refresh throwing no_such_column_family' from Scylladb[bot]
When the topology coordinator refreshes load_stats, it caches load_stats for every node. In case the node becomes unresponsive, and fresh load_stats can not be read from the node, the cached version of load_stats will be used. This is to allow the load balancer to have at least some information about the table sizes and disk capacities of the host.

During load_stats refresh, we aggregate the table sizes from all the nodes. This procedure calls db.find_column_family() for each table_id found in load_stats. This function will throw if the table is not found. This will cause load_stats refresh to fail.

It is also possible for a table to have been dropped between the time load_stats has been prepared on the host, and the time it is processed on the topology coordinator. This would also cause an exception in the refresh procedure.

This fixes this problem by checking if the table still exists.

Fixes: #28359

- (cherry picked from commit 71be10b8d6)

- (cherry picked from commit 92dbde54a5)

Parent PR: #28440

Closes scylladb/scylladb#28471

* github.com:scylladb/scylladb:
  test: add test and reproducer for load_stats refresh exception
  load_stats: handle dropped tables when refreshing load_stats
2026-02-01 13:51:31 +02:00
Jenkins Promoter
bafd185087 Update pgo profiles - aarch64 2026-02-01 05:06:48 +02:00
Jenkins Promoter
07d1f8f48a Update pgo profiles - x86_64 2026-02-01 04:20:45 +02:00
Ferenc Szili
523d529d27 test: add test and reproducer for load_stats refresh exception
This patch adds a test and reproducer for the issue where the load_stats
refresh procedure throws exceptions if any of the tables have been
dropped since load_stats was produced.

(cherry picked from commit 92dbde54a5)
2026-02-01 00:34:26 +00:00
Ferenc Szili
c8dbd43ed5 load_stats: handle dropped tables when refreshing load_stats
When the topology coordinator refreshes load_stats, it caches load_stats
for every node. In case the node becomes unresponsive, and fresh
load_stats can not be read from the node, the cached version of
load_stats will be used. This is to allow the load balancer to
have at least some information about the table sizes and disk capacities
of the host.

During load_stats refresh, we aggregate the table sizes from all the
nodes. This procedure calls db.find_column_family() for each table_id
found in load_stats. This function will throw if the table is not found.
This will cause load_stats refresh to fail.

It is also possible for a table to have been dropped between the time
load_stats has been prepared on the host, and the time it is processed
on the topology coordinator. This would also cause an exception in the
refresh procedure.

This patch fixes this problem by checking if the table still exists.

(cherry picked from commit 71be10b8d6)
2026-02-01 00:34:26 +00:00
Botond Dénes
0cf9f41649 Merge '[Backport 2026.1] docs: add documentation for automatic repair' from Scylladb[bot]
Explain what automatic repair is and how to configure it. While at it, improve the existing repair documentation a bit.

Fixes: SCYLLADB-130

This PR missed the 2026.1 branch date, so it needs backport to 2026.1, where the auto repair feature debuts.

- (cherry picked from commit a84b1b8b78)

- (cherry picked from commit 57b2cd2c16)

- (cherry picked from commit 1713d75c0d)

Parent PR: #28199

Closes scylladb/scylladb#28424

* github.com:scylladb/scylladb:
  docs: add feature page for automatic repair
  docs: inter-link incremental-repair and repair documents
  docs: incremental-repair: fix curl example
2026-01-30 16:01:03 +02:00
Botond Dénes
dc89e2ea37 Merge '[Backport 2026.1] test: test_alternator_proxy_protocol: fix race between node startup and test start' from Scylladb[bot]
test_alternator_proxy_protocol starts a node and connects via the alternator ports.
Starting a node, by default, waits until the CQL ports are up. This does not guarantee
that the alternator ports are up (they will be up very soon after this), so there is a short
window where a connection to the alternator ports will fail.

Fix by adding a ServerUpState=SERVING mode, which waits for the node to report
to its supervisor (systemd, which we are pretending to be) that its ports are open.
The test is then adjusted to request this new ServerUpState.

Fixes #28210
Fixes #28211

Flaky tests are only in master and branch-2026.1, so backporting there.

- (cherry picked from commit ebac810c4e)

- (cherry picked from commit 59f2a3ce72)

Parent PR: #28291

Closes scylladb/scylladb#28443

* github.com:scylladb/scylladb:
  test: test_alternator_proxy_protocol: wait for the node to report itself as serving
  test: cluster_manager: add ability to wait for supervisor STATUS=serving
2026-01-30 15:59:09 +02:00
Tomasz Grabiec
797f56cb45 Merge '[Backport 2026.1] Improve load balancer logging and other minor cleanups' from Scylladb[bot]
Contains various improvements to tablet load balancer. Batched together to save on the bill for CI.

Most notably:
 - Make plan summary more concise, and print info only about present elements.
 - Print rack name in addition to DC name when making a per-rack plan
 - Print "Not possible to achieve balance" only when this is the final plan with no active migrations
 - Print per-node stats when "Not possible to achieve balance" is printed
 - amortize metrics lookup cost
 - avoid spamming logs with per-node "Node {} does not have complete tablet stats, ignoring"

Backport to 2026.1: since the changes enhance debuggability and are relatively low risk

Fixes #28423
Fixes #28422

- (cherry picked from commit 32b336e062)

- (cherry picked from commit df32318f66)

- (cherry picked from commit f2b0146f0f)

- (cherry picked from commit 0d090aa47b)

- (cherry picked from commit 12fdd205d6)

- (cherry picked from commit 615b86e88b)

- (cherry picked from commit 7228bd1502)

- (cherry picked from commit 4a161bff2d)

- (cherry picked from commit ef0e9ad34a)

- (cherry picked from commit 9715965d0c)

- (cherry picked from commit 8e831a7b6d)

Parent PR: #28337

Closes scylladb/scylladb#28428

* github.com:scylladb/scylladb:
  tablets: tablet_allocator.cc: Convert tabs to spaces
  tablets: load_balancer: Warn about incomplete stats once for all offending nodes
  tablets: load_balancer: Improve node stats printout
  tablets: load_balancer: Warn about imbalance only when there are no more active migrations
  tablets: load_balancer: Extract print_node_stats()
  tablet: load_balancer: Use empty() instead of size() where applicable
  tablets: Fix redundancy in migration_plan::empty()
  tablets: Cache pointer to stats during plan-making
  tablets: load_balancer: Print rack in addition to DC when giving context
  tablets: load_balancer: Make plan summary concise
  tablets: load_balancer: Move "tablet_migration_bypass" injection point to make_plan()
2026-01-30 14:08:34 +01:00
Pawel Pery
be1d418bc0 vector_search: allow full secondary indexes syntax while creating the vector index
Vector Search feature needs to support creating vector indexes with additional
filtering column. There will be two types of indexes: global which indexes
vectors per table, and local which indexes vectors per partition key. The new
syntaxes are based on ScyllaDB's Global Secondary Index and Local Secondary
Index. Vector indexes don't use secondary indexes functionalities in any way -
all indexing, filtering and processing data will be done on Vector Store side.

This patch allows creating vector indexes using this CQL syntax:

```
CREATE TABLE IF NOT EXISTS cycling.comments_vs (
  commenter text,
  comment text,
  comment_vector VECTOR <FLOAT, 5>,
  created_at timestamp,
  discussion_board_id int,
  country text,
  lang text,
  PRIMARY KEY ((commenter, discussion_board_id), created_at)
);

CREATE CUSTOM INDEX IF NOT EXISTS global_ann_index
  ON cycling.comments_vs(comment_vector, country, lang) USING 'vector_index'
  WITH OPTIONS = { 'similarity_function': 'DOT_PRODUCT' };

CREATE CUSTOM INDEX IF NOT EXISTS local_ann_index
  ON cycling.comments_vs((commenter, discussion_board_id), comment_vector, country, lang)
  USING 'vector_index'
  WITH OPTIONS = { 'similarity_function': 'DOT_PRODUCT' };
```

Currently, if we run these queries to create indexes we will receive such errors:

```
InvalidRequest: Error from server: code=2200 [Invalid query] message="Vector index can only be created on a single column"
InvalidRequest: Error from server: code=2200 [Invalid query] message="Local index definition must contain full partition key only. Redundant column: XYZ"
```

This commit refactors `vector_index::check_target` to correctly validate
columns building the index. Vector-store currently support filtering by native
types, so the type of columns is checked. The first column from the list must
be a vector (to build index based on these vectors), so it is also checked.

Allowed types for columns are native types without counter (it is not possible
to create a table with counter and vector) and without duration (it is not
possible to correctly compare durations, this type is even not allowed in
secondary indexes).

This commits adds cqlpy test to check errors while creating indexes.

Fixes: SCYLLADB-298

This needs to be backported to version 2026.1 as this is a fix for filtering support.

Closes scylladb/scylladb#28366

(cherry picked from commit f49c9e896a)

Closes scylladb/scylladb#28448
2026-01-30 11:25:01 +01:00
Patryk Jędrzejczak
46923f7358 Merge '[Backport 2026.1] Introduce TTL and retries to address resolution' from Scylladb[bot]
In production environments, we observed cases where the S3 client would repeatedly fail to connect due to DNS entries becoming stale. Because the existing logic only attempted the first resolved address and lacked a way to refresh DNS state, the client could get stuck in a failure loop.

Introduce RR TTL and connection failure retry to
- re-resolve the RR in a timely manner
- forcefully reset and re-resolve addresses
- add a special case when the TTL is 0 and the record must be resolved for every request

Fixes: CUSTOMER-96
Fixes: CUSTOMER-139

Should be backported to 2025.3/4 and 2026.1 since we already encountered it in the production clusters for 2025.3

- (cherry picked from commit bd9d5ad75b)

- (cherry picked from commit 359d0b7a3e)

- (cherry picked from commit ce0c7b5896)

- (cherry picked from commit 5b3e513cba)

- (cherry picked from commit 66a33619da)

- (cherry picked from commit 6eb7dba352)

- (cherry picked from commit a05a4593a6)

- (cherry picked from commit 3a31380b2c)

- (cherry picked from commit 912c48a806)

Parent PR: #27891

Closes scylladb/scylladb#28405

* https://github.com/scylladb/scylladb:
  connection_factory: includes cleanup
  dns_connection_factory: refine the move constructor
  connection_factory: retry on failure
  connection_factory: introduce TTL timer
  connection_factory: get rid of shared_future in dns_connection_factory
  connection_factory: extract connection logic into a member
  connection_factory: remove unnecessary `else`
  connection_factory: use all resolved DNS addresses
  s3_test: remove client double-close
2026-01-30 11:10:48 +01:00
Avi Kivity
4032e95715 test: test_alternator_proxy_protocol: wait for the node to report itself as serving
Use the new ServerUpState=SERVING mechanism to wait to the alternator
ports to be up, rather than relying on the default waiting for CQL,
which happens earlier and therefore opens a window where a connection to
the alternator ports will fail.

(cherry picked from commit 59f2a3ce72)
2026-01-29 22:46:11 +02:00
Avi Kivity
eab10c00b1 test: cluster_manager: add ability to wait for supervisor STATUS=serving
When running under systemd, ScyllaDB sends a STATUS=serving message
to systemd. Co-opt this mechanism by setting up NOTIFY_SOCKET, thus
making the cluster manager pretend it is systemd. Users of the cluster
manager can now wait for the node to report itself up, rather than
having to parse log files or retry connections.

(cherry picked from commit ebac810c4e)
2026-01-29 19:48:53 +00:00
Patryk Jędrzejczak
091c3b4e22 test: test_gossiper_orphan_remover: get host ID of the bootstrapping node before it crashes
The test is currently flaky. It tries to get the host ID of the bootstrapping
node via the REST API after the node crashes. This can obviously fail. The
test usually doesn't fail, though, as it relies on the host ID being saved
in `ScyllaServer._host_id` at this point by `ScyllaServer.try_get_host_id()`
repeatedly called in `ScyllaServer.start()`. However, with a very fast crash
and unlucky timings, no such call may succeed.

We deflake the test by getting the host ID before the crash. Note that at this
point, the bootstrapping node must be serving the REST API requests because
`await log.wait_for("finished do_send_ack2_msg")` above guarantees that the
node has started the gossip shadow round, which happens after starting the REST
API.

Fixes #28385

Closes scylladb/scylladb#28388

(cherry picked from commit a2c1569e04)

Closes scylladb/scylladb#28417
2026-01-29 11:25:10 +01:00
Tomasz Grabiec
19eadafdef tablets: tablet_allocator.cc: Convert tabs to spaces
(cherry picked from commit 8e831a7b6d)
2026-01-29 09:06:49 +00:00
Tomasz Grabiec
358fc15893 tablets: load_balancer: Warn about incomplete stats once for all offending nodes
To reduce log spamming when all nodes are missing stats.

(cherry picked from commit 9715965d0c)
2026-01-29 09:06:49 +00:00
Tomasz Grabiec
32124d209e tablets: load_balancer: Improve node stats printout
Make it more concise:
- reduce precision for load to 6 fractional digits
- reduce precision for tablets/shard to 3 fractional digits
- print "dc1/rack1" instead of "dc=dc1 rack=rack1", like in other places
- print "rd=0 wr=0" instead of "stream_read=0 stream_write=0"

Example:

 load_balancer - Node 477569c0-f937-11f0-ab6f-541ce4a00601: dc10/rack10c load=170.666667 tablets=1 shards=12 tablets/shard=0.083 state=normal cap=64424509440 stream: rd=0 wr=0
 load_balancer - Node 47678711-f937-11f0-ab6f-541ce4a00601: dc10/rack10c load=0.000000 tablets=0 shards=12 tablets/shard=0.000 state=normal cap=64424509440 stream: rd=0 wr=0
 load_balancer - Node 47832560-f937-11f0-ab6f-541ce4a00601: dc10/rack10c load=0.000000 tablets=0 shards=12 tablets/shard=0.000 state=normal cap=64424509440 stream: rd=0 wr=0

(cherry picked from commit ef0e9ad34a)
2026-01-29 09:06:49 +00:00
Tomasz Grabiec
c7f4bda459 tablets: load_balancer: Warn about imbalance only when there are no more active migrations
Otherwise, it may be only a temporary situation due to lack of
candidates, and may be unnecessarily alerting.

Also, print node stats to allow assessing how bad the situation is on
the spot. Those stats can hint to a cause of imbalance, if balancing
is per-DC and racks have different capacity.

(cherry picked from commit 4a161bff2d)
2026-01-29 09:06:49 +00:00
Tomasz Grabiec
568af3cd8d tablets: load_balancer: Extract print_node_stats()
(cherry picked from commit 7228bd1502)
2026-01-29 09:06:49 +00:00
Tomasz Grabiec
bd694dd1a1 tablet: load_balancer: Use empty() instead of size() where applicable
(cherry picked from commit 615b86e88b)
2026-01-29 09:06:49 +00:00
Tomasz Grabiec
9672e0171f tablets: Fix redundancy in migration_plan::empty()
(cherry picked from commit 12fdd205d6)
2026-01-29 09:06:49 +00:00
Tomasz Grabiec
8cec41acf2 tablets: Cache pointer to stats during plan-making
Saves on lookup cost, esp. for candidate evaluation. This showed up in
perf profile in the past.

Also, lays the ground for splitting stats per rack.

(cherry picked from commit 0d090aa47b)
2026-01-29 09:06:49 +00:00
Tomasz Grabiec
d207de0d76 tablets: load_balancer: Print rack in addition to DC when giving context
Load-balancing can be now per-rack instead of per-DC. So just printing
"in DC" is confusing. If we're balancing a rack, we should print which
rack is that.

(cherry picked from commit f2b0146f0f)
2026-01-29 09:06:49 +00:00
Tomasz Grabiec
edde4e878e tablets: load_balancer: Make plan summary concise
Before:

  load_balancer - Prepared 1 migration plans, out of which there were 1 tablet migration(s) and 0 resize decision(s) and 0 tablet repair(s) and 0 rack-list colocation(s)

After:

  load_balancer - Prepared plan: migrations: 1

We print only stats about elements which are present.

(cherry picked from commit df32318f66)
2026-01-29 09:06:49 +00:00
Tomasz Grabiec
be1c674f1a tablets: load_balancer: Move "tablet_migration_bypass" injection point to make_plan()
Just a cleanup. After this, we don't have a new scope in the outmost
make_plan() just for injection handling.

(cherry picked from commit 32b336e062)
2026-01-29 09:06:49 +00:00
Botond Dénes
a7cff37024 docs: add feature page for automatic repair
Explain what the feature is and how to confiture it.
Inter-link all the repair related pages, so one can discover all about
repair, regardless of which page they land on.

(cherry picked from commit 1713d75c0d)
2026-01-29 00:25:21 +00:00
Botond Dénes
9431bc5628 docs: inter-link incremental-repair and repair documents
The user can now discover the general explanatio of repair when reading
about incremental repair, useful if they don't know what repair is.
The user can now discover incremental repair while reading the generic
repair procedure document.

(cherry picked from commit 57b2cd2c16)
2026-01-29 00:25:21 +00:00
Botond Dénes
14db8375ac docs: incremental-repair: fix curl example
Currently it is regular text, make it a code block so it is easier to
read and copy+paste.

(cherry picked from commit a84b1b8b78)
2026-01-29 00:25:21 +00:00
Ernest Zaslavsky
614020b5d5 aws_error: handle all restartable nested exception types
Previously we only inspected std::system_error inside
std::nested_exception to support a specific TLS-related failure
mode. However, nested exceptions may contain any type, including
other restartable (retryable) errors. This change unwraps one
nested exception per iteration and re-applies all known handlers
until a match is found or the chain is exhausted.

Closes scylladb/scylladb#28240

(cherry picked from commit cb2aa85cf5)

Closes scylladb/scylladb#28345
2026-01-28 14:58:28 +02:00
Anna Stuchlik
e091afb400 doc: add the version name to the Install pages
This is a follow-up to https://github.com/scylladb/scylladb/pull/28022
It adds the version name to more install pages.

Closes scylladb/scylladb#28289

(cherry picked from commit c25b770342)

Closes scylladb/scylladb#28362
2026-01-28 12:52:23 +02:00
Ernest Zaslavsky
edc46fe6a1 connection_factory: includes cleanup
(cherry picked from commit 912c48a806)
2026-01-27 22:43:08 +00:00
Ernest Zaslavsky
f8b9b767c2 dns_connection_factory: refine the move constructor
Clean up the awkward move constructor that was declared in the header
but defaulted in a separate compilation unit, improving clarity and
consistency.

(cherry picked from commit 3a31380b2c)
2026-01-27 22:43:08 +00:00
Ernest Zaslavsky
23d038b385 connection_factory: retry on failure
If connecting to a provided address throws, renew the address list and
retry once (and only once) before giving up.

(cherry picked from commit a05a4593a6)
2026-01-27 22:43:08 +00:00
Ernest Zaslavsky
3e2d1384bf connection_factory: introduce TTL timer
Add a TTL-based timer to connection_factory to automatically refresh
resolved host name addresses when they expire.

(cherry picked from commit 6eb7dba352)
2026-01-27 22:43:08 +00:00
Ernest Zaslavsky
bd7481e30c connection_factory: get rid of shared_future in dns_connection_factory
Move state management from dns_connection_factory into state class
itself to encapsulate its internal state and stop managing it from the
`dns_connection_factory`

(cherry picked from commit 66a33619da)
2026-01-27 22:43:08 +00:00
Ernest Zaslavsky
16d7b65754 connection_factory: extract connection logic into a member
extract connection logic into a private member function to make it reusable

(cherry picked from commit 5b3e513cba)
2026-01-27 22:43:08 +00:00
Ernest Zaslavsky
e30c01eae6 connection_factory: remove unnecessary else
(cherry picked from commit ce0c7b5896)
2026-01-27 22:43:08 +00:00
Ernest Zaslavsky
d0f3725887 connection_factory: use all resolved DNS addresses
Improve dns_connection_factory to iterate over all resolved
addresses instead of using only the first one.

(cherry picked from commit 359d0b7a3e)
2026-01-27 22:43:07 +00:00
Ernest Zaslavsky
c12168b7ef s3_test: remove client double-close
`test_chunked_download_data_source_with_delays` was calling `close()` on a client twice, remove the unnecessary call

(cherry picked from commit bd9d5ad75b)
2026-01-27 22:43:07 +00:00
Yaron Kaikov
76c0162060 .github/workflows/backport-pr-fixes-validation: support Atlassian URL format in backport PR fixes validation
Add support for matching full Atlassian JIRA URLs in the format
https://scylladb.atlassian.net/browse/SCYLLADB-400 in addition to
the bare JIRA key format (SCYLLADB-400).

This makes the validation more flexible by accepting both formats
that developers commonly use when referencing JIRA issues.

Fixes: https://github.com/scylladb/scylladb/issues/28373

Closes scylladb/scylladb#28374

(cherry picked from commit 3f10f44232)

Closes scylladb/scylladb#28394
2026-01-27 16:05:06 +02:00
Avi Kivity
c9620d9573 test/cqlpy: restore LWT tests marked XFAIL for tablets
Commit 0156e97560 ("storage_proxy: cas: reject for
tablets-enabled tables") marked a bunch of LWT tests as
XFAIL with tablets enabled, pending resolution of #18066.
But since that event is now in the past, we undo the XFAIL
markings (or in some cases, use an any-keyspace fixture
instead of a vnodes-only fixture).

Ref #18066.

Closes scylladb/scylladb#28336

(cherry picked from commit ec70cea2a1)

Closes scylladb/scylladb#28365
2026-01-27 12:20:18 +02:00
Anna Stuchlik
91cf77d016 doc: update the GPG keys
Update the keys in the installation instructions (linux packages).

Fixes https://github.com/scylladb/scylladb/issues/28330

Closes scylladb/scylladb#28357

(cherry picked from commit edc291961b)

Closes scylladb/scylladb#28370
2026-01-27 11:20:11 +02:00
Anna Stuchlik
2c2f0693ab doc: remove the troubleshooting section on upgrades from OSS
This commit removes a document originally created to troubleshoot
upgrades from Open Source to Enterprise.

Since we no longer support Open Source, this document is now redundant.

Fixes https://github.com/scylladb/scylladb/issues/28246

Closes scylladb/scylladb#28248

(cherry picked from commit 84281f900f)

Closes scylladb/scylladb#28360
2026-01-26 19:17:28 +02:00
Jenkins Promoter
2c73d0e6b5 Update ScyllaDB version to: 2026.1.0-rc1 2026-01-26 11:06:45 +02:00
Yaron Kaikov
f94296e0ae Update ScyllaDB version to: 2026.1.0-rc0 2026-01-25 11:07:17 +02:00
3865 changed files with 12710 additions and 11132 deletions

View File

@@ -1,22 +0,0 @@
name: Sync Jira Based on PR Milestone Events
on:
pull_request_target:
types: [milestoned, demilestoned]
permissions:
contents: read
pull-requests: read
jobs:
jira-sync-milestone-set:
if: github.event.action == 'milestoned'
uses: scylladb/github-automation/.github/workflows/main_jira_sync_pr_milestone_set.yml@main
secrets:
caller_jira_auth: ${{ secrets.USER_AND_KEY_FOR_JIRA_AUTOMATION }}
jira-sync-milestone-removed:
if: github.event.action == 'demilestoned'
uses: scylladb/github-automation/.github/workflows/main_jira_sync_pr_milestone_removed.yml@main
secrets:
caller_jira_auth: ${{ secrets.USER_AND_KEY_FOR_JIRA_AUTOMATION }}

View File

@@ -1,4 +1,4 @@
name: Call Jira release creation for new milestone
name: Call Jira release creation for new milestone
on:
milestone:
@@ -9,6 +9,6 @@ jobs:
uses: scylladb/github-automation/.github/workflows/main_sync_milestone_to_jira_release.yml@main
with:
# Comma-separated list of Jira project keys
jira_project_keys: "SCYLLADB,CUSTOMER,SMI"
jira_project_keys: "SCYLLADB,CUSTOMER"
secrets:
caller_jira_auth: ${{ secrets.USER_AND_KEY_FOR_JIRA_AUTOMATION }}

View File

@@ -1,62 +0,0 @@
name: Close issues created by Scylla associates
on:
issues:
types: [opened, reopened]
permissions:
issues: write
jobs:
comment-and-close:
runs-on: ubuntu-latest
steps:
- name: Comment and close if author email is scylladb.com
uses: actions/github-script@v7
with:
github-token: ${{ secrets.GITHUB_TOKEN }}
script: |
const issue = context.payload.issue;
const actor = context.actor;
// Get user data (only public email is available)
const { data: user } = await github.rest.users.getByUsername({
username: actor,
});
const email = user.email || "";
console.log(`Actor: ${actor}, public email: ${email || "<none>"}`);
// Only continue if email exists and ends with @scylladb.com
if (!email || !email.toLowerCase().endsWith("@scylladb.com")) {
console.log("User is not a scylladb.com email (or email not public); skipping.");
return;
}
const owner = context.repo.owner;
const repo = context.repo.repo;
const issue_number = issue.number;
const body = "Issues in this repository are closed automatically. Scylla associates should use Jira to manage issues.\nPlease move this issue to Jira https://scylladb.atlassian.net/jira/software/c/projects/SCYLLADB/list";
// Add the comment
await github.rest.issues.createComment({
owner,
repo,
issue_number,
body,
});
console.log(`Comment added to #${issue_number}`);
// Close the issue
await github.rest.issues.update({
owner,
repo,
issue_number,
state: "closed",
state_reason: "not_planned"
});
console.log(`Issue #${issue_number} closed.`);

View File

@@ -14,8 +14,7 @@ env:
CLEANER_DIRS: test/unit exceptions alternator api auth cdc compaction db dht gms index lang message mutation mutation_writer node_ops raft redis replica service
SEASTAR_BAD_INCLUDE_OUTPUT_PATH: build/seastar-bad-include.log
permissions:
contents: read
permissions: {}
# cancel the in-progress run upon a repush
concurrency:
@@ -35,6 +34,8 @@ jobs:
- uses: actions/checkout@v4
with:
submodules: true
- run: |
sudo dnf -y install clang-tools-extra
- name: Generate compilation database
run: |
cmake \

View File

@@ -14,32 +14,37 @@ jobs:
steps:
- name: Verify Org Membership
id: verify_author
env:
EVENT_NAME: ${{ github.event_name }}
PR_AUTHOR: ${{ github.event.pull_request.user.login }}
PR_ASSOCIATION: ${{ github.event.pull_request.author_association }}
COMMENT_AUTHOR: ${{ github.event.comment.user.login }}
COMMENT_ASSOCIATION: ${{ github.event.comment.author_association }}
shell: bash
run: |
if [[ "${{ github.event_name }}" == "pull_request_target" ]]; then
AUTHOR="${{ github.event.pull_request.user.login }}"
ASSOCIATION="${{ github.event.pull_request.author_association }}"
if [[ "$EVENT_NAME" == "pull_request_target" ]]; then
AUTHOR="$PR_AUTHOR"
ASSOCIATION="$PR_ASSOCIATION"
else
AUTHOR="${{ github.event.comment.user.login }}"
ASSOCIATION="${{ github.event.comment.author_association }}"
AUTHOR="$COMMENT_AUTHOR"
ASSOCIATION="$COMMENT_ASSOCIATION"
fi
if [[ "$ASSOCIATION" == "MEMBER" || "$ASSOCIATION" == "OWNER" ]]; then
ORG="scylladb"
if gh api "/orgs/${ORG}/members/${AUTHOR}" --silent 2>/dev/null; then
echo "member=true" >> $GITHUB_OUTPUT
else
echo "::warning::${AUTHOR} is not a member of scylladb (association: ${ASSOCIATION}); skipping CI trigger."
echo "::warning::${AUTHOR} is not a member of ${ORG}; skipping CI trigger."
echo "member=false" >> $GITHUB_OUTPUT
fi
- name: Validate Comment Trigger
if: github.event_name == 'issue_comment'
id: verify_comment
env:
COMMENT_BODY: ${{ github.event.comment.body }}
shell: bash
run: |
BODY=$(cat << 'EOF'
${{ github.event.comment.body }}
EOF
)
CLEAN_BODY=$(echo "$BODY" | grep -v '^[[:space:]]*>')
CLEAN_BODY=$(echo "$COMMENT_BODY" | grep -v '^[[:space:]]*>')
if echo "$CLEAN_BODY" | grep -qi '@scylladbbot' && echo "$CLEAN_BODY" | grep -qi 'trigger-ci'; then
echo "trigger=true" >> $GITHUB_OUTPUT

View File

@@ -300,6 +300,7 @@ add_subdirectory(locator)
add_subdirectory(message)
add_subdirectory(mutation)
add_subdirectory(mutation_writer)
add_subdirectory(node_ops)
add_subdirectory(readers)
add_subdirectory(replica)
add_subdirectory(raft)

View File

@@ -43,7 +43,7 @@ For further information, please see:
[developer documentation]: HACKING.md
[build documentation]: docs/dev/building.md
[docker image build documentation]: dist/docker/redhat/README.md
[docker image build documentation]: dist/docker/debian/README.md
## Running Scylla

View File

@@ -78,7 +78,7 @@ fi
# Default scylla product/version tags
PRODUCT=scylla
VERSION=2026.2.0-dev
VERSION=2026.1.1
if test -f version
then

View File

@@ -618,7 +618,7 @@ conditional_operator_type get_conditional_operator(const rjson::value& req) {
// Check if the existing values of the item (previous_item) match the
// conditions given by the Expected and ConditionalOperator parameters
// (if they exist) in the request (an UpdateItem, PutItem or DeleteItem).
// This function can throw a ValidationException API error if there
// This function can throw an ValidationException API error if there
// are errors in the format of the condition itself.
bool verify_expected(const rjson::value& req, const rjson::value* previous_item) {
const rjson::value* expected = rjson::find(req, "Expected");

View File

@@ -45,7 +45,7 @@ bool consumed_capacity_counter::should_add_capacity(const rjson::value& request)
}
void consumed_capacity_counter::add_consumed_capacity_to_response_if_needed(rjson::value& response) const noexcept {
if (_should_add_to_response) {
if (_should_add_to_reponse) {
auto consumption = rjson::empty_object();
rjson::add(consumption, "CapacityUnits", get_consumed_capacity_units());
rjson::add(response, "ConsumedCapacity", std::move(consumption));

View File

@@ -28,9 +28,9 @@ namespace alternator {
class consumed_capacity_counter {
public:
consumed_capacity_counter() = default;
consumed_capacity_counter(bool should_add_to_response) : _should_add_to_response(should_add_to_response){}
consumed_capacity_counter(bool should_add_to_reponse) : _should_add_to_reponse(should_add_to_reponse){}
bool operator()() const noexcept {
return _should_add_to_response;
return _should_add_to_reponse;
}
consumed_capacity_counter& operator +=(uint64_t bytes);
@@ -44,7 +44,7 @@ public:
uint64_t _total_bytes = 0;
static bool should_add_capacity(const rjson::value& request);
protected:
bool _should_add_to_response = false;
bool _should_add_to_reponse = false;
};
class rcu_consumed_capacity_counter : public consumed_capacity_counter {

View File

@@ -63,7 +63,6 @@
#include "types/types.hh"
#include "db/system_keyspace.hh"
#include "cql3/statements/ks_prop_defs.hh"
#include "alternator/ttl_tag.hh"
using namespace std::chrono_literals;
@@ -165,7 +164,7 @@ static map_type attrs_type() {
static const column_definition& attrs_column(const schema& schema) {
const column_definition* cdef = schema.get_column_definition(bytes(executor::ATTRS_COLUMN_NAME));
throwing_assert(cdef);
SCYLLA_ASSERT(cdef);
return *cdef;
}
@@ -238,7 +237,7 @@ static void validate_is_object(const rjson::value& value, const char* caller) {
}
// This function assumes the given value is an object and returns requested member value.
// If it is not possible, an api_error::validation is thrown.
// If it is not possible an api_error::validation is thrown.
static const rjson::value& get_member(const rjson::value& obj, const char* member_name, const char* caller) {
validate_is_object(obj, caller);
const rjson::value* ret = rjson::find(obj, member_name);
@@ -250,7 +249,7 @@ static const rjson::value& get_member(const rjson::value& obj, const char* membe
// This function assumes the given value is an object with a single member, and returns this member.
// In case the requirements are not met, an api_error::validation is thrown.
// In case the requirements are not met an api_error::validation is thrown.
static const rjson::value::Member& get_single_member(const rjson::value& v, const char* caller) {
if (!v.IsObject() || v.MemberCount() != 1) {
throw api_error::validation(format("{}: expected an object with a single member.", caller));
@@ -683,7 +682,7 @@ static std::optional<int> get_int_attribute(const rjson::value& value, std::stri
}
// Sets a KeySchema object inside the given JSON parent describing the key
// attributes of the given schema as being either HASH or RANGE keys.
// attributes of the the given schema as being either HASH or RANGE keys.
// Additionally, adds to a given map mappings between the key attribute
// names and their type (as a DynamoDB type string).
void executor::describe_key_schema(rjson::value& parent, const schema& schema, std::unordered_map<std::string,std::string>* attribute_types, const std::map<sstring, sstring> *tags) {
@@ -917,7 +916,7 @@ future<rjson::value> executor::fill_table_description(schema_ptr schema, table_s
sstring index_name = cf_name.substr(delim_it + 1);
rjson::add(view_entry, "IndexName", rjson::from_string(index_name));
rjson::add(view_entry, "IndexArn", generate_arn_for_index(*schema, index_name));
// Add index's KeySchema and collect types for AttributeDefinitions:
// Add indexes's KeySchema and collect types for AttributeDefinitions:
executor::describe_key_schema(view_entry, *vptr, key_attribute_types, db::get_tags_of_table(vptr));
// Add projection type
rjson::value projection = rjson::empty_object();
@@ -1650,7 +1649,7 @@ static future<> mark_view_schemas_as_built(utils::chunked_vector<mutation>& out,
}
future<executor::request_return_type> executor::create_table_on_shard0(service::client_state&& client_state, tracing::trace_state_ptr trace_state, rjson::value request, bool enforce_authorization, bool warn_authorization, const db::tablets_mode_t::mode tablets_mode) {
throwing_assert(this_shard_id() == 0);
SCYLLA_ASSERT(this_shard_id() == 0);
// We begin by parsing and validating the content of the CreateTable
// command. We can't inspect the current database schema at this point
@@ -2436,7 +2435,7 @@ std::unordered_map<bytes, std::string> si_key_attributes(data_dictionary::table
// case, this function simply won't be called for this attribute.)
//
// This function checks if the given attribute update is an update to some
// GSI's key, and if the value is unsuitable, an api_error::validation is
// GSI's key, and if the value is unsuitable, a api_error::validation is
// thrown. The checking here is similar to the checking done in
// get_key_from_typed_value() for the base table's key columns.
//
@@ -2838,12 +2837,14 @@ future<executor::request_return_type> rmw_operation::execute(service::storage_pr
}
} else if (_write_isolation != write_isolation::LWT_ALWAYS) {
std::optional<mutation> m = apply(nullptr, api::new_timestamp(), cdc_opts);
throwing_assert(m); // !needs_read_before_write, so apply() did not check a condition
SCYLLA_ASSERT(m); // !needs_read_before_write, so apply() did not check a condition
return proxy.mutate(utils::chunked_vector<mutation>{std::move(*m)}, db::consistency_level::LOCAL_QUORUM, executor::default_timeout(), trace_state, std::move(permit), db::allow_per_partition_rate_limit::yes, false, std::move(cdc_opts)).then([this, &wcu_total] () mutable {
return rmw_operation_return(std::move(_return_attributes), _consumed_capacity, wcu_total);
});
}
throwing_assert(cas_shard);
if (!cas_shard) {
on_internal_error(elogger, "cas_shard is not set");
}
// If we're still here, we need to do this write using LWT:
global_stats.write_using_lwt++;
per_table_stats.write_using_lwt++;
@@ -3547,7 +3548,7 @@ static bool hierarchy_filter(rjson::value& val, const attribute_path_map_node<T>
return true;
}
// Add a path to an attribute_path_map. Throws a validation error if the path
// Add a path to a attribute_path_map. Throws a validation error if the path
// "overlaps" with one already in the filter (one is a sub-path of the other)
// or "conflicts" with it (both a member and index is requested).
template<typename T>
@@ -5412,7 +5413,7 @@ static future<executor::request_return_type> do_query(service::storage_proxy& pr
}
static dht::token token_for_segment(int segment, int total_segments) {
throwing_assert(total_segments > 1 && segment >= 0 && segment < total_segments);
SCYLLA_ASSERT(total_segments > 1 && segment >= 0 && segment < total_segments);
uint64_t delta = std::numeric_limits<uint64_t>::max() / total_segments;
return dht::token::from_int64(std::numeric_limits<int64_t>::min() + delta * segment);
}

View File

@@ -50,7 +50,7 @@ public:
_operators.emplace_back(i);
check_depth_limit();
}
void add_dot(std::string name) {
void add_dot(std::string(name)) {
_operators.emplace_back(std::move(name));
check_depth_limit();
}
@@ -85,7 +85,7 @@ struct constant {
}
};
// "value" is a value used in the right hand side of an assignment
// "value" is is a value used in the right hand side of an assignment
// expression, "SET a = ...". It can be a constant (a reference to a value
// included in the request, e.g., ":val"), a path to an attribute from the
// existing item (e.g., "a.b[3].c"), or a function of other such values.
@@ -205,7 +205,7 @@ public:
// The supported primitive conditions are:
// 1. Binary operators - v1 OP v2, where OP is =, <>, <, <=, >, or >= and
// v1 and v2 are values - from the item (an attribute path), the query
// (a ":val" reference), or a function of the above (only the size()
// (a ":val" reference), or a function of the the above (only the size()
// function is supported).
// 2. Ternary operator - v1 BETWEEN v2 and v3 (means v1 >= v2 AND v1 <= v3).
// 3. N-ary operator - v1 IN ( v2, v3, ... )

View File

@@ -55,7 +55,7 @@ partition_key pk_from_json(const rjson::value& item, schema_ptr schema);
clustering_key ck_from_json(const rjson::value& item, schema_ptr schema);
position_in_partition pos_from_json(const rjson::value& item, schema_ptr schema);
// If v encodes a number (i.e., it is a {"N": [...]}), returns an object representing it. Otherwise,
// If v encodes a number (i.e., it is a {"N": [...]}, returns an object representing it. Otherwise,
// raises ValidationException with diagnostic.
big_decimal unwrap_number(const rjson::value& v, std::string_view diagnostic);

View File

@@ -710,7 +710,7 @@ future<executor::request_return_type> server::handle_api_request(std::unique_ptr
++_executor._stats.requests_blocked_memory;
}
auto units = co_await std::move(units_fut);
throwing_assert(req->content_stream);
SCYLLA_ASSERT(req->content_stream);
chunked_content content = co_await read_entire_stream(*req->content_stream, request_content_length_limit);
// If the request had no Content-Length, we reserved too many units
// so need to return some

View File

@@ -491,7 +491,7 @@ future<executor::request_return_type> executor::describe_stream(client_state& cl
if (!opts.enabled()) {
rjson::add(ret, "StreamDescription", std::move(stream_desc));
co_return rjson::print(std::move(ret));
return make_ready_future<executor::request_return_type>(rjson::print(std::move(ret)));
}
// TODO: label
@@ -502,121 +502,123 @@ future<executor::request_return_type> executor::describe_stream(client_state& cl
// filter out cdc generations older than the table or now() - cdc::ttl (typically dynamodb_streams_max_window - 24h)
auto low_ts = std::max(as_timepoint(schema->id()), db_clock::now() - ttl);
std::map<db_clock::time_point, cdc::streams_version> topologies = co_await _sdks.cdc_get_versioned_streams(low_ts, { normal_token_owners });
auto e = topologies.end();
auto prev = e;
auto shards = rjson::empty_array();
return _sdks.cdc_get_versioned_streams(low_ts, { normal_token_owners }).then([db, shard_start, limit, ret = std::move(ret), stream_desc = std::move(stream_desc)] (std::map<db_clock::time_point, cdc::streams_version> topologies) mutable {
std::optional<shard_id> last;
auto e = topologies.end();
auto prev = e;
auto shards = rjson::empty_array();
auto i = topologies.begin();
// if we're a paged query, skip to the generation where we left of.
if (shard_start) {
i = topologies.find(shard_start->time);
}
std::optional<shard_id> last;
// for parent-child stuff we need id:s to be sorted by token
// (see explanation above) since we want to find closest
// token boundary when determining parent.
// #7346 - we processed and searched children/parents in
// stored order, which is not necessarily token order,
// so the finding of "closest" token boundary (using upper bound)
// could give somewhat weird results.
static auto token_cmp = [](const cdc::stream_id& id1, const cdc::stream_id& id2) {
return id1.token() < id2.token();
};
auto i = topologies.begin();
// if we're a paged query, skip to the generation where we left of.
if (shard_start) {
i = topologies.find(shard_start->time);
}
// #7409 - shards must be returned in lexicographical order,
// normal bytes compare is string_traits<int8_t>::compare.
// thus bytes 0x8000 is less than 0x0000. By doing unsigned
// compare instead we inadvertently will sort in string lexical.
static auto id_cmp = [](const cdc::stream_id& id1, const cdc::stream_id& id2) {
return compare_unsigned(id1.to_bytes(), id2.to_bytes()) < 0;
};
// need a prev even if we are skipping stuff
if (i != topologies.begin()) {
prev = std::prev(i);
}
for (; limit > 0 && i != e; prev = i, ++i) {
auto& [ts, sv] = *i;
last = std::nullopt;
auto lo = sv.streams.begin();
auto end = sv.streams.end();
// for parent-child stuff we need id:s to be sorted by token
// (see explanation above) since we want to find closest
// token boundary when determining parent.
// #7346 - we processed and searched children/parents in
// stored order, which is not necessarily token order,
// so the finding of "closest" token boundary (using upper bound)
// could give somewhat weird results.
static auto token_cmp = [](const cdc::stream_id& id1, const cdc::stream_id& id2) {
return id1.token() < id2.token();
};
// #7409 - shards must be returned in lexicographical order,
std::sort(lo, end, id_cmp);
// normal bytes compare is string_traits<int8_t>::compare.
// thus bytes 0x8000 is less than 0x0000. By doing unsigned
// compare instead we inadvertently will sort in string lexical.
static auto id_cmp = [](const cdc::stream_id& id1, const cdc::stream_id& id2) {
return compare_unsigned(id1.to_bytes(), id2.to_bytes()) < 0;
};
if (shard_start) {
// find next shard position
lo = std::upper_bound(lo, end, shard_start->id, id_cmp);
shard_start = std::nullopt;
// need a prev even if we are skipping stuff
if (i != topologies.begin()) {
prev = std::prev(i);
}
if (lo != end && prev != e) {
// We want older stuff sorted in token order so we can find matching
// token range when determining parent shard.
std::stable_sort(prev->second.streams.begin(), prev->second.streams.end(), token_cmp);
}
auto expired = [&]() -> std::optional<db_clock::time_point> {
auto j = std::next(i);
if (j == e) {
return std::nullopt;
}
// add this so we sort of match potential
// sequence numbers in get_records result.
return j->first + confidence_interval(db);
}();
while (lo != end) {
auto& id = *lo++;
auto shard = rjson::empty_object();
if (prev != e) {
auto& pids = prev->second.streams;
auto pid = std::upper_bound(pids.begin(), pids.end(), id.token(), [](const dht::token& t, const cdc::stream_id& id) {
return t < id.token();
});
if (pid != pids.begin()) {
pid = std::prev(pid);
}
if (pid != pids.end()) {
rjson::add(shard, "ParentShardId", shard_id(prev->first, *pid));
}
}
last.emplace(ts, id);
rjson::add(shard, "ShardId", *last);
auto range = rjson::empty_object();
rjson::add(range, "StartingSequenceNumber", sequence_number(utils::UUID_gen::min_time_UUID(ts.time_since_epoch())));
if (expired) {
rjson::add(range, "EndingSequenceNumber", sequence_number(utils::UUID_gen::min_time_UUID(expired->time_since_epoch())));
}
rjson::add(shard, "SequenceNumberRange", std::move(range));
rjson::push_back(shards, std::move(shard));
if (--limit == 0) {
break;
}
for (; limit > 0 && i != e; prev = i, ++i) {
auto& [ts, sv] = *i;
last = std::nullopt;
auto lo = sv.streams.begin();
auto end = sv.streams.end();
// #7409 - shards must be returned in lexicographical order,
std::sort(lo, end, id_cmp);
if (shard_start) {
// find next shard position
lo = std::upper_bound(lo, end, shard_start->id, id_cmp);
shard_start = std::nullopt;
}
if (lo != end && prev != e) {
// We want older stuff sorted in token order so we can find matching
// token range when determining parent shard.
std::stable_sort(prev->second.streams.begin(), prev->second.streams.end(), token_cmp);
}
auto expired = [&]() -> std::optional<db_clock::time_point> {
auto j = std::next(i);
if (j == e) {
return std::nullopt;
}
// add this so we sort of match potential
// sequence numbers in get_records result.
return j->first + confidence_interval(db);
}();
while (lo != end) {
auto& id = *lo++;
auto shard = rjson::empty_object();
if (prev != e) {
auto& pids = prev->second.streams;
auto pid = std::upper_bound(pids.begin(), pids.end(), id.token(), [](const dht::token& t, const cdc::stream_id& id) {
return t < id.token();
});
if (pid != pids.begin()) {
pid = std::prev(pid);
}
if (pid != pids.end()) {
rjson::add(shard, "ParentShardId", shard_id(prev->first, *pid));
}
}
last.emplace(ts, id);
rjson::add(shard, "ShardId", *last);
auto range = rjson::empty_object();
rjson::add(range, "StartingSequenceNumber", sequence_number(utils::UUID_gen::min_time_UUID(ts.time_since_epoch())));
if (expired) {
rjson::add(range, "EndingSequenceNumber", sequence_number(utils::UUID_gen::min_time_UUID(expired->time_since_epoch())));
}
rjson::add(shard, "SequenceNumberRange", std::move(range));
rjson::push_back(shards, std::move(shard));
if (--limit == 0) {
break;
}
last = std::nullopt;
}
}
}
if (last) {
rjson::add(stream_desc, "LastEvaluatedShardId", *last);
}
if (last) {
rjson::add(stream_desc, "LastEvaluatedShardId", *last);
}
rjson::add(stream_desc, "Shards", std::move(shards));
rjson::add(ret, "StreamDescription", std::move(stream_desc));
co_return rjson::print(std::move(ret));
rjson::add(stream_desc, "Shards", std::move(shards));
rjson::add(ret, "StreamDescription", std::move(stream_desc));
return make_ready_future<executor::request_return_type>(rjson::print(std::move(ret)));
});
}
enum class shard_iterator_type {
@@ -896,169 +898,172 @@ future<executor::request_return_type> executor::get_records(client_state& client
auto command = ::make_lw_shared<query::read_command>(schema->id(), schema->version(), partition_slice, _proxy.get_max_result_size(partition_slice),
query::tombstone_limit(_proxy.get_tombstone_limit()), query::row_limit(limit * mul));
service::storage_proxy::coordinator_query_result qr = co_await _proxy.query(schema, std::move(command), std::move(partition_ranges), cl, service::storage_proxy::coordinator_query_options(default_timeout(), std::move(permit), client_state));
cql3::selection::result_set_builder builder(*selection, gc_clock::now());
query::result_view::consume(*qr.query_result, partition_slice, cql3::selection::result_set_builder::visitor(builder, *schema, *selection));
co_return co_await _proxy.query(schema, std::move(command), std::move(partition_ranges), cl, service::storage_proxy::coordinator_query_options(default_timeout(), std::move(permit), client_state)).then(
[this, schema, partition_slice = std::move(partition_slice), selection = std::move(selection), start_time = std::move(start_time), limit, key_names = std::move(key_names), attr_names = std::move(attr_names), type, iter, high_ts] (service::storage_proxy::coordinator_query_result qr) mutable {
cql3::selection::result_set_builder builder(*selection, gc_clock::now());
query::result_view::consume(*qr.query_result, partition_slice, cql3::selection::result_set_builder::visitor(builder, *schema, *selection));
auto result_set = builder.build();
auto records = rjson::empty_array();
auto result_set = builder.build();
auto records = rjson::empty_array();
auto& metadata = result_set->get_metadata();
auto& metadata = result_set->get_metadata();
auto op_index = std::distance(metadata.get_names().begin(),
std::find_if(metadata.get_names().begin(), metadata.get_names().end(), [](const lw_shared_ptr<cql3::column_specification>& cdef) {
return cdef->name->name() == op_column_name;
})
);
auto ts_index = std::distance(metadata.get_names().begin(),
std::find_if(metadata.get_names().begin(), metadata.get_names().end(), [](const lw_shared_ptr<cql3::column_specification>& cdef) {
return cdef->name->name() == timestamp_column_name;
})
);
auto eor_index = std::distance(metadata.get_names().begin(),
std::find_if(metadata.get_names().begin(), metadata.get_names().end(), [](const lw_shared_ptr<cql3::column_specification>& cdef) {
return cdef->name->name() == eor_column_name;
})
);
auto op_index = std::distance(metadata.get_names().begin(),
std::find_if(metadata.get_names().begin(), metadata.get_names().end(), [](const lw_shared_ptr<cql3::column_specification>& cdef) {
return cdef->name->name() == op_column_name;
})
);
auto ts_index = std::distance(metadata.get_names().begin(),
std::find_if(metadata.get_names().begin(), metadata.get_names().end(), [](const lw_shared_ptr<cql3::column_specification>& cdef) {
return cdef->name->name() == timestamp_column_name;
})
);
auto eor_index = std::distance(metadata.get_names().begin(),
std::find_if(metadata.get_names().begin(), metadata.get_names().end(), [](const lw_shared_ptr<cql3::column_specification>& cdef) {
return cdef->name->name() == eor_column_name;
})
);
std::optional<utils::UUID> timestamp;
auto dynamodb = rjson::empty_object();
auto record = rjson::empty_object();
const auto dc_name = _proxy.get_token_metadata_ptr()->get_topology().get_datacenter();
std::optional<utils::UUID> timestamp;
auto dynamodb = rjson::empty_object();
auto record = rjson::empty_object();
const auto dc_name = _proxy.get_token_metadata_ptr()->get_topology().get_datacenter();
using op_utype = std::underlying_type_t<cdc::operation>;
using op_utype = std::underlying_type_t<cdc::operation>;
auto maybe_add_record = [&] {
if (!dynamodb.ObjectEmpty()) {
rjson::add(record, "dynamodb", std::move(dynamodb));
dynamodb = rjson::empty_object();
}
if (!record.ObjectEmpty()) {
rjson::add(record, "awsRegion", rjson::from_string(dc_name));
rjson::add(record, "eventID", event_id(iter.shard.id, *timestamp));
rjson::add(record, "eventSource", "scylladb:alternator");
rjson::add(record, "eventVersion", "1.1");
rjson::push_back(records, std::move(record));
record = rjson::empty_object();
--limit;
}
};
auto maybe_add_record = [&] {
if (!dynamodb.ObjectEmpty()) {
rjson::add(record, "dynamodb", std::move(dynamodb));
dynamodb = rjson::empty_object();
}
if (!record.ObjectEmpty()) {
rjson::add(record, "awsRegion", rjson::from_string(dc_name));
rjson::add(record, "eventID", event_id(iter.shard.id, *timestamp));
rjson::add(record, "eventSource", "scylladb:alternator");
rjson::add(record, "eventVersion", "1.1");
rjson::push_back(records, std::move(record));
record = rjson::empty_object();
--limit;
}
};
for (auto& row : result_set->rows()) {
auto op = static_cast<cdc::operation>(value_cast<op_utype>(data_type_for<op_utype>()->deserialize(*row[op_index])));
auto ts = value_cast<utils::UUID>(data_type_for<utils::UUID>()->deserialize(*row[ts_index]));
auto eor = row[eor_index].has_value() ? value_cast<bool>(boolean_type->deserialize(*row[eor_index])) : false;
for (auto& row : result_set->rows()) {
auto op = static_cast<cdc::operation>(value_cast<op_utype>(data_type_for<op_utype>()->deserialize(*row[op_index])));
auto ts = value_cast<utils::UUID>(data_type_for<utils::UUID>()->deserialize(*row[ts_index]));
auto eor = row[eor_index].has_value() ? value_cast<bool>(boolean_type->deserialize(*row[eor_index])) : false;
if (!dynamodb.HasMember("Keys")) {
auto keys = rjson::empty_object();
describe_single_item(*selection, row, key_names, keys);
rjson::add(dynamodb, "Keys", std::move(keys));
rjson::add(dynamodb, "ApproximateCreationDateTime", utils::UUID_gen::unix_timestamp_in_sec(ts).count());
rjson::add(dynamodb, "SequenceNumber", sequence_number(ts));
rjson::add(dynamodb, "StreamViewType", type);
// TODO: SizeBytes
}
if (!dynamodb.HasMember("Keys")) {
auto keys = rjson::empty_object();
describe_single_item(*selection, row, key_names, keys);
rjson::add(dynamodb, "Keys", std::move(keys));
rjson::add(dynamodb, "ApproximateCreationDateTime", utils::UUID_gen::unix_timestamp_in_sec(ts).count());
rjson::add(dynamodb, "SequenceNumber", sequence_number(ts));
rjson::add(dynamodb, "StreamViewType", type);
// TODO: SizeBytes
}
/**
* We merge rows with same timestamp into a single event.
* This is pretty much needed, because a CDC row typically
* encodes ~half the info of an alternator write.
*
* A big, big downside to how alternator records are written
* (i.e. CQL), is that the distinction between INSERT and UPDATE
* is somewhat lost/unmappable to actual eventName.
* A write (currently) always looks like an insert+modify
* regardless whether we wrote existing record or not.
*
* Maybe RMW ops could be done slightly differently so
* we can distinguish them here...
*
* For now, all writes will become MODIFY.
*
* Note: we do not check the current pre/post
* flags on CDC log, instead we use data to
* drive what is returned. This is (afaict)
* consistent with dynamo streams
*/
switch (op) {
case cdc::operation::pre_image:
case cdc::operation::post_image:
{
auto item = rjson::empty_object();
describe_single_item(*selection, row, attr_names, item, nullptr, true);
describe_single_item(*selection, row, key_names, item);
rjson::add(dynamodb, op == cdc::operation::pre_image ? "OldImage" : "NewImage", std::move(item));
break;
}
case cdc::operation::update:
rjson::add(record, "eventName", "MODIFY");
break;
case cdc::operation::insert:
rjson::add(record, "eventName", "INSERT");
break;
case cdc::operation::service_row_delete:
case cdc::operation::service_partition_delete:
{
auto user_identity = rjson::empty_object();
rjson::add(user_identity, "Type", "Service");
rjson::add(user_identity, "PrincipalId", "dynamodb.amazonaws.com");
rjson::add(record, "userIdentity", std::move(user_identity));
rjson::add(record, "eventName", "REMOVE");
break;
}
default:
rjson::add(record, "eventName", "REMOVE");
break;
}
if (eor) {
maybe_add_record();
timestamp = ts;
if (limit == 0) {
/**
* We merge rows with same timestamp into a single event.
* This is pretty much needed, because a CDC row typically
* encodes ~half the info of an alternator write.
*
* A big, big downside to how alternator records are written
* (i.e. CQL), is that the distinction between INSERT and UPDATE
* is somewhat lost/unmappable to actual eventName.
* A write (currently) always looks like an insert+modify
* regardless whether we wrote existing record or not.
*
* Maybe RMW ops could be done slightly differently so
* we can distinguish them here...
*
* For now, all writes will become MODIFY.
*
* Note: we do not check the current pre/post
* flags on CDC log, instead we use data to
* drive what is returned. This is (afaict)
* consistent with dynamo streams
*/
switch (op) {
case cdc::operation::pre_image:
case cdc::operation::post_image:
{
auto item = rjson::empty_object();
describe_single_item(*selection, row, attr_names, item, nullptr, true);
describe_single_item(*selection, row, key_names, item);
rjson::add(dynamodb, op == cdc::operation::pre_image ? "OldImage" : "NewImage", std::move(item));
break;
}
case cdc::operation::update:
rjson::add(record, "eventName", "MODIFY");
break;
case cdc::operation::insert:
rjson::add(record, "eventName", "INSERT");
break;
case cdc::operation::service_row_delete:
case cdc::operation::service_partition_delete:
{
auto user_identity = rjson::empty_object();
rjson::add(user_identity, "Type", "Service");
rjson::add(user_identity, "PrincipalId", "dynamodb.amazonaws.com");
rjson::add(record, "userIdentity", std::move(user_identity));
rjson::add(record, "eventName", "REMOVE");
break;
}
default:
rjson::add(record, "eventName", "REMOVE");
break;
}
if (eor) {
maybe_add_record();
timestamp = ts;
if (limit == 0) {
break;
}
}
}
}
auto ret = rjson::empty_object();
auto nrecords = records.Size();
rjson::add(ret, "Records", std::move(records));
auto ret = rjson::empty_object();
auto nrecords = records.Size();
rjson::add(ret, "Records", std::move(records));
if (nrecords != 0) {
// #9642. Set next iterators threshold to > last
shard_iterator next_iter(iter.table, iter.shard, *timestamp, false);
// Note that here we unconditionally return NextShardIterator,
// without checking if maybe we reached the end-of-shard. If the
// shard did end, then the next read will have nrecords == 0 and
// will notice end end of shard and not return NextShardIterator.
rjson::add(ret, "NextShardIterator", next_iter);
_stats.api_operations.get_records_latency.mark(std::chrono::steady_clock::now() - start_time);
co_return rjson::print(std::move(ret));
}
if (nrecords != 0) {
// #9642. Set next iterators threshold to > last
shard_iterator next_iter(iter.table, iter.shard, *timestamp, false);
// Note that here we unconditionally return NextShardIterator,
// without checking if maybe we reached the end-of-shard. If the
// shard did end, then the next read will have nrecords == 0 and
// will notice end end of shard and not return NextShardIterator.
rjson::add(ret, "NextShardIterator", next_iter);
_stats.api_operations.get_records_latency.mark(std::chrono::steady_clock::now() - start_time);
return make_ready_future<executor::request_return_type>(rjson::print(std::move(ret)));
}
// ugh. figure out if we are and end-of-shard
auto normal_token_owners = _proxy.get_token_metadata_ptr()->count_normal_token_owners();
// ugh. figure out if we are and end-of-shard
auto normal_token_owners = _proxy.get_token_metadata_ptr()->count_normal_token_owners();
db_clock::time_point ts = co_await _sdks.cdc_current_generation_timestamp({ normal_token_owners });
auto& shard = iter.shard;
return _sdks.cdc_current_generation_timestamp({ normal_token_owners }).then([this, iter, high_ts, start_time, ret = std::move(ret)](db_clock::time_point ts) mutable {
auto& shard = iter.shard;
if (shard.time < ts && ts < high_ts) {
// The DynamoDB documentation states that when a shard is
// closed, reading it until the end has NextShardIterator
// "set to null". Our test test_streams_closed_read
// confirms that by "null" they meant not set at all.
} else {
// We could have return the same iterator again, but we did
// a search from it until high_ts and found nothing, so we
// can also start the next search from high_ts.
// TODO: but why? It's simpler just to leave the iterator be.
shard_iterator next_iter(iter.table, iter.shard, utils::UUID_gen::min_time_UUID(high_ts.time_since_epoch()), true);
rjson::add(ret, "NextShardIterator", iter);
}
_stats.api_operations.get_records_latency.mark(std::chrono::steady_clock::now() - start_time);
if (is_big(ret)) {
co_return make_streamed(std::move(ret));
}
co_return rjson::print(std::move(ret));
if (shard.time < ts && ts < high_ts) {
// The DynamoDB documentation states that when a shard is
// closed, reading it until the end has NextShardIterator
// "set to null". Our test test_streams_closed_read
// confirms that by "null" they meant not set at all.
} else {
// We could have return the same iterator again, but we did
// a search from it until high_ts and found nothing, so we
// can also start the next search from high_ts.
// TODO: but why? It's simpler just to leave the iterator be.
shard_iterator next_iter(iter.table, iter.shard, utils::UUID_gen::min_time_UUID(high_ts.time_since_epoch()), true);
rjson::add(ret, "NextShardIterator", iter);
}
_stats.api_operations.get_records_latency.mark(std::chrono::steady_clock::now() - start_time);
if (is_big(ret)) {
return make_ready_future<executor::request_return_type>(make_streamed(std::move(ret)));
}
return make_ready_future<executor::request_return_type>(rjson::print(std::move(ret)));
});
});
}
bool executor::add_stream_options(const rjson::value& stream_specification, schema_builder& builder, service::storage_proxy& sp) {

View File

@@ -46,7 +46,6 @@
#include "alternator/executor.hh"
#include "alternator/controller.hh"
#include "alternator/serialization.hh"
#include "alternator/ttl_tag.hh"
#include "dht/sharder.hh"
#include "db/config.hh"
#include "db/tags/utils.hh"
@@ -58,10 +57,19 @@ static logging::logger tlogger("alternator_ttl");
namespace alternator {
// We write the expiration-time attribute enabled on a table in a
// tag TTL_TAG_KEY.
// Currently, the *value* of this tag is simply the name of the attribute,
// and the expiration scanner interprets it as an Alternator attribute name -
// It can refer to a real column or if that doesn't exist, to a member of
// the ":attrs" map column. Although this is designed for Alternator, it may
// be good enough for CQL as well (there, the ":attrs" column won't exist).
extern const sstring TTL_TAG_KEY;
future<executor::request_return_type> executor::update_time_to_live(client_state& client_state, service_permit permit, rjson::value request) {
_stats.api_operations.update_time_to_live++;
if (!_proxy.features().alternator_ttl) {
co_return api_error::unknown_operation("UpdateTimeToLive not yet supported. Upgrade all nodes to a version that supports it.");
co_return api_error::unknown_operation("UpdateTimeToLive not yet supported. Experimental support is available if the 'alternator-ttl' experimental feature is enabled on all nodes.");
}
schema_ptr schema = get_table(_proxy, request);
@@ -133,7 +141,7 @@ future<executor::request_return_type> executor::describe_time_to_live(client_sta
// expiration_service is a sharded service responsible for cleaning up expired
// items in all tables with per-item expiration enabled. Currently, this means
// Alternator tables with TTL configured via an UpdateTimeToLive request.
// Alternator tables with TTL configured via a UpdateTimeToLive request.
//
// Here is a brief overview of how the expiration service works:
//
@@ -316,7 +324,9 @@ static future<std::vector<std::pair<dht::token_range, locator::host_id>>> get_se
const auto& tm = *erm->get_token_metadata_ptr();
const auto& sorted_tokens = tm.sorted_tokens();
std::vector<std::pair<dht::token_range, locator::host_id>> ret;
throwing_assert(!sorted_tokens.empty());
if (sorted_tokens.empty()) {
on_internal_error(tlogger, "Token metadata is empty");
}
auto prev_tok = sorted_tokens.back();
for (const auto& tok : sorted_tokens) {
co_await coroutine::maybe_yield();
@@ -553,7 +563,7 @@ static future<> scan_table_ranges(
expiration_service::stats& expiration_stats)
{
const schema_ptr& s = scan_ctx.s;
throwing_assert(partition_ranges.size() == 1); // otherwise issue #9167 will cause incorrect results.
SCYLLA_ASSERT (partition_ranges.size() == 1); // otherwise issue #9167 will cause incorrect results.
auto p = service::pager::query_pagers::pager(proxy, s, scan_ctx.selection, *scan_ctx.query_state_ptr,
*scan_ctx.query_options, scan_ctx.command, std::move(partition_ranges), nullptr);
while (!p->is_exhausted()) {
@@ -583,7 +593,7 @@ static future<> scan_table_ranges(
if (retries >= 10) {
// Don't get stuck forever asking the same page, maybe there's
// a bug or a real problem in several replicas. Give up on
// this scan and retry the scan from a random position later,
// this scan an retry the scan from a random position later,
// in the next scan period.
throw runtime_exception("scanner thread failed after too many timeouts for the same page");
}
@@ -630,38 +640,13 @@ static future<> scan_table_ranges(
}
} else {
// For a real column to contain an expiration time, it
// must be a numeric type. We currently support decimal
// (used by Alternator TTL) as well as bigint, int and
// timestamp (used by CQL per-row TTL).
switch (meta[*expiration_column]->type->get_kind()) {
case abstract_type::kind::decimal:
// Used by Alternator TTL for key columns not stored
// in the map. The value is in seconds, fractional
// part is ignored.
expired = is_expired(value_cast<big_decimal>(v), now);
break;
case abstract_type::kind::long_kind:
// Used by CQL per-row TTL. The value is in seconds.
expired = is_expired(gc_clock::time_point(std::chrono::seconds(value_cast<int64_t>(v))), now);
break;
case abstract_type::kind::int32:
// Used by CQL per-row TTL. The value is in seconds.
// Using int type is not recommended because it will
// overflow in 2038, but we support it to allow users
// to use existing int columns for expiration.
expired = is_expired(gc_clock::time_point(std::chrono::seconds(value_cast<int32_t>(v))), now);
break;
case abstract_type::kind::timestamp:
// Used by CQL per-row TTL. The value is in milliseconds
// but we truncate it to gc_clock's precision (whole seconds).
expired = is_expired(gc_clock::time_point(std::chrono::duration_cast<gc_clock::duration>(value_cast<db_clock::time_point>(v).time_since_epoch())), now);
break;
default:
// Should never happen - we verified the column's type
// before starting the scan.
[[unlikely]]
on_internal_error(tlogger, format("expiration scanner value of unsupported type {} in column {}", meta[*expiration_column]->type->cql3_type_name(), scan_ctx.column_name) );
}
// must be a numeric type.
// FIXME: Currently we only support decimal_type (which is
// what Alternator uses), but other numeric types can be
// supported as well to make this feature more useful in CQL.
// Note that kind::decimal is also checked above.
big_decimal n = value_cast<big_decimal>(v);
expired = is_expired(n, now);
}
if (expired) {
expiration_stats.items_deleted++;
@@ -723,12 +708,16 @@ static future<bool> scan_table(
co_return false;
}
// attribute_name may be one of the schema's columns (in Alternator, this
// means a key column, in CQL it's a regular column), or an element in
// Alternator's attrs map encoded in Alternator's JSON encoding (which we
// decode). If attribute_name is a real column, in Alternator it will have
// the type decimal, counting seconds since the UNIX epoch, while in CQL
// it will one of the types bigint or int (counting seconds) or timestamp
// (counting milliseconds).
// means it's a key column), or an element in Alternator's attrs map
// encoded in Alternator's JSON encoding.
// FIXME: To make this less Alternators-specific, we should encode in the
// single key's value three things:
// 1. The name of a column
// 2. Optionally if column is a map, a member in the map
// 3. The deserializer for the value: CQL or Alternator (JSON).
// The deserializer can be guessed: If the given column or map item is
// numeric, it can be used directly. If it is a "bytes" type, it needs to
// be deserialized using Alternator's deserializer.
bytes column_name = to_bytes(*attribute_name);
const column_definition *cd = s->get_column_definition(column_name);
std::optional<std::string> member;
@@ -747,14 +736,11 @@ static future<bool> scan_table(
data_type column_type = cd->type;
// Verify that the column has the right type: If "member" exists
// the column must be a map, and if it doesn't, the column must
// be decimal_type (Alternator), bigint, int or timestamp (CQL).
// If the column has the wrong type nothing can get expired in
// this table, and it's pointless to scan it.
// (currently) be a decimal_type. If the column has the wrong type
// nothing can get expired in this table, and it's pointless to
// scan it.
if ((member && column_type->get_kind() != abstract_type::kind::map) ||
(!member && column_type->get_kind() != abstract_type::kind::decimal &&
column_type->get_kind() != abstract_type::kind::long_kind &&
column_type->get_kind() != abstract_type::kind::int32 &&
column_type->get_kind() != abstract_type::kind::timestamp)) {
(!member && column_type->get_kind() != abstract_type::kind::decimal)) {
tlogger.info("table {} TTL column has unsupported type, not scanning", s->cf_name());
co_return false;
}
@@ -892,10 +878,12 @@ future<> expiration_service::run() {
future<> expiration_service::start() {
// Called by main() on each shard to start the expiration-service
// thread. Just runs run() in the background and allows stop().
if (!shutting_down()) {
_end = run().handle_exception([] (std::exception_ptr ep) {
tlogger.error("expiration_service failed: {}", ep);
});
if (_db.features().alternator_ttl) {
if (!shutting_down()) {
_end = run().handle_exception([] (std::exception_ptr ep) {
tlogger.error("expiration_service failed: {}", ep);
});
}
}
return make_ready_future<>();
}

View File

@@ -30,7 +30,7 @@ namespace alternator {
// expiration_service is a sharded service responsible for cleaning up expired
// items in all tables with per-item expiration enabled. Currently, this means
// Alternator tables with TTL configured via an UpdateTimeToLive request.
// Alternator tables with TTL configured via a UpdateTimeToLeave request.
class expiration_service final : public seastar::peering_sharded_service<expiration_service> {
public:
// Object holding per-shard statistics related to the expiration service.
@@ -52,7 +52,7 @@ private:
data_dictionary::database _db;
service::storage_proxy& _proxy;
gms::gossiper& _gossiper;
// _end is set by start(), and resolves when the background service
// _end is set by start(), and resolves when the the background service
// started by it ends. To ask the background service to end, _abort_source
// should be triggered. stop() below uses both _abort_source and _end.
std::optional<future<>> _end;

View File

@@ -1,26 +0,0 @@
/*
* Copyright 2026-present ScyllaDB
*/
/*
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
*/
#pragma once
#include "seastarx.hh"
#include <seastar/core/sstring.hh>
namespace alternator {
// We use the table tag TTL_TAG_KEY ("system:ttl_attribute") to remember
// which attribute was chosen as the expiration-time attribute for
// Alternator's TTL and CQL's per-row TTL features.
// Currently, the *value* of this tag is simply the name of the attribute:
// It can refer to a real column or if that doesn't exist, to a member of
// the ":attrs" map column (which Alternator uses).
extern const sstring TTL_TAG_KEY;
} // namespace alternator
// let users use TTL_TAG_KEY without the "alternator::" prefix,
// to make it easier to move it to a different namespace later.
using alternator::TTL_TAG_KEY;

View File

@@ -12,7 +12,7 @@
"operations":[
{
"method":"POST",
"summary":"Resets authorized prepared statements cache",
"summary":"Reset cache",
"type":"void",
"nickname":"authorization_cache_reset",
"produces":[

View File

@@ -3085,48 +3085,6 @@
}
]
},
{
"path":"/storage_service/tablets/snapshots",
"operations":[
{
"method":"POST",
"summary":"Takes the snapshot for the given keyspaces/tables. A snapshot name must be specified.",
"type":"void",
"nickname":"take_cluster_snapshot",
"produces":[
"application/json"
],
"parameters":[
{
"name":"tag",
"description":"the tag given to the snapshot",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"keyspace",
"description":"Keyspace(s) to snapshot. Multiple keyspaces can be provided using a comma-separated list. If omitted, snapshot all keyspaces.",
"required":false,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"table",
"description":"Table(s) to snapshot. Multiple tables (in a single keyspace) can be provided using a comma-separated list. If omitted, snapshot all tables in the given keyspace(s).",
"required":false,
"allowMultiple":false,
"type":"string",
"paramType":"query"
}
]
}
]
},
{
"path":"/storage_service/quiesce_topology",
"operations":[

View File

@@ -23,6 +23,31 @@
namespace api {
template<class T>
std::vector<T> map_to_key_value(const std::map<sstring, sstring>& map) {
std::vector<T> res;
res.reserve(map.size());
for (const auto& [key, value] : map) {
res.push_back(T());
res.back().key = key;
res.back().value = value;
}
return res;
}
template<class T, class MAP>
std::vector<T>& map_to_key_value(const MAP& map, std::vector<T>& res) {
res.reserve(res.size() + std::size(map));
for (const auto& [key, value] : map) {
T val;
val.key = fmt::to_string(key);
val.value = fmt::to_string(value);
res.push_back(val);
}
return res;
}
template <typename T, typename S = T>
T map_sum(T&& dest, const S& src) {
for (const auto& i : src) {

View File

@@ -536,15 +536,13 @@ void unset_sstables_loader(http_context& ctx, routes& r) {
}
void set_view_builder(http_context& ctx, routes& r, sharded<db::view::view_builder>& vb, sharded<gms::gossiper>& g) {
ss::view_build_statuses.set(r, [&ctx, &vb, &g] (std::unique_ptr<http::request> req) -> future<json::json_return_type> {
ss::view_build_statuses.set(r, [&ctx, &vb, &g] (std::unique_ptr<http::request> req) {
auto keyspace = validate_keyspace(ctx, req);
auto view = req->get_path_param("view");
co_return json::json_return_type(stream_range_as_array(co_await vb.local().view_build_statuses(std::move(keyspace), std::move(view), g.local()), [] (const auto& i) {
storage_service_json::mapper res;
res.key = i.first;
res.value = i.second;
return res;
}));
return vb.local().view_build_statuses(std::move(keyspace), std::move(view), g.local()).then([] (std::unordered_map<sstring, sstring> status) {
std::vector<storage_service_json::mapper> res;
return make_ready_future<json::json_return_type>(map_to_key_value(std::move(status), res));
});
});
cf::get_built_indexes.set(r, [&vb](std::unique_ptr<http::request> req) -> future<json::json_return_type> {
@@ -582,16 +580,6 @@ static future<json::json_return_type> describe_ring_as_json_for_table(const shar
co_return json::json_return_type(stream_range_as_array(co_await ss.local().describe_ring_for_table(keyspace, table), token_range_endpoints_to_json));
}
namespace {
template <typename Key, typename Value>
storage_service_json::mapper map_to_json(const std::pair<Key, Value>& i) {
storage_service_json::mapper val;
val.key = fmt::to_string(i.first);
val.value = fmt::to_string(i.second);
return val;
}
}
static
future<json::json_return_type>
rest_get_token_endpoint(http_context& ctx, sharded<service::storage_service>& ss, std::unique_ptr<http::request> req) {
@@ -609,7 +597,12 @@ rest_get_token_endpoint(http_context& ctx, sharded<service::storage_service>& ss
throw bad_param_exception("Either provide both keyspace and table (for tablet table) or neither (for vnodes)");
}
co_return json::json_return_type(stream_range_as_array(token_endpoints, &map_to_json<dht::token, gms::inet_address>));
co_return json::json_return_type(stream_range_as_array(token_endpoints, [](const auto& i) {
storage_service_json::mapper val;
val.key = fmt::to_string(i.first);
val.value = fmt::to_string(i.second);
return val;
}));
}
static
@@ -693,6 +686,7 @@ rest_get_range_to_endpoint_map(http_context& ctx, sharded<service::storage_servi
table_id = validate_table(ctx.db.local(), keyspace, table);
}
std::vector<ss::maplist_mapper> res;
co_return stream_range_as_array(co_await ss.local().get_range_to_address_map(keyspace, table_id),
[](const std::pair<dht::token_range, inet_address_vector_replica_set>& entry){
ss::maplist_mapper m;
@@ -783,13 +777,17 @@ rest_cleanup_all(http_context& ctx, sharded<service::storage_service>& ss, std::
apilog.info("cleanup_all global={}", global);
if (global) {
co_await ss.invoke_on(0, [] (service::storage_service& ss) -> future<> {
co_return co_await ss.do_clusterwide_vnodes_cleanup();
});
auto done = !global ? false : co_await ss.invoke_on(0, [] (service::storage_service& ss) -> future<bool> {
if (!ss.is_topology_coordinator_enabled()) {
co_return false;
}
co_await ss.do_clusterwide_vnodes_cleanup();
co_return true;
});
if (done) {
co_return json::json_return_type(0);
}
// fall back to the local cleanup if local cleanup is requested
// fall back to the local cleanup if topology coordinator is not enabled or local cleanup is requested
auto& db = ctx.db;
auto& compaction_module = db.local().get_compaction_manager().get_task_manager_module();
auto task = co_await compaction_module.make_and_start_task<compaction::global_cleanup_compaction_task_impl>({}, db);
@@ -797,7 +795,9 @@ rest_cleanup_all(http_context& ctx, sharded<service::storage_service>& ss, std::
// Mark this node as clean
co_await ss.invoke_on(0, [] (service::storage_service& ss) -> future<> {
co_await ss.reset_cleanup_needed();
if (ss.is_topology_coordinator_enabled()) {
co_await ss.reset_cleanup_needed();
}
});
co_return json::json_return_type(0);
@@ -808,6 +808,9 @@ future<json::json_return_type>
rest_reset_cleanup_needed(http_context& ctx, sharded<service::storage_service>& ss, std::unique_ptr<http::request> req) {
apilog.info("reset_cleanup_needed");
co_await ss.invoke_on(0, [] (service::storage_service& ss) {
if (!ss.is_topology_coordinator_enabled()) {
throw std::runtime_error("mark_node_as_clean is only supported when topology over raft is enabled");
}
return ss.reset_cleanup_needed();
});
co_return json_void();
@@ -1314,7 +1317,10 @@ rest_get_ownership(http_context& ctx, sharded<service::storage_service>& ss, std
throw httpd::bad_param_exception("storage_service/ownership cannot be used when a keyspace uses tablets");
}
co_return json::json_return_type(stream_range_as_array(co_await ss.local().get_ownership(), &map_to_json<gms::inet_address, float>));
return ss.local().get_ownership().then([] (auto&& ownership) {
std::vector<storage_service_json::mapper> res;
return make_ready_future<json::json_return_type>(map_to_key_value(ownership, res));
});
}
static
@@ -1331,7 +1337,10 @@ rest_get_effective_ownership(http_context& ctx, sharded<service::storage_service
}
}
co_return json::json_return_type(stream_range_as_array(co_await ss.local().effective_ownership(keyspace_name, table_name), &map_to_json<gms::inet_address, float>));
return ss.local().effective_ownership(keyspace_name, table_name).then([] (auto&& ownership) {
std::vector<storage_service_json::mapper> res;
return make_ready_future<json::json_return_type>(map_to_key_value(ownership, res));
});
}
static
@@ -1341,7 +1350,7 @@ rest_estimate_compression_ratios(http_context& ctx, sharded<service::storage_ser
apilog.warn("estimate_compression_ratios: called before the cluster feature was enabled");
throw std::runtime_error("estimate_compression_ratios requires all nodes to support the SSTABLE_COMPRESSION_DICTS cluster feature");
}
auto ticket = co_await get_units(ss.local().get_do_sample_sstables_concurrency_limiter(), 1);
auto ticket = get_units(ss.local().get_do_sample_sstables_concurrency_limiter(), 1);
auto ks = api::req_param<sstring>(*req, "keyspace", {}).value;
auto cf = api::req_param<sstring>(*req, "cf", {}).value;
apilog.debug("estimate_compression_ratios: called with ks={} cf={}", ks, cf);
@@ -1407,7 +1416,7 @@ rest_retrain_dict(http_context& ctx, sharded<service::storage_service>& ss, serv
apilog.warn("retrain_dict: called before the cluster feature was enabled");
throw std::runtime_error("retrain_dict requires all nodes to support the SSTABLE_COMPRESSION_DICTS cluster feature");
}
auto ticket = co_await get_units(ss.local().get_do_sample_sstables_concurrency_limiter(), 1);
auto ticket = get_units(ss.local().get_do_sample_sstables_concurrency_limiter(), 1);
auto ks = api::req_param<sstring>(*req, "keyspace", {}).value;
auto cf = api::req_param<sstring>(*req, "cf", {}).value;
apilog.debug("retrain_dict: called with ks={} cf={}", ks, cf);
@@ -1565,7 +1574,16 @@ rest_reload_raft_topology_state(sharded<service::storage_service>& ss, service::
static
future<json::json_return_type>
rest_upgrade_to_raft_topology(sharded<service::storage_service>& ss, std::unique_ptr<http::request> req) {
apilog.info("Requested to schedule upgrade to raft topology, but this version does not need it since it uses raft topology by default.");
apilog.info("Requested to schedule upgrade to raft topology");
try {
co_await ss.invoke_on(0, [] (auto& ss) {
return ss.start_upgrade_to_raft_topology();
});
} catch (...) {
auto ex = std::current_exception();
apilog.error("Failed to schedule upgrade to raft topology: {}", ex);
std::rethrow_exception(std::move(ex));
}
co_return json_void();
}
@@ -2007,8 +2025,6 @@ void set_snapshot(http_context& ctx, routes& r, sharded<db::snapshot_ctl>& snap_
auto tag = req->get_query_param("tag");
auto column_families = split(req->get_query_param("cf"), ",");
auto sfopt = req->get_query_param("sf");
auto tcopt = req->get_query_param("tc");
db::snapshot_options opts = {
.skip_flush = strcasecmp(sfopt.c_str(), "true") == 0,
};
@@ -2033,27 +2049,6 @@ void set_snapshot(http_context& ctx, routes& r, sharded<db::snapshot_ctl>& snap_
}
});
ss::take_cluster_snapshot.set(r, [&snap_ctl](std::unique_ptr<http::request> req) -> future<json::json_return_type> {
apilog.info("take_cluster_snapshot: {}", req->get_query_params());
auto tag = req->get_query_param("tag");
auto column_families = split(req->get_query_param("table"), ",");
// Note: not published/active. Retain as internal option, but...
auto sfopt = req->get_query_param("skip_flush");
db::snapshot_options opts = {
.skip_flush = strcasecmp(sfopt.c_str(), "true") == 0,
};
std::vector<sstring> keynames = split(req->get_query_param("keyspace"), ",");
try {
co_await snap_ctl.local().take_cluster_column_family_snapshot(keynames, column_families, tag, opts);
co_return json_void();
} catch (...) {
apilog.error("take_cluster_snapshot failed: {}", std::current_exception());
throw;
}
});
ss::del_snapshot.set(r, [&snap_ctl](std::unique_ptr<http::request> req) -> future<json::json_return_type> {
apilog.info("del_snapshot: {}", req->get_query_params());
auto tag = req->get_query_param("tag");

View File

@@ -209,11 +209,15 @@ future<> audit::stop_audit() {
});
}
audit_info_ptr audit::create_audit_info(statement_category cat, const sstring& keyspace, const sstring& table, bool batch) {
audit_info_ptr audit::create_audit_info(statement_category cat, const sstring& keyspace, const sstring& table) {
if (!audit_instance().local_is_initialized()) {
return nullptr;
}
return std::make_unique<audit_info>(cat, keyspace, table, batch);
return std::make_unique<audit_info>(cat, keyspace, table);
}
audit_info_ptr audit::create_no_audit_info() {
return audit_info_ptr();
}
future<> audit::start(const db::config& cfg) {
@@ -263,21 +267,18 @@ future<> audit::log_login(const sstring& username, socket_address client_ip, boo
}
future<> inspect(shared_ptr<cql3::cql_statement> statement, service::query_state& query_state, const cql3::query_options& options, bool error) {
auto audit_info = statement->get_audit_info();
if (!audit_info) {
return make_ready_future<>();
}
if (audit_info->batch()) {
cql3::statements::batch_statement* batch = static_cast<cql3::statements::batch_statement*>(statement.get());
cql3::statements::batch_statement* batch = dynamic_cast<cql3::statements::batch_statement*>(statement.get());
if (batch != nullptr) {
return do_for_each(batch->statements().begin(), batch->statements().end(), [&query_state, &options, error] (auto&& m) {
return inspect(m.statement, query_state, options, error);
});
} else {
if (audit::local_audit_instance().should_log(audit_info)) {
auto audit_info = statement->get_audit_info();
if (bool(audit_info) && audit::local_audit_instance().should_log(audit_info)) {
return audit::local_audit_instance().log(audit_info, query_state, options, error);
}
return make_ready_future<>();
}
return make_ready_future<>();
}
future<> inspect_login(const sstring& username, socket_address client_ip, bool error) {

View File

@@ -75,13 +75,11 @@ class audit_info final {
sstring _keyspace;
sstring _table;
sstring _query;
bool _batch;
public:
audit_info(statement_category cat, sstring keyspace, sstring table, bool batch)
audit_info(statement_category cat, sstring keyspace, sstring table)
: _category(cat)
, _keyspace(std::move(keyspace))
, _table(std::move(table))
, _batch(batch)
{ }
void set_query_string(const std::string_view& query_string) {
_query = sstring(query_string);
@@ -91,7 +89,6 @@ public:
const sstring& query() const { return _query; }
sstring category_string() const;
statement_category category() const { return _category; }
bool batch() const { return _batch; }
};
using audit_info_ptr = std::unique_ptr<audit_info>;
@@ -129,7 +126,8 @@ public:
}
static future<> start_audit(const db::config& cfg, sharded<locator::shared_token_metadata>& stm, sharded<cql3::query_processor>& qp, sharded<service::migration_manager>& mm);
static future<> stop_audit();
static audit_info_ptr create_audit_info(statement_category cat, const sstring& keyspace, const sstring& table, bool batch = false);
static audit_info_ptr create_audit_info(statement_category cat, const sstring& keyspace, const sstring& table);
static audit_info_ptr create_no_audit_info();
audit(locator::shared_token_metadata& stm,
cql3::query_processor& qp,
service::migration_manager& mm,

View File

@@ -17,6 +17,7 @@ target_sources(scylla_auth
password_authenticator.cc
passwords.cc
permission.cc
permissions_cache.cc
resource.cc
role_or_anonymous.cc
roles-metadata.cc

View File

@@ -8,7 +8,6 @@
#include "auth/cache.hh"
#include "auth/common.hh"
#include "auth/role_or_anonymous.hh"
#include "auth/roles-metadata.hh"
#include "cql3/query_processor.hh"
#include "cql3/untyped_result_set.hh"
@@ -19,8 +18,6 @@
#include <seastar/core/abort_source.hh>
#include <seastar/coroutine/maybe_yield.hh>
#include <seastar/core/format.hh>
#include <seastar/core/metrics.hh>
#include <seastar/core/do_with.hh>
namespace auth {
@@ -30,21 +27,7 @@ cache::cache(cql3::query_processor& qp, abort_source& as) noexcept
: _current_version(0)
, _qp(qp)
, _loading_sem(1)
, _as(as)
, _permission_loader(nullptr)
, _permission_loader_sem(8) {
namespace sm = seastar::metrics;
_metrics.add_group("auth_cache", {
sm::make_gauge("roles", [this] { return _roles.size(); },
sm::description("Number of roles currently cached")),
sm::make_gauge("permissions", [this] {
return _cached_permissions_count;
}, sm::description("Total number of permission sets currently cached across all roles"))
});
}
void cache::set_permission_loader(permission_loader_func loader) {
_permission_loader = std::move(loader);
, _as(as) {
}
lw_shared_ptr<const cache::role_record> cache::get(const role_name_t& role) const noexcept {
@@ -55,83 +38,6 @@ lw_shared_ptr<const cache::role_record> cache::get(const role_name_t& role) cons
return it->second;
}
future<permission_set> cache::get_permissions(const role_or_anonymous& role, const resource& r) {
std::unordered_map<resource, permission_set>* perms_cache;
lw_shared_ptr<role_record> role_ptr;
if (is_anonymous(role)) {
perms_cache = &_anonymous_permissions;
} else {
const auto& role_name = *role.name;
auto role_it = _roles.find(role_name);
if (role_it == _roles.end()) {
// Role might have been deleted but there are some connections
// left which reference it. They should no longer have access to anything.
return make_ready_future<permission_set>(permissions::NONE);
}
role_ptr = role_it->second;
perms_cache = &role_ptr->cached_permissions;
}
if (auto it = perms_cache->find(r); it != perms_cache->end()) {
return make_ready_future<permission_set>(it->second);
}
// keep alive role_ptr as it holds perms_cache (except anonymous)
return do_with(std::move(role_ptr), [this, &role, &r, perms_cache] (auto& role_ptr) {
return load_permissions(role, r, perms_cache);
});
}
future<permission_set> cache::load_permissions(const role_or_anonymous& role, const resource& r, std::unordered_map<resource, permission_set>* perms_cache) {
SCYLLA_ASSERT(_permission_loader);
auto units = co_await get_units(_permission_loader_sem, 1, _as);
// Check again, perhaps we were blocked and other call loaded
// the permissions already. This is a protection against misses storm.
if (auto it = perms_cache->find(r); it != perms_cache->end()) {
co_return it->second;
}
auto perms = co_await _permission_loader(role, r);
add_permissions(*perms_cache, r, perms);
co_return perms;
}
future<> cache::prune(const resource& r) {
auto units = co_await get_units(_loading_sem, 1, _as);
_anonymous_permissions.erase(r);
for (auto& it : _roles) {
// Prunning can run concurrently with other functions but it
// can only cause cached_permissions extra reload via get_permissions.
remove_permissions(it.second->cached_permissions, r);
co_await coroutine::maybe_yield();
}
}
future<> cache::reload_all_permissions() noexcept {
SCYLLA_ASSERT(_permission_loader);
auto units = co_await get_units(_loading_sem, 1, _as);
auto copy_keys = [] (const std::unordered_map<resource, permission_set>& m) {
std::vector<resource> keys;
keys.reserve(m.size());
for (const auto& [res, _] : m) {
keys.push_back(res);
}
return keys;
};
const role_or_anonymous anon;
for (const auto& res : copy_keys(_anonymous_permissions)) {
_anonymous_permissions[res] = co_await _permission_loader(anon, res);
}
for (auto& [role, entry] : _roles) {
auto& perms_cache = entry->cached_permissions;
auto r = role_or_anonymous(role);
for (const auto& res : copy_keys(perms_cache)) {
perms_cache[res] = co_await _permission_loader(r, res);
}
}
logger.debug("Reloaded auth cache with {} entries", _roles.size());
}
future<lw_shared_ptr<cache::role_record>> cache::fetch_role(const role_name_t& role) const {
auto rec = make_lw_shared<role_record>();
rec->version = _current_version;
@@ -199,7 +105,7 @@ future<lw_shared_ptr<cache::role_record>> cache::fetch_role(const role_name_t& r
future<> cache::prune_all() noexcept {
for (auto it = _roles.begin(); it != _roles.end(); ) {
if (it->second->version != _current_version) {
remove_role(it++);
_roles.erase(it++);
co_await coroutine::maybe_yield();
} else {
++it;
@@ -223,7 +129,7 @@ future<> cache::load_all() {
const auto name = r.get_as<sstring>("role");
auto role = co_await fetch_role(name);
if (role) {
add_role(name, role);
_roles[name] = role;
}
co_return stop_iteration::no;
};
@@ -236,32 +142,11 @@ future<> cache::load_all() {
co_await distribute_role(name, role);
}
co_await container().invoke_on_others([this](cache& c) -> future<> {
auto units = co_await get_units(c._loading_sem, 1, c._as);
c._current_version = _current_version;
co_await c.prune_all();
});
}
future<> cache::gather_inheriting_roles(std::unordered_set<role_name_t>& roles, lw_shared_ptr<cache::role_record> role, const role_name_t& name) {
if (!role) {
// Role might have been removed or not yet added, either way
// their members will be handled by another top call to this function.
co_return;
}
for (const auto& member_name : role->members) {
bool is_new = roles.insert(member_name).second;
if (!is_new) {
continue;
}
lw_shared_ptr<cache::role_record> member_role;
auto r = _roles.find(member_name);
if (r != _roles.end()) {
member_role = r->second;
}
co_await gather_inheriting_roles(roles, member_role, member_name);
}
}
future<> cache::load_roles(std::unordered_set<role_name_t> roles) {
if (legacy_mode(_qp)) {
co_return;
@@ -269,41 +154,27 @@ future<> cache::load_roles(std::unordered_set<role_name_t> roles) {
SCYLLA_ASSERT(this_shard_id() == 0);
auto units = co_await get_units(_loading_sem, 1, _as);
std::unordered_set<role_name_t> roles_to_clear_perms;
for (const auto& name : roles) {
logger.info("Loading role {}", name);
auto role = co_await fetch_role(name);
if (role) {
add_role(name, role);
co_await gather_inheriting_roles(roles_to_clear_perms, role, name);
_roles[name] = role;
} else {
if (auto it = _roles.find(name); it != _roles.end()) {
auto old_role = it->second;
remove_role(it);
co_await gather_inheriting_roles(roles_to_clear_perms, old_role, name);
}
_roles.erase(name);
}
co_await distribute_role(name, role);
}
co_await container().invoke_on_all([&roles_to_clear_perms] (cache& c) -> future<> {
for (const auto& name : roles_to_clear_perms) {
c.clear_role_permissions(name);
co_await coroutine::maybe_yield();
}
});
}
future<> cache::distribute_role(const role_name_t& name, lw_shared_ptr<role_record> role) {
auto role_ptr = role.get();
co_await container().invoke_on_others([&name, role_ptr](cache& c) -> future<> {
auto units = co_await get_units(c._loading_sem, 1, c._as);
co_await container().invoke_on_others([&name, role_ptr](cache& c) {
if (!role_ptr) {
c.remove_role(name);
co_return;
c._roles.erase(name);
return;
}
auto role_copy = make_lw_shared<role_record>(*role_ptr);
c.add_role(name, std::move(role_copy));
c._roles[name] = std::move(role_copy);
});
}
@@ -314,40 +185,4 @@ bool cache::includes_table(const table_id& id) noexcept {
|| id == db::system_keyspace::role_permissions()->id();
}
void cache::add_role(const role_name_t& name, lw_shared_ptr<role_record> role) {
if (auto it = _roles.find(name); it != _roles.end()) {
_cached_permissions_count -= it->second->cached_permissions.size();
}
_cached_permissions_count += role->cached_permissions.size();
_roles[name] = std::move(role);
}
void cache::remove_role(const role_name_t& name) {
if (auto it = _roles.find(name); it != _roles.end()) {
remove_role(it);
}
}
void cache::remove_role(roles_map::iterator it) {
_cached_permissions_count -= it->second->cached_permissions.size();
_roles.erase(it);
}
void cache::clear_role_permissions(const role_name_t& name) {
if (auto it = _roles.find(name); it != _roles.end()) {
_cached_permissions_count -= it->second->cached_permissions.size();
it->second->cached_permissions.clear();
}
}
void cache::add_permissions(std::unordered_map<resource, permission_set>& cache, const resource& r, permission_set perms) {
if (cache.emplace(r, perms).second) {
++_cached_permissions_count;
}
}
void cache::remove_permissions(std::unordered_map<resource, permission_set>& cache, const resource& r) {
_cached_permissions_count -= cache.erase(r);
}
} // namespace auth

View File

@@ -17,14 +17,11 @@
#include <seastar/core/sharded.hh>
#include <seastar/core/shared_ptr.hh>
#include <seastar/core/semaphore.hh>
#include <seastar/core/metrics_registration.hh>
#include <absl/container/flat_hash_map.h>
#include "auth/permission.hh"
#include "auth/common.hh"
#include "auth/resource.hh"
#include "auth/role_or_anonymous.hh"
namespace cql3 { class query_processor; }
@@ -34,7 +31,6 @@ class cache : public peering_sharded_service<cache> {
public:
using role_name_t = sstring;
using version_tag_t = char;
using permission_loader_func = std::function<future<permission_set>(const role_or_anonymous&, const resource&)>;
struct role_record {
bool can_login = false;
@@ -44,19 +40,11 @@ public:
sstring salted_hash;
std::unordered_map<sstring, sstring> attributes;
std::unordered_map<sstring, permission_set> permissions;
private:
friend cache;
// cached permissions include effects of role's inheritance
std::unordered_map<resource, permission_set> cached_permissions;
version_tag_t version; // used for seamless cache reloads
};
explicit cache(cql3::query_processor& qp, abort_source& as) noexcept;
lw_shared_ptr<const role_record> get(const role_name_t& role) const noexcept;
void set_permission_loader(permission_loader_func loader);
future<permission_set> get_permissions(const role_or_anonymous& role, const resource& r);
future<> prune(const resource& r);
future<> reload_all_permissions() noexcept;
future<> load_all();
future<> load_roles(std::unordered_set<role_name_t> roles);
static bool includes_table(const table_id&) noexcept;
@@ -64,31 +52,14 @@ public:
private:
using roles_map = absl::flat_hash_map<role_name_t, lw_shared_ptr<role_record>>;
roles_map _roles;
// anonymous permissions map exists mainly due to compatibility with
// higher layers which use role_or_anonymous to get permissions.
std::unordered_map<resource, permission_set> _anonymous_permissions;
version_tag_t _current_version;
cql3::query_processor& _qp;
semaphore _loading_sem; // protects iteration of _roles map
semaphore _loading_sem;
abort_source& _as;
permission_loader_func _permission_loader;
semaphore _permission_loader_sem; // protects against reload storms on a single role change
metrics::metric_groups _metrics;
size_t _cached_permissions_count = 0;
future<lw_shared_ptr<role_record>> fetch_role(const role_name_t& role) const;
future<> prune_all() noexcept;
future<> distribute_role(const role_name_t& name, const lw_shared_ptr<role_record> role);
future<> gather_inheriting_roles(std::unordered_set<role_name_t>& roles, lw_shared_ptr<cache::role_record> role, const role_name_t& name);
void add_role(const role_name_t& name, lw_shared_ptr<role_record> role);
void remove_role(const role_name_t& name);
void remove_role(roles_map::iterator it);
void clear_role_permissions(const role_name_t& name);
void add_permissions(std::unordered_map<resource, permission_set>& cache, const resource& r, permission_set perms);
void remove_permissions(std::unordered_map<resource, permission_set>& cache, const resource& r);
future<permission_set> load_permissions(const role_or_anonymous& role, const resource& r, std::unordered_map<resource, permission_set>* perms_cache);
};
} // namespace auth

View File

@@ -88,16 +88,10 @@ static const class_registrator<
ldap_role_manager::ldap_role_manager(
std::string_view query_template, std::string_view target_attr, std::string_view bind_name, std::string_view bind_password,
uint32_t permissions_update_interval_in_ms,
utils::observer<uint32_t> permissions_update_interval_in_ms_observer,
cql3::query_processor& qp, ::service::raft_group0_client& rg0c, ::service::migration_manager& mm, cache& cache)
: _std_mgr(qp, rg0c, mm, cache), _group0_client(rg0c), _query_template(query_template), _target_attr(target_attr), _bind_name(bind_name)
, _bind_password(bind_password)
, _permissions_update_interval_in_ms(permissions_update_interval_in_ms)
, _permissions_update_interval_in_ms_observer(std::move(permissions_update_interval_in_ms_observer))
, _connection_factory(bind(std::mem_fn(&ldap_role_manager::reconnect), std::ref(*this)))
, _cache(cache)
, _cache_pruner(make_ready_future<>()) {
, _connection_factory(bind(std::mem_fn(&ldap_role_manager::reconnect), std::ref(*this))) {
}
ldap_role_manager::ldap_role_manager(cql3::query_processor& qp, ::service::raft_group0_client& rg0c, ::service::migration_manager& mm, cache& cache)
@@ -106,8 +100,6 @@ ldap_role_manager::ldap_role_manager(cql3::query_processor& qp, ::service::raft_
qp.db().get_config().ldap_attr_role(),
qp.db().get_config().ldap_bind_dn(),
qp.db().get_config().ldap_bind_passwd(),
qp.db().get_config().permissions_update_interval_in_ms(),
qp.db().get_config().permissions_update_interval_in_ms.observe([this] (const uint32_t& v) { _permissions_update_interval_in_ms = v; }),
qp,
rg0c,
mm,
@@ -127,22 +119,6 @@ future<> ldap_role_manager::start() {
return make_exception_future(
std::runtime_error(fmt::format("error getting LDAP server address from template {}", _query_template)));
}
_cache_pruner = futurize_invoke([this] () -> future<> {
while (true) {
try {
co_await seastar::sleep_abortable(std::chrono::milliseconds(_permissions_update_interval_in_ms), _as);
} catch (const seastar::sleep_aborted&) {
co_return; // ignore
}
co_await _cache.container().invoke_on_all([] (cache& c) -> future<> {
try {
co_await c.reload_all_permissions();
} catch (...) {
mylog.warn("Cache reload all permissions failed: {}", std::current_exception());
}
});
}
});
return _std_mgr.start();
}
@@ -199,11 +175,7 @@ future<conn_ptr> ldap_role_manager::reconnect() {
future<> ldap_role_manager::stop() {
_as.request_abort();
return std::move(_cache_pruner).then([this] {
return _std_mgr.stop();
}).then([this] {
return _connection_factory.stop();
});
return _std_mgr.stop().then([this] { return _connection_factory.stop(); });
}
future<> ldap_role_manager::create(std::string_view name, const role_config& config, ::service::group0_batch& mc) {

View File

@@ -10,7 +10,6 @@
#pragma once
#include <seastar/core/abort_source.hh>
#include <seastar/core/future.hh>
#include <stdexcept>
#include "ent/ldap/ldap_connection.hh"
@@ -35,22 +34,14 @@ class ldap_role_manager : public role_manager {
seastar::sstring _target_attr; ///< LDAP entry attribute containing the Scylla role name.
seastar::sstring _bind_name; ///< Username for LDAP simple bind.
seastar::sstring _bind_password; ///< Password for LDAP simple bind.
uint32_t _permissions_update_interval_in_ms;
utils::observer<uint32_t> _permissions_update_interval_in_ms_observer;
mutable ldap_reuser _connection_factory; // Potentially modified by query_granted().
seastar::abort_source _as;
cache& _cache;
seastar::future<> _cache_pruner;
public:
ldap_role_manager(
std::string_view query_template, ///< LDAP query template as described in Scylla documentation.
std::string_view target_attr, ///< LDAP entry attribute containing the Scylla role name.
std::string_view bind_name, ///< LDAP bind credentials.
std::string_view bind_password, ///< LDAP bind credentials.
uint32_t permissions_update_interval_in_ms,
utils::observer<uint32_t> permissions_update_interval_in_ms_observer,
cql3::query_processor& qp, ///< Passed to standard_role_manager.
::service::raft_group0_client& rg0c, ///< Passed to standard_role_manager.
::service::migration_manager& mm, ///< Passed to standard_role_manager.

38
auth/permissions_cache.cc Normal file
View File

@@ -0,0 +1,38 @@
/*
* Copyright (C) 2017-present ScyllaDB
*/
/*
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
*/
#include "auth/permissions_cache.hh"
#include <fmt/ranges.h>
#include "auth/authorizer.hh"
#include "auth/service.hh"
namespace auth {
permissions_cache::permissions_cache(const utils::loading_cache_config& c, service& ser, logging::logger& log)
: _cache(c, log, [&ser, &log](const key_type& k) {
log.debug("Refreshing permissions for {}", k.first);
return ser.get_uncached_permissions(k.first, k.second);
}) {
}
bool permissions_cache::update_config(utils::loading_cache_config c) {
return _cache.update_config(std::move(c));
}
void permissions_cache::reset() {
_cache.reset();
}
future<permission_set> permissions_cache::get(const role_or_anonymous& maybe_role, const resource& r) {
return do_with(key_type(maybe_role, r), [this](const auto& k) {
return _cache.get(k);
});
}
}

66
auth/permissions_cache.hh Normal file
View File

@@ -0,0 +1,66 @@
/*
* Copyright (C) 2017-present ScyllaDB
*/
/*
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
*/
#pragma once
#include <iostream>
#include <utility>
#include <fmt/core.h>
#include <seastar/core/future.hh>
#include "auth/permission.hh"
#include "auth/resource.hh"
#include "auth/role_or_anonymous.hh"
#include "utils/log.hh"
#include "utils/hash.hh"
#include "utils/loading_cache.hh"
namespace std {
inline std::ostream& operator<<(std::ostream& os, const pair<auth::role_or_anonymous, auth::resource>& p) {
fmt::print(os, "{{role: {}, resource: {}}}", p.first, p.second);
return os;
}
}
namespace db {
class config;
}
namespace auth {
class service;
class permissions_cache final {
using cache_type = utils::loading_cache<
std::pair<role_or_anonymous, resource>,
permission_set,
1,
utils::loading_cache_reload_enabled::yes,
utils::simple_entry_size<permission_set>,
utils::tuple_hash>;
using key_type = typename cache_type::key_type;
cache_type _cache;
public:
explicit permissions_cache(const utils::loading_cache_config&, service&, logging::logger&);
future <> stop() {
return _cache.stop();
}
bool update_config(utils::loading_cache_config);
void reset();
future<permission_set> get(const role_or_anonymous&, const resource&);
};
}

View File

@@ -64,11 +64,11 @@ static const sstring superuser_col_name("super");
static logging::logger log("auth_service");
class auth_migration_listener final : public ::service::migration_listener {
service& _service;
authorizer& _authorizer;
cql3::query_processor& _qp;
public:
explicit auth_migration_listener(service& s, cql3::query_processor& qp) : _service(s), _qp(qp) {
explicit auth_migration_listener(authorizer& a, cql3::query_processor& qp) : _authorizer(a), _qp(qp) {
}
private:
@@ -92,14 +92,14 @@ private:
return;
}
// Do it in the background.
(void)do_with(auth::make_data_resource(ks_name), ::service::group0_batch::unused(), [this] (auto& r, auto& mc) mutable {
return _service.revoke_all(r, mc);
(void)do_with(::service::group0_batch::unused(), [this, &ks_name] (auto& mc) mutable {
return _authorizer.revoke_all(auth::make_data_resource(ks_name), mc);
}).handle_exception([] (std::exception_ptr e) {
log.error("Unexpected exception while revoking all permissions on dropped keyspace: {}", e);
});
(void)do_with(auth::make_functions_resource(ks_name), ::service::group0_batch::unused(), [this] (auto& r, auto& mc) mutable {
return _service.revoke_all(r, mc);
(void)do_with(::service::group0_batch::unused(), [this, &ks_name] (auto& mc) mutable {
return _authorizer.revoke_all(auth::make_functions_resource(ks_name), mc);
}).handle_exception([] (std::exception_ptr e) {
log.error("Unexpected exception while revoking all permissions on functions in dropped keyspace: {}", e);
});
@@ -111,8 +111,9 @@ private:
return;
}
// Do it in the background.
(void)do_with(auth::make_data_resource(ks_name, cf_name), ::service::group0_batch::unused(), [this] (auto& r, auto& mc) mutable {
return _service.revoke_all(r, mc);
(void)do_with(::service::group0_batch::unused(), [this, &ks_name, &cf_name] (auto& mc) mutable {
return _authorizer.revoke_all(
auth::make_data_resource(ks_name, cf_name), mc);
}).handle_exception([] (std::exception_ptr e) {
log.error("Unexpected exception while revoking all permissions on dropped table: {}", e);
});
@@ -125,8 +126,9 @@ private:
return;
}
// Do it in the background.
(void)do_with(auth::make_functions_resource(ks_name, function_name), ::service::group0_batch::unused(), [this] (auto& r, auto& mc) mutable {
return _service.revoke_all(r, mc);
(void)do_with(::service::group0_batch::unused(), [this, &ks_name, &function_name] (auto& mc) mutable {
return _authorizer.revoke_all(
auth::make_functions_resource(ks_name, function_name), mc);
}).handle_exception([] (std::exception_ptr e) {
log.error("Unexpected exception while revoking all permissions on dropped function: {}", e);
});
@@ -136,8 +138,9 @@ private:
// in non legacy path revoke is part of schema change statement execution
return;
}
(void)do_with(auth::make_functions_resource(ks_name, aggregate_name), ::service::group0_batch::unused(), [this] (auto& r, auto& mc) mutable {
return _service.revoke_all(r, mc);
(void)do_with(::service::group0_batch::unused(), [this, &ks_name, &aggregate_name] (auto& mc) mutable {
return _authorizer.revoke_all(
auth::make_functions_resource(ks_name, aggregate_name), mc);
}).handle_exception([] (std::exception_ptr e) {
log.error("Unexpected exception while revoking all permissions on dropped aggregate: {}", e);
});
@@ -154,6 +157,7 @@ static future<> validate_role_exists(const service& ser, std::string_view role_n
}
service::service(
utils::loading_cache_config c,
cache& cache,
cql3::query_processor& qp,
::service::raft_group0_client& g0,
@@ -162,17 +166,25 @@ service::service(
std::unique_ptr<authenticator> a,
std::unique_ptr<role_manager> r,
maintenance_socket_enabled used_by_maintenance_socket)
: _cache(cache)
: _loading_cache_config(std::move(c))
, _permissions_cache(nullptr)
, _cache(cache)
, _qp(qp)
, _group0_client(g0)
, _mnotifier(mn)
, _authorizer(std::move(z))
, _authenticator(std::move(a))
, _role_manager(std::move(r))
, _migration_listener(std::make_unique<auth_migration_listener>(*this, qp))
, _migration_listener(std::make_unique<auth_migration_listener>(*_authorizer, qp))
, _permissions_cache_cfg_cb([this] (uint32_t) { (void) _permissions_cache_config_action.trigger_later(); })
, _permissions_cache_config_action([this] { update_cache_config(); return make_ready_future<>(); })
, _permissions_cache_max_entries_observer(_qp.db().get_config().permissions_cache_max_entries.observe(_permissions_cache_cfg_cb))
, _permissions_cache_update_interval_in_ms_observer(_qp.db().get_config().permissions_update_interval_in_ms.observe(_permissions_cache_cfg_cb))
, _permissions_cache_validity_in_ms_observer(_qp.db().get_config().permissions_validity_in_ms.observe(_permissions_cache_cfg_cb))
, _used_by_maintenance_socket(used_by_maintenance_socket) {}
service::service(
utils::loading_cache_config c,
cql3::query_processor& qp,
::service::raft_group0_client& g0,
::service::migration_notifier& mn,
@@ -181,6 +193,7 @@ service::service(
maintenance_socket_enabled used_by_maintenance_socket,
cache& cache)
: service(
std::move(c),
cache,
qp,
g0,
@@ -244,14 +257,7 @@ future<> service::start(::service::migration_manager& mm, db::system_keyspace& s
co_await _role_manager->ensure_superuser_is_created();
}
co_await when_all_succeed(_authorizer->start(), _authenticator->start()).discard_result();
if (!_used_by_maintenance_socket) {
// Maintenance socket mode can't cache permissions because it has
// different authorizer. We can't mix cached permissions, they could be
// different in normal mode.
_cache.set_permission_loader(std::bind(
&service::get_uncached_permissions,
this, std::placeholders::_1, std::placeholders::_2));
}
_permissions_cache = std::make_unique<permissions_cache>(_loading_cache_config, *this, log);
co_await once_among_shards([this] {
_mnotifier.register_listener(_migration_listener.get());
return make_ready_future<>();
@@ -263,7 +269,9 @@ future<> service::stop() {
// Only one of the shards has the listener registered, but let's try to
// unregister on each one just to make sure.
return _mnotifier.unregister_listener(_migration_listener.get()).then([this] {
_cache.set_permission_loader(nullptr);
if (_permissions_cache) {
return _permissions_cache->stop();
}
return make_ready_future<>();
}).then([this] {
return when_all_succeed(_role_manager->stop(), _authorizer->stop(), _authenticator->stop()).discard_result();
@@ -275,8 +283,21 @@ future<> service::ensure_superuser_is_created() {
co_await _authenticator->ensure_superuser_is_created();
}
void service::update_cache_config() {
auto db = _qp.db();
utils::loading_cache_config perm_cache_config;
perm_cache_config.max_size = db.get_config().permissions_cache_max_entries();
perm_cache_config.expiry = std::chrono::milliseconds(db.get_config().permissions_validity_in_ms());
perm_cache_config.refresh = std::chrono::milliseconds(db.get_config().permissions_update_interval_in_ms());
if (!_permissions_cache->update_config(std::move(perm_cache_config))) {
log.error("Failed to apply permissions cache changes. Please read the documentation of these parameters");
}
}
void service::reset_authorization_cache() {
_permissions_cache->reset();
_qp.reset_cache();
}
@@ -301,10 +322,7 @@ service::get_uncached_permissions(const role_or_anonymous& maybe_role, const res
}
future<permission_set> service::get_permissions(const role_or_anonymous& maybe_role, const resource& r) const {
if (legacy_mode(_qp) || _used_by_maintenance_socket) {
return get_uncached_permissions(maybe_role, r);
}
return _cache.get_permissions(maybe_role, r);
return _permissions_cache->get(maybe_role, r);
}
future<bool> service::has_superuser(std::string_view role_name, const role_set& roles) const {
@@ -429,11 +447,6 @@ future<bool> service::exists(const resource& r) const {
return make_ready_future<bool>(false);
}
future<> service::revoke_all(const resource& r, ::service::group0_batch& mc) const {
co_await _authorizer->revoke_all(r, mc);
co_await _cache.prune(r);
}
future<std::vector<cql3::description>> service::describe_roles(bool with_hashed_passwords) {
std::vector<cql3::description> result{};
@@ -788,7 +801,7 @@ future<> revoke_permissions(
}
future<> revoke_all(const service& ser, const resource& r, ::service::group0_batch& mc) {
return ser.revoke_all(r, mc);
return ser.underlying_authorizer().revoke_all(r, mc);
}
future<std::vector<permission_details>> list_filtered_permissions(

View File

@@ -20,6 +20,7 @@
#include "auth/authenticator.hh"
#include "auth/authorizer.hh"
#include "auth/permission.hh"
#include "auth/permissions_cache.hh"
#include "auth/cache.hh"
#include "auth/role_manager.hh"
#include "auth/common.hh"
@@ -74,6 +75,8 @@ public:
/// peering_sharded_service inheritance is needed to be able to access shard local authentication service
/// given an object from another shard. Used for bouncing lwt requests to correct shard.
class service final : public seastar::peering_sharded_service<service> {
utils::loading_cache_config _loading_cache_config;
std::unique_ptr<permissions_cache> _permissions_cache;
cache& _cache;
cql3::query_processor& _qp;
@@ -91,12 +94,20 @@ class service final : public seastar::peering_sharded_service<service> {
// Only one of these should be registered, so we end up with some unused instances. Not the end of the world.
std::unique_ptr<::service::migration_listener> _migration_listener;
std::function<void(uint32_t)> _permissions_cache_cfg_cb;
serialized_action _permissions_cache_config_action;
utils::observer<uint32_t> _permissions_cache_max_entries_observer;
utils::observer<uint32_t> _permissions_cache_update_interval_in_ms_observer;
utils::observer<uint32_t> _permissions_cache_validity_in_ms_observer;
maintenance_socket_enabled _used_by_maintenance_socket;
abort_source _as;
public:
service(
utils::loading_cache_config,
cache& cache,
cql3::query_processor&,
::service::raft_group0_client&,
@@ -112,6 +123,7 @@ public:
/// of the instances themselves.
///
service(
utils::loading_cache_config,
cql3::query_processor&,
::service::raft_group0_client&,
::service::migration_notifier&,
@@ -126,6 +138,8 @@ public:
future<> ensure_superuser_is_created();
void update_cache_config();
void reset_authorization_cache();
///
@@ -167,13 +181,6 @@ public:
future<bool> exists(const resource&) const;
///
/// Revoke all permissions granted to any role for a particular resource.
///
/// \throws \ref unsupported_authorization_operation if revoking permissions is not supported.
///
future<> revoke_all(const resource&, ::service::group0_batch&) const;
///
/// Produces descriptions that can be used to restore the state of auth. That encompasses
/// roles, role grants, and permission grants.

View File

@@ -52,6 +52,13 @@ static const class_registrator<
::service::migration_manager&,
cache&> registration("org.apache.cassandra.auth.CassandraRoleManager");
struct record final {
sstring name;
bool is_superuser;
bool can_login;
role_set member_of;
};
static db::consistency_level consistency_for_role(std::string_view role_name) noexcept {
if (role_name == meta::DEFAULT_SUPERUSER_NAME) {
return db::consistency_level::QUORUM;
@@ -60,13 +67,13 @@ static db::consistency_level consistency_for_role(std::string_view role_name) no
return db::consistency_level::LOCAL_ONE;
}
future<std::optional<standard_role_manager::record>> standard_role_manager::legacy_find_record(std::string_view role_name) {
static future<std::optional<record>> find_record(cql3::query_processor& qp, std::string_view role_name) {
const sstring query = seastar::format("SELECT * FROM {}.{} WHERE {} = ?",
get_auth_ks_name(_qp),
get_auth_ks_name(qp),
meta::roles_table::name,
meta::roles_table::role_col_name);
const auto results = co_await _qp.execute_internal(
const auto results = co_await qp.execute_internal(
query,
consistency_for_role(role_name),
internal_distributed_query_state(),
@@ -86,25 +93,8 @@ future<std::optional<standard_role_manager::record>> standard_role_manager::lega
: role_set())});
}
future<std::optional<standard_role_manager::record>> standard_role_manager::find_record(std::string_view role_name) {
if (legacy_mode(_qp)) {
return legacy_find_record(role_name);
}
auto name = sstring(role_name);
auto role = _cache.get(name);
if (!role) {
return make_ready_future<std::optional<record>>(std::nullopt);
}
return make_ready_future<std::optional<record>>(std::make_optional(record{
.name = std::move(name),
.is_superuser = role->is_superuser,
.can_login = role->can_login,
.member_of = role->member_of
}));
}
future<standard_role_manager::record> standard_role_manager::require_record(std::string_view role_name) {
return find_record(role_name).then([role_name](std::optional<record> mr) {
static future<record> require_record(cql3::query_processor& qp, std::string_view role_name) {
return find_record(qp, role_name).then([role_name](std::optional<record> mr) {
if (!mr) {
throw nonexistant_role(role_name);
}
@@ -396,7 +386,7 @@ standard_role_manager::alter(std::string_view role_name, const role_config_updat
return fmt::to_string(fmt::join(assignments, ", "));
};
return require_record(role_name).then([this, role_name, &u, &mc](record) {
return require_record(_qp, role_name).then([this, role_name, &u, &mc](record) {
if (!u.is_superuser && !u.can_login) {
return make_ready_future<>();
}
@@ -630,17 +620,18 @@ standard_role_manager::revoke(std::string_view revokee_name, std::string_view ro
});
}
future<> standard_role_manager::collect_roles(
static future<> collect_roles(
cql3::query_processor& qp,
std::string_view grantee_name,
bool recurse,
role_set& roles) {
return require_record(grantee_name).then([this, &roles, recurse](standard_role_manager::record r) {
return do_with(std::move(r.member_of), [this, &roles, recurse](const role_set& memberships) {
return do_for_each(memberships.begin(), memberships.end(), [this, &roles, recurse](const sstring& role_name) {
return require_record(qp, grantee_name).then([&qp, &roles, recurse](record r) {
return do_with(std::move(r.member_of), [&qp, &roles, recurse](const role_set& memberships) {
return do_for_each(memberships.begin(), memberships.end(), [&qp, &roles, recurse](const sstring& role_name) {
roles.insert(role_name);
if (recurse) {
return collect_roles(role_name, true, roles);
return collect_roles(qp, role_name, true, roles);
}
return make_ready_future<>();
@@ -655,7 +646,7 @@ future<role_set> standard_role_manager::query_granted(std::string_view grantee_n
return do_with(
role_set{sstring(grantee_name)},
[this, grantee_name, recurse](role_set& roles) {
return collect_roles(grantee_name, recurse, roles).then([&roles] { return roles; });
return collect_roles(_qp, grantee_name, recurse, roles).then([&roles] { return roles; });
});
}
@@ -715,21 +706,27 @@ future<role_set> standard_role_manager::query_all(::service::query_state& qs) {
}
future<bool> standard_role_manager::exists(std::string_view role_name) {
return find_record(role_name).then([](std::optional<record> mr) {
return find_record(_qp, role_name).then([](std::optional<record> mr) {
return static_cast<bool>(mr);
});
}
future<bool> standard_role_manager::is_superuser(std::string_view role_name) {
return require_record(role_name).then([](record r) {
return require_record(_qp, role_name).then([](record r) {
return r.is_superuser;
});
}
future<bool> standard_role_manager::can_login(std::string_view role_name) {
return require_record(role_name).then([](record r) {
return r.can_login;
});
if (legacy_mode(_qp)) {
const auto r = co_await require_record(_qp, role_name);
co_return r.can_login;
}
auto role = _cache.get(sstring(role_name));
if (!role) {
throw nonexistant_role(role_name);
}
co_return role->can_login;
}
future<std::optional<sstring>> standard_role_manager::get_attribute(std::string_view role_name, std::string_view attribute_name, ::service::query_state& qs) {

View File

@@ -90,12 +90,6 @@ public:
private:
enum class membership_change { add, remove };
struct record final {
sstring name;
bool is_superuser;
bool can_login;
role_set member_of;
};
future<> create_legacy_metadata_tables_if_missing() const;
@@ -113,14 +107,6 @@ private:
future<> legacy_modify_membership(std::string_view role_name, std::string_view grantee_name, membership_change);
future<> modify_membership(std::string_view role_name, std::string_view grantee_name, membership_change, ::service::group0_batch& mc);
future<std::optional<record>> legacy_find_record(std::string_view role_name);
future<std::optional<record>> find_record(std::string_view role_name);
future<record> require_record(std::string_view role_name);
future<> collect_roles(
std::string_view grantee_name,
bool recurse,
role_set& roles);
};
} // namespace auth

View File

@@ -814,7 +814,8 @@ generation_service::generation_service(
config cfg, gms::gossiper& g, sharded<db::system_distributed_keyspace>& sys_dist_ks,
sharded<db::system_keyspace>& sys_ks,
abort_source& abort_src, const locator::shared_token_metadata& stm, gms::feature_service& f,
replica::database& db)
replica::database& db,
std::function<bool()> raft_topology_change_enabled)
: _cfg(std::move(cfg))
, _gossiper(g)
, _sys_dist_ks(sys_dist_ks)
@@ -823,6 +824,7 @@ generation_service::generation_service(
, _token_metadata(stm)
, _feature_service(f)
, _db(db)
, _raft_topology_change_enabled(std::move(raft_topology_change_enabled))
{
}
@@ -876,7 +878,16 @@ future<> generation_service::on_join(gms::inet_address ep, locator::host_id id,
future<> generation_service::on_change(gms::inet_address ep, locator::host_id id, const gms::application_state_map& states, gms::permit_id pid) {
assert_shard_zero(__PRETTY_FUNCTION__);
return make_ready_future<>();
if (_raft_topology_change_enabled()) {
return make_ready_future<>();
}
return on_application_state_change(ep, id, states, gms::application_state::CDC_GENERATION_ID, pid, [this] (gms::inet_address ep, locator::host_id id, const gms::versioned_value& v, gms::permit_id) {
auto gen_id = gms::versioned_value::cdc_generation_id_from_string(v.value());
cdc_log.debug("Endpoint: {}, CDC generation ID change: {}", ep, gen_id);
return legacy_handle_cdc_generation(gen_id);
});
}
future<> generation_service::check_and_repair_cdc_streams() {

View File

@@ -79,12 +79,17 @@ private:
std::optional<cdc::generation_id> _gen_id;
future<> _cdc_streams_rewrite_complete = make_ready_future<>();
/* Returns true if raft topology changes are enabled.
* Can only be called from shard 0.
*/
std::function<bool()> _raft_topology_change_enabled;
public:
generation_service(config cfg, gms::gossiper&,
sharded<db::system_distributed_keyspace>&,
sharded<db::system_keyspace>& sys_ks,
abort_source&, const locator::shared_token_metadata&,
gms::feature_service&, replica::database& db);
gms::feature_service&, replica::database& db,
std::function<bool()> raft_topology_change_enabled);
future<> stop();
~generation_service();

View File

@@ -48,6 +48,7 @@
#include "mutation/mutation_fragment_stream_validator.hh"
#include "utils/assert.hh"
#include "utils/error_injection.hh"
#include "utils/chunked_vector.hh"
#include "utils/pretty_printers.hh"
#include "readers/multi_range.hh"
#include "readers/compacting.hh"
@@ -611,23 +612,23 @@ private:
}
// Called in a seastar thread
dht::partition_range_vector
utils::chunked_vector<dht::partition_range>
get_ranges_for_invalidation(const std::vector<sstables::shared_sstable>& sstables) {
// If owned ranges is disengaged, it means no cleanup work was done and
// so nothing needs to be invalidated.
if (!_owned_ranges) {
return dht::partition_range_vector{};
return {};
}
auto owned_ranges = dht::to_partition_ranges(*_owned_ranges, utils::can_yield::yes);
auto owned_ranges = dht::to_partition_ranges_chunked(*_owned_ranges).get();
auto non_owned_ranges = sstables
| std::views::transform([] (const sstables::shared_sstable& sst) {
seastar::thread::maybe_yield();
return dht::partition_range::make({sst->get_first_decorated_key(), true},
{sst->get_last_decorated_key(), true});
}) | std::ranges::to<dht::partition_range_vector>();
}) | std::ranges::to<utils::chunked_vector<dht::partition_range>>();
return dht::subtract_ranges(*_schema, non_owned_ranges, std::move(owned_ranges)).get();
return dht::subtract_ranges(*_schema, std::move(non_owned_ranges), std::move(owned_ranges)).get();
}
protected:
compaction(compaction_group_view& table_s, compaction_descriptor descriptor, compaction_data& cdata, compaction_progress_monitor& progress_monitor, use_backlog_tracker use_backlog_tracker)
@@ -718,8 +719,8 @@ protected:
compaction_completion_desc
get_compaction_completion_desc(std::vector<sstables::shared_sstable> input_sstables, std::vector<sstables::shared_sstable> output_sstables) {
auto ranges_for_for_invalidation = get_ranges_for_invalidation(input_sstables);
return compaction_completion_desc{std::move(input_sstables), std::move(output_sstables), std::move(ranges_for_for_invalidation)};
auto ranges = get_ranges_for_invalidation(input_sstables);
return compaction_completion_desc{std::move(input_sstables), std::move(output_sstables), std::move(ranges)};
}
// Tombstone expiration is enabled based on the presence of sstable set.

View File

@@ -16,6 +16,7 @@
#include "sstables/sstable_set.hh"
#include "compaction_fwd.hh"
#include "mutation_writer/token_group_based_splitting_writer.hh"
#include "utils/chunked_vector.hh"
namespace compaction {
@@ -38,7 +39,7 @@ struct compaction_completion_desc {
// New, fresh SSTables that should be added to SSTable set, replacing the old ones.
std::vector<sstables::shared_sstable> new_sstables;
// Set of compacted partition ranges that should be invalidated in the cache.
dht::partition_range_vector ranges_for_cache_invalidation;
utils::chunked_vector<dht::partition_range> ranges_for_cache_invalidation;
};
// creates a new SSTable for a given shard

View File

@@ -778,6 +778,7 @@ compaction_manager::get_incremental_repair_read_lock(compaction::compaction_grou
cmlog.debug("Get get_incremental_repair_read_lock for {} started", reason);
}
compaction::compaction_state& cs = get_compaction_state(&t);
auto gh = cs.gate.hold();
auto ret = co_await cs.incremental_repair_lock.hold_read_lock();
if (!reason.empty()) {
cmlog.debug("Get get_incremental_repair_read_lock for {} done", reason);
@@ -791,6 +792,7 @@ compaction_manager::get_incremental_repair_write_lock(compaction::compaction_gro
cmlog.debug("Get get_incremental_repair_write_lock for {} started", reason);
}
compaction::compaction_state& cs = get_compaction_state(&t);
auto gh = cs.gate.hold();
auto ret = co_await cs.incremental_repair_lock.hold_write_lock();
if (!reason.empty()) {
cmlog.debug("Get get_incremental_repair_write_lock for {} done", reason);
@@ -1519,8 +1521,8 @@ future<> compaction_manager::maybe_wait_for_sstable_count_reduction(compaction_g
| std::views::transform(std::mem_fn(&sstables::sstable::run_identifier))
| std::ranges::to<std::unordered_set>());
};
const auto injected_threshold = utils::get_local_injector().inject_parameter<size_t>("set_sstable_count_reduction_threshold");
const auto threshold = injected_threshold.value_or(size_t(std::max(schema->max_compaction_threshold(), 32)));
const auto threshold = utils::get_local_injector().inject_parameter<size_t>("set_sstable_count_reduction_threshold")
.value_or(size_t(std::max(schema->max_compaction_threshold(), 32)));
auto count = co_await num_runs_for_compaction();
if (count <= threshold) {
@@ -1536,7 +1538,7 @@ future<> compaction_manager::maybe_wait_for_sstable_count_reduction(compaction_g
auto& cstate = get_compaction_state(&t);
try {
while (can_perform_regular_compaction(t) && co_await num_runs_for_compaction() > threshold) {
co_await cstate.compaction_done.when();
co_await cstate.compaction_done.wait();
}
} catch (const broken_condition_variable&) {
co_return;
@@ -2387,6 +2389,8 @@ future<> compaction_manager::remove(compaction_group_view& t, sstring reason) no
if (!c_state.gate.is_closed()) {
auto close_gate = c_state.gate.close();
co_await stop_ongoing_compactions(reason, &t);
// Wait for users of incremental repair lock (can be either repair itself or maintenance compactions).
co_await c_state.incremental_repair_lock.write_lock();
co_await std::move(close_gate);
}

View File

@@ -299,11 +299,13 @@ batch_size_fail_threshold_in_kb: 1024
# max_hint_window_in_ms: 10800000 # 3 hours
# Validity period for authorized statements cache. Defaults to 10000, set to 0 to disable.
# Validity period for permissions cache (fetching permissions can be an
# expensive operation depending on the authorizer, CassandraAuthorizer is
# one example). Defaults to 10000, set to 0 to disable.
# Will be disabled automatically for AllowAllAuthorizer.
# permissions_validity_in_ms: 10000
# Refresh interval for authorized statements cache.
# Refresh interval for permissions cache (if enabled).
# After this interval, cache entries become eligible for refresh. Upon next
# access, an async reload is scheduled and the old value returned until it
# completes. If permissions_validity_in_ms is non-zero, then this also must have
@@ -564,16 +566,15 @@ commitlog_total_space_in_mb: -1
# prometheus_address: 1.2.3.4
# audit settings
# Table audit is enabled by default.
# By default, Scylla does not audit anything.
# 'audit' config option controls if and where to output audited events:
# - "none": auditing is disabled
# - "table": save audited events in audit.audit_log column family (default)
# - "none": auditing is disabled (default)
# - "table": save audited events in audit.audit_log column family
# - "syslog": send audited events via syslog (depends on OS, but usually to /dev/log)
audit: "table"
#
# List of statement categories that should be audited.
# Possible categories are: QUERY, DML, DCL, DDL, AUTH, ADMIN
audit_categories: "DCL,AUTH,ADMIN"
audit_categories: "DCL,DDL,AUTH,ADMIN"
#
# List of tables that should be audited.
# audit_tables: "<keyspace_name>.<table_name>,<keyspace_name>.<table_name>"
@@ -874,16 +875,7 @@ maintenance_socket: ignore
# The `tablets` option cannot be changed using `ALTER KEYSPACE`.
tablets_mode_for_new_keyspaces: enabled
# Require every tablet-enabled keyspace to be RF-rack-valid.
#
# A tablet-enabled keyspace is RF-rack-valid when, for each data center,
# its replication factor (RF) is 0, 1, or exactly equal to the number of
# racks in that data center. Setting the RF to the number of racks ensures
# that a single rack failure never results in data unavailability.
#
# When set to true, CREATE KEYSPACE and ALTER KEYSPACE statements that
# would produce an RF-rack-invalid keyspace are rejected.
# When set to false, such statements are allowed but emit a warning.
# Enforce RF-rack-valid keyspaces.
rf_rack_valid_keyspaces: false
#

View File

@@ -795,9 +795,6 @@ arg_parser.add_argument('--c-compiler', action='store', dest='cc', default='clan
help='C compiler path')
arg_parser.add_argument('--compiler-cache', action='store', dest='compiler_cache', default='auto',
help='Compiler cache to use: auto (default, prefers sccache), sccache, ccache, none, or a path to a binary')
# Workaround for https://github.com/mozilla/sccache/issues/2575
arg_parser.add_argument('--sccache-rust', action=argparse.BooleanOptionalAction, default=False,
help='Use sccache for rust code (if sccache is selected as compiler cache). Doesn\'t work with distributed builds.')
add_tristate(arg_parser, name='dpdk', dest='dpdk', default=False,
help='Use dpdk (from seastar dpdk sources)')
arg_parser.add_argument('--dpdk-target', action='store', dest='dpdk_target', default='',
@@ -928,7 +925,8 @@ scylla_core = (['message/messaging_service.cc',
'utils/crypt_sha512.cc',
'utils/logalloc.cc',
'utils/large_bitset.cc',
'test/lib/limiting_data_source.cc',
'utils/buffer_input_stream.cc',
'utils/limiting_data_source.cc',
'utils/updateable_value.cc',
'message/dictionary_service.cc',
'utils/directories.cc',
@@ -1174,7 +1172,6 @@ scylla_core = (['message/messaging_service.cc',
'utils/gz/crc_combine.cc',
'utils/gz/crc_combine_table.cc',
'utils/http.cc',
'utils/http_client_error_processing.cc',
'utils/rest/client.cc',
'utils/s3/aws_error.cc',
'utils/s3/client.cc',
@@ -1192,7 +1189,6 @@ scylla_core = (['message/messaging_service.cc',
'utils/azure/identity/default_credentials.cc',
'utils/gcp/gcp_credentials.cc',
'utils/gcp/object_storage.cc',
'utils/gcp/object_storage_retry_strategy.cc',
'gms/version_generator.cc',
'gms/versioned_value.cc',
'gms/gossiper.cc',
@@ -1277,6 +1273,7 @@ scylla_core = (['message/messaging_service.cc',
'auth/passwords.cc',
'auth/password_authenticator.cc',
'auth/permission.cc',
'auth/permissions_cache.cc',
'auth/service.cc',
'auth/standard_role_manager.cc',
'auth/ldap_role_manager.cc',
@@ -1361,6 +1358,7 @@ scylla_core = (['message/messaging_service.cc',
'service/topology_state_machine.cc',
'service/topology_mutation.cc',
'service/topology_coordinator.cc',
'node_ops/node_ops_ctl.cc',
'node_ops/task_manager_module.cc',
'reader_concurrency_semaphore_group.cc',
'utils/disk_space_monitor.cc',
@@ -1537,7 +1535,6 @@ scylla_perfs = ['test/perf/perf_alternator.cc',
'test/perf/perf_fast_forward.cc',
'test/perf/perf_row_cache_update.cc',
'test/perf/perf_simple_query.cc',
'test/perf/perf_cql_raw.cc',
'test/perf/perf_sstable.cc',
'test/perf/perf_tablets.cc',
'test/perf/tablet_load_balancing.cc',
@@ -1645,7 +1642,6 @@ for t in sorted(perf_tests):
deps['test/boost/combined_tests'] += [
'test/boost/aggregate_fcts_test.cc',
'test/boost/auth_cache_test.cc',
'test/boost/auth_test.cc',
'test/boost/batchlog_manager_test.cc',
'test/boost/cache_algorithm_test.cc',
@@ -2387,7 +2383,7 @@ def write_build_file(f,
# If compiler cache is available, prefix the compiler with it
cxx_with_cache = f'{compiler_cache} {args.cxx}' if compiler_cache else args.cxx
# For Rust, sccache is used via RUSTC_WRAPPER environment variable
rustc_wrapper = f'RUSTC_WRAPPER={compiler_cache} ' if compiler_cache and 'sccache' in compiler_cache and args.sccache_rust else ''
rustc_wrapper = f'RUSTC_WRAPPER={compiler_cache} ' if compiler_cache and 'sccache' in compiler_cache else ''
f.write(textwrap.dedent('''\
configure_args = {configure_args}
builddir = {outdir}
@@ -3116,7 +3112,7 @@ def configure_using_cmake(args):
settings['CMAKE_CXX_COMPILER_LAUNCHER'] = compiler_cache
settings['CMAKE_C_COMPILER_LAUNCHER'] = compiler_cache
# For Rust, sccache is used via RUSTC_WRAPPER
if 'sccache' in compiler_cache and args.sccache_rust:
if 'sccache' in compiler_cache:
settings['Scylla_RUSTC_WRAPPER'] = compiler_cache
if args.date_stamp:

View File

@@ -389,10 +389,8 @@ selectStatement returns [std::unique_ptr<raw::select_statement> expr]
bool is_ann_ordering = false;
}
: K_SELECT (
( (K_JSON K_DISTINCT)=> K_JSON { statement_subtype = raw::select_statement::parameters::statement_subtype::JSON; }
| (K_JSON selectClause K_FROM)=> K_JSON { statement_subtype = raw::select_statement::parameters::statement_subtype::JSON; }
)?
( (K_DISTINCT selectClause K_FROM)=> K_DISTINCT { is_distinct = true; } )?
( K_JSON { statement_subtype = raw::select_statement::parameters::statement_subtype::JSON; } )?
( K_DISTINCT { is_distinct = true; } )?
sclause=selectClause
)
K_FROM (
@@ -427,13 +425,13 @@ selector returns [shared_ptr<raw_selector> s]
unaliasedSelector returns [uexpression tmp]
: ( c=cident { tmp = unresolved_identifier{std::move(c)}; }
| v=value { tmp = std::move(v); }
| K_COUNT '(' countArgument ')' { tmp = make_count_rows_function_expression(); }
| K_WRITETIME '(' c=cident ')' { tmp = column_mutation_attribute{column_mutation_attribute::attribute_kind::writetime,
unresolved_identifier{std::move(c)}}; }
| K_TTL '(' c=cident ')' { tmp = column_mutation_attribute{column_mutation_attribute::attribute_kind::ttl,
unresolved_identifier{std::move(c)}}; }
| f=functionName args=selectionFunctionArgs { tmp = function_call{std::move(f), std::move(args)}; }
| f=similarityFunctionName args=vectorSimilarityArgs { tmp = function_call{std::move(f), std::move(args)}; }
| K_CAST '(' arg=unaliasedSelector K_AS t=native_type ')' { tmp = cast{.style = cast::cast_style::sql, .arg = std::move(arg), .type = std::move(t)}; }
)
( '.' fi=cident { tmp = field_selection{std::move(tmp), std::move(fi)}; }
@@ -448,9 +446,23 @@ selectionFunctionArgs returns [std::vector<expression> a]
')'
;
vectorSimilarityArgs returns [std::vector<expression> a]
: '(' ')'
| '(' v1=vectorSimilarityArg { a.push_back(std::move(v1)); }
( ',' vn=vectorSimilarityArg { a.push_back(std::move(vn)); } )*
')'
;
vectorSimilarityArg returns [uexpression a]
: s=unaliasedSelector { a = std::move(s); }
| v=value { a = std::move(v); }
;
countArgument
: '*'
/* COUNT(1) is also allowed, it is recognized via the general function(args) path */
| i=INTEGER { if (i->getText() != "1") {
add_recognition_error("Only COUNT(1) is supported, got COUNT(" + i->getText() + ")");
} }
;
whereClause returns [uexpression clause]
@@ -874,8 +886,8 @@ cfamDefinition[cql3::statements::create_table_statement::raw_statement& expr]
;
cfamColumns[cql3::statements::create_table_statement::raw_statement& expr]
@init { bool is_static=false, is_ttl=false; }
: k=ident v=comparatorType (K_TTL {is_ttl = true;})? (K_STATIC {is_static = true;})? { $expr.add_definition(k, v, is_static, is_ttl); }
@init { bool is_static=false; }
: k=ident v=comparatorType (K_STATIC {is_static = true;})? { $expr.add_definition(k, v, is_static); }
(K_PRIMARY K_KEY { $expr.add_key_aliases(std::vector<shared_ptr<cql3::column_identifier>>{k}); })?
| K_PRIMARY K_KEY '(' pkDef[expr] (',' c=ident { $expr.add_column_alias(c); } )* ')'
;
@@ -1042,7 +1054,6 @@ alterTableStatement returns [std::unique_ptr<alter_table_statement::raw_statemen
std::vector<alter_table_statement::column_change> column_changes;
std::vector<std::pair<shared_ptr<cql3::column_identifier::raw>, shared_ptr<cql3::column_identifier::raw>>> renames;
auto attrs = std::make_unique<cql3::attributes::raw>();
shared_ptr<cql3::column_identifier::raw> ttl_change;
}
: K_ALTER K_COLUMNFAMILY cf=columnFamilyName
( K_ALTER id=cident K_TYPE v=comparatorType { type = alter_table_statement::type::alter; column_changes.emplace_back(alter_table_statement::column_change{id, v}); }
@@ -1061,11 +1072,9 @@ alterTableStatement returns [std::unique_ptr<alter_table_statement::raw_statemen
| K_RENAME { type = alter_table_statement::type::rename; }
id1=cident K_TO toId1=cident { renames.emplace_back(id1, toId1); }
( K_AND idn=cident K_TO toIdn=cident { renames.emplace_back(idn, toIdn); } )*
| K_TTL { type = alter_table_statement::type::ttl; }
( id=cident { ttl_change = id; } | K_NULL )
)
{
$expr = std::make_unique<alter_table_statement::raw_statement>(std::move(cf), type, std::move(column_changes), std::move(props), std::move(renames), std::move(attrs), std::move(ttl_change));
$expr = std::make_unique<alter_table_statement::raw_statement>(std::move(cf), type, std::move(column_changes), std::move(props), std::move(renames), std::move(attrs));
}
;
@@ -1697,6 +1706,10 @@ functionName returns [cql3::functions::function_name s]
: (ks=keyspaceName '.')? f=allowedFunctionName { $s.keyspace = std::move(ks); $s.name = std::move(f); }
;
similarityFunctionName returns [cql3::functions::function_name s]
: f=allowedSimilarityFunctionName { $s = cql3::functions::function_name::native_function(std::move(f)); }
;
allowedFunctionName returns [sstring s]
: f=IDENT { $s = $f.text; std::transform(s.begin(), s.end(), s.begin(), ::tolower); }
| f=QUOTED_NAME { $s = $f.text; }
@@ -1705,6 +1718,11 @@ allowedFunctionName returns [sstring s]
| K_COUNT { $s = "count"; }
;
allowedSimilarityFunctionName returns [sstring s]
: f=(K_SIMILARITY_COSINE | K_SIMILARITY_EUCLIDEAN | K_SIMILARITY_DOT_PRODUCT)
{ $s = $f.text; std::transform(s.begin(), s.end(), s.begin(), ::tolower); }
;
functionArgs returns [std::vector<expression> a]
: '(' ')'
| '(' t1=term { a.push_back(std::move(t1)); }
@@ -2074,21 +2092,7 @@ vector_type returns [shared_ptr<cql3::cql3_type::raw> pt]
{
if ($d.text[0] == '-')
throw exceptions::invalid_request_exception("Vectors must have a dimension greater than 0");
unsigned long parsed_dimension;
try {
parsed_dimension = std::stoul($d.text);
} catch (const std::exception& e) {
throw exceptions::invalid_request_exception(format("Invalid vector dimension: {}", $d.text));
}
static_assert(sizeof(unsigned long) >= sizeof(vector_dimension_t));
if (parsed_dimension == 0) {
throw exceptions::invalid_request_exception("Vectors must have a dimension greater than 0");
}
if (parsed_dimension > cql3::cql3_type::MAX_VECTOR_DIMENSION) {
throw exceptions::invalid_request_exception(
format("Vectors must have a dimension less than or equal to {}", cql3::cql3_type::MAX_VECTOR_DIMENSION));
}
$pt = cql3::cql3_type::raw::vector(t, static_cast<vector_dimension_t>(parsed_dimension));
$pt = cql3::cql3_type::raw::vector(t, std::stoul($d.text));
}
;
@@ -2415,6 +2419,10 @@ K_MUTATION_FRAGMENTS: M U T A T I O N '_' F R A G M E N T S;
K_VECTOR_SEARCH_INDEXING: V E C T O R '_' S E A R C H '_' I N D E X I N G;
K_SIMILARITY_EUCLIDEAN: S I M I L A R I T Y '_' E U C L I D E A N;
K_SIMILARITY_COSINE: S I M I L A R I T Y '_' C O S I N E;
K_SIMILARITY_DOT_PRODUCT: S I M I L A R I T Y '_' D O T '_' P R O D U C T;
// Case-insensitive alpha characters
fragment A: ('a'|'A');
fragment B: ('b'|'B');

View File

@@ -27,7 +27,7 @@ public:
struct vector_test_result {
test_result result;
std::optional<vector_dimension_t> dimension_opt;
std::optional<size_t> dimension_opt;
};
static bool is_assignable(test_result tr) {

View File

@@ -307,14 +307,17 @@ public:
class cql3_type::raw_vector : public raw {
shared_ptr<raw> _type;
vector_dimension_t _dimension;
size_t _dimension;
// This limitation is acquired from the maximum number of dimensions in OpenSearch.
static constexpr size_t MAX_VECTOR_DIMENSION = 16000;
virtual sstring to_string() const override {
return seastar::format("vector<{}, {}>", _type, _dimension);
}
public:
raw_vector(shared_ptr<raw> type, vector_dimension_t dimension)
raw_vector(shared_ptr<raw> type, size_t dimension)
: _type(std::move(type)), _dimension(dimension) {
}
@@ -414,7 +417,7 @@ cql3_type::raw::tuple(std::vector<shared_ptr<raw>> ts) {
}
shared_ptr<cql3_type::raw>
cql3_type::raw::vector(shared_ptr<raw> t, vector_dimension_t dimension) {
cql3_type::raw::vector(shared_ptr<raw> t, size_t dimension) {
return ::make_shared<raw_vector>(std::move(t), dimension);
}

View File

@@ -39,9 +39,6 @@ public:
data_type get_type() const { return _type; }
const sstring& to_string() const { return _type->cql3_type_name(); }
// This limitation is acquired from the maximum number of dimensions in OpenSearch.
static constexpr vector_dimension_t MAX_VECTOR_DIMENSION = 16000;
// For UserTypes, we need to know the current keyspace to resolve the
// actual type used, so Raw is a "not yet prepared" CQL3Type.
class raw {
@@ -67,7 +64,7 @@ public:
static shared_ptr<raw> list(shared_ptr<raw> t);
static shared_ptr<raw> set(shared_ptr<raw> t);
static shared_ptr<raw> tuple(std::vector<shared_ptr<raw>> ts);
static shared_ptr<raw> vector(shared_ptr<raw> t, vector_dimension_t dimension);
static shared_ptr<raw> vector(shared_ptr<raw> t, size_t dimension);
static shared_ptr<raw> frozen(shared_ptr<raw> t);
friend sstring format_as(const raw& r) {
return r.to_string();

View File

@@ -10,7 +10,6 @@
#include "expr-utils.hh"
#include "evaluate.hh"
#include "cql3/functions/functions.hh"
#include "cql3/functions/aggregate_fcts.hh"
#include "cql3/functions/castas_fcts.hh"
#include "cql3/functions/scalar_function.hh"
#include "cql3/column_identifier.hh"
@@ -502,8 +501,8 @@ vector_validate_assignable_to(const collection_constructor& c, data_dictionary::
throw exceptions::invalid_request_exception(format("Invalid vector type literal for {} of type {}", *receiver.name, receiver.type->as_cql3_type()));
}
vector_dimension_t expected_size = vt->get_dimension();
if (expected_size == 0) {
size_t expected_size = vt->get_dimension();
if (!expected_size) {
throw exceptions::invalid_request_exception(format("Invalid vector type literal for {}: type {} expects at least one element",
*receiver.name, receiver.type->as_cql3_type()));
}
@@ -1048,47 +1047,8 @@ prepare_function_args_for_type_inference(std::span<const expression> args, data_
return partially_prepared_args;
}
// Special case for count(1) - recognize it as the countRows() function. Note it is quite
// artificial and we might relax it to the more general count(expression) later.
static
std::optional<expression>
try_prepare_count_rows(const expr::function_call& fc, data_dictionary::database db, const sstring& keyspace, const schema* schema_opt, lw_shared_ptr<column_specification> receiver) {
return std::visit(overloaded_functor{
[&] (const functions::function_name& name) -> std::optional<expression> {
auto native_name = name;
if (!native_name.has_keyspace()) {
native_name = name.as_native_function();
}
// Collapse count(1) into countRows()
if (native_name == functions::function_name::native_function("count")) {
if (fc.args.size() == 1) {
if (auto uc_arg = expr::as_if<expr::untyped_constant>(&fc.args[0])) {
if (uc_arg->partial_type == expr::untyped_constant::type_class::integer
&& uc_arg->raw_text == "1") {
return expr::function_call{
.func = functions::aggregate_fcts::make_count_rows_function(),
.args = {},
};
} else {
throw exceptions::invalid_request_exception(format("count() expects a column or the literal 1 as an argument", fc.args[0]));
}
}
}
}
return std::nullopt;
},
[] (const shared_ptr<functions::function>&) -> std::optional<expression> {
// Already prepared, nothing to do
return std::nullopt;
},
}, fc.func);
}
std::optional<expression>
prepare_function_call(const expr::function_call& fc, data_dictionary::database db, const sstring& keyspace, const schema* schema_opt, lw_shared_ptr<column_specification> receiver) {
if (auto prepared = try_prepare_count_rows(fc, db, keyspace, schema_opt, receiver)) {
return prepared;
}
// Try to extract a column family name from the available information.
// Most functions can be prepared without information about the column family, usually just the keyspace is enough.
// One exception is the token() function - in order to prepare system.token() we have to know the partition key of the table,

View File

@@ -18,7 +18,7 @@ namespace functions {
namespace detail {
std::vector<float> extract_float_vector(const bytes_opt& param, vector_dimension_t dimension) {
std::vector<float> extract_float_vector(const bytes_opt& param, size_t dimension) {
if (!param) {
throw exceptions::invalid_request_exception("Cannot extract float vector from null parameter");
}
@@ -156,7 +156,7 @@ std::vector<data_type> retrieve_vector_arg_types(const function_name& name, cons
}
}
vector_dimension_t dimension = first_dim_opt ? *first_dim_opt : *second_dim_opt;
size_t dimension = first_dim_opt ? *first_dim_opt : *second_dim_opt;
auto type = vector_type_impl::get_instance(float_type, dimension);
return {type, type};
}
@@ -170,7 +170,7 @@ bytes_opt vector_similarity_fct::execute(std::span<const bytes_opt> parameters)
// Extract dimension from the vector type
const auto& type = static_cast<const vector_type_impl&>(*arg_types()[0]);
vector_dimension_t dimension = type.get_dimension();
size_t dimension = type.get_dimension();
// Optimized path: extract floats directly from bytes, bypassing data_value overhead
std::vector<float> v1 = detail::extract_float_vector(parameters[0], dimension);

View File

@@ -39,7 +39,7 @@ namespace detail {
// Extract float vector directly from serialized bytes, bypassing data_value overhead.
// This is an internal API exposed for testing purposes.
// Vector<float, N> wire format: N floats as big-endian uint32_t values, 4 bytes each.
std::vector<float> extract_float_vector(const bytes_opt& param, vector_dimension_t dimension);
std::vector<float> extract_float_vector(const bytes_opt& param, size_t dimension);
} // namespace detail

View File

@@ -105,6 +105,7 @@ public:
static const std::chrono::minutes entry_expiry;
using key_type = prepared_cache_key_type;
using pinned_value_type = cache_value_ptr;
using value_type = checked_weak_ptr;
using statement_is_too_big = typename cache_type::entry_is_too_big;
@@ -116,9 +117,14 @@ public:
: _cache(size, entry_expiry, logger)
{}
template <typename LoadFunc>
future<pinned_value_type> get_pinned(const key_type& key, LoadFunc&& load) {
return _cache.get_ptr(key.key(), [load = std::forward<LoadFunc>(load)] (const cache_key_type&) { return load(); });
}
template <typename LoadFunc>
future<value_type> get(const key_type& key, LoadFunc&& load) {
return _cache.get_ptr(key.key(), [load = std::forward<LoadFunc>(load)] (const cache_key_type&) { return load(); }).then([] (cache_value_ptr v_ptr) {
return get_pinned(key, std::forward<LoadFunc>(load)).then([] (cache_value_ptr v_ptr) {
return make_ready_future<value_type>((*v_ptr)->checked_weak_from_this());
});
}

View File

@@ -697,7 +697,7 @@ future<::shared_ptr<cql_transport::messages::result_message::prepared>>
query_processor::prepare(sstring query_string, const service::client_state& client_state, cql3::dialect d) {
try {
auto key = compute_id(query_string, client_state.get_raw_keyspace(), d);
auto prep_ptr = co_await _prepared_cache.get(key, [this, &query_string, &client_state, d] {
auto prep_entry = co_await _prepared_cache.get_pinned(key, [this, &query_string, &client_state, d] {
auto prepared = get_statement(query_string, client_state, d);
prepared->calculate_metadata_id();
auto bound_terms = prepared->statement->get_bound_terms();
@@ -711,13 +711,13 @@ query_processor::prepare(sstring query_string, const service::client_state& clie
return make_ready_future<std::unique_ptr<statements::prepared_statement>>(std::move(prepared));
});
const auto& warnings = prep_ptr->warnings;
const auto msg = ::make_shared<result_message::prepared::cql>(prepared_cache_key_type::cql_id(key), std::move(prep_ptr),
co_await utils::get_local_injector().inject(
"query_processor_prepare_wait_after_cache_get",
utils::wait_for_message(std::chrono::seconds(60)));
auto msg = ::make_shared<result_message::prepared::cql>(prepared_cache_key_type::cql_id(key), std::move(prep_entry),
client_state.is_protocol_extension_set(cql_transport::cql_protocol_extension::LWT_ADD_METADATA_MARK));
for (const auto& w : warnings) {
msg->add_warning(w);
}
co_return ::shared_ptr<cql_transport::messages::result_message::prepared>(std::move(msg));
co_return std::move(msg);
} catch(typename prepared_statements_cache::statement_is_too_big&) {
throw prepared_statement_is_too_big(query_string);
}

View File

@@ -10,7 +10,6 @@
#include "cdc/log.hh"
#include "index/vector_index.hh"
#include "types/types.hh"
#include "utils/assert.hh"
#include <seastar/core/coroutine.hh>
#include "cql3/query_options.hh"
@@ -31,9 +30,6 @@
#include "cql3/query_processor.hh"
#include "cdc/cdc_extension.hh"
#include "cdc/cdc_partitioner.hh"
#include "db/tags/extension.hh"
#include "db/tags/utils.hh"
#include "alternator/ttl_tag.hh"
namespace cql3 {
@@ -47,8 +43,7 @@ alter_table_statement::alter_table_statement(uint32_t bound_terms,
std::vector<column_change> column_changes,
std::optional<cf_prop_defs> properties,
renames_type renames,
std::unique_ptr<attributes> attrs,
shared_ptr<column_identifier::raw> ttl_change)
std::unique_ptr<attributes> attrs)
: schema_altering_statement(std::move(name))
, _bound_terms(bound_terms)
, _type(t)
@@ -56,7 +51,6 @@ alter_table_statement::alter_table_statement(uint32_t bound_terms,
, _properties(std::move(properties))
, _renames(std::move(renames))
, _attrs(std::move(attrs))
, _ttl_change(std::move(ttl_change))
{
}
@@ -386,21 +380,6 @@ std::pair<schema_ptr, std::vector<view_ptr>> alter_table_statement::prepare_sche
throw exceptions::invalid_request_exception("Cannot drop columns from a non-CQL3 table");
}
invoke_column_change_fn(std::mem_fn(&alter_table_statement::drop_column));
// If we dropped the column used for per-row TTL, we need to remove the tag.
if (std::optional<std::string> ttl_column = db::find_tag(*s, TTL_TAG_KEY)) {
for (auto& [raw_name, raw_validator, is_static] : _column_changes) {
if (*ttl_column == raw_name->text()) {
const std::map<sstring, sstring>* tags_ptr = db::get_tags_of_table(s);
if (tags_ptr) {
std::map<sstring, sstring> tags_map = *tags_ptr;
tags_map.erase(TTL_TAG_KEY);
cfm.add_extension(db::tags_extension::NAME, ::make_shared<db::tags_extension>(std::move(tags_map)));
}
break;
}
}
}
break;
case alter_table_statement::type::opts:
@@ -455,7 +434,6 @@ std::pair<schema_ptr, std::vector<view_ptr>> alter_table_statement::prepare_sche
break;
case alter_table_statement::type::rename:
{
for (auto&& entry : _renames) {
auto from = entry.first->prepare_column_identifier(*s);
auto to = entry.second->prepare_column_identifier(*s);
@@ -492,53 +470,6 @@ std::pair<schema_ptr, std::vector<view_ptr>> alter_table_statement::prepare_sche
}
return make_pair(std::move(new_base_schema), std::move(view_updates));
}
case alter_table_statement::type::ttl:
if (!db.features().cql_row_ttl) {
throw exceptions::invalid_request_exception("The CQL per-row TTL feature is not yet supported by this cluster. Upgrade all nodes to use it.");
}
if (_ttl_change) {
// Enable per-row TTL with chosen column for expiration time
const column_definition *cdef =
s->get_column_definition(to_bytes(_ttl_change->text()));
if (!cdef) {
throw exceptions::invalid_request_exception(fmt::format("Column '{}' does not exist in table {}.{}", _ttl_change->text(), keyspace(), column_family()));
}
if (cdef->type != timestamp_type && cdef->type != long_type && cdef->type != int32_type) {
throw exceptions::invalid_request_exception(fmt::format("TTL column {} must be of type timestamp, bigint or int, can't be {}", _ttl_change->text(), cdef->type->as_cql3_type().to_string()));
}
if (cdef->is_primary_key()) {
throw exceptions::invalid_request_exception(fmt::format("Cannot use a primary key column {} as a TTL column", _ttl_change->text()));
}
if (cdef->is_static()) {
throw exceptions::invalid_request_exception(fmt::format("Cannot use a static column {} as a TTL column", _ttl_change->text()));
}
std::optional<std::string> old_ttl_column = db::find_tag(*s, TTL_TAG_KEY);
if (old_ttl_column) {
throw exceptions::invalid_request_exception(fmt::format("Cannot set TTL column, table {}.{} already has a TTL column defined: {}", keyspace(), column_family(), *old_ttl_column));
}
const std::map<sstring, sstring>* old_tags_ptr = db::get_tags_of_table(s);
std::map<sstring, sstring> tags_map;
if (old_tags_ptr) {
// tags_ptr is a constant pointer to schema data. To modify
// it, we must make a copy.
tags_map = *old_tags_ptr;
}
tags_map[TTL_TAG_KEY] = _ttl_change->text();
cfm.add_extension(db::tags_extension::NAME, ::make_shared<db::tags_extension>(std::move(tags_map)));
} else {
// Disable per-row TTL
const std::map<sstring, sstring>* tags_ptr = db::get_tags_of_table(s);
if (!tags_ptr || tags_ptr->find(TTL_TAG_KEY) == tags_ptr->end()) {
throw exceptions::invalid_request_exception(fmt::format("Cannot unset TTL column, table {}.{} does not have a TTL column set", keyspace(), column_family()));
}
// tags_ptr is a constant pointer to schema data. To modify it, we
// must make a copy.
std::map<sstring, sstring> tags_map = *tags_ptr;
tags_map.erase(TTL_TAG_KEY);
cfm.add_extension(db::tags_extension::NAME, ::make_shared<db::tags_extension>(std::move(tags_map)));
}
break;
}
return make_pair(cfm.build(), std::move(view_updates));
}
@@ -577,15 +508,13 @@ alter_table_statement::raw_statement::raw_statement(cf_name name,
std::vector<column_change> column_changes,
std::optional<cf_prop_defs> properties,
renames_type renames,
std::unique_ptr<attributes::raw> attrs,
shared_ptr<column_identifier::raw> ttl_change)
std::unique_ptr<attributes::raw> attrs)
: cf_statement(std::move(name))
, _type(t)
, _column_changes(std::move(column_changes))
, _properties(std::move(properties))
, _renames(std::move(renames))
, _attrs(std::move(attrs))
, _ttl_change(std::move(ttl_change))
{}
std::unique_ptr<cql3::statements::prepared_statement>
@@ -610,8 +539,7 @@ alter_table_statement::raw_statement::prepare(data_dictionary::database db, cql_
_column_changes,
_properties,
_renames,
std::move(prepared_attrs),
_ttl_change
std::move(prepared_attrs)
),
ctx,
// since alter table is `cql_statement_no_metadata` (it doesn't return any metadata when preparing)

View File

@@ -32,7 +32,6 @@ public:
drop,
opts,
rename,
ttl,
};
using renames_type = std::vector<std::pair<shared_ptr<column_identifier::raw>,
shared_ptr<column_identifier::raw>>>;
@@ -51,7 +50,6 @@ private:
const std::optional<cf_prop_defs> _properties;
const renames_type _renames;
const std::unique_ptr<attributes> _attrs;
shared_ptr<column_identifier::raw> _ttl_change;
public:
alter_table_statement(uint32_t bound_terms,
cf_name name,
@@ -59,8 +57,7 @@ public:
std::vector<column_change> column_changes,
std::optional<cf_prop_defs> properties,
renames_type renames,
std::unique_ptr<attributes> attrs,
shared_ptr<column_identifier::raw> ttl_change);
std::unique_ptr<attributes> attrs);
virtual uint32_t get_bound_terms() const override;
virtual future<> check_access(query_processor& qp, const service::client_state& state) const override;
@@ -81,7 +78,6 @@ class alter_table_statement::raw_statement : public raw::cf_statement {
const std::optional<cf_prop_defs> _properties;
const alter_table_statement::renames_type _renames;
const std::unique_ptr<attributes::raw> _attrs;
shared_ptr<column_identifier::raw> _ttl_change;
public:
raw_statement(cf_name name,
@@ -89,8 +85,7 @@ public:
std::vector<column_change> column_changes,
std::optional<cf_prop_defs> properties,
renames_type renames,
std::unique_ptr<attributes::raw> attrs,
shared_ptr<column_identifier::raw> ttl_change);
std::unique_ptr<attributes::raw> attrs);
virtual std::unique_ptr<prepared_statement> prepare(data_dictionary::database db, cql_stats& stats) override;

View File

@@ -30,9 +30,6 @@
#include "service/storage_proxy.hh"
#include "db/config.hh"
#include "compaction/time_window_compaction_strategy.hh"
#include "db/tags/extension.hh"
#include "db/tags/utils.hh"
#include "alternator/ttl_tag.hh"
namespace cql3 {
@@ -44,12 +41,10 @@ create_table_statement::create_table_statement(cf_name name,
::shared_ptr<cf_prop_defs> properties,
bool if_not_exists,
column_set_type static_columns,
::shared_ptr<column_identifier> ttl_column,
const std::optional<table_id>& id)
: schema_altering_statement{name}
, _use_compact_storage(false)
, _static_columns{static_columns}
, _ttl_column{ttl_column}
, _properties{properties}
, _if_not_exists{if_not_exists}
, _id(id)
@@ -128,13 +123,6 @@ void create_table_statement::apply_properties_to(schema_builder& builder, const
#endif
_properties->apply_to_builder(builder, _properties->make_schema_extensions(db.extensions()), db, keyspace(), true);
// Remembering which column was designated as the TTL column for row-based
// TTL column is done using a "tag" extension. If there is no TTL column,
// we don't need this extension at all.
if (_ttl_column) {
std::map<sstring, sstring> tags_map = {{TTL_TAG_KEY, _ttl_column->text()}};
builder.add_extension(db::tags_extension::NAME, ::make_shared<db::tags_extension>(std::move(tags_map)));
}
}
void create_table_statement::add_column_metadata_from_aliases(schema_builder& builder, std::vector<bytes> aliases, const std::vector<data_type>& types, column_kind kind) const
@@ -210,7 +198,7 @@ std::unique_ptr<prepared_statement> create_table_statement::raw_statement::prepa
}
const bool has_default_ttl = _properties.properties()->get_default_time_to_live() > 0;
auto stmt = ::make_shared<create_table_statement>(*_cf_name, _properties.properties(), _if_not_exists, _static_columns, _ttl_column, _properties.properties()->get_id());
auto stmt = ::make_shared<create_table_statement>(*_cf_name, _properties.properties(), _if_not_exists, _static_columns, _properties.properties()->get_id());
bool ks_uses_tablets;
try {
@@ -415,27 +403,6 @@ std::unique_ptr<prepared_statement> create_table_statement::raw_statement::prepa
}
}
// If a TTL column is defined, it must be a regular column - not a static
// column or part of the primary key.
if (_ttl_column) {
if (!db.features().cql_row_ttl) {
throw exceptions::invalid_request_exception("The CQL per-row TTL feature is not yet supported by this cluster. Upgrade all nodes to use it.");
}
for (const auto& alias : key_aliases) {
if (alias->text() == _ttl_column->text()) {
throw exceptions::invalid_request_exception(format("TTL column {} cannot be part of the PRIMARY KEY", alias->text()));
}
}
for (const auto& alias : _column_aliases) {
if (alias->text() == _ttl_column->text()) {
throw exceptions::invalid_request_exception(format("TTL column {} cannot be part of the PRIMARY KEY", alias->text()));
}
}
if (_static_columns.contains(_ttl_column)) {
throw exceptions::invalid_request_exception(format("TTL column {} cannot be a static column", _ttl_column->text()));
}
}
return std::make_unique<prepared_statement>(audit_info(), stmt, std::move(stmt_warnings));
}
@@ -458,23 +425,12 @@ data_type create_table_statement::raw_statement::get_type_and_remove(column_map_
return _properties.get_reversable_type(*t, type);
}
void create_table_statement::raw_statement::add_definition(::shared_ptr<column_identifier> def, ::shared_ptr<cql3_type::raw> type, bool is_static, bool is_ttl) {
void create_table_statement::raw_statement::add_definition(::shared_ptr<column_identifier> def, ::shared_ptr<cql3_type::raw> type, bool is_static) {
_defined_names.emplace(def);
_definitions.emplace(def, type);
if (is_static) {
_static_columns.emplace(def);
}
if (is_ttl) {
if (_ttl_column) {
throw exceptions::invalid_request_exception(fmt::format("Cannot have more than one TTL column in a table. Saw {} and {}", _ttl_column->text(), def->text()));
}
// FIXME: find a way to check cql3_type::raw without fmt::format
auto type_name = fmt::format("{}", type);
if (type_name != "timestamp" && type_name != "bigint" && type_name != "int") {
throw exceptions::invalid_request_exception(fmt::format("TTL column '{}' must be of type timestamp, bigint or int, can't be {}", def->text(), type_name));
}
_ttl_column = def;
}
}
void create_table_statement::raw_statement::add_key_aliases(const std::vector<::shared_ptr<column_identifier>> aliases) {

View File

@@ -57,7 +57,6 @@ class create_table_statement : public schema_altering_statement {
shared_ptr_equal_by_value<column_identifier>>;
column_map_type _columns;
column_set_type _static_columns;
::shared_ptr<column_identifier> _ttl_column; // for row-based TTL
const ::shared_ptr<cf_prop_defs> _properties;
const bool _if_not_exists;
std::optional<table_id> _id;
@@ -66,7 +65,6 @@ public:
::shared_ptr<cf_prop_defs> properties,
bool if_not_exists,
column_set_type static_columns,
::shared_ptr<column_identifier> ttl_column,
const std::optional<table_id>& id);
virtual future<> check_access(query_processor& qp, const service::client_state& state) const override;
@@ -102,7 +100,6 @@ private:
std::vector<std::vector<::shared_ptr<column_identifier>>> _key_aliases;
std::vector<::shared_ptr<column_identifier>> _column_aliases;
create_table_statement::column_set_type _static_columns;
::shared_ptr<column_identifier> _ttl_column; // for row-based TTL
std::multiset<::shared_ptr<column_identifier>,
indirect_less<::shared_ptr<column_identifier>, column_identifier::text_comparator>> _defined_names;
@@ -119,7 +116,7 @@ public:
data_type get_type_and_remove(column_map_type& columns, ::shared_ptr<column_identifier> t);
void add_definition(::shared_ptr<column_identifier> def, ::shared_ptr<cql3_type::raw> type, bool is_static, bool is_ttl);
void add_definition(::shared_ptr<column_identifier> def, ::shared_ptr<cql3_type::raw> type, bool is_static);
void add_key_aliases(const std::vector<::shared_ptr<column_identifier>> aliases);

View File

@@ -50,8 +50,8 @@ public:
protected:
virtual audit::statement_category category() const override;
virtual audit::audit_info_ptr audit_info() const override {
constexpr bool batch = true;
return audit::audit::create_audit_info(category(), sstring(), sstring(), batch);
// We don't audit batch statements. Instead we audit statements that are inside the batch.
return audit::audit::create_no_audit_info();
}
};

View File

@@ -259,9 +259,11 @@ uint32_t select_statement::get_bound_terms() const {
future<> select_statement::check_access(query_processor& qp, const service::client_state& state) const {
try {
auto cdc = qp.db().get_cdc_base_table(*_schema);
auto& cf_name = _schema->is_view()
? _schema->view_info()->base_name()
const data_dictionary::database db = qp.db();
auto&& s = db.find_schema(keyspace(), column_family());
auto cdc = db.get_cdc_base_table(*s);
auto& cf_name = s->is_view()
? s->view_info()->base_name()
: (cdc ? cdc->cf_name() : column_family());
const schema_ptr& base_schema = cdc ? cdc : _schema;
bool is_vector_indexed = secondary_index::vector_index::has_vector_index(*base_schema);
@@ -2004,7 +2006,9 @@ static std::optional<ann_ordering_info> get_ann_ordering_info(
auto indexes = sim.list_indexes();
auto it = std::find_if(indexes.begin(), indexes.end(), [&prepared_ann_ordering](const auto& ind) {
return secondary_index::vector_index::is_vector_index_on_column(ind.metadata(), prepared_ann_ordering.first->name_as_text());
return (ind.metadata().options().contains(db::index::secondary_index::custom_class_option_name) &&
ind.metadata().options().at(db::index::secondary_index::custom_class_option_name) == ANN_CUSTOM_INDEX_OPTION) &&
(ind.target_column() == prepared_ann_ordering.first->name_as_text());
});
if (it == indexes.end()) {

View File

@@ -621,6 +621,25 @@ db::config::config(std::shared_ptr<db::extensions> exts)
* @GroupDescription: Provides an overview of the group.
*/
/**
* @Group Ungrouped properties
*/
, background_writer_scheduling_quota(this, "background_writer_scheduling_quota", value_status::Deprecated, 1.0,
"max cpu usage ratio (between 0 and 1) for compaction process. Not intended for setting in normal operations. Setting it to 1 or higher will disable it, recommended operational setting is 0.5.")
, auto_adjust_flush_quota(this, "auto_adjust_flush_quota", value_status::Deprecated, false,
"true: auto-adjust memtable shares for flush processes")
, memtable_flush_static_shares(this, "memtable_flush_static_shares", liveness::LiveUpdate, value_status::Used, 0,
"If set to higher than 0, ignore the controller's output and set the memtable shares statically. Do not set this unless you know what you are doing and suspect a problem in the controller. This option will be retired when the controller reaches more maturity.")
, compaction_static_shares(this, "compaction_static_shares", liveness::LiveUpdate, value_status::Used, 0,
"If set to higher than 0, ignore the controller's output and set the compaction shares statically. Do not set this unless you know what you are doing and suspect a problem in the controller. This option will be retired when the controller reaches more maturity.")
, compaction_max_shares(this, "compaction_max_shares", liveness::LiveUpdate, value_status::Used, default_compaction_maximum_shares,
"Set the maximum shares of regular compaction to the specific value. Do not set this unless you know what you are doing and suspect a problem in the controller. This option will be retired when the controller reaches more maturity.")
, compaction_enforce_min_threshold(this, "compaction_enforce_min_threshold", liveness::LiveUpdate, value_status::Used, false,
"If set to true, enforce the min_threshold option for compactions strictly. If false (default), Scylla may decide to compact even if below min_threshold.")
, compaction_flush_all_tables_before_major_seconds(this, "compaction_flush_all_tables_before_major_seconds", value_status::Used, 86400,
"Set the minimum interval in seconds between flushing all tables before each major compaction (default is 86400)."
"This option is useful for maximizing tombstone garbage collection by releasing all active commitlog segments."
"Set to 0 to disable automatic flushing all tables before major compaction.")
/**
* @Group Initialization properties
* @GroupDescription The minimal properties needed for configuring a cluster.
*/
@@ -1201,13 +1220,13 @@ db::config::config(std::shared_ptr<db::extensions> exts)
"* org.apache.cassandra.auth.CassandraRoleManager: Stores role data in the system_auth keyspace;\n"
"* com.scylladb.auth.LDAPRoleManager: Fetches role data from an LDAP server.")
, permissions_validity_in_ms(this, "permissions_validity_in_ms", liveness::LiveUpdate, value_status::Used, 10000,
"How long authorized statements cache entries remain valid. The cached value is considered valid as long as both its value is not older than the permissions_validity_in_ms "
"How long permissions in cache remain valid. Depending on the authorizer, such as CassandraAuthorizer, fetching permissions can be resource intensive. Permissions caching is disabled when this property is set to 0 or when AllowAllAuthorizer is used. The cached value is considered valid as long as both its value is not older than the permissions_validity_in_ms "
"and the cached value has been read at least once during the permissions_validity_in_ms time frame. If any of these two conditions doesn't hold the cached value is going to be evicted from the cache.\n"
"\n"
"Related information: Object permissions")
, permissions_update_interval_in_ms(this, "permissions_update_interval_in_ms", liveness::LiveUpdate, value_status::Used, 2000,
"Refresh interval for authorized statements cache. After this interval, cache entries become eligible for refresh. An async reload is scheduled every permissions_update_interval_in_ms time period and the old value is returned until it completes. If permissions_validity_in_ms has a non-zero value, then this property must also have a non-zero value. It's recommended to set this value to be at least 3 times smaller than the permissions_validity_in_ms. This option additionally controls the permissions refresh interval for LDAP.")
, permissions_cache_max_entries(this, "permissions_cache_max_entries", liveness::LiveUpdate, value_status::Unused, 1000,
"Refresh interval for permissions cache (if enabled). After this interval, cache entries become eligible for refresh. An async reload is scheduled every permissions_update_interval_in_ms time period and the old value is returned until it completes. If permissions_validity_in_ms has a non-zero value, then this property must also have a non-zero value. It's recommended to set this value to be at least 3 times smaller than the permissions_validity_in_ms.")
, permissions_cache_max_entries(this, "permissions_cache_max_entries", liveness::LiveUpdate, value_status::Used, 1000,
"Maximum cached permission entries. Must have a non-zero value if permissions caching is enabled (see a permissions_validity_in_ms description).")
, server_encryption_options(this, "server_encryption_options", value_status::Used, {/*none*/},
"Enable or disable inter-node encryption. You must also generate keys and provide the appropriate key and trust store locations and passwords. The available options are:\n"
@@ -1292,7 +1311,7 @@ db::config::config(std::shared_ptr<db::extensions> exts)
, fd_initial_value_ms(this, "fd_initial_value_ms", value_status::Used, 2 * 1000, "The initial failure_detector interval time in milliseconds.")
, shutdown_announce_in_ms(this, "shutdown_announce_in_ms", value_status::Used, 2 * 1000, "Time a node waits after sending gossip shutdown message in milliseconds. Same as -Dcassandra.shutdown_announce_in_ms in cassandra.")
, developer_mode(this, "developer_mode", value_status::Used, DEVELOPER_MODE_DEFAULT, "Relax environment checks. Setting to true can reduce performance and reliability significantly.")
, skip_wait_for_gossip_to_settle(this, "skip_wait_for_gossip_to_settle", value_status::Deprecated, -1, "An integer to configure the wait for gossip to settle. -1: wait normally, 0: do not wait at all, n: wait for at most n polls. Same as -Dcassandra.skip_wait_for_gossip_to_settle in cassandra.")
, skip_wait_for_gossip_to_settle(this, "skip_wait_for_gossip_to_settle", value_status::Used, -1, "An integer to configure the wait for gossip to settle. -1: wait normally, 0: do not wait at all, n: wait for at most n polls. Same as -Dcassandra.skip_wait_for_gossip_to_settle in cassandra.")
, force_gossip_generation(this, "force_gossip_generation", liveness::LiveUpdate, value_status::Used, -1 , "Force gossip to use the generation number provided by user.")
, experimental_features(this, "experimental_features", value_status::Used, {}, experimental_features_help_string())
, lsa_reclamation_step(this, "lsa_reclamation_step", value_status::Used, 1, "Minimum number of segments to reclaim in a single step.")
@@ -1375,10 +1394,6 @@ db::config::config(std::shared_ptr<db::extensions> exts)
"Start killing reads after their collective memory consumption goes above $normal_limit * $multiplier.")
, reader_concurrency_semaphore_cpu_concurrency(this, "reader_concurrency_semaphore_cpu_concurrency", liveness::LiveUpdate, value_status::Used, 2,
"Admit new reads while there are less than this number of requests that need CPU.")
, reader_concurrency_semaphore_preemptive_abort_factor(this, "reader_concurrency_semaphore_preemptive_abort_factor", liveness::LiveUpdate, value_status::Used, 0.3,
"Admit new reads while their remaining time is more than this factor times their timeout times when arrived to a semaphore. Its vale means\n"
"* <= 0.0 means new reads will never get rejected during admission\n"
"* >= 1.0 means new reads will always get rejected during admission\n")
, view_update_reader_concurrency_semaphore_serialize_limit_multiplier(this, "view_update_reader_concurrency_semaphore_serialize_limit_multiplier", liveness::LiveUpdate, value_status::Used, 2,
"Start serializing view update reads after their collective memory consumption goes above $normal_limit * $multiplier.")
, view_update_reader_concurrency_semaphore_kill_limit_multiplier(this, "view_update_reader_concurrency_semaphore_kill_limit_multiplier", liveness::LiveUpdate, value_status::Used, 4,
@@ -1498,7 +1513,7 @@ db::config::config(std::shared_ptr<db::extensions> exts)
, index_cache_fraction(this, "index_cache_fraction", liveness::LiveUpdate, value_status::Used, 0.2,
"The maximum fraction of cache memory permitted for use by index cache. Clamped to the [0.0; 1.0] range. Must be small enough to not deprive the row cache of memory, but should be big enough to fit a large fraction of the index. The default value 0.2 means that at least 80\% of cache memory is reserved for the row cache, while at most 20\% is usable by the index cache.")
, consistent_cluster_management(this, "consistent_cluster_management", value_status::Deprecated, true, "Use RAFT for cluster management and DDL.")
, force_gossip_topology_changes(this, "force_gossip_topology_changes", value_status::Deprecated, false, "Force gossip-based topology operations in a fresh cluster. Only the first node in the cluster must use it. The rest will fall back to gossip-based operations anyway. This option should be used only for testing. Note: gossip topology changes are incompatible with tablets.")
, force_gossip_topology_changes(this, "force_gossip_topology_changes", value_status::Used, false, "Force gossip-based topology operations in a fresh cluster. Only the first node in the cluster must use it. The rest will fall back to gossip-based operations anyway. This option should be used only for testing. Note: gossip topology changes are incompatible with tablets.")
, recovery_leader(this, "recovery_leader", liveness::LiveUpdate, value_status::Used, utils::null_uuid(), "Host ID of the node restarted first while performing the Manual Raft-based Recovery Procedure. Warning: this option disables some guardrails for the needs of the Manual Raft-based Recovery Procedure. Make sure you unset it at the end of the procedure.")
, wasm_cache_memory_fraction(this, "wasm_cache_memory_fraction", value_status::Used, 0.01, "Maximum total size of all WASM instances stored in the cache as fraction of total shard memory.")
, wasm_cache_timeout_in_ms(this, "wasm_cache_timeout_in_ms", value_status::Used, 5000, "Time after which an instance is evicted from the cache.")
@@ -1527,21 +1542,17 @@ db::config::config(std::shared_ptr<db::extensions> exts)
"Allows target tablet size to be configured. Defaults to 5G (in bytes). Maintaining tablets at reasonable sizes is important to be able to " \
"redistribute load. A higher value means tablet migration throughput can be reduced. A lower value may cause number of tablets to increase significantly, " \
"potentially resulting in performance drawbacks.")
, tablet_streaming_read_concurrency_per_shard(this, "tablet_streaming_read_concurrency_per_shard", liveness::LiveUpdate, value_status::Used, 2,
"Maximum number of tablets which may be leaving a shard at the same time. Effecting only on topology coordinator. Set to the same value on all nodes.")
, tablet_streaming_write_concurrency_per_shard(this, "tablet_streaming_write_concurrency_per_shard", liveness::LiveUpdate, value_status::Used, 2,
"Maximum number of tablets which may be pending on a shard at the same time. Effecting only on topology coordinator. Set to the same value on all nodes.")
, replication_strategy_warn_list(this, "replication_strategy_warn_list", liveness::LiveUpdate, value_status::Used, {locator::replication_strategy_type::simple}, "Controls which replication strategies to warn about when creating/altering a keyspace. Doesn't affect the pre-existing keyspaces.")
, replication_strategy_fail_list(this, "replication_strategy_fail_list", liveness::LiveUpdate, value_status::Used, {}, "Controls which replication strategies are disallowed to be used when creating/altering a keyspace. Doesn't affect the pre-existing keyspaces.")
, service_levels_interval(this, "service_levels_interval_ms", liveness::LiveUpdate, value_status::Used, 10000, "Controls how often service levels module polls configuration table")
, audit(this, "audit", value_status::Used, "table",
, audit(this, "audit", value_status::Used, "none",
"Controls the audit feature:\n"
"\n"
"\tnone : No auditing enabled.\n"
"\tsyslog : Audit messages sent to Syslog.\n"
"\ttable : Audit messages written to column family named audit.audit_log.\n")
, audit_categories(this, "audit_categories", liveness::LiveUpdate, value_status::Used, "DCL,AUTH,ADMIN", "Comma separated list of operation categories that should be audited.")
, audit_categories(this, "audit_categories", liveness::LiveUpdate, value_status::Used, "DCL,DDL,AUTH", "Comma separated list of operation categories that should be audited.")
, audit_tables(this, "audit_tables", liveness::LiveUpdate, value_status::Used, "", "Comma separated list of table names (<keyspace>.<table>) that will be audited.")
, audit_keyspaces(this, "audit_keyspaces", liveness::LiveUpdate, value_status::Used, "", "Comma separated list of keyspaces that will be audited. All tables in those keyspaces will be audited")
, audit_unix_socket_path(this, "audit_unix_socket_path", value_status::Used, "/dev/log", "The path to the unix socket used for writing to syslog. Only applicable when audit is set to syslog.")
@@ -1591,25 +1602,6 @@ db::config::config(std::shared_ptr<db::extensions> exts)
"Sets the maximum difference in percentages between the most loaded and least loaded nodes, below which the load balancer considers nodes balanced.")
, minimal_tablet_size_for_balancing(this, "minimal_tablet_size_for_balancing", liveness::LiveUpdate, value_status::Used, service::default_target_tablet_size / 100,
"Sets the minimal tablet size for the load balancer. For any tablet smaller than this, the balancer will use this size instead of the actual tablet size.")
/**
* @Group Ungrouped properties
*/
, background_writer_scheduling_quota(this, "background_writer_scheduling_quota", value_status::Deprecated, 1.0,
"max cpu usage ratio (between 0 and 1) for compaction process. Not intended for setting in normal operations. Setting it to 1 or higher will disable it, recommended operational setting is 0.5.")
, auto_adjust_flush_quota(this, "auto_adjust_flush_quota", value_status::Deprecated, false,
"true: auto-adjust memtable shares for flush processes")
, memtable_flush_static_shares(this, "memtable_flush_static_shares", liveness::LiveUpdate, value_status::Used, 0,
"If set to higher than 0, ignore the controller's output and set the memtable shares statically. Do not set this unless you know what you are doing and suspect a problem in the controller. This option will be retired when the controller reaches more maturity.")
, compaction_static_shares(this, "compaction_static_shares", liveness::LiveUpdate, value_status::Used, 0,
"If set to higher than 0, ignore the controller's output and set the compaction shares statically. Do not set this unless you know what you are doing and suspect a problem in the controller. This option will be retired when the controller reaches more maturity.")
, compaction_max_shares(this, "compaction_max_shares", liveness::LiveUpdate, value_status::Used, default_compaction_maximum_shares,
"Set the maximum shares of regular compaction to the specific value. Do not set this unless you know what you are doing and suspect a problem in the controller. This option will be retired when the controller reaches more maturity.")
, compaction_enforce_min_threshold(this, "compaction_enforce_min_threshold", liveness::LiveUpdate, value_status::Used, false,
"If set to true, enforce the min_threshold option for compactions strictly. If false (default), Scylla may decide to compact even if below min_threshold.")
, compaction_flush_all_tables_before_major_seconds(this, "compaction_flush_all_tables_before_major_seconds", value_status::Used, 86400,
"Set the minimum interval in seconds between flushing all tables before each major compaction (default is 86400)."
"This option is useful for maximizing tombstone garbage collection by releasing all active commitlog segments."
"Set to 0 to disable automatic flushing all tables before major compaction.")
, default_log_level(this, "default_log_level", value_status::Used, seastar::log_level::info, "Default log level for log messages")
, logger_log_level(this, "logger_log_level", value_status::Used, {}, "Map of logger name to log level. Valid log levels are 'error', 'warn', 'info', 'debug' and 'trace'")
, log_to_stdout(this, "log_to_stdout", value_status::Used, true, "Send log output to stdout")

View File

@@ -185,6 +185,13 @@ public:
* All values and documentation taken from
* http://docs.datastax.com/en/cassandra/2.1/cassandra/configuration/configCassandra_yaml_r.html
*/
named_value<double> background_writer_scheduling_quota;
named_value<bool> auto_adjust_flush_quota;
named_value<float> memtable_flush_static_shares;
named_value<float> compaction_static_shares;
named_value<float> compaction_max_shares;
named_value<bool> compaction_enforce_min_threshold;
named_value<uint32_t> compaction_flush_all_tables_before_major_seconds;
named_value<sstring> cluster_name;
named_value<sstring> listen_address;
named_value<sstring> listen_interface;
@@ -439,7 +446,6 @@ public:
named_value<uint32_t> reader_concurrency_semaphore_serialize_limit_multiplier;
named_value<uint32_t> reader_concurrency_semaphore_kill_limit_multiplier;
named_value<uint32_t> reader_concurrency_semaphore_cpu_concurrency;
named_value<float> reader_concurrency_semaphore_preemptive_abort_factor;
named_value<uint32_t> view_update_reader_concurrency_semaphore_serialize_limit_multiplier;
named_value<uint32_t> view_update_reader_concurrency_semaphore_kill_limit_multiplier;
named_value<uint32_t> view_update_reader_concurrency_semaphore_cpu_concurrency;
@@ -542,8 +548,6 @@ public:
named_value<double> tablets_initial_scale_factor;
named_value<unsigned> tablets_per_shard_goal;
named_value<uint64_t> target_tablet_size_in_bytes;
named_value<unsigned> tablet_streaming_read_concurrency_per_shard;
named_value<unsigned> tablet_streaming_write_concurrency_per_shard;
named_value<std::vector<enum_option<replication_strategy_restriction_t>>> replication_strategy_warn_list;
named_value<std::vector<enum_option<replication_strategy_restriction_t>>> replication_strategy_fail_list;
@@ -608,14 +612,6 @@ public:
named_value<float> size_based_balance_threshold_percentage;
named_value<uint64_t> minimal_tablet_size_for_balancing;
named_value<double> background_writer_scheduling_quota;
named_value<bool> auto_adjust_flush_quota;
named_value<float> memtable_flush_static_shares;
named_value<float> compaction_static_shares;
named_value<float> compaction_max_shares;
named_value<bool> compaction_enforce_min_threshold;
named_value<uint32_t> compaction_flush_all_tables_before_major_seconds;
static const sstring default_tls_priority;
private:
template<typename T>

View File

@@ -158,7 +158,7 @@ void hint_endpoint_manager::cancel_draining() noexcept {
_sender.cancel_draining();
}
hint_endpoint_manager::hint_endpoint_manager(const endpoint_id& key, fs::path hint_directory, manager& shard_manager, scheduling_group send_sg)
hint_endpoint_manager::hint_endpoint_manager(const endpoint_id& key, fs::path hint_directory, manager& shard_manager)
: _key(key)
, _shard_manager(shard_manager)
, _store_gate("hint_endpoint_manager")
@@ -169,7 +169,7 @@ hint_endpoint_manager::hint_endpoint_manager(const endpoint_id& key, fs::path hi
// Approximate the position of the last written hint by using the same formula as for segment id calculation in commitlog
// TODO: Should this logic be deduplicated with what is in the commitlog?
, _last_written_rp(this_shard_id(), std::chrono::duration_cast<std::chrono::milliseconds>(runtime::get_boot_time().time_since_epoch()).count())
, _sender(*this, _shard_manager.local_storage_proxy(), _shard_manager.local_db(), _shard_manager.local_gossiper(), send_sg)
, _sender(*this, _shard_manager.local_storage_proxy(), _shard_manager.local_db(), _shard_manager.local_gossiper())
{}
hint_endpoint_manager::hint_endpoint_manager(hint_endpoint_manager&& other)

View File

@@ -63,7 +63,7 @@ private:
hint_sender _sender;
public:
hint_endpoint_manager(const endpoint_id& key, std::filesystem::path hint_directory, manager& shard_manager, scheduling_group send_sg);
hint_endpoint_manager(const endpoint_id& key, std::filesystem::path hint_directory, manager& shard_manager);
hint_endpoint_manager(hint_endpoint_manager&&);
~hint_endpoint_manager();

View File

@@ -122,7 +122,7 @@ const column_mapping& hint_sender::get_column_mapping(lw_shared_ptr<send_one_fil
return cm_it->second;
}
hint_sender::hint_sender(hint_endpoint_manager& parent, service::storage_proxy& local_storage_proxy,replica::database& local_db, const gms::gossiper& local_gossiper, scheduling_group sg) noexcept
hint_sender::hint_sender(hint_endpoint_manager& parent, service::storage_proxy& local_storage_proxy,replica::database& local_db, const gms::gossiper& local_gossiper) noexcept
: _stopped(make_ready_future<>())
, _ep_key(parent.end_point_key())
, _ep_manager(parent)
@@ -130,7 +130,7 @@ hint_sender::hint_sender(hint_endpoint_manager& parent, service::storage_proxy&
, _resource_manager(_shard_manager._resource_manager)
, _proxy(local_storage_proxy)
, _db(local_db)
, _hints_cpu_sched_group(sg)
, _hints_cpu_sched_group(_db.get_streaming_scheduling_group())
, _gossiper(local_gossiper)
, _file_update_mutex(_ep_manager.file_update_mutex())
{}

View File

@@ -120,7 +120,7 @@ private:
std::multimap<db::replay_position, lw_shared_ptr<std::optional<promise<>>>> _replay_waiters;
public:
hint_sender(hint_endpoint_manager& parent, service::storage_proxy& local_storage_proxy, replica::database& local_db, const gms::gossiper& local_gossiper, scheduling_group sg) noexcept;
hint_sender(hint_endpoint_manager& parent, service::storage_proxy& local_storage_proxy, replica::database& local_db, const gms::gossiper& local_gossiper) noexcept;
~hint_sender();
/// \brief A constructor that should be called from the copy/move-constructor of hint_endpoint_manager.

View File

@@ -142,7 +142,7 @@ future<> directory_initializer::ensure_rebalanced() {
}
manager::manager(service::storage_proxy& proxy, sstring hints_directory, host_filter filter, int64_t max_hint_window_ms,
resource_manager& res_manager, sharded<replica::database>& db, scheduling_group sg)
resource_manager& res_manager, sharded<replica::database>& db)
: _hints_dir(fs::path(hints_directory) / fmt::to_string(this_shard_id()))
, _host_filter(std::move(filter))
, _proxy(proxy)
@@ -150,7 +150,6 @@ manager::manager(service::storage_proxy& proxy, sstring hints_directory, host_fi
, _local_db(db.local())
, _draining_eps_gate(seastar::format("hints::manager::{}", _hints_dir.native()))
, _resource_manager(res_manager)
, _hints_sending_sched_group(sg)
{
if (utils::get_local_injector().enter("decrease_hints_flush_period")) {
hints_flush_period = std::chrono::seconds{1};
@@ -416,7 +415,7 @@ hint_endpoint_manager& manager::get_ep_manager(const endpoint_id& host_id, const
try {
std::filesystem::path hint_directory = hints_dir() / (_uses_host_id ? fmt::to_string(host_id) : fmt::to_string(ip));
auto [it, _] = _ep_managers.emplace(host_id, hint_endpoint_manager{host_id, std::move(hint_directory), *this, _hints_sending_sched_group});
auto [it, _] = _ep_managers.emplace(host_id, hint_endpoint_manager{host_id, std::move(hint_directory), *this});
hint_endpoint_manager& ep_man = it->second;
manager_logger.trace("Created an endpoint manager for {}", host_id);

View File

@@ -133,7 +133,6 @@ private:
hint_stats _stats;
seastar::metrics::metric_groups _metrics;
scheduling_group _hints_sending_sched_group;
// We need to keep a variant here. Before migrating hinted handoff to using host ID, hint directories will
// still represent IP addresses. But after the migration, they will start representing host IDs.
@@ -156,7 +155,7 @@ private:
public:
manager(service::storage_proxy& proxy, sstring hints_directory, host_filter filter,
int64_t max_hint_window_ms, resource_manager& res_manager, sharded<replica::database>& db, scheduling_group sg);
int64_t max_hint_window_ms, resource_manager& res_manager, sharded<replica::database>& db);
manager(const manager&) = delete;
manager& operator=(const manager&) = delete;

View File

@@ -16,7 +16,6 @@
#include <string>
#include <tuple>
#include "cql3/cql3_type.hh"
#include "types/user.hh"
#include "types/map.hh"
#include "types/list.hh"
@@ -114,7 +113,7 @@ std::vector<data_type> type_parser::get_type_parameters(bool multicell)
throw parse_exception(_str, _idx, "unexpected end of string");
}
std::tuple<data_type, vector_dimension_t> type_parser::get_vector_parameters()
std::tuple<data_type, size_t> type_parser::get_vector_parameters()
{
if (is_eos() || _str[_idx] != '(') {
throw std::logic_error("internal error");
@@ -129,7 +128,7 @@ std::tuple<data_type, vector_dimension_t> type_parser::get_vector_parameters()
}
data_type type = do_parse(true);
vector_dimension_t size = 0;
size_t size = 0;
if (_str[_idx] == ',') {
++_idx;
skip_blank();
@@ -143,20 +142,7 @@ std::tuple<data_type, vector_dimension_t> type_parser::get_vector_parameters()
throw parse_exception(_str, _idx, "expected digit or ')'");
}
unsigned long parsed_size;
try {
parsed_size = std::stoul(_str.substr(i, _idx - i));
} catch (const std::exception& e) {
throw parse_exception(_str, i, format("Invalid vector dimension: {}", e.what()));
}
static_assert(sizeof(unsigned long) >= sizeof(vector_dimension_t));
if (parsed_size == 0) {
throw parse_exception(_str, _idx, "Vectors must have a dimension greater than 0");
}
if (parsed_size > cql3::cql3_type::MAX_VECTOR_DIMENSION) {
throw parse_exception(_str, _idx, format("Vectors must have a dimension less than or equal to {}", cql3::cql3_type::MAX_VECTOR_DIMENSION));
}
size = static_cast<vector_dimension_t>(parsed_size);
size = std::stoul(_str.substr(i, _idx - i));
++_idx; // skipping ')'
return std::make_tuple(type, size);

View File

@@ -97,7 +97,7 @@ public:
}
#endif
std::vector<data_type> get_type_parameters(bool multicell=true);
std::tuple<data_type, vector_dimension_t> get_vector_parameters();
std::tuple<data_type, size_t> get_vector_parameters();
std::tuple<sstring, bytes, std::vector<bytes>, std::vector<data_type>> get_user_type_parameters();
data_type do_parse(bool multicell = true);

View File

@@ -24,11 +24,12 @@
#include "readers/forwardable.hh"
#include "readers/nonforwardable.hh"
#include "cache_mutation_reader.hh"
#include "replica/partition_snapshot_reader.hh"
#include "partition_snapshot_reader.hh"
#include "keys/clustering_key_filter.hh"
#include "utils/assert.hh"
#include "utils/updateable_value.hh"
#include "utils/labels.hh"
#include "utils/chunked_vector.hh"
namespace cache {
@@ -845,7 +846,7 @@ mutation_reader row_cache::make_nonpopulating_reader(schema_ptr schema, reader_p
cache_entry& e = *i;
upgrade_entry(e);
tracing::trace(ts, "Reading partition {} from cache", pos);
return replica::make_partition_snapshot_reader<false, dummy_accounter>(
return make_partition_snapshot_flat_reader<false, dummy_accounter>(
schema,
std::move(permit),
e.key(),
@@ -1215,10 +1216,10 @@ future<> row_cache::invalidate(external_updater eu, const dht::decorated_key& dk
}
future<> row_cache::invalidate(external_updater eu, const dht::partition_range& range, cache_invalidation_filter filter) {
return invalidate(std::move(eu), dht::partition_range_vector({range}), std::move(filter));
return invalidate(std::move(eu), utils::chunked_vector<dht::partition_range>({range}), std::move(filter));
}
future<> row_cache::invalidate(external_updater eu, dht::partition_range_vector&& ranges, cache_invalidation_filter filter) {
future<> row_cache::invalidate(external_updater eu, utils::chunked_vector<dht::partition_range>&& ranges, cache_invalidation_filter filter) {
return do_update(std::move(eu), [this, ranges = std::move(ranges), filter = std::move(filter)] mutable {
return seastar::async([this, ranges = std::move(ranges), filter = std::move(filter)] {
auto on_failure = defer([this] () noexcept {

View File

@@ -17,6 +17,7 @@
#include "utils/histogram.hh"
#include "mutation/partition_version.hh"
#include "utils/double-decker.hh"
#include "utils/chunked_vector.hh"
#include "db/cache_tracker.hh"
#include "readers/empty.hh"
#include "readers/mutation_source.hh"
@@ -457,7 +458,7 @@ public:
// mutation source made prior to the call to invalidate().
future<> invalidate(external_updater, const dht::decorated_key&);
future<> invalidate(external_updater, const dht::partition_range& = query::full_partition_range, cache_invalidation_filter filter = [] (const auto&) { return true; });
future<> invalidate(external_updater, dht::partition_range_vector&&, cache_invalidation_filter filter = [] (const auto&) { return true; });
future<> invalidate(external_updater, utils::chunked_vector<dht::partition_range>&&, cache_invalidation_filter filter = [] (const auto&) { return true; });
// Evicts entries from cache.
//

View File

@@ -1139,14 +1139,17 @@ future<> schema_applier::finalize_tables_and_views() {
// was already dropped (see https://github.com/scylladb/scylla/issues/5614)
for (auto& dropped_view : diff.tables_and_views.local().views.dropped) {
auto s = dropped_view.get();
co_await _ss.local().on_cleanup_for_drop_table(s->id());
co_await replica::database::cleanup_drop_table_on_all_shards(sharded_db, _sys_ks, true, diff.table_shards[s->id()]);
}
for (auto& dropped_table : diff.tables_and_views.local().tables.dropped) {
auto s = dropped_table.get();
co_await _ss.local().on_cleanup_for_drop_table(s->id());
co_await replica::database::cleanup_drop_table_on_all_shards(sharded_db, _sys_ks, true, diff.table_shards[s->id()]);
}
for (auto& dropped_cdc : diff.tables_and_views.local().cdc.dropped) {
auto s = dropped_cdc.get();
co_await _ss.local().on_cleanup_for_drop_table(s->id());
co_await replica::database::cleanup_drop_table_on_all_shards(sharded_db, _sys_ks, true, diff.table_shards[s->id()]);
}

View File

@@ -105,7 +105,7 @@ namespace {
schema_builder::register_schema_initializer([](schema_builder& builder) {
if (builder.ks_name() == schema_tables::NAME) {
// all schema tables are group0 tables
builder.set_is_group0_table(true);
builder.set_is_group0_table();
}
});
}

View File

@@ -21,16 +21,14 @@
#include "replica/database.hh"
#include "replica/global_table_ptr.hh"
#include "sstables/sstables_manager.hh"
#include "service/storage_proxy.hh"
logging::logger snap_log("snapshots");
namespace db {
snapshot_ctl::snapshot_ctl(sharded<replica::database>& db, sharded<service::storage_proxy>& sp, tasks::task_manager& tm, sstables::storage_manager& sstm, config cfg)
snapshot_ctl::snapshot_ctl(sharded<replica::database>& db, tasks::task_manager& tm, sstables::storage_manager& sstm, config cfg)
: _config(std::move(cfg))
, _db(db)
, _sp(sp)
, _ops("snapshot_ctl")
, _task_manager_module(make_shared<snapshot::task_manager_module>(tm))
, _storage_manager(sstm)
@@ -106,45 +104,6 @@ future<> snapshot_ctl::take_column_family_snapshot(sstring ks_name, std::vector<
});
}
future<> snapshot_ctl::take_cluster_column_family_snapshot(std::vector<sstring> ks_names, std::vector<sstring> tables, sstring tag, snapshot_options opts) {
if (tag.empty()) {
throw std::invalid_argument("You must supply a snapshot name.");
}
if (ks_names.size() != 1 && !tables.empty()) {
throw std::invalid_argument("Cannot name tables when doing multiple keyspaces snapshot");
}
if (ks_names.empty()) {
std::ranges::copy(_db.local().get_keyspaces() | std::views::keys, std::back_inserter(ks_names));
}
return run_snapshot_modify_operation([this, ks_names = std::move(ks_names), tables = std::move(tables), tag = std::move(tag), opts] () mutable {
return do_take_cluster_column_family_snapshot(std::move(ks_names), std::move(tables), std::move(tag), opts);
});
}
future<> snapshot_ctl::do_take_cluster_column_family_snapshot(std::vector<sstring> ks_names, std::vector<sstring> tables, sstring tag, snapshot_options opts) {
if (tables.empty()) {
co_await coroutine::parallel_for_each(ks_names, [tag, this] (const auto& ks_name) {
return check_snapshot_not_exist(ks_name, tag);
});
co_await _sp.local().snapshot_keyspace(
ks_names | std::views::transform([&](auto& ks) { return std::make_pair(ks, sstring{}); })
| std::ranges::to<std::unordered_multimap>(),
tag, opts
);
co_return;
};
auto ks = ks_names[0];
co_await check_snapshot_not_exist(ks, tag, tables);
co_await _sp.local().snapshot_keyspace(
tables | std::views::transform([&](auto& cf) { return std::make_pair(ks, cf); })
| std::ranges::to<std::unordered_multimap>(),
tag, opts
);
}
future<> snapshot_ctl::do_take_column_family_snapshot(sstring ks_name, std::vector<sstring> tables, sstring tag, snapshot_options opts) {
co_await check_snapshot_not_exist(ks_name, tag, tables);
co_await replica::database::snapshot_tables_on_all_shards(_db, ks_name, std::move(tables), std::move(tag), opts);
@@ -226,4 +185,4 @@ future<int64_t> snapshot_ctl::true_snapshots_size(sstring ks, sstring cf) {
}));
}
}
}

View File

@@ -24,7 +24,6 @@
using namespace seastar;
namespace sstables { class storage_manager; }
namespace service { class storage_proxy; }
namespace db {
@@ -64,7 +63,7 @@ public:
using db_snapshot_details = std::vector<table_snapshot_details_ext>;
snapshot_ctl(sharded<replica::database>& db, sharded<service::storage_proxy>&, tasks::task_manager& tm, sstables::storage_manager& sstm, config cfg);
snapshot_ctl(sharded<replica::database>& db, tasks::task_manager& tm, sstables::storage_manager& sstm, config cfg);
future<> stop();
@@ -96,17 +95,6 @@ public:
*/
future<> take_column_family_snapshot(sstring ks_name, std::vector<sstring> tables, sstring tag, snapshot_options opts = {});
/**
* Takes the snapshot of multiple tables or a whole keyspace, or all keyspaces,
* using global, clusterwide topology coordinated op.
* A snapshot name must be specified.
*
* @param ks_names the keyspaces to snapshot
* @param tables optional - a vector of tables names to snapshot
* @param tag the tag given to the snapshot; may not be null or empty
*/
future<> take_cluster_column_family_snapshot(std::vector<sstring> ks_names, std::vector<sstring> tables, sstring tag, snapshot_options opts = {});
/**
* Remove the snapshot with the given name from the given keyspaces.
* If no tag is specified we will remove all snapshots.
@@ -123,7 +111,6 @@ public:
private:
config _config;
sharded<replica::database>& _db;
sharded<service::storage_proxy>& _sp;
seastar::rwlock _lock;
seastar::named_gate _ops;
shared_ptr<snapshot::task_manager_module> _task_manager_module;
@@ -146,7 +133,6 @@ private:
future<> do_take_snapshot(sstring tag, std::vector<sstring> keyspace_names, snapshot_options opts = {} );
future<> do_take_column_family_snapshot(sstring ks_name, std::vector<sstring> tables, sstring tag, snapshot_options opts = {});
future<> do_take_cluster_column_family_snapshot(std::vector<sstring> ks_names, std::vector<sstring> tables, sstring tag, snapshot_options opts = {});
};
}
}

View File

@@ -770,6 +770,13 @@ system_distributed_keyspace::get_cdc_desc_v1_timestamps(context ctx) {
co_return res;
}
bool system_distributed_keyspace::workload_prioritization_tables_exists() {
auto wp_table = get_updated_service_levels(_qp.db(), true);
auto table = _qp.db().try_find_table(NAME, wp_table->cf_name());
return table && table->schema()->equal_columns(*wp_table);
}
future<qos::service_levels_info> system_distributed_keyspace::get_service_levels(qos::query_context ctx) const {
return qos::get_service_levels(_qp, NAME, SERVICE_LEVELS, db::consistency_level::ONE, ctx);
}

View File

@@ -117,6 +117,7 @@ public:
future<qos::service_levels_info> get_service_level(sstring service_level_name) const;
future<> set_service_level(sstring service_level_name, qos::service_level_options slo) const;
future<> drop_service_level(sstring service_level_name) const;
bool workload_prioritization_tables_exists();
private:
future<> create_tables(std::vector<schema_ptr> tables);

View File

@@ -87,31 +87,15 @@ namespace {
static const std::unordered_set<sstring> tables = {
schema_tables::SCYLLA_TABLE_SCHEMA_HISTORY,
system_keyspace::BROADCAST_KV_STORE,
system_keyspace::CDC_GENERATIONS_V3,
system_keyspace::RAFT,
system_keyspace::RAFT_SNAPSHOTS,
system_keyspace::RAFT_SNAPSHOT_CONFIG,
system_keyspace::GROUP0_HISTORY,
system_keyspace::DISCOVERY,
system_keyspace::TABLETS,
system_keyspace::TOPOLOGY,
system_keyspace::TOPOLOGY_REQUESTS,
system_keyspace::LOCAL,
system_keyspace::PEERS,
system_keyspace::SCYLLA_LOCAL,
system_keyspace::COMMITLOG_CLEANUPS,
system_keyspace::SERVICE_LEVELS_V2,
system_keyspace::VIEW_BUILD_STATUS_V2,
system_keyspace::CDC_STREAMS_STATE,
system_keyspace::CDC_STREAMS_HISTORY,
system_keyspace::ROLES,
system_keyspace::ROLE_MEMBERS,
system_keyspace::ROLE_ATTRIBUTES,
system_keyspace::ROLE_PERMISSIONS,
system_keyspace::CDC_LOCAL,
system_keyspace::DICTS,
system_keyspace::VIEW_BUILDING_TASKS,
system_keyspace::CLIENT_ROUTES,
};
if (builder.ks_name() == system_keyspace::NAME && tables.contains(builder.cf_name())) {
builder.enable_schema_commitlog();
@@ -143,7 +127,7 @@ namespace {
system_keyspace::REPAIR_TASKS,
};
if (builder.ks_name() == system_keyspace::NAME && tables.contains(builder.cf_name())) {
builder.set_is_group0_table(true);
builder.set_is_group0_table();
}
});
}
@@ -335,10 +319,6 @@ schema_ptr system_keyspace::topology_requests() {
.with_column("truncate_table_id", uuid_type)
.with_column("new_keyspace_rf_change_ks_name", utf8_type)
.with_column("new_keyspace_rf_change_data", map_type_impl::get_instance(utf8_type, utf8_type, false))
.with_column("snapshot_table_ids", set_type_impl::get_instance(uuid_type, false))
.with_column("snapshot_tag", utf8_type)
.with_column("snapshot_expiry", timestamp_type)
.with_column("snapshot_skip_flush", boolean_type)
.set_comment("Topology request tracking")
.with_hash_version()
.build();
@@ -3585,18 +3565,6 @@ system_keyspace::topology_requests_entry system_keyspace::topology_request_row_t
entry.new_keyspace_rf_change_ks_name = row.get_as<sstring>("new_keyspace_rf_change_ks_name");
entry.new_keyspace_rf_change_data = row.get_map<sstring,sstring>("new_keyspace_rf_change_data");
}
if (row.has("snapshot_table_ids")) {
entry.snapshot_tag = row.get_as<sstring>("snapshot_tag");
entry.snapshot_skip_flush = row.get_as<bool>("snapshot_skip_flush");
entry.snapshot_table_ids = row.get_set<utils::UUID>("snapshot_table_ids")
| std::views::transform([](auto& uuid) { return table_id(uuid); })
| std::ranges::to<std::unordered_set>()
;
;
if (row.has("snapshot_expiry")) {
entry.snapshot_expiry = row.get_as<db_clock::time_point>("snapshot_expiry");
}
}
return entry;
}

View File

@@ -417,10 +417,6 @@ public:
std::optional<sstring> new_keyspace_rf_change_ks_name;
// The KS options to be used when executing the scheduled ALTER KS statement
std::optional<std::unordered_map<sstring, sstring>> new_keyspace_rf_change_data;
std::optional<std::unordered_set<table_id>> snapshot_table_ids;
std::optional<sstring> snapshot_tag;
std::optional<db_clock::time_point> snapshot_expiry;
bool snapshot_skip_flush;
};
using topology_requests_entries = std::unordered_map<utils::UUID, system_keyspace::topology_requests_entry>;

View File

@@ -23,7 +23,6 @@
#include <seastar/core/future-util.hh>
#include <seastar/core/coroutine.hh>
#include <seastar/coroutine/all.hh>
#include <seastar/coroutine/maybe_yield.hh>
#include <flat_map>
@@ -66,7 +65,6 @@
#include "mutation/timestamp.hh"
#include "utils/assert.hh"
#include "utils/small_vector.hh"
#include "view_builder.hh"
#include "view_info.hh"
#include "view_update_checks.hh"
#include "types/list.hh"
@@ -932,8 +930,7 @@ bool view_updates::can_skip_view_updates(const clustering_or_static_row& update,
const row& existing_row = existing.cells();
const row& updated_row = update.cells();
const bool base_has_nonexpiring_marker = update.marker().is_live() && !update.marker().is_expiring();
return std::ranges::all_of(_base->regular_columns(), [this, &updated_row, &existing_row, base_has_nonexpiring_marker] (const column_definition& cdef) {
return std::ranges::all_of(_base->regular_columns(), [this, &updated_row, &existing_row] (const column_definition& cdef) {
const auto view_it = _view->columns_by_name().find(cdef.name());
const bool column_is_selected = view_it != _view->columns_by_name().end();
@@ -941,49 +938,29 @@ bool view_updates::can_skip_view_updates(const clustering_or_static_row& update,
// as part of its PK, there are NO virtual columns corresponding to the unselected columns in the view.
// Because of that, we don't generate view updates when the value in an unselected column is created
// or changes.
if (!column_is_selected && _base_info.has_base_non_pk_columns_in_view_pk) {
if (!column_is_selected) {
return true;
}
//TODO(sarna): Optimize collections case - currently they do not go under optimization
if (!cdef.is_atomic()) {
return false;
}
// We cannot skip if the value was created or deleted, unless we have a non-expiring marker
// We cannot skip if the value was created or deleted
const auto* existing_cell = existing_row.find_cell(cdef.id);
const auto* updated_cell = updated_row.find_cell(cdef.id);
if (existing_cell == nullptr || updated_cell == nullptr) {
return existing_cell == updated_cell || (!column_is_selected && base_has_nonexpiring_marker);
return existing_cell == updated_cell;
}
if (!cdef.is_atomic()) {
return existing_cell->as_collection_mutation().data == updated_cell->as_collection_mutation().data;
}
atomic_cell_view existing_cell_view = existing_cell->as_atomic_cell(cdef);
atomic_cell_view updated_cell_view = updated_cell->as_atomic_cell(cdef);
// We cannot skip when a selected column is changed
if (column_is_selected) {
if (view_it->second->is_view_virtual()) {
return atomic_cells_liveness_equal(existing_cell_view, updated_cell_view);
}
return compare_atomic_cell_for_merge(existing_cell_view, updated_cell_view) == 0;
if (view_it->second->is_view_virtual()) {
return atomic_cells_liveness_equal(existing_cell_view, updated_cell_view);
}
// With non-expiring row marker, liveness checks below are not relevant
if (base_has_nonexpiring_marker) {
return true;
}
if (existing_cell_view.is_live() != updated_cell_view.is_live()) {
return false;
}
// We cannot skip if the change updates TTL
const bool existing_has_ttl = existing_cell_view.is_live_and_has_ttl();
const bool updated_has_ttl = updated_cell_view.is_live_and_has_ttl();
if (existing_has_ttl || updated_has_ttl) {
return existing_has_ttl == updated_has_ttl && existing_cell_view.expiry() == updated_cell_view.expiry();
}
return true;
return compare_atomic_cell_for_merge(existing_cell_view, updated_cell_view) == 0;
});
}
@@ -1751,7 +1728,7 @@ static endpoints_to_update get_view_natural_endpoint_vnodes(
std::vector<std::reference_wrapper<const locator::node>> base_nodes,
std::vector<std::reference_wrapper<const locator::node>> view_nodes,
locator::endpoint_dc_rack my_location,
const locator::network_topology_strategy* network_topology,
const bool network_topology,
replica::cf_stats& cf_stats) {
using node_vector = std::vector<std::reference_wrapper<const locator::node>>;
node_vector base_endpoints, view_endpoints;
@@ -1904,7 +1881,7 @@ endpoints_to_update get_view_natural_endpoint(
locator::host_id me,
const locator::effective_replication_map_ptr& base_erm,
const locator::effective_replication_map_ptr& view_erm,
const locator::abstract_replication_strategy& replication_strategy,
const bool network_topology,
const dht::token& base_token,
const dht::token& view_token,
bool use_tablets,
@@ -1912,7 +1889,6 @@ endpoints_to_update get_view_natural_endpoint(
auto& topology = base_erm->get_token_metadata_ptr()->get_topology();
auto& view_topology = view_erm->get_token_metadata_ptr()->get_topology();
auto& my_location = topology.get_location(me);
auto* network_topology = dynamic_cast<const locator::network_topology_strategy*>(&replication_strategy);
auto resolve = [&] (const locator::topology& topology, const locator::host_id& ep, bool is_view) -> const locator::node& {
if (auto* np = topology.find_node(ep)) {
@@ -1946,7 +1922,7 @@ endpoints_to_update get_view_natural_endpoint(
// view pairing as the leaving base replica.
// note that the recursive call will not recurse again because leaving_base is in base_nodes.
auto leaving_base = it->get().host_id();
return get_view_natural_endpoint(leaving_base, base_erm, view_erm, replication_strategy, base_token,
return get_view_natural_endpoint(leaving_base, base_erm, view_erm, network_topology, base_token,
view_token, use_tablets, cf_stats);
}
}
@@ -2042,7 +2018,9 @@ future<> view_update_generator::mutate_MV(
wait_for_all_updates wait_for_all)
{
auto& ks = _db.find_keyspace(base->ks_name());
auto& replication = ks.get_replication_strategy();
const bool uses_tablets = ks.uses_tablets();
const bool uses_nts = dynamic_cast<const locator::network_topology_strategy*>(&ks.get_replication_strategy()) != nullptr;
// The object pointed by `ks` may disappear after preeemption. It should not be touched again after this comment.
std::unordered_map<table_id, locator::effective_replication_map_ptr> erms;
auto get_erm = [&] (table_id id) {
auto it = erms.find(id);
@@ -2061,8 +2039,8 @@ future<> view_update_generator::mutate_MV(
co_await max_concurrent_for_each(view_updates, max_concurrent_updates, [&] (frozen_mutation_and_schema mut) mutable -> future<> {
auto view_token = dht::get_token(*mut.s, mut.fm.key());
auto view_ermp = erms.at(mut.s->id());
auto [target_endpoint, no_pairing_endpoint] = get_view_natural_endpoint(me, base_ermp, view_ermp, replication, base_token, view_token,
ks.uses_tablets(), cf_stats);
auto [target_endpoint, no_pairing_endpoint] = get_view_natural_endpoint(me, base_ermp, view_ermp, uses_nts, base_token, view_token,
uses_tablets, cf_stats);
auto remote_endpoints = view_ermp->get_pending_replicas(view_token);
auto memory_units = seastar::make_lw_shared<db::timeout_semaphore_units>(pending_view_update_memory_units.split(memory_usage_of(mut)));
if (no_pairing_endpoint) {
@@ -2240,20 +2218,12 @@ void view_builder::setup_metrics() {
}
future<> view_builder::start_in_background(service::migration_manager& mm, utils::cross_shard_barrier barrier) {
auto step_fiber = make_ready_future<>();
try {
view_builder_init_state vbi;
auto fail = defer([&barrier] mutable { barrier.abort(); });
// Semaphore usage invariants:
// - One unit of _sem serializes all per-shard bookkeeping that mutates view-builder state
// (_base_to_build_step, _built_views, build_status, reader resets).
// - The unit is held for the whole operation, including the async chain, until the state
// is stable for the next operation on that shard.
// - Cross-shard operations acquire _sem on shard 0 for the duration of the broadcast.
// Other shards acquire their own _sem only around their local handling; shard 0 skips
// the local acquire because it already holds the unit from the dispatcher.
// Guard the whole startup routine with a semaphore so that it's not intercepted by
// `on_drop_view`, `on_create_view`, or `on_update_view` events.
// Guard the whole startup routine with a semaphore,
// so that it's not intercepted by `on_drop_view`, `on_create_view`
// or `on_update_view` events.
auto units = co_await get_units(_sem, view_builder_semaphore_units);
// Wait for schema agreement even if we're a seed node.
co_await mm.wait_for_schema_agreement(_db, db::timeout_clock::time_point::max(), &_as);
@@ -2274,10 +2244,8 @@ future<> view_builder::start_in_background(service::migration_manager& mm, utils
_mnotifier.register_listener(this);
co_await calculate_shard_build_step(vbi);
_current_step = _base_to_build_step.begin();
// If preparation above fails, run_in_background() is not invoked, just
// the start_in_background() emits a warning into logs and resolves
step_fiber = run_in_background();
// Waited on indirectly in stop().
(void)_build_step.trigger();
} catch (...) {
auto ex = std::current_exception();
auto ll = log_level::error;
@@ -2292,12 +2260,10 @@ future<> view_builder::start_in_background(service::migration_manager& mm, utils
}
vlogger.log(ll, "start aborted: {}", ex);
}
co_await std::move(step_fiber);
}
future<> view_builder::start(service::migration_manager& mm, utils::cross_shard_barrier barrier) {
_step_fiber = start_in_background(mm, std::move(barrier));
_started = start_in_background(mm, std::move(barrier));
return make_ready_future<>();
}
@@ -2307,13 +2273,12 @@ future<> view_builder::drain() {
}
vlogger.info("Draining view builder");
_as.request_abort();
co_await std::move(_started);
co_await _mnotifier.unregister_listener(this);
co_await _ops_gate.close();
co_await _vug.drain();
co_await _sem.wait();
_sem.broken();
_build_step.broken();
co_await std::move(_step_fiber);
co_await _build_step.join();
co_await coroutine::parallel_for_each(_base_to_build_step, [] (std::pair<const table_id, build_step>& p) {
return p.second.reader.close();
});
@@ -2682,59 +2647,63 @@ static bool should_ignore_tablet_keyspace(const replica::database& db, const sst
return db.features().view_building_coordinator && db.has_keyspace(ks_name) && db.find_keyspace(ks_name).uses_tablets();
}
future<view_builder::view_builder_units> view_builder::get_or_adopt_view_builder_lock(view_builder_units_opt units) {
co_return units ? std::move(*units) : co_await get_units(_sem, view_builder_semaphore_units);
}
future<> view_builder::dispatch_create_view(sstring ks_name, sstring view_name) {
if (should_ignore_tablet_keyspace(_db, ks_name)) {
co_return;
return make_ready_future<>();
}
auto units = co_await get_or_adopt_view_builder_lock(std::nullopt);
co_await handle_seed_view_build_progress(ks_name, view_name);
co_await coroutine::all(
[this, ks_name, view_name, units = std::move(units)] mutable -> future<> {
co_await handle_create_view_local(ks_name, view_name, std::move(units)); },
[this, ks_name, view_name] mutable -> future<> {
co_await container().invoke_on_others([ks_name = std::move(ks_name), view_name = std::move(view_name)] (view_builder& vb) mutable -> future<> {
return vb.handle_create_view_local(ks_name, view_name, std::nullopt); }); });
return with_semaphore(_sem, view_builder_semaphore_units, [this, ks_name = std::move(ks_name), view_name = std::move(view_name)] () mutable {
// This runs on shard 0 only; seed the global rows before broadcasting.
return handle_seed_view_build_progress(ks_name, view_name).then([this, ks_name = std::move(ks_name), view_name = std::move(view_name)] () mutable {
return container().invoke_on_all([ks_name = std::move(ks_name), view_name = std::move(view_name)] (view_builder& vb) mutable {
return vb.handle_create_view_local(std::move(ks_name), std::move(view_name));
});
});
});
}
future<> view_builder::handle_seed_view_build_progress(const sstring& ks_name, const sstring& view_name) {
future<> view_builder::handle_seed_view_build_progress(sstring ks_name, sstring view_name) {
auto view = view_ptr(_db.find_schema(ks_name, view_name));
auto& step = get_or_create_build_step(view->view_info()->base_id());
return _sys_ks.register_view_for_building_for_all_shards(view->ks_name(), view->cf_name(), step.current_token());
}
future<> view_builder::handle_create_view_local(const sstring& ks_name, const sstring& view_name, view_builder_units_opt units) {
[[maybe_unused]] auto sem_units = co_await get_or_adopt_view_builder_lock(std::move(units));
future<> view_builder::handle_create_view_local(sstring ks_name, sstring view_name){
if (this_shard_id() == 0) {
return handle_create_view_local_impl(std::move(ks_name), std::move(view_name));
} else {
return with_semaphore(_sem, view_builder_semaphore_units, [this, ks_name = std::move(ks_name), view_name = std::move(view_name)] () mutable {
return handle_create_view_local_impl(std::move(ks_name), std::move(view_name));
});
}
}
future<> view_builder::handle_create_view_local_impl(sstring ks_name, sstring view_name) {
auto view = view_ptr(_db.find_schema(ks_name, view_name));
auto& step = get_or_create_build_step(view->view_info()->base_id());
try {
co_await coroutine::all(
[&step] -> future<> {
co_await step.base->await_pending_writes(); },
[&step] -> future<> {
co_await step.base->await_pending_streams(); });
co_await flush_base(step.base, _as);
return when_all(step.base->await_pending_writes(), step.base->await_pending_streams()).discard_result().then([this, &step] {
return flush_base(step.base, _as);
}).then([this, view, &step] () {
// This resets the build step to the current token. It may result in views currently
// being built to receive duplicate updates, but it simplifies things as we don't have
// to keep around a list of new views to build the next time the reader crosses a token
// threshold.
co_await initialize_reader_at_current_token(step);
co_await add_new_view(view, step);
} catch (abort_requested_exception&) {
vlogger.debug("Aborted while setting up view for building {}.{}", view->ks_name(), view->cf_name());
} catch (raft::request_aborted&) {
vlogger.debug("Aborted while setting up view for building {}.{}", view->ks_name(), view->cf_name());
} catch (...) {
vlogger.error("Error setting up view for building {}.{}: {}", view->ks_name(), view->cf_name(), std::current_exception());
}
return initialize_reader_at_current_token(step).then([this, view, &step] () mutable {
return add_new_view(view, step);
}).then_wrapped([this, view] (future<>&& f) {
try {
f.get();
} catch (abort_requested_exception&) {
vlogger.debug("Aborted while setting up view for building {}.{}", view->ks_name(), view->cf_name());
} catch (raft::request_aborted&) {
vlogger.debug("Aborted while setting up view for building {}.{}", view->ks_name(), view->cf_name());
} catch (...) {
vlogger.error("Error setting up view for building {}.{}: {}", view->ks_name(), view->cf_name(), std::current_exception());
}
_build_step.signal();
// Waited on indirectly in stop().
static_cast<void>(_build_step.trigger());
});
});
}
void view_builder::on_create_view(const sstring& ks_name, const sstring& view_name) {
@@ -2743,101 +2712,90 @@ void view_builder::on_create_view(const sstring& ks_name, const sstring& view_na
}
// Do it in the background, serialized and broadcast from shard 0.
static_cast<void>(with_gate(_ops_gate, [this, ks_name = ks_name, view_name = view_name] () mutable {
return dispatch_create_view(std::move(ks_name), std::move(view_name));
}).handle_exception([ks_name, view_name] (std::exception_ptr ep) {
static_cast<void>(dispatch_create_view(ks_name, view_name).handle_exception([ks_name, view_name] (std::exception_ptr ep) {
vlogger.warn("Failed to dispatch view creation {}.{}: {}", ks_name, view_name, ep);
}));
}
future<> view_builder::dispatch_update_view(sstring ks_name, sstring view_name) {
if (should_ignore_tablet_keyspace(_db, ks_name)) {
co_return;
}
[[maybe_unused]] auto sem_units = co_await get_or_adopt_view_builder_lock(std::nullopt);
auto view = view_ptr(_db.find_schema(ks_name, view_name));
auto step_it = _base_to_build_step.find(view->view_info()->base_id());
if (step_it == _base_to_build_step.end()) {
co_return; // In case all the views for this CF have finished building already.
}
auto status_it = std::ranges::find_if(step_it->second.build_status, [view] (const view_build_status& bs) {
return bs.view->id() == view->id();
});
if (status_it != step_it->second.build_status.end()) {
status_it->view = std::move(view);
}
}
void view_builder::on_update_view(const sstring& ks_name, const sstring& view_name, bool) {
if (should_ignore_tablet_keyspace(_db, ks_name)) {
return;
}
// Do it in the background, serialized.
static_cast<void>(with_gate(_ops_gate, [this, ks_name = ks_name, view_name = view_name] () mutable {
return dispatch_update_view(std::move(ks_name), std::move(view_name));
}).handle_exception([ks_name, view_name] (std::exception_ptr ep) {
try {
std::rethrow_exception(ep);
} catch (const seastar::gate_closed_exception&) {
vlogger.warn("Ignoring gate_closed_exception during view update {}.{}", ks_name, view_name);
} catch (const seastar::broken_named_semaphore&) {
vlogger.warn("Ignoring broken_named_semaphore during view update {}.{}", ks_name, view_name);
} catch (const replica::no_such_column_family&) {
vlogger.warn("Ignoring no_such_column_family during view update {}.{}", ks_name, view_name);
(void)with_semaphore(_sem, view_builder_semaphore_units, [ks_name, view_name, this] {
auto view = view_ptr(_db.find_schema(ks_name, view_name));
auto step_it = _base_to_build_step.find(view->view_info()->base_id());
if (step_it == _base_to_build_step.end()) {
return;// In case all the views for this CF have finished building already.
}
}));
auto status_it = std::ranges::find_if(step_it->second.build_status, [view] (const view_build_status& bs) {
return bs.view->id() == view->id();
});
if (status_it != step_it->second.build_status.end()) {
status_it->view = std::move(view);
}
}).handle_exception_type([] (replica::no_such_column_family&) { });
}
future<> view_builder::dispatch_drop_view(sstring ks_name, sstring view_name) {
if (should_ignore_tablet_keyspace(_db, ks_name)) {
co_return;
return make_ready_future<>();
}
auto units = co_await get_or_adopt_view_builder_lock(std::nullopt);
co_await coroutine::all(
[this, ks_name, view_name, units = std::move(units)] mutable -> future<> {
co_await handle_drop_view_local(ks_name, view_name, std::move(units)); },
[this, ks_name, view_name] mutable -> future<> {
co_await container().invoke_on_others([ks_name = std::move(ks_name), view_name = std::move(view_name)] (view_builder& vb) mutable -> future<> {
return vb.handle_drop_view_local(ks_name, view_name, std::nullopt); });});
co_await handle_drop_view_global_cleanup(ks_name, view_name);
return with_semaphore(_sem, view_builder_semaphore_units, [this, ks_name = std::move(ks_name), view_name = std::move(view_name)] () mutable {
// This runs on shard 0 only; broadcast local cleanup before global cleanup.
return container().invoke_on_all([ks_name, view_name] (view_builder& vb) mutable {
return vb.handle_drop_view_local(std::move(ks_name), std::move(view_name));
}).then([this, ks_name = std::move(ks_name), view_name = std::move(view_name)] () mutable {
return handle_drop_view_global_cleanup(std::move(ks_name), std::move(view_name));
});
});
}
future<> view_builder::handle_drop_view_local(const sstring& ks_name, const sstring& view_name, view_builder_units_opt units) {
[[maybe_unused]] auto sem_units = co_await get_or_adopt_view_builder_lock(std::move(units));
vlogger.info0("Stopping to build view {}.{}", ks_name, view_name);
future<> view_builder::handle_drop_view_local(sstring ks_name, sstring view_name) {
if (this_shard_id() == 0) {
return handle_drop_view_local_impl(std::move(ks_name), std::move(view_name));
} else {
return with_semaphore(_sem, view_builder_semaphore_units, [this, ks_name = std::move(ks_name), view_name = std::move(view_name)] () mutable {
return handle_drop_view_local_impl(std::move(ks_name), std::move(view_name));
});
}
}
for (auto& [_, step] : _base_to_build_step) {
if (step.build_status.empty() || step.build_status.front().view->ks_name() != ks_name) {
continue;
}
for (auto it = step.build_status.begin(); it != step.build_status.end(); ++it) {
if (it->view->cf_name() == view_name) {
_built_views.erase(it->view->id());
step.build_status.erase(it);
co_return;
future<> view_builder::handle_drop_view_local_impl(sstring ks_name, sstring view_name) {
vlogger.info0("Stopping to build view {}.{}", ks_name, view_name);
// The view is absent from the database at this point, so find it by brute force.
([&, this] {
for (auto& [_, step] : _base_to_build_step) {
if (step.build_status.empty() || step.build_status.front().view->ks_name() != ks_name) {
continue;
}
for (auto it = step.build_status.begin(); it != step.build_status.end(); ++it) {
if (it->view->cf_name() == view_name) {
_built_views.erase(it->view->id());
step.build_status.erase(it);
return;
}
}
}
}
})();
return make_ready_future<>();
}
future<> view_builder::handle_drop_view_global_cleanup(const sstring& ks_name, const sstring& view_name) {
future<> view_builder::handle_drop_view_global_cleanup(sstring ks_name, sstring view_name) {
if (this_shard_id() != 0) {
co_return;
return make_ready_future<>();
}
vlogger.info0("Starting view global cleanup {}.{}", ks_name, view_name);
try {
co_await coroutine::all(
[this, &ks_name, &view_name] -> future<> {
co_await _sys_ks.remove_view_build_progress_across_all_shards(ks_name, view_name); },
[this, &ks_name, &view_name] -> future<> {
co_await _sys_ks.remove_built_view(ks_name, view_name); },
[this, &ks_name, &view_name] -> future<> {
co_await remove_view_build_status(ks_name, view_name); });
} catch (...) {
vlogger.warn("Failed to cleanup view {}.{}: {}", ks_name, view_name, std::current_exception());
}
return when_all_succeed(
_sys_ks.remove_view_build_progress_across_all_shards(ks_name, view_name),
_sys_ks.remove_built_view(ks_name, view_name),
remove_view_build_status(ks_name, view_name))
.discard_result()
.handle_exception([ks_name, view_name] (std::exception_ptr ep) {
vlogger.warn("Failed to cleanup view {}.{}: {}", ks_name, view_name, ep);
});
}
void view_builder::on_drop_view(const sstring& ks_name, const sstring& view_name) {
@@ -2846,22 +2804,19 @@ void view_builder::on_drop_view(const sstring& ks_name, const sstring& view_name
}
// Do it in the background, serialized and broadcast from shard 0.
static_cast<void>(with_gate(_ops_gate, [this, ks_name = ks_name, view_name = view_name] () mutable {
return dispatch_drop_view(std::move(ks_name), std::move(view_name));
}).handle_exception([ks_name, view_name] (std::exception_ptr ep) {
static_cast<void>(dispatch_drop_view(ks_name, view_name).handle_exception([ks_name, view_name] (std::exception_ptr ep) {
vlogger.warn("Failed to dispatch view drop {}.{}: {}", ks_name, view_name, ep);
}));
}
future<> view_builder::run_in_background() {
return seastar::async([this] {
future<> view_builder::do_build_step() {
// Run the view building in the streaming scheduling group
// so that it doesn't impact other tasks with higher priority.
seastar::thread_attributes attr;
attr.sched_group = _db.get_streaming_scheduling_group();
return seastar::async(std::move(attr), [this] {
exponential_backoff_retry r(1s, 1min);
while (!_as.abort_requested()) {
try {
_build_step.wait([this] { return !_base_to_build_step.empty(); }).get();
} catch (const seastar::broken_condition_variable&) {
return;
}
while (!_base_to_build_step.empty() && !_as.abort_requested()) {
auto units = get_units(_sem, view_builder_semaphore_units).get();
++_stats.steps_performed;
try {

View File

@@ -303,7 +303,7 @@ endpoints_to_update get_view_natural_endpoint(
locator::host_id node,
const locator::effective_replication_map_ptr& base_erm,
const locator::effective_replication_map_ptr& view_erm,
const locator::abstract_replication_strategy& replication_strategy,
const bool network_topology,
const dht::token& base_token,
const dht::token& view_token,
bool use_tablets,

View File

@@ -11,14 +11,13 @@
#include "query/query-request.hh"
#include "service/migration_listener.hh"
#include "service/raft/raft_group0_client.hh"
#include "utils/serialized_action.hh"
#include "utils/cross-shard-barrier.hh"
#include "replica/database.hh"
#include <seastar/core/abort_source.hh>
#include <seastar/core/future.hh>
#include <seastar/core/gate.hh>
#include <seastar/core/semaphore.hh>
#include <seastar/core/condition-variable.hh>
#include <seastar/core/sharded.hh>
#include <seastar/core/shared_future.hh>
#include <seastar/core/shared_ptr.hh>
@@ -105,12 +104,6 @@ class view_update_generator;
* redo the missing step, for simplicity.
*/
class view_builder final : public service::migration_listener::only_view_notifications, public seastar::peering_sharded_service<view_builder> {
//aliasing for semaphore units that will be used throughout the class
using view_builder_units = semaphore_units<named_semaphore_exception_factory>;
//aliasing for optional semaphore units that will be used throughout the class
using view_builder_units_opt = std::optional<view_builder_units>;
/**
* Keeps track of the build progress for a particular view.
* When the view is built, next_token == first_token.
@@ -175,25 +168,14 @@ class view_builder final : public service::migration_listener::only_view_notific
reader_permit _permit;
base_to_build_step_type _base_to_build_step;
base_to_build_step_type::iterator _current_step = _base_to_build_step.end();
condition_variable _build_step;
serialized_action _build_step{std::bind(&view_builder::do_build_step, this)};
static constexpr size_t view_builder_semaphore_units = 1;
// Ensures bookkeeping operations are serialized, meaning that while we execute
// a build step we don't consider newly added or removed views. This simplifies
// the algorithms. Also synchronizes an operation wrt. a call to stop().
// Semaphore usage invariants:
// - One unit of _sem serializes all per-shard bookkeeping that mutates view-builder state
// (_base_to_build_step, _built_views, build_status, reader resets).
// - The unit is held for the whole operation, including the async chain, until the state
// is stable for the next operation on that shard.
// - Cross-shard operations acquire _sem on shard 0 for the duration of the broadcast.
// Other shards acquire their own _sem only around their local handling; shard 0 skips
// the local acquire because it already holds the unit from the dispatcher.
// Guard the whole startup routine with a semaphore so that it's not intercepted by
// `on_drop_view`, `on_create_view`, or `on_update_view` events.
seastar::named_semaphore _sem{view_builder_semaphore_units, named_semaphore_exception_factory{"view builder"}};
seastar::gate _ops_gate;
seastar::abort_source _as;
future<> _step_fiber = make_ready_future<>();
future<> _started = make_ready_future<>();
// Used to coordinate between shards the conclusion of the build process for a particular view.
std::unordered_set<table_id> _built_views;
// Used for testing.
@@ -280,19 +262,19 @@ private:
void setup_shard_build_step(view_builder_init_state& vbi, std::vector<system_keyspace_view_name>, std::vector<system_keyspace_view_build_progress>);
future<> calculate_shard_build_step(view_builder_init_state& vbi);
future<> add_new_view(view_ptr, build_step&);
future<> run_in_background();
future<> do_build_step();
void execute(build_step&, exponential_backoff_retry);
future<> maybe_mark_view_as_built(view_ptr, dht::token);
future<> mark_as_built(view_ptr);
void setup_metrics();
future<> dispatch_create_view(sstring ks_name, sstring view_name);
future<> dispatch_update_view(sstring ks_name, sstring view_name);
future<> dispatch_drop_view(sstring ks_name, sstring view_name);
future<> handle_seed_view_build_progress(const sstring& ks_name, const sstring& view_name);
future<> handle_create_view_local(const sstring& ks_name, const sstring& view_name, view_builder_units_opt units);
future<> handle_drop_view_local(const sstring& ks_name, const sstring& view_name, view_builder_units_opt units);
future<> handle_drop_view_global_cleanup(const sstring& ks_name, const sstring& view_name);
future<view_builder_units> get_or_adopt_view_builder_lock(view_builder_units_opt units);
future<> handle_seed_view_build_progress(sstring ks_name, sstring view_name);
future<> handle_create_view_local(sstring ks_name, sstring view_name);
future<> handle_drop_view_local(sstring ks_name, sstring view_name);
future<> handle_create_view_local_impl(sstring ks_name, sstring view_name);
future<> handle_drop_view_local_impl(sstring ks_name, sstring view_name);
future<> handle_drop_view_global_cleanup(sstring ks_name, sstring view_name);
template <typename Func1, typename Func2>
future<> write_view_build_status(Func1&& fn_group0, Func2&& fn_sys_dist) {

View File

@@ -242,7 +242,7 @@ future<> view_building_worker::create_staging_sstable_tasks() {
utils::UUID_gen::get_time_UUID(), view_building_task::task_type::process_staging, false,
table_id, ::table_id{}, {my_host_id, sst_info.shard}, sst_info.last_token
};
auto mut = co_await _sys_ks.make_view_building_task_mutation(guard.write_timestamp(), task);
auto mut = co_await _group0.client().sys_ks().make_view_building_task_mutation(guard.write_timestamp(), task);
cmuts.emplace_back(std::move(mut));
}
}
@@ -386,6 +386,7 @@ future<> view_building_worker::update_built_views() {
auto schema = _db.find_schema(table_id);
return std::make_pair(schema->ks_name(), schema->cf_name());
};
auto& sys_ks = _group0.client().sys_ks();
std::set<std::pair<sstring, sstring>> built_views;
for (auto& [id, statuses]: _vb_state_machine.views_state.status_map) {
@@ -394,22 +395,22 @@ future<> view_building_worker::update_built_views() {
}
}
auto local_built = co_await _sys_ks.load_built_views() | std::views::filter([&] (auto& v) {
auto local_built = co_await sys_ks.load_built_views() | std::views::filter([&] (auto& v) {
return !_db.has_keyspace(v.first) || _db.find_keyspace(v.first).uses_tablets();
}) | std::ranges::to<std::set>();
// Remove dead entries
for (auto& view: local_built) {
if (!built_views.contains(view)) {
co_await _sys_ks.remove_built_view(view.first, view.second);
co_await sys_ks.remove_built_view(view.first, view.second);
}
}
// Add new entries
for (auto& view: built_views) {
if (!local_built.contains(view)) {
co_await _sys_ks.mark_view_as_built(view.first, view.second);
co_await _sys_ks.remove_view_build_progress_across_all_shards(view.first, view.second);
co_await sys_ks.mark_view_as_built(view.first, view.second);
co_await sys_ks.remove_view_build_progress_across_all_shards(view.first, view.second);
}
}
}
@@ -588,7 +589,11 @@ future<> view_building_worker::do_build_range(table_id base_id, std::vector<tabl
utils::get_local_injector().inject("do_build_range_fail",
[] { throw std::runtime_error("do_build_range failed due to error injection"); });
return seastar::async([this, base_id, views_ids = std::move(views_ids), last_token, &as] {
// Run the view building in the streaming scheduling group
// so that it doesn't impact other tasks with higher priority.
seastar::thread_attributes attr;
attr.sched_group = _db.get_streaming_scheduling_group();
return seastar::async(std::move(attr), [this, base_id, views_ids = std::move(views_ids), last_token, &as] {
gc_clock::time_point now = gc_clock::now();
auto base_cf = _db.find_column_family(base_id).shared_from_this();
reader_permit permit = _db.get_reader_concurrency_semaphore().make_tracking_only_permit(nullptr, "build_views_range", db::no_timeout, {});

View File

@@ -67,7 +67,6 @@ public:
return schema_builder(system_keyspace::NAME, "cluster_status", std::make_optional(id))
.with_column("peer", inet_addr_type, column_kind::partition_key)
.with_column("dc", utf8_type)
.with_column("rack", utf8_type)
.with_column("up", boolean_type)
.with_column("draining", boolean_type)
.with_column("excluded", boolean_type)
@@ -98,22 +97,21 @@ public:
auto hostid = eps.get_host_id();
set_cell(cr, "up", gossiper.is_alive(hostid));
if (gossiper.is_shutdown(endpoint)) {
if (!ss.raft_topology_change_enabled() || gossiper.is_shutdown(endpoint)) {
set_cell(cr, "status", gossiper.get_gossip_status(endpoint));
} else {
set_cell(cr, "status", boost::to_upper_copy<std::string>(fmt::format("{}", ss.get_node_state(hostid))));
}
set_cell(cr, "load", gossiper.get_application_state_value(endpoint, gms::application_state::LOAD));
if (ss.raft_topology_change_enabled() && !gossiper.is_shutdown(endpoint)) {
set_cell(cr, "status", boost::to_upper_copy<std::string>(fmt::format("{}", ss.get_node_state(hostid))));
}
set_cell(cr, "host_id", hostid.uuid());
if (tm.get_topology().has_node(hostid)) {
// Not all entries in gossiper are present in the topology
auto& node = tm.get_topology().get_node(hostid);
sstring dc = node.dc_rack().dc;
sstring rack = node.dc_rack().rack;
set_cell(cr, "dc", dc);
set_cell(cr, "rack", rack);
set_cell(cr, "draining", node.is_draining());
set_cell(cr, "excluded", node.is_excluded());
}

View File

@@ -11,7 +11,5 @@
namespace debug {
seastar::sharded<replica::database>* volatile the_database = nullptr;
seastar::scheduling_group streaming_scheduling_group;
seastar::scheduling_group gossip_scheduling_group;
}

View File

@@ -17,8 +17,7 @@ class database;
namespace debug {
extern seastar::sharded<replica::database>* volatile the_database;
extern seastar::scheduling_group streaming_scheduling_group;
extern seastar::scheduling_group gossip_scheduling_group;
}

View File

@@ -352,6 +352,16 @@ dht::partition_range_vector to_partition_ranges(const dht::token_range_vector& r
return prs;
}
future<utils::chunked_vector<dht::partition_range>> to_partition_ranges_chunked(const dht::token_range_vector& ranges) {
utils::chunked_vector<dht::partition_range> prs;
prs.reserve(ranges.size());
for (auto& range : ranges) {
prs.push_back(dht::to_partition_range(range));
co_await coroutine::maybe_yield();
}
co_return prs;
}
std::map<unsigned, dht::partition_range_vector>
split_range_to_shards(dht::partition_range pr, const schema& s, const sharder& raw_sharder) {
std::map<unsigned, dht::partition_range_vector> ret;
@@ -364,11 +374,11 @@ split_range_to_shards(dht::partition_range pr, const schema& s, const sharder& r
return ret;
}
future<dht::partition_range_vector> subtract_ranges(const schema& schema, const dht::partition_range_vector& source_ranges, dht::partition_range_vector ranges_to_subtract) {
future<utils::chunked_vector<dht::partition_range>> subtract_ranges(const schema& schema, utils::chunked_vector<dht::partition_range> source_ranges, utils::chunked_vector<dht::partition_range> ranges_to_subtract) {
auto cmp = dht::ring_position_comparator(schema);
// optimize set of potentially overlapping ranges by deoverlapping them.
auto ranges = dht::partition_range::deoverlap(source_ranges, cmp);
dht::partition_range_vector res;
auto ranges = dht::partition_range::deoverlap(std::move(source_ranges), cmp);
utils::chunked_vector<dht::partition_range> res;
res.reserve(ranges.size() * 2);
auto range = ranges.begin();

View File

@@ -91,6 +91,7 @@ inline token get_token(const schema& s, partition_key_view key) {
dht::partition_range to_partition_range(dht::token_range);
dht::partition_range_vector to_partition_ranges(const dht::token_range_vector& ranges, utils::can_yield can_yield = utils::can_yield::no);
future<utils::chunked_vector<dht::partition_range>> to_partition_ranges_chunked(const dht::token_range_vector& ranges);
// Each shard gets a sorted, disjoint vector of ranges
std::map<unsigned, dht::partition_range_vector>
@@ -105,7 +106,7 @@ std::unique_ptr<dht::i_partitioner> make_partitioner(sstring name);
// Returns a sorted and deoverlapped list of ranges that are
// the result of subtracting all ranges from ranges_to_subtract.
// ranges_to_subtract must be sorted and deoverlapped.
future<dht::partition_range_vector> subtract_ranges(const schema& schema, const dht::partition_range_vector& ranges, dht::partition_range_vector ranges_to_subtract);
future<utils::chunked_vector<dht::partition_range>> subtract_ranges(const schema& schema, utils::chunked_vector<dht::partition_range> ranges, utils::chunked_vector<dht::partition_range> ranges_to_subtract);
// Returns a token_range vector split based on the given number of most-significant bits
dht::token_range_vector split_token_range_msb(unsigned most_significant_bits);

View File

@@ -30,6 +30,31 @@ enum class token_kind {
after_all_keys,
};
// Represents a token for partition keys.
// Has a disengaged state, which sorts before all engaged states.
struct raw_token {
int64_t value;
/// Constructs a disengaged token.
raw_token() : value(std::numeric_limits<int64_t>::min()) {}
/// Constructs an engaged token.
/// The token must be of token_kind::key kind.
explicit raw_token(const token&);
explicit raw_token(int64_t v) : value(v) {};
std::strong_ordering operator<=>(const raw_token& o) const noexcept = default;
std::strong_ordering operator<=>(const token& o) const noexcept;
/// Returns true iff engaged.
explicit operator bool() const noexcept {
return value != std::numeric_limits<int64_t>::min();
}
};
using raw_token_opt = seastar::optimized_optional<raw_token>;
class token {
// INT64_MIN is not a legal token, but a special value used to represent
// infinity in token intervals.
@@ -52,6 +77,10 @@ public:
constexpr explicit token(int64_t d) noexcept : token(kind::key, normalize(d)) {}
token(raw_token raw) noexcept
: token(raw ? kind::key : kind::before_all_keys, raw.value)
{ }
// This constructor seems redundant with the bytes_view constructor, but
// it's necessary for IDL, which passes a deserialized_bytes_proxy here.
// (deserialized_bytes_proxy is convertible to bytes&&, but not bytes_view.)
@@ -223,6 +252,29 @@ public:
}
};
inline
raw_token::raw_token(const token& t)
: value(t.raw())
{
#ifdef DEBUG
assert(t._kind == token::kind::key);
#endif
}
inline
std::strong_ordering raw_token::operator<=>(const token& o) const noexcept {
switch (o._kind) {
case token::kind::after_all_keys:
return std::strong_ordering::less;
case token::kind::before_all_keys:
// before_all_keys has a raw value set to the same raw value as a disengaged raw_token, and sorts before all keys.
// So we can order them by just comparing raw values.
[[fallthrough]];
case token::kind::key:
return value <=> o._data;
}
}
inline constexpr std::strong_ordering tri_compare_raw(const int64_t l1, const int64_t l2) noexcept {
if (l1 == l2) {
return std::strong_ordering::equal;
@@ -329,6 +381,17 @@ struct fmt::formatter<dht::token> : fmt::formatter<string_view> {
}
};
template <>
struct fmt::formatter<dht::raw_token> : fmt::formatter<string_view> {
template <typename FormatContext>
auto format(const dht::raw_token& t, FormatContext& ctx) const {
if (!t) {
return fmt::format_to(ctx.out(), "null");
}
return fmt::format_to(ctx.out(), "{}", t.value);
}
};
namespace std {
template<>

View File

@@ -12,7 +12,7 @@ Do the following in the top-level Scylla source directory:
2. Run `ninja dist-dev` (with the same mode name as above) to prepare
the distribution artifacts.
3. Run `./dist/docker/redhat/build_docker.sh --mode dev`
3. Run `./dist/docker/debian/build_docker.sh --mode dev`
This creates a docker image as a **file**, in the OCI format, and prints
its name, looking something like:

View File

@@ -70,7 +70,7 @@ bcp() { buildah copy "$container" "$@"; }
run() { buildah run "$container" "$@"; }
bconfig() { buildah config "$@" "$container"; }
container="$(buildah from --pull=always docker.io/redhat/ubi9-minimal:latest)"
container="$(buildah from docker.io/redhat/ubi9-minimal:latest)"
packages=(
"build/dist/$config/redhat/RPMS/$arch/$product-$version-$release.$arch.rpm"

View File

@@ -1,10 +1,6 @@
### a dictionary of redirections
#old path: new path
# Move the OS Support page
/stable/getting-started/os-support.html: https://docs.scylladb.com/stable/versioning/os-support-per-version.html
# Remove an outdated KB
/stable/kb/perftune-modes-sync.html: /stable/kb/index.html

View File

@@ -402,82 +402,3 @@ it also describes authentication/authorization and service levels. Additionally,
statement: `DESCRIBE SCHEMA WITH INTERNALS AND PASSWORDS`, which also includes the information about hashed passwords of the roles.
For more details, see [the article on DESCRIBE SCHEMA](./describe-schema.rst).
## Per-row TTL
CQL's traditional time-to-live (TTL) feature attaches an expiration time to
each cell - i.e., each value in each column. For example, the statement:
```
UPDATE tbl USING TTL 60 SET x = 1 WHERE p = 2
```
Sets a new value for the column `x` in row `p = 2`, and asks for this value to
expire in 60 seconds. When a row is updated incrementally, with different
columns set at different times, this can result in different pieces of the
row expiring at different times. Applications rarely want partially-expired
rows, so they often need to re-write an entire row each time the row needs
updating. In particular, it is not possible to change the expiration time of
an existing row without re-writing it.
Per-row time-to-live (TTL) is a new CQL feature that is an alternative to
the traditional per-cell TTL. One column is designated as the "expiration
time" column, and the value of this column determines when the entire row
will expire. It becomes possible to update pieces of a row without changing
its expiration time, and vice versa - to change a row's expiration time
without rewriting its data.
The expiration-time column of a table can be chosen when it is created by
adding the keyword "TTL" to one of the columns:
```cql
CREATE TABLE tab (
id int PRIMARY KEY,
t text,
expiration timestamp TTL
);
```
The TTL column's name, in this example `expiration`, can be anything.
Per-row TTL can also be enabled on an existing table by adding the "TTL"
designation to one of the existing columns, with:
```cql
ALTER TABLE tab TTL colname
```
Or per-row TTL can be disabled (rows will never expire), with:
```cql
ALTER TABLE tab TTL NULL
```
It is not possible to enable per-row TTL if it's already enabled, or disable
it when already disabled. If you have TTL enabled on one column and want to
enable it instead on a second column, you must do it in two steps: First
disable TTL and then re-enable it on the second column.
The designated TTL column must have the type `timestamp` or `bigint`,
and specifies the absolute time when the row should expire (the `bigint`
type is interpreted as seconds since the UNIX epoch). It must be a regular
column (not a primary key column or a static column), and there can only be
one such column.
The 32-bit type `int` (specifying number of seconds since the UNIX epoch)
is also supported, but not recommended because it will wrap around in 2038.
Unless you must use the `int` type because of pre-existing expiration data
with that type, please prefer `timestamp` or `bigint`.
Another important feature of per-row TTL is that if CDC is enabled, when a
row expires a deletion event appears in the CDC log - something that doesn't
happen in per-cell TTL. This deletion event can be distinguished from user-
initiated deletes: Whereas user-initiated deletes have `cdc_operation` set to
3 (`row_delete`) or 4 (`partition_delete`), those generated by expiration have
`cdc_operation` -3 (`service_row_delete`) or -4 (`service_partition_delete`).
Unlike per-cell TTL where a value becomes unreadable at the precise specified
second, the per-row TTL's expiration is _eventual_ - the row will expire
some time _after_ its requested expiration time, where this "some time" can
be controlled by the configuration `alternator_ttl_period_in_seconds`. Until
the row is actually deleted, it can still be read, and even written.
Importantly, the CDC event will appear immediately after the row is finally
deleted.
It's important to re-iterate that the per-cell TTL and per-row TTL features
are separate and distinct, use a different CQL syntax, have a different
implementation and provide different guarantees. It is possible to use
both features in the same table, or even the same row.

View File

@@ -476,7 +476,7 @@ Creating a new table uses the ``CREATE TABLE`` statement:
: [ ',' PRIMARY KEY '(' `primary_key` ')' ]
: ')' [ WITH `table_options` ]
column_definition: `column_name` `cql_type` [ TTL ] [ STATIC ] [ PRIMARY KEY]
column_definition: `column_name` `cql_type` [ STATIC ] [ PRIMARY KEY]
primary_key: `partition_key` [ ',' `clustering_columns` ]
@@ -572,11 +572,6 @@ A :token:`column_definition` is primarily comprised of the name of the column de
which restricts which values are accepted for that column. Additionally, a column definition can have the following
modifiers:
``TTL``
declares the column as being the expiration-time column for the
`per-row TTL <https://docs.scylladb.com/stable/cql/cql-extensions.html#per-row-ttl>`_
feature.
``STATIC``
declares the column as being a :ref:`static column <static-columns>`.
@@ -1044,29 +1039,7 @@ You can enable the after-repair tombstone GC by setting the ``repair`` mode usin
ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'repair'} ;
To support writes arriving out-of-order -- either due to natural delays, or user provided timestamps -- the repair mode has a propagation delay.
Out-of-order writes present a problem for repair mode tombstone gc. Consider the following example sequence of events:
1) Write ``DELETE FROM table WHERE key = K1`` arrives at the node.
2) Repair is run.
3) Compaction runs and garbage collects the tombstone for ``key = K1``.
4) Write ``INSERT INTO table (key, ...) VALUES (K1, ...)`` arrives at the node with timestamp smaller than that of the delete. The tombstone for ``key = K1`` should apply to this write, but it is already garbage collected, so this data is resurrected.
Propagation delay solves this problem by establishing a window before repair, where tombstones are not yet garbage collectible: a tombstone is garbage collectible if it was written before the last repair by at least the propagation delay.
The value of the propagation delay can be set via the ``propagation_delay_in_seconds`` parameter:
.. code-block:: cql
CREATE TABLE ks.cf (key blob PRIMARY KEY, val blob) WITH tombstone_gc = {'mode':'repair', 'propagation_delay_in_seconds': 120};
.. code-block:: cql
ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'repair', 'propagation_delay_in_seconds': 120};
The default value of the propagation delay is 1 hour. This parameter should only be changed if your application uses user provided timestamps and writes and deletes can arrive out-of-order by more than the default 1 hour.
The following tombstone gc modes are available:
The following modes are available:
.. list-table::
:widths: 20 80
@@ -1177,7 +1150,6 @@ Altering an existing table uses the ``ALTER TABLE`` statement:
: | DROP '(' `column_name` ( ',' `column_name` )* ')' [ USING TIMESTAMP `timestamp` ]
: | ALTER `column_name` TYPE `cql_type`
: | WITH `options`
: | TTL (`column_name` | NULL)
: | scylla_encryption_options: '=' '{'[`cipher_algorithm` : <hash>]','[`secret_key_strength` : <len>]','[`key_provider`: <provider>]'}'
For instance:
@@ -1221,11 +1193,6 @@ The ``ALTER TABLE`` statement can:
- Change or add any of the ``Encryption options`` above.
- Change or add any of the :ref:`CDC options <cdc-options>` above.
- Change or add per-partition rate limits. See :ref:`Limiting the rate of requests per partition <ddl-per-parition-rate-limit>`.
- Enable `per-row TTL <https://docs.scylladb.com/stable/cql/cql-extensions.html#per-row-ttl>`_
using the given column as the expiration-time column, or disable per-row
TTL on this table. If per-row TTL is already enabled, to change the choice
of expiration-time column you must first disable per-row TTL and then
re-enable it using the chosen column.
.. warning:: Dropping a column assumes that the timestamps used for the value of this column are "real" timestamp in
microseconds. Using "real" timestamps in microseconds is the default is and is **strongly** recommended, but as

View File

@@ -25,8 +25,6 @@ Querying data from data is done using a ``SELECT`` statement:
: | CAST '(' `selector` AS `cql_type` ')'
: | `function_name` '(' [ `selector` ( ',' `selector` )* ] ')'
: | COUNT '(' '*' ')'
: | literal
: | bind_marker
: )
: ( '.' `field_name` | '[' `term` ']' )*
where_clause: `relation` ( AND `relation` )*
@@ -37,8 +35,6 @@ Querying data from data is done using a ``SELECT`` statement:
operator: '=' | '<' | '>' | '<=' | '>=' | IN | NOT IN | CONTAINS | CONTAINS KEY
ordering_clause: `column_name` [ ASC | DESC ] ( ',' `column_name` [ ASC | DESC ] )*
timeout: `duration`
literal: number | 'string' | boolean | NULL | tuple_literal | list_literal | map_literal
bind_marker: '?' | ':' `identifier`
For instance::
@@ -85,13 +81,6 @@ A :token:`selector` can be one of the following:
- A casting, which allows you to convert a nested selector to a (compatible) type.
- A function call, where the arguments are selector themselves.
- A call to the :ref:`COUNT function <count-function>`, which counts all non-null results.
- A literal value (constant).
- A bind variable (`?` or `:name`).
Note that due to a quirk of the type system, literals and bind markers cannot be
used as top-level selectors, as the parser cannot infer their type. However, they can be used
when nested inside functions, as the function formal parameter types provide the
necessary context.
Aliases
```````
@@ -292,8 +281,8 @@ For example::
ORDER BY embedding ANN OF [0.1, 0.2, 0.3, 0.4] LIMIT 5;
Vector queries also support filtering with ``WHERE`` clauses on columns that are part of the primary key
or columns provided in a definition of the index.
Vector queries also support filtering with ``WHERE`` clauses on columns that are part of the primary key.
See :ref:`WHERE <where-clause>`.
For example::
@@ -301,10 +290,6 @@ For example::
WHERE user_id = 'user123'
ORDER BY embedding ANN OF [0.1, 0.2, 0.3, 0.4] LIMIT 5;
The supported operations are equal relations (``=`` and ``IN``) with restrictions as in regular ``WHERE`` clauses. See :ref:`WHERE <where-clause>`.
Other filtering scenarios are currently not supported.
.. note::
Vector indexes are supported in ScyllaDB Cloud only in clusters that have the Vector Search feature enabled.

View File

@@ -277,9 +277,6 @@ Dropping a secondary index uses the ``DROP INDEX`` statement:
The ``DROP INDEX`` statement is used to drop an existing secondary index. The argument of the statement is the index
name, which may optionally specify the keyspace of the index.
If the index is currently being built, the ``DROP INDEX`` can still be executed. Once the ``DROP INDEX`` command is issued,
the system stops the build process and cleans up any partially built data associated with the index.
.. If the index does not exists, the statement will return an error, unless ``IF EXISTS`` is used in which case the
.. operation is a no-op.

View File

@@ -13,19 +13,6 @@ The TTL can be set when defining a Table (CREATE), or when using the INSERT and
The expiration works at the individual column level, which provides a lot of flexibility.
By default, the TTL value is null, which means that the data will not expire.
This document is about CQL's classic per-write TTL feature, where individual
columns from the same row can expire at separate times if written at
different times. ScyllaDB also supports an alternative TTL feature,
`Per-row TTL <https://docs.scylladb.com/stable/cql/cql-extensions.html#per-row-ttl>`_.
In *per-row TTL* each row has an expiration time for the entire row,
defined by the value of the expiration-time column. In per-row TTL, the
entire row expires together regardless of how its indivial columns were
written, and the expiration time of an entire row can be modified by modifying
the expiration-time column. Another benefit of per-row TTL is that it
generates a CDC event when a row expires - in contrast in per-write TTL
(the feature described in this document) where expiration events do not
show up in CDC.
.. note::
The expiration time is always calculated as *now() on the Coordinator + TTL* where, *now()* is the wall clock during the corresponding write operation.

View File

@@ -108,4 +108,6 @@ check the statement and throw if it is disallowed, similar to what
Obviously, an audit definition must survive a server restart and stay
consistent among all nodes in a cluster. We'll accomplish both by
storing audits in a system table.
storing audits in a system table. They will be cached in memory the
same way `permissions_cache` caches table contents in `permission_set`
objects resident in memory.

View File

@@ -39,17 +39,6 @@ Both client and server use the same string identifiers for the keys to determine
negotiated extension set, judging by the presence of a particular key in the
SUPPORTED/STARTUP messages.
## Client options
`client_options` column in `system.clients` table stores all data sent by the
client in STARTUP request, as a `map<text, text>`. This column may be useful
for debugging and monitoring purposes.
Drivers can send additional data in STARTUP, e.g. load balancing policy, retry
policy, timeouts, and other configuration.
Such data should be sent in `CLIENT_OPTIONS` key, as JSON. The recommended
structure of this JSON will be decided in the future.
## Intranode sharding
This extension allows the driver to discover how Scylla internally
@@ -85,6 +74,8 @@ The keys and values are:
as an indicator to which shard client wants to connect. The desired shard number
is calculated as: `desired_shard_no = client_port % SCYLLA_NR_SHARDS`.
Its value is a decimal representation of type `uint16_t`, by default `19142`.
- `CLIENT_OPTIONS` is a string containing a JSON object representation that
contains CQL Driver configuration, e.g. load balancing policy, retry policy, timeouts, etc.
Currently, one `SCYLLA_SHARDING_ALGORITHM` is defined,
`biased-token-round-robin`. To apply the algorithm,

View File

@@ -78,7 +78,6 @@ Permits are in one of the following states:
* `active/await` - a previously `active/need_cpu` permit, which needs something other than CPU to proceed, it is waiting on I/O or a remote shards, other permits can be admitted while the permit is in this state, pending resource availability;
* `inactive` - the permit was marked inactive, it can be evicted to make room for admitting more permits if needed;
* `evicted` - a former inactive permit which was evicted, the permit has to undergo admission again for the read to resume;
* `preemptive_aborted` - the permit timed out or was rejected during admission as it was detected the read might time out later during execution;
Note that some older releases will have different names for some of these states or lack some of the states altogether:

View File

@@ -563,18 +563,17 @@ CREATE TABLE system.clients (
address inet,
port int,
client_type text,
client_options frozen<map<text, text>>,
connection_stage text,
driver_name text,
driver_version text,
hostname text,
protocol_version int,
scheduling_group text,
shard_id int,
ssl_cipher_suite text,
ssl_enabled boolean,
ssl_protocol text,
username text,
scheduling_group text,
PRIMARY KEY (address, port, client_type)
) WITH CLUSTERING ORDER BY (port ASC, client_type ASC)
~~~
@@ -582,7 +581,4 @@ CREATE TABLE system.clients (
Currently only CQL clients are tracked. The table used to be present on disk (in data
directory) before and including version 4.5.
`client_options` column stores all data sent by the client in the STARTUP request.
This column is useful for debugging and monitoring purposes.
## TODO: the rest

Some files were not shown because too many files have changed in this diff Show More