Commit Graph

4754 Commits

Author SHA1 Message Date
Avi Kivity
196db8931e partition_snapshot_row_cursor: fix reversed maybe_refresh() losing latest version entry
In partition_snapshot_row_cursor::maybe_refresh(), the !is_in_latest_version()
path calls lower_bound(_position) on the latest version's rows to find the
cursor's position in that version. When lower_bound returns null (the cursor
is positioned above all entries in the latest version in table order), the code
unconditionally sets _background_continuity = true and allows the subsequent
if(!it) block to erase the latest version's entry from the heap.

This is correct for forward traversal: null means there are no more entries
ahead, so removing the version from the heap is safe.

However, in reversed mode, null from lower_bound means the cursor is above
all entries in table order -- those entries are BELOW the cursor in query
order and will be visited LATER during reversed traversal. Erasing the heap
entry permanently loses them, causing live rows to be skipped.

The fix mirrors what prepare_heap() already does correctly: when lower_bound
returns null in reversed mode, use std::prev(rows.end()) to keep the last
entry in the heap instead of erasing it.

Add test_reversed_maybe_refresh_keeps_latest_version_entry to mvcc_test,
alongside the existing reversed cursor tests. The test creates a two-version
partition snapshot (v0 with range tombstones, v1 with a live row positioned
below all v0 entries in table order), and
traverses in reverse calling maybe_refresh() at each step -- directly
exercising the buggy code path. The test fails without the fix.

The bug was introduced by 6b7473be53 ("Handle non-evictable snapshots",
2022-11-21), which added null-iterator handling for non-evictable snapshots
(memtable snapshots lack the trailing dummy entry that evictable snapshots
have). prepare_heap() got correct reversed-mode handling at that time, but
maybe_refresh() received only forward-mode logic.

The bug is intermittent because multiple mechanisms cause iterators_valid()
to return false, forcing maybe_refresh() to take the full rebuild path via
prepare_heap() (which handles reversed mode correctly):
  - Mutation cleaner merging versions in the background (changes change_mark)
  - LSA segment compaction during reserve() (invalidates references)
  - B-tree rebalancing on partition insertion (invalidates references)
  - Debug mode's always-true need_preempt() creating many multi-version
    partitions via preempted apply_monotonically()

A dtest reproducer confirmed the same root cause: with 100K overlapping range
tombstones creating a massively multi-version memtable partition (287K preemption
events), the reversed scan's latest_iterator was observed jumping discontinuously
during a version transition -- the latest version's heap entry was erased --
causing the query to walk the entire partition without finding the live row.

Fixes: SCYLLADB-1253

Closes scylladb/scylladb#29368

(cherry picked from commit 21d9f54a9a)

Closes scylladb/scylladb#29480
2026-04-15 18:54:57 +03:00
Piotr Dulikowski
bf1f5ee796 Merge 'db/view/view_building_worker: lock staging sstables mutex for all necessary shards when creating tasks' from Michał Jadwiszczak
To create `process_staging` view building tasks, we firstly need to collect informations about them on shard0, create necessary mutations, commit them to group0 and move staging sstables objects to their original shards.

But there is a possible race after committing the group0 command and before moving the staging sstables to their shards. Between those two events, the coordinator may schedule freshly created tasks and dispatch them to the worker but the worker won't have the sstables objects because they weren't moved yet.

This patch fixes the race by holding `_staging_sstables_mutex` locks from all necessary shards when executing `create_staging_sstable_tasks()`. With this, even if the task will be scheduled and dispatched quickly, the worker will wait with executing it until the sstables objects are moved and the locks are released.

Fixes SCYLLADB-816

This PR should be backported to all versions containing view building coordinator (2025.4 and newer).

Closes scylladb/scylladb#29174

* github.com:scylladb/scylladb:
  db/view/view_building_worker: fix indentation
  db/view/view_building_worker: lock staging sstables mutex for necessary shards when creating tasks

(cherry picked from commit ec0231c36c)

Closes scylladb/scylladb#29394
2026-04-09 15:11:17 +02:00
Botond Dénes
7afcc56128 db,compaction: use utils::chunked_vector for cache invalidation ranges
Instead of dht::partition_ranges_vector, which is an std::vector<> and
have been seen to cause large allocations when calculating ranges to be
invalidated after compaction:

    seastar_memory - oversized allocation: 147456 bytes. This is non-fatal, but could lead to latency and/or fragmentation issues. Please report: at
    [Backtrace #0]
    void seastar::backtrace<seastar::current_backtrace_tasklocal()::$_0>(seastar::current_backtrace_tasklocal()::$_0&&, bool) at ./build/release/seastar/./seastar/include/seastar/util/backtrace.hh:89
     (inlined by) seastar::current_backtrace_tasklocal() at ./build/release/seastar/./seastar/src/util/backtrace.cc:99
    seastar::current_tasktrace() at ./build/release/seastar/./seastar/src/util/backtrace.cc:136
    seastar::current_backtrace() at ./build/release/seastar/./seastar/src/util/backtrace.cc:169
    seastar::memory::cpu_pages::warn_large_allocation(unsigned long) at ./build/release/seastar/./seastar/src/core/memory.cc:840
    seastar::memory::cpu_pages::check_large_allocation(unsigned long) at ./build/release/seastar/./seastar/src/core/memory.cc:903
     (inlined by) seastar::memory::cpu_pages::allocate_large(unsigned int, bool) at ./build/release/seastar/./seastar/src/core/memory.cc:910
     (inlined by) seastar::memory::allocate_large(unsigned long, bool) at ./build/release/seastar/./seastar/src/core/memory.cc:1533
     (inlined by) seastar::memory::allocate_slowpath(unsigned long) at ./build/release/seastar/./seastar/src/core/memory.cc:1679
    seastar::memory::allocate(unsigned long) at ././seastar/src/core/memory.cc:1698
     (inlined by) operator new(unsigned long) at ././seastar/src/core/memory.cc:2440
     (inlined by) std::__new_allocator<interval<dht::ring_position>>::allocate(unsigned long, void const*) at /usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/bits/new_allocator.h:151
     (inlined by) std::allocator<interval<dht::ring_position>>::allocate(unsigned long) at /usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/bits/allocator.h:203
     (inlined by) std::allocator_traits<std::allocator<interval<dht::ring_position>>>::allocate(std::allocator<interval<dht::ring_position>>&, unsigned long) at /usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/bits/alloc_traits.h:614
     (inlined by) std::_Vector_base<interval<dht::ring_position>, std::allocator<interval<dht::ring_position>>>::_M_allocate(unsigned long) at /usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/bits/stl_vector.h:387
     (inlined by) std::vector<interval<dht::ring_position>, std::allocator<interval<dht::ring_position>>>::reserve(unsigned long) at /usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/bits/vector.tcc:79
    dht::to_partition_ranges(utils::chunked_vector<interval<dht::token>, 131072ul> const&, seastar::bool_class<utils::can_yield_tag>) at ./dht/i_partitioner.cc:347
    compaction::compaction::get_ranges_for_invalidation(std::vector<seastar::lw_shared_ptr<sstables::sstable>, std::allocator<seastar::lw_shared_ptr<sstables::sstable>>> const&) at ./compaction/compaction.cc:619
     (inlined by) compaction::compaction::get_compaction_completion_desc(std::vector<seastar::lw_shared_ptr<sstables::sstable>, std::allocator<seastar::lw_shared_ptr<sstables::sstable>>>, std::vector<seastar::lw_shared_ptr<sstables::sstable>, std::allocator<seastar::lw_shared_ptr<sstables::sstable>>>) at ./compaction/compaction.cc:719
     (inlined by) compaction::regular_compaction::replace_remaining_exhausted_sstables() at ./compaction/compaction.cc:1362
    compaction::compaction::finish(std::chrono::time_point<db_clock, std::chrono::duration<long, std::ratio<1l, 1000l>>>, std::chrono::time_point<db_clock, std::chrono::duration<long, std::ratio<1l, 1000l>>>) at ./compaction/compaction.cc:1021
    compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0::operator()() at ./compaction/compaction.cc:1960
     (inlined by) compaction::compaction_result std::__invoke_impl<compaction::compaction_result, compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0>(std::__invoke_other, compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0&&) at /usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/bits/invoke.h:63
     (inlined by) std::__invoke_result<compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0>::type std::__invoke<compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0>(compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0&&) at /usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/bits/invoke.h:98
     (inlined by) decltype(auto) std::__apply_impl<compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0, std::tuple<>>(compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0&&, std::tuple<>&&, std::integer_sequence<unsigned long, ...>) at /usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/tuple:2920
     (inlined by) decltype(auto) std::apply<compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0, std::tuple<>>(compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0&&, std::tuple<>&&) at /usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/tuple:2935
     (inlined by) seastar::future<compaction::compaction_result> seastar::futurize<compaction::compaction_result>::apply<compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0>(compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0&&, std::tuple<>&&) at ././seastar/include/seastar/core/future.hh:1930
     (inlined by) seastar::futurize<std::invoke_result<compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0>::type>::type seastar::async<compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0>(seastar::thread_attributes, compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0&&)::'lambda'()::operator()() const at ././seastar/include/seastar/core/thread.hh:267
     (inlined by) seastar::noncopyable_function<void ()>::direct_vtable_for<seastar::futurize<std::invoke_result<compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0>::type>::type seastar::async<compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0>(seastar::thread_attributes, compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0&&)::'lambda'()>::call(seastar::noncopyable_function<void ()> const*) at ././seastar/include/seastar/util/noncopyable_function.hh:138
    seastar::noncopyable_function<void ()>::operator()() const at ./build/release/seastar/./seastar/include/seastar/util/noncopyable_function.hh:224
     (inlined by) seastar::thread_context::main() at ./build/release/seastar/./seastar/src/core/thread.cc:318

dht::partition_ranges_vector is used on the hot path, so just convert
the problematic user -- cache invalidation -- to use
utils::chunked_vector<dht::partition_range> instead.

Fixes: SCYLLADB-121

Closes scylladb/scylladb#28855

(cherry picked from commit 13ff9c4394)

Closes scylladb/scylladb#28975
2026-03-20 10:29:23 +02:00
Piotr Dulikowski
c7ac3b5394 db: view: mutate_MV: don't hold keyspace ref across preemption
Currently, the view_update_generator::mutate_MV function acquires a
reference to the keyspace relevant to the operation, then it calls
max_concurrent_for_each and uses that reference inside the lambda passed
to that function. max_concurrent_for_each can preempt and there is no
mechanism that makes sure that the keyspace is alive until the view
updates are generated, so it is possible that the keyspace is freed by
the time the reference is used.

Fix the issue by precomputing the necessary information based on the
keyspace reference right away, and then passing that information by
value to the other parts of the code. It turns out that we only need to
know whether the keyspace uses tablets and whether it uses a network
topology strategy.

Fixes: scylladb/scylladb#28925

Closes scylladb/scylladb#28928

(cherry picked from commit 42d70baad3)

Closes scylladb/scylladb#28968
2026-03-17 13:35:22 +01:00
Piotr Dulikowski
d6ed05efc1 Merge '[Backport 2026.1] mv: allow skipping view updates when a collection is unmodified' from Scylladb[bot]
mv: allow skipping view updates when a collection is unmodified
When we generate view updates, we check whether we can skip the
entire view update if all columns selected by the view are unmodified.
However, for collection columns, we only check if they were unset
before and after the update.
In this patch we add a check for the actual collection contents.
We perform this check for both virtual and non-virtual selections.
When the column is only a virtual column in the view, it would be
enough to check the liveness of each collection cell, however for
that we'd need to deserialize the entire collection anyway, which
should be effectively as expensive as comparing all of its bytes.

Fixes: SCYLLADB-996

- (cherry picked from commit 01ddc17ab9)

Parent PR: #28839

Closes scylladb/scylladb#28977

* github.com:scylladb/scylladb:
  Merge 'mv: allow skipping view updates when a collection is unmodified' from Wojciech Mitros
  mv: remove dead code in view_updates::can_skip_view_updates
2026-03-17 13:34:28 +01:00
Avi Kivity
1105d83893 Merge 'mv: allow skipping view updates when a collection is unmodified' from Wojciech Mitros
When we generate view updates, we check whether we can skip the
entire view update if all columns selected by the view are unmodified.
However, for collection columns, we only check if they were unset
before and after the update.
In this patch we add a check for the actual collection contents.
We perform this check for both virtual and non-virtual selections.
When the column is only a virtual column in the view, it would be
enough to check the liveness of each collection cell, however for
that we'd need to deserialize the entire collection anyway, which
should be effectively as expensive as comparing all of its bytes.

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-808

Closes scylladb/scylladb#28839

* github.com:scylladb/scylladb:
  mv: allow skipping view updates when a collection is unmodified
  mv: allow skipping view updates if an empty collection remains unset

(cherry picked from commit 01ddc17ab9)
2026-03-10 21:27:23 +01:00
Wojciech Mitros
9b9d5cee8a mv: remove dead code in view_updates::can_skip_view_updates
When we create a materialized view, we consider 2 cases:
1. the view's primary key contains a column that is not
in the primary key of the base table
2. the view's primary key doesn't contain such a column

In the 2nd case, we add all columns from the base table
to the schema of the view (as virtual columns). As a result,
all of these columns are effectively "selected" in
view_updates::can_skip_view_updates. Same thing happens when
we add new columns to the base table using ALTER.
Because of this, we can never have !column_is_selected and
!has_base_non_pk_columns_in_view_pk at the same time. And
thus, the check (!column_is_selected
&& _base_info.has_base_non_pk_columns_in_view_pk) is always
the same as (!column_is_selected).
Because we immediately return after this check, the tail of
this function is also never reached - all checks after the
(column_is_selected) are affected by this. Also, the condition
(!column_is_selected && base_has_nonexpiring_marker) is always
false at the point it is called. And this in turn makes the
`base_has_nonexpiring_marker` unused, so we delete it as well.

It's worth considering, why did we even have
`base_has_nonexpiring_marker` if it's effectively unused. We
initially introduced it in bd52e05ae2 and we (incorrectly)
used it to allow skipping view updates even if the liveness of
virtual columns changed. Soon after, in 5f85a7a821, we
started categorizing virtual columns as column_is_selected == true
and we moved the liveness checks for virtual columns to the
`if (column_is_selected)` clause, before the `base_has_nonexpiring_marker`
check. We changed this because even if we have a nonexpiring marker
right now, it may be changed in the future, in which case the liveness
of the view row will depend on liveness of the virtual columns and
we'll need to have the view updates from the time the row marker was
nonexpiring.

(cherry picked from commit ca1c8ff209)
2026-03-10 21:24:53 +01:00
Tomasz Grabiec
a8fd9936a3 Merge 'service: assert that tables updated via group0 use schema commitlog' from Aleksandra Martyniuk
Set enable_schema_commitlog for each group0 tables.

Assert that group0 tables use schema commitlog in ensure_group0_schema
(per each command).

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-914.

Needs backport to all live releases as all are vulnerable

Closes scylladb/scylladb#28876

* github.com:scylladb/scylladb:
  test: add test_group0_tables_use_schema_commitlog
  db: service: remove group0 tables from schema commitlog schema initializer
  service: ensure that tables updated via group0 use schema commitlog
  db: schema: remove set_is_group0_table param

(cherry picked from commit b90fe19a42)

Closes scylladb/scylladb#28916
2026-03-10 11:58:03 +01:00
Botond Dénes
9190d42863 Merge 'repair: Fix rwlock in compaction_state and lock holder lifecycle' from Raphael Raph Carvalho
Consider this:

- repair takes the lock holder
- tablet merge filber destories the compaction group and the compaction state
- repair fails
- repair destroy the lock holder

This is observed in the test:

```
repair - repair[5d73d094-72ee-4570-a3cc-1cd479b2a036] Repair 1 out of 1 tablets: table=sec_index.users range=(432345564227567615,504403158265495551] replicas=[0e9d51a5-9c99-4d6e-b9db-ad36a148b0ea:15, 498e354c-1254-4d8d-a565-2f5c6523845a:9, 5208598c-84f0-4526-bb7f-573728592172:28]

...

repair - repair[5d73d094-72ee-4570-a3cc-1cd479b2a036]: Started to repair 1 out of 1 tables in keyspace=sec_index, table=users, table_id=ea2072d0-ccd9-11f0-8dba-c5ab01bffb77, repair_reason=repair
repair - Enable incremental repair for table=sec_index.users range=(432345564227567615,504403158265495551]
table - Disabled compaction for range=(432345564227567615,504403158265495551] session_id=a13a72cc-cd2d-11f0-8e9b-76d54580ab09 for incremental repair
table - Got unrepaired compaction and repair lock for range=(432345564227567615,504403158265495551] session_id=a13a72cc-cd2d-11f0-8e9b-76d54580ab09 for incremental repair
table - Disabled compaction for range=(432345564227567615,504403158265495551] session_id=a13a72cc-cd2d-11f0-8e9b-76d54580ab09 for incremental repair
table - Got unrepaired compaction and repair lock for range=(432345564227567615,504403158265495551] session_id=a13a72cc-cd2d-11f0-8e9b-76d54580ab09 for incremental repair
repair - repair[5d73d094-72ee-4570-a3cc-1cd479b2a036]: get_sync_boundary: got error from node=0e9d51a5-9c99-4d6e-b9db-ad36a148b0ea, keyspace=sec_index, table=users, range=(432345564227567615,504403158265495551], error=seastar::rpc::remote_verb_error (Compaction state for table [0x60f008fa34c0] not found)
compaction_manager - Stopping 1 tasks for 1 ongoing compactions for table sec_index.users compaction_group=238 due to tablet merge
compaction_manager - Stopping 1 tasks for 1 ongoing compactions for table sec_index.users compaction_group=238 due to tablet merge

....

scylla[10793] Segmentation fault on shard 28, in scheduling group streaming
```

The rwlock in compaction_state could be destroyed before the lock holder
of the rwlock is destroyed. This causes user after free when the lock
the holder is destroyed.

To fix it, users of repair lock will now be waited when a compaction
group is being stopped.
That way, compaction group - which controls the lifetime of rwlock -
cannot be destroyed while the lock is held.
Additionally, the merge completion fiber - that might remove groups -
is properly serialized with incremental repair.

The issue can be reproduced using sanitize build consistently and can not
be reproduced after the fix.

Fixes #27365

Closes scylladb/scylladb#28823

* github.com:scylladb/scylladb:
  repair: Fix rwlock in compaction_state and lock holder lifecycle
  repair: Prevent repair lock holder leakage after table drop

(cherry picked from commit 509f2af8db)

Closes scylladb/scylladb#28934
2026-03-09 10:25:47 +02:00
Avi Kivity
8a7f5f1428 Merge '[Backport 2026.1] raft ropology: prevent crashes of multiple nodes' from Scylladb[bot]
Some assertions in the Raft-based topology are likely to cause crashes of
multiple nodes due to the consistent nature of the Raft-based code. If the
failing assertion is executed in the code run by each follower (e.g., the code
reloading the in-memory topology state machine), then all nodes can crash. If
the failing assertion is executed only by the leader (e.g., the topology
coordinator fiber), then multiple consecutive group0 leaders will chain-crash
until there is no group0 majority.

Crashing multiple nodes is much more severe than necessary. It's enough to
prevent the topology state machine from making more progress. This will
naturally happen after throwing a runtime error. The problematic fiber will be
killed or will keep failing in a loop. Note that it should be safe to block
the topology state machine, but not the whole group0, as the topology state
machine is mostly isolated from the rest of group0.

We replace some occurrences of `on_fatal_internal_error` and `SCYLLA_ASSERT`
with `on_internal_error`. These are not all occurrences, as some fatal
assertions make sense, for example, in the bootstrap procedure.

We also raise an internal error to prevent a segmentation fault in a few places.

Fixes #27987

Backporting this PR is not required, but we can consider it at least for 2026.1
because:
- it is LTS,
- the changes are low-risk,
- there shouldn't be many conflicts.

- (cherry picked from commit e21ecf69de)

- (cherry picked from commit 8e9c7397c5)

Parent PR: #28558

Manually cherry-picked 2a3476094e.

Closes scylladb/scylladb#28735

* github.com:scylladb/scylladb:
  storage_service: raft_topology_cmd_handler: fix use-after-free
  raft topology: prevent accessing nullptr returned by topology::find
  raft topology: make some assertions non-crashing
2026-03-05 20:49:03 +02:00
Marcin Maliszkiewicz
81685b0d06 Merge 'db/batchlog_manager: re-add v1 support for mixed clusters' from Botond Dénes
3f7ee3ce5d introduced system.batchlog_v2, with a schema designed to speed up batchlog replays and make post-replay cleanups much more effective.
It did not introduce a cluster feature for the new table, because it is node local table, so the cluster can switch to the new table gradually, one node at a time.
However, https://github.com/scylladb/scylladb/issues/27886 showed that the switching causes timeouts during upgrades, in mixed clusters. Furthermore, switching to the new table unconditionally  on upgrades nodes, means that on rollback, the batches saved into the v2 table are lost.
This PR introduces re-introduces v1 (`system.batchlog`) support and guards the use of the v2 table with a cluster feature, so mixed clusters keep using v1 and thus be rollback-compatible.
The re-introduced v1 support doesn't support post-replay cleanups for simplicity. The cleanup in v1 was never particularly effective anyway and we ended up disabling it for heavy batchlog users, so I don't think the lack of support for cleanup is a problem.

Fixes: https://github.com/scylladb/scylladb/issues/27886

Needs backport to 2026.1, to fix upgrades for clusters using batches

Closes scylladb/scylladb#28736

* github.com:scylladb/scylladb:
  test/boost/batchlog_manager_test: add tests for v1 batchlog
  test/boost/batchlog_manager_test: make prepare_batches() work with both v1 and v2
  test/boost/batchlog_manager_test: fix indentation
  test/boost/batchlog_manager_test: extract prepare_batches() method
  test/lib/cql_assertions: is_rows(): add dump parameter
  tools/scylla-sstable: extract query result printers
  tools/scylla-sstable: add std::ostream& arg to query result printers
  repair/row_level: repair_flush_hints_batchlog_handler(): add all_replayed to finish log
  db/batchlog_manager: re-add v1 support
  db/batchlog_manager: return all_replayed from process_batch()
  db/batchlog_manager: process_bath() fix indentation
  db/batchlog_manager: make batch() a standalone function
  db/batchlog_manager: make structs stats public
  db/batchlog_manager: allocate limiter on the stack
  db/batchlog_manager: add feature_service dependency
  gms/feature_service: add batchlog_v2 feature

(cherry picked from commit a83ee6cf66)

Closes scylladb/scylladb#28853
2026-03-04 08:28:39 +02:00
Botond Dénes
0dfefc3f12 db/config: don't use RBNO for scaling
Remove bootstrap and decomission from allowed_repair_based_node_ops.
Using RBNO over streaming for these operations has no benefits, as they
are not exposed to the out-of-date replica problem that replace,
removenode and rebuild are.
On top of that, RBNO is known to have problems with empty user tables.
Using streaming for boostrap and decomission is safe and faster
than RBNO in all condition, especially when the table is small.

One test needs adjustment as it relies on RBNO being used for all node
ops.

Fixes: SCYLLADB-105

Closes scylladb/scylladb#28080

(cherry picked from commit b637e17b19)

Closes scylladb/scylladb#28725
2026-02-27 06:32:15 +02:00
Patryk Jędrzejczak
dc3133b031 raft topology: make some assertions non-crashing
Some assertions in the Raft-based topology are likely to cause crashes of
multiple nodes due to the consistent nature of the Raft-based code. If the
failing assertion is executed in the code run by each follower (e.g., the code
reloading the in-memory topology state machine), then all nodes can crash. If
the failing assertion is executed only by the leader (e.g., the topology
coordinator fiber), then multiple consecutive group0 leaders will chain-crash
until there is no group0 majority.

Crashing multiple nodes is much more severe than necessary. It's enough to
prevent the topology state machine from making more progress. This will
naturally happen after throwing a runtime error. The problematic fiber will be
killed or will keep failing in a loop. Note that it should be safe to block
the topology state machine, but not the whole group0, as the topology state
machine is mostly isolated from the rest of group0.

We replace some occurrences of `on_fatal_internal_error` and `SCYLLA_ASSERT`
with `on_internal_error`. These are not all occurrences, as some fatal
assertions make sense, for example, in the bootstrap procedure.

(cherry picked from commit e21ecf69de)
2026-02-20 13:00:19 +01:00
Calle Wilund
d87467f77b commitlog: Always abort replenish queue on loop exit
Fixes #28678

If replenish loop exits the sleep condition, with an empty queue,
when "_shutdown" is already set, a waiter might get stuck, unsignalled
waiting for segments, even though we are exiting.

Simply move queue abort to always be done on loop exit.

Closes scylladb/scylladb#28679

(cherry picked from commit ab4e4a8ac7)

Closes scylladb/scylladb#28693
2026-02-18 12:32:43 +02:00
Michał Hudobski
f633f57163 auth: add CDC streams and timestamps to vector search permissions
It turns out that the cdc driver requires permissions to two additional system tables. This patch adds them to VECTOR_SEARCH_INDEXING and modifies the unit tests. The integration with vector store was tested manually, integration tests will be added in vector-store repository in a follow up PR.

Fixes: SCYLLADB-522

Closes scylladb/scylladb#28519

(cherry picked from commit 6b9fcc6ca3)

Closes scylladb/scylladb#28538
2026-02-05 10:31:39 +01:00
Piotr Dulikowski
fe9237fdc9 Merge 'alternator: don't require rf_rack flag for indexes, validate instead' from Michael Litvak
In 8df61f6d99 we changed the requirements for creating materialized
views and MV-based indexes - instead of requiring the
rf_rack_valid_keyspaces flag to be set, we now require the keyspace to
be RF-rack-valid at the time of creation, and it is enforced to remain
RF-rack-valid while the MV exists. This validation is done in the cql
create view/index statements.

The same should be done also for alternator - when creating a table with
GSI or LSI, or when adding a GSI to an existing table, previously we
required the flag rf_rack_valid_keyspaces to be set. Now we change it to
instead check if the keyspace is RF-rack-valid, and if not the operation
fails with an appropriate error.

Fixes https://github.com/scylladb/scylladb/issues/28214

backport to 2025.4 to add RF-rack-valid enforcements in alternator

Closes scylladb/scylladb#28154

* github.com:scylladb/scylladb:
  locator: document the exception type of assert_rf_rack_valid_keyspace
  alternator: don't require rf_rack flag for indexes, validate instead
2026-01-23 11:49:02 +01:00
Botond Dénes
c4c2f87be7 Merge 'db: fail reads and writes with local consistency level to a DC with RF=0' from null
When read or write operations are performed on a DC with RF=0 with LOCAL_QUORUM
or LOCAL_ONE consistency level, Cassandra throws `Unavailable` exception.
Scylla allowed such read operations and failed write operations with a cryptic:
"broken promise" error. This occured because the initial availability
check passed (quorum of 0 requires 0 replicas), but execution failed
later when no replicas existed to process the mutation.

This patch adds an explicit RF=0 validation for LOCAL_ONE and LOCAL_QUORUM that
throws before attempting operation execution.

The change also requires `test_query_dc_with_rf_0_does_not_crash_db` to be
upgraded. This testcase was asserting somewhat similar scenario, but wasn't
taking into account the whole matrix of combinations:
- scenarios: successful vs unsuccesful operation outcome
- local consistency levels: LOCAL_QUORUM & LOCAL_ONE
- operations: SELECT (read) & INSERT (write)

and so it's been extended to cover both the pre-existing and the current issues
and the whole matrix of combinations.

Fixes: scylladb/scylladb#27893

A minor change, no need to backport.

Closes scylladb/scylladb#27894

* github.com:scylladb/scylladb:
  db: fail reads and writes with local consistencty level to a DC with RF=0
  db: consistency_level: split `local_quorum_for()`
  db: consistency_level: fix nrs -> nts abbreviation
2026-01-23 12:36:20 +02:00
Patryk Jędrzejczak
4e984139b2 Merge 'strongly consistent tables: basic implementation' from Petr Gusev
In this PR we add a basic implementation of the strongly-consistent tables:
* generate raft group id when a strongly-consistent table is created
* persist it into system.tables table
* start raft groups on replicas when a strongly-consistent tablet_map reaches them
* add strongly-consistent version of the storage_proxy, with the `query` and `mutate` methods
* the `mutate` method submits a command to the tablets raft group, the query method reads the data with `raft.read_barrier()`
* strongly-consistent versions of the `select_statement` and `modification_statement` are added
* a basic `test_strong_consistency.py/test_basic_write_read` is added which to check that we can write and read data in a strongly consistent fashion.

Limitations:
* for now the strongly consistent tables can have tablets only on shard zero. This is because we (ab/re) use the existing raft system tables which live only on shard0. In the next PRs we'll create separate tables for the new tablets raft groups.
* No Scylla-side proxying - the test has to figure out who is the leader and submit the command to the right node. This will be fixed separately.
* No tablet balancing -- migration/split/merges require separate complicated code.

The new behavior is hidden behind `STRONGLY_CONSISTENT_TABLES` feature, which is enabled when the `STRONGLY_CONSISTENT_TABLES` experimental feature flag is set.

Requirements, specs and general overview of the feature can be found [here](https://scylladb.atlassian.net/wiki/spaces/RND/pages/91422722/Strong+Consistency). Short term implementation plan is [here](https://docs.google.com/document/d/1afKeeHaCkKxER7IThHkaAQlh2JWpbqhFLIQ3CzmiXhI/edit?tab=t.0#heading=h.thkorgfek290)

One can check the strongly consistent writes and reads locally via cqlsh:
scylla.yaml:
```
experimental_features:
  - strongly-consistent-tables
```

cqlsh:
```
CREATE KEYSPACE IF NOT EXISTS my_ks WITH replication = {'class': 'NetworkTopologyStrategy', 'replication_factor': 1} AND tablets = {'initial': 1} AND consistency = 'local';
CREATE TABLE my_ks.test (pk int PRIMARY KEY, c int);
INSERT INTO my_ks.test (pk, c) VALUES (10, 20);
SELECT * FROM my_ks.test WHERE pk = 10;
```

Fixes SCYLLADB-34
Fixes SCYLLADB-32
Fixes SCYLLADB-31
Fixes SCYLLADB-33
Fixes SCYLLADB-56

backport: no need

Closes scylladb/scylladb#27614

* https://github.com/scylladb/scylladb:
  test_encryption: capture stderr
  test/cluster: add test_strong_consistency.py
  raft_group_registry: disable metrics for non-0 groups
  strong consistency: implement select_statement::do_execute()
  cql: add select_statement.cc
  strong consistency: implement coordinator::query()
  cql: add modification_statement
  cql: add statement_helpers
  strong consistency: implement coordinator::mutate()
  raft.hh: make server::wait_for_leader() public
  strong_consistency: add coordinator
  modification_statement: make get_timeout public
  strong_consistency: add groups_manager
  strong_consistency: add state_machine and raft_command
  table: add get_max_timestamp_for_tablet
  tablets: generate raft group_id-s for new table
  tablet_replication_strategy: add consistency field
  tablets: add raft_group_id
  modification_statement: remove virtual where it's not needed
  modification_statement: inline prepare_statement()
  system_keyspace: disable tablet_balancing for strongly_consistent_tables
  cql: rename strongly_consistent statements to broadcast statements
2026-01-23 09:52:33 +01:00
Michael Litvak
d5009882c6 locator: document the exception type of assert_rf_rack_valid_keyspace
The function assert_rf_rack_valid_keyspace uses the exception type
std::invalid_argument when the RF-rack validation fails. Document it and
change all callers to catch this specific exception type when checking
for RF-rack validation failures, so that other exception types can be
propagated properly.
2026-01-22 16:11:35 +01:00
Pavel Emelyanov
cb6ee05391 Merge 'Extend snapshot manifest.json with tablet-aware metadata' from Benny Halevy
This series extends the json manifest file we create when taking snapshots.
It adds the following metadata:
- manifesr version and scope
- snapshot name
- created_at and expires_at timestamps (#24061)
- node metadata (host_id, dc, rack)
- keyspace and table metadat
- tablet_count (#26352)
- per-sstable metadata (#26352)

Fixes [SCYLLADB-189](https://scylladb.atlassian.net/browse/SCYLLADB-189)
Fixes [SCYLLADB-195](https://scylladb.atlassian.net/browse/SCYLLADB-195)
Fixes [SCYLLADB-196](https://scylladb.atlassian.net/browse/SCYLLADB-196)

* Enhancement, no backport needed

[SCYLLADB-189]: https://scylladb.atlassian.net/browse/SCYLLADB-189?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ
[SCYLLADB-195]: https://scylladb.atlassian.net/browse/SCYLLADB-195?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ
[SCYLLADB-196]: https://scylladb.atlassian.net/browse/SCYLLADB-196?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ

Closes scylladb/scylladb#27945

* github.com:scylladb/scylladb:
  snapshot: keep per-sstable metadata in manifest.json
  snapshot: add table info and tablet_count to manifest.json
  snapshot: add basic support for snapshot ttl in manifest.json
  table: snapshot_on_all_shards: take snapshot_options
  db: snapshot_ctl: move skip_flush to struct snapshot_options
  snapshot: add snapshot name in manifest.json
  test: lib: cql_test_env: apply db::config::tablets_mode_for_new_keyspaces
  snapshot: add node info to manifest.json
  snapshot: add manifest info to manifest.json
  test: database_test: snapshot_works: add validate_manifest
2026-01-22 15:19:11 +03:00
Patryk Jędrzejczak
67045b5f17 Merge 'raft_topology, tablets: Drain tablets in parallel with other topology operations' from Tomasz Grabiec
Allows other topology operations to execute while tablets are being
drained on decommission. In particular, bootstrap on scale-out. This
is important for elasticity.

Allows multiple decommission/removenode to happen in parallel, which
is important for efficiency.

Flow of decommission/removenode request:
  1) pending and paused, has tablet replicas on target node.
     Tablet scheduler will start draining tablets.
  2) No tablets on target node, request is pending but not paused
  3) Request is scheduled, node is in transition
  4) Request is done

Nodes are considered draining as soon as there is a leave or remove
request on them. If there are tablet replicas present on the target
node, the request is in a paused state and will not be picked by
topology coordinator. The paused state is computed from topology state
automatically on reload.

When request is not paused, its execution starts in
write_both_read_old state. The old tablet_draining state is not
entered (it's deprecated now).

Tablet load balancing will yield the state machine as soon as some
request is no longer paused and ready to be scheduled, based on
standard preemption mechanics.

Fixes #21452

Closes scylladb/scylladb#24129

* https://github.com/scylladb/scylladb:
  docs: Document parallel decommission and removenode and relevant task API
  test: Add tests for parallel decommission/removenode
  test: util: Introduce ensure_group0_leader_on()
  test: tablets: Check that there are no migrations scheduled on draining nodes
  test: lib: topology_builder: Introduce add_draining_request()
  topology_coordinator, tablets: Fail draining operations when tablet migration fails due to critical disk utilization
  tablets: topology_coordinator: Refactor to propagate reason for migration rollback
  tablet_allocator: Skip co-location on draining nodes
  node_ops: task_manager_module: Populate entity field also for active requests
  tasks: node_ops: Put node id in the entity field
  tasks, node_ops: Unify setting of task_stats in get_status() and get_stats()
  topology: Protect against empty cancelation reason
  tasks, topology: Make pending node operations abortable
  doc: topology-over-raft.md: Fix diagram for replacing, tablet_draining is not engaged
  raft_topology, tablets: Drain tablets in parallel with other topology operations
  virtual_tables: Show draining and excluded fields in system.cluster_status and system.load_by_node
  locator: topology: Add "draining" flag to a node
  topology_coordinator: Extract generate_cancel_request_update()
  storage_service: Drop dependency in topology_state_machine.hh in the header
  locator: Extract common code in assert_rf_rack_valid_keyspace()
  topology_coordinator, storage_service: Validate node removal/decommission at request submission time
2026-01-22 13:06:53 +01:00
Piotr Smaron
d4c28690e1 db: fail reads and writes with local consistencty level to a DC with RF=0
When read or write operations are performed on a DC with RF=0 with LOCAL_QUORUM
or LOCAL_ONE consistency level, Cassandra throws `Unavailable` exception.
Scylla allowed such read operations and failed write operations with a cryptic:
"broken promise" error. This occured because the initial availability
check passed (quorum of 0 requires 0 replicas), but execution failed
later when no replicas existed to process the mutation.

This patch adds an explicit RF=0 validation for LOCAL_ONE and LOCAL_QUORUM that
throws before attempting operation execution.

The change also requires `test_query_dc_with_rf_0_does_not_crash_db` to be
upgraded. This testcase was asserting somewhat similar scenario, but wasn't
taking into account the whole matrix of combinations:
- scenarios: successful vs unsuccesful operation outcome
- local consistency levels: LOCAL_QUORUM & LOCAL_ONE
- operations: SELECT (read) & INSERT (write)

and so it's been extended to cover both the pre-existing and the current issues
and the whole matrix of combinations.

Fixes: scylladb/scylladb#27893
2026-01-22 12:49:45 +01:00
Piotr Smaron
9475659ae8 db: consistency_level: split local_quorum_for()
The core of `local_quorum_for()` has been extracted to
`get_replication_factor_for_dc()`, which is going to be used later,
while `local_quorum_for()` itself has been recreated using the exracted
part.
2026-01-22 12:49:23 +01:00
Piotr Smaron
0b3ee197b6 db: consistency_level: fix nrs -> nts abbreviation
`network_topology_strategy` was abbreviated with `nrs`, and not `nts`. I
think someone incorrectly assumed it's 'network Replication strategy', hence
nrs.
2026-01-22 12:48:37 +01:00
Botond Dénes
7d2e6c0170 Merge 'config: add enforce_rack_list option' from Aleksandra Martyniuk
Add enforce_rack_list option. When the option is set to true,
all tablet keyspaces have rack list replication factor.

When the option is on:
- CREATE STATEMENT always auto-extends rf to rack lists;
- ALTER STATEMENT fails when there is numeric rf in any DC.

The flag is set to false by default and a node needs to be restarted
in order to change its value. Starting a node with enforce_rack_list
option will fail, if there are any tablet keyspaces with numeric rf
in any DC.

enforce_rack_list is a per-node option and a user needs to ensure
that no tablet keyspace is altered or created while nodes in
the cluster don't have the consistent value.

Mark rf_rack_valid_keyspaces as deprecated.

Fixes: https://github.com/scylladb/scylladb/issues/26399.

New feature; no backport needed

Closes scylladb/scylladb#28084

* github.com:scylladb/scylladb:
  test: add test for enforce_rack_list option
  db: mark rf_rack_valid_keyspaces as deprecated
  config: add enforce_rack_list option
  Revert "alternator: require rf_rack_valid_keyspaces when creating index"
2026-01-22 10:27:35 +02:00
Benny Halevy
91df129e21 snapshot: add basic support for snapshot ttl in manifest.json
Store the snapshot `created_at` time and an optional
`expires_at` time.

Fixes SCYLLADB-189

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-01-22 09:12:56 +02:00
Benny Halevy
49a3e0914d db: snapshot_ctl: move skip_flush to struct snapshot_options
So we can easily extend it and add more options.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-01-22 09:12:56 +02:00
Botond Dénes
4281d18c2e Merge 'schema: Apply sstable_compression_user_table_options to CQL aux and Alternator tables' from Nikos Dragazis
In PR 5b6570be52 we introduced the config option `sstable_compression_user_table_options` to allow adjusting the default compression settings for user tables. However, the new option was hooked into the CQL layer and applied only to CQL base tables, not to the whole spectrum of user tables: CQL auxiliary tables (materialized views, secondary indexes, CDC log tables), Alternator base tables, Alternator auxiliary tables (GSIs, LSIs, Streams).

This gap also led to inconsistent default compression algorithms after we changed the option’s default algorithm from LZ4 to LZ4WithDicts (adf9c426c2).

This series introduces a general “schema initializer” mechanism in `schema_builder` and uses it to apply the default compression settings uniformly across all user tables. This ensures that all base and aux tables take their default compression settings from config.

Fixes #26914.

Backport justification: LZ4WithDicts is the new default since 2025.4, but the config option exists since 2025.2. Based on severity, I suggest we backport only to 2025.4 to maintain consistency of the defaults.

Closes scylladb/scylladb#27204

* github.com:scylladb/scylladb:
  db/config: Update sstable_compression_user_table_options description
  schema: Add initializer for compression defaults
  schema: Generalize static configurators into schema initializers
  schema: Initialize static properties eagerly
  db: config: Add accessor for sstable_compression_user_table_options
  test: Check that CQL and Alternator tables respect compression config
2026-01-22 06:50:48 +02:00
Petr Gusev
998ee5b7fb system_keyspace: disable tablet_balancing for strongly_consistent_tables
We don't yet support tablet balancing for strongly consistent tables.
2026-01-21 14:56:00 +01:00
Botond Dénes
a53f989d2f db/row_cache: make_nonpopulating_reader(): pass cache tracker to snapshot
The API contract in partition_version.hh states that when dealing with
evictable entries, a real cache tracker pointer has to be passed to all
methods that ask for it. The nonpopulating reader violates this, passing
a nullptr to the snapshot. This was observed to cause a crash when a
concurrent cache read accessed the snapshot with the null tracker.

A reproducer is included which fails before and passes after the fix.

Fixes: #26847

Closes scylladb/scylladb#28163
2026-01-20 12:34:37 +01:00
Aleksandra Martyniuk
7dc371f312 db: mark rf_rack_valid_keyspaces as deprecated
Mark rf_rack_valid_keyspaces option as deprecated. User should
use enforce_rack_list option instead.

The option can still be used and it does not change it's behavior.

Docs is updated accordingly.
2026-01-20 09:58:57 +01:00
Aleksandra Martyniuk
761ace4f05 config: add enforce_rack_list option
Add enforce_rack_list option. When the option is set to true,
all tablet keyspaces have rack list replication factor.

When the option is on:
- CREATE STATEMENT always auto-extends rf to rack lists;
- ALTER STATEMENT fails when there is numeric rf in any DC.

The flag is set to false by default and a node needs to be restarted
in order to change its value. Starting a node with enforce_rack_list
option will fail, if there are any tablet keyspaces with numeric rf
in any DC.

enforce_rack_list is a per-node option and a user needs to ensure
that no tablet keyspace is altered or created while nodes in
the cluster don't have the consistent value.
2026-01-20 09:58:51 +01:00
Avi Kivity
36347c3ce9 Merge 'db/system_keyspace: remove namespace v3' from Botond Dénes
Cassandra changed their system tables in 3.0. We migrated to the new system table layout in 2017, in ScyllaDB 2.0.
System tables introduced in Cassandra 3.0, as well as the 3.0 variant of pre-existing system tables were added to the db::system_table::v3 namespace.
We ended up adding some new ScyllaDB-only system tables to this namespace as well.

As the dust settled, most of the v3 system tables ended up being either simple aliases to non-v3 tables, or new tables.
Either way, the codebase uses just one variant of each table for a long time now the v3:: distinction is pointless.

Remove the v3 namespace and unify the table listing under the top-level db::system_keyspace scope.

Code cleanup, no backport

Closes scylladb/scylladb#28146

* github.com:scylladb/scylladb:
  db/system_keyspace: move remining tables out of v3 keyspace
  db/system_keyspace: relocate truncated() and commitlog_cleanups()
  db/system_keyspace: drop v3::local()
  db/system_keyspace: remove duplicate table names from v3
2026-01-19 20:54:38 +02:00
Botond Dénes
e01041d3ee db/system_keyspace: move remining tables out of v3 keyspace
The last remining tables in the v3 keyspace are those that are genuinely
distinct -- added by Cassandra 3.0 or >= ScyllaDB 2.0.
Move these out of the v3 keyspace too, with this the v3 keyspace is
defunct and removed.
2026-01-19 12:32:21 +02:00
Botond Dénes
ce57ef94bd db/system_keyspace: relocate truncated() and commitlog_cleanups()
The name variables of these tables is outside the v3 namespace but the
method defining their schema is in the v3 namespace. Relocate the
methods out from the v3 namespace, to the scope where the name variables
live.
The methods are moved to the private: part of system_keyspace, as they
don't have external users currently.
2026-01-19 12:32:21 +02:00
Botond Dénes
2ccb8ff666 db/system_keyspace: drop v3::local()
It is unused, the non-v3 variant is used instead.
2026-01-19 12:32:21 +02:00
Botond Dénes
b52a3f3a43 db/system_keyspace: remove duplicate table names from v3
Those table names that are effectively just an alias of the their
counterpart outside of the v3 namespace (struct).

scylla_local() is made public. Currently it is private, but it has
external users, working around the private designation by using the
public v3::scylla_local() alias. This change just makes the existing
status clear.
2026-01-19 12:32:21 +02:00
Ferenc Szili
3e0362ec67 virtual_table: add effective_capacity to load_per_node
This change adds effective_capacity to the virtual table load_per_node.
This value can be useful for debugging size based load balancing.
2026-01-18 16:52:13 +01:00
Tomasz Grabiec
e38ee160fc virtual_tables: Show draining and excluded fields in system.cluster_status and system.load_by_node
It gives a more accurate picture of what happens in the cluster.
2026-01-18 15:36:04 +01:00
Botond Dénes
1e09a34686 replica: add abort polling to memtable and cache readers
Continuing the read once it is aborted (e.g. due to timeout) is a waste
of resources, as the produced results will be discarded.
Poll the permit's abort exception in the memtable and cache reader's
fill_buffer(). This results in one poll per buffer filled (8KB of data).

We already have similar poll for sstable readers, as disk reads are
usually much heavier and therefore it is more important to stop them
ASAP after abort. Cache and memtable reads are usually quick but not
always, hence it is important to also have polling in the cache and
memtable readers.

Refs: #11469
Fixes: #28148

Closes scylladb/scylladb#28149
2026-01-16 18:03:04 +01:00
Avi Kivity
bd08b6e5b2 Merge 'Unify configuration of object storage endpoints (take 2)' from Pavel Emelyanov
To configure S3 storage, one needs to do

```
object_storage_endpoints:
  - name: s3.us-east-1.amazonaws.com
    port: 443
    https: true
    aws_region: us-east-1
```

and for GCS it's

```
object_storage_endpoints:
  - name: https://storage.googleapis.com:433
    type: gs
    credentials_file: <gcp account credentials json file>
```

This PR updates the S3 part to look like

```
object_storage_endpoints:
  - name: https://s3.us-east-1.amazonaws.com:443
    aws_region: us-east-1
```

fixes: #26570

This is 2nd attempt, previous one (#27360) was reverted because it reported endpoint configs in new format via API and CQL always, even if the endpoint was configured in the old way. This "broke" scylla manager and some dtests. This version has this bug fixed, and endpoints are reported in the same format as they were configured with.

About correctness of the changes.

No modifications to existing tests are made here, so old format is respected correctly (as far as it's covered by tests). To prove the new format works the the test_get_object_store_endpoints is extended to validate both options. Some preparations to this test to make this happen come on their own with the PR #28111  to show that they are valid and pass before changing the core code.

Enhancing the way configuration is made, likely no need to backport.

Closes scylladb/scylladb#28112

* github.com:scylladb/scylladb:
  test: Validate S3 endpoints new format works
  docs: Update docs according to new endpoints config option format
  object_storage: Create s3 client with "extended" endpoint name
  s3/storage: Tune config updating
  sstable: Shuffle args for s3_client_wrapper
  test: Rename badconf variable into objconf
  test: Split the object_store/test_get_object_store_endpoints test
2026-01-14 18:29:03 +02:00
Gleb Natapov
bee5f63cb6 topology coordinator: complete pending operation for a replaced node
A replaced node may have pending operation on it. The replace operation
will move the node into the 'left' state and the request will never be
completed. More over the code does not expect left node to have a
request. It will try to process the request and will crash because the
node for the request will not be found.

The patch checks is the replaced node has peening request and completes
it with failure. It also changes topology loading code to skip requests
for nodes that are in a left state. This is not strictly needed, but
makes the code more robust.

Fixes #27990

Closes scylladb/scylladb#28009
2026-01-14 13:11:27 +01:00
Nikos Dragazis
7fa1f87355 db/config: Update sstable_compression_user_table_options description
Clarify what "user table" means.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2026-01-13 20:45:59 +02:00
Nikos Dragazis
d5ec66bc0c schema: Generalize static configurators into schema initializers
Extend the `static_configurator` mechanism to support initialization of
arbitrary schema properties, not only static ones, by passing a
`schema_builder` reference to the configurator interface.

As part of this change, rename `static_configurator` to
`schema_initializer` to better reflect its broader responsibility.

Add a checkpoint/restore mechanism to allow de-registering an
initializer (useful for testing; will be used in the next patch).

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2026-01-13 20:45:59 +02:00
Nikos Dragazis
76b2d0f961 db: config: Add accessor for sstable_compression_user_table_options
The `sstable_compression_user_table_options` config option determines
the default compression settings for user tables.

In patch 2fc812a1b9, the default value of this option was changed from
LZ4 to LZ4WithDicts and a fallback logic was introduced during startup
to temporarily revert the option to LZ4 until the dictionary compression
feature is enabled.

Replace this fallback logic with an accessor that returns the correct
settings depending on the feature flag. This is cleaner and more
consistent with the way we handle the `sstable_format` option, where the
same problem appears (see `get_preferred_sstable_version()`).

As a consequence, the configuration option must always be accessed
through this accessor. Add a comment to point this out.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2026-01-13 18:30:38 +02:00
Avi Kivity
c6dfae5661 treewide: #include Seastar headers with angle brackets
Seastar is an external library from the point of view of
ScyllaDB, so should be included with angle brackets.

Closes scylladb/scylladb#27947
2026-01-13 14:56:15 +02:00
Pavel Emelyanov
f227de24b2 object_storage: Create s3 client with "extended" endpoint name
For this, add the s3::client::make(endpoint, ...) overload that accepts
endpoint in proto://host:port format. Then it parses the provided url
and calls the legacy one, that accepts raw host string and config with
port, https bit, etc.

The generic object_storage_endpoint_param no longer needs to carry the
internal s3::endpoint_config, the config option parsing changes
respectively.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-01-13 13:24:06 +03:00
Avi Kivity
66aee0fb5e alternator: add optional listeners for proxy protocol v2
Following 954f2cbd2f, which added proxy protocol v2 listeners
for CQL, we do the same for alternator. We add two optional ports
for plain and TLS-wrapped HTTP.

We test each new port, that the old ports still work, and that
mixing up a port with no proxy protocol and a connection with proxy
protocol (or the opposite) fails. The latter serves to show
that the testing strategy is valid and doesn't just pass whatever
happens. We also verify that the correct addresses (and TLS mode)
show up in system.clients.

Closes scylladb/scylladb#27889
2026-01-13 09:59:24 +02:00
Botond Dénes
04b8f72946 Merge 'repair: Implement auto repair for tablet repair' from Asias He
repair: Implement auto repair for tablet repair

This patch implements the basic auto repair support for tablet repair.

It was decided to add no per table configuration for the initial
implementation, so two scylla yaml config options are introduced to set
the default auto repair configs for all the tablet tables.

- auto_repair_enabled_default

Set true to enable auto repair for tablet tables by default. The value
will be overridden by the per keyspace or per table configuration which
is not implemented yet.

- auto_repair_threshold_default_in_seconds

Set the default time in seconds for the auto repair threshold for tablet
tables. If the time since last repair is bigger than the configured
time, the tablet is eligible for auto repair. The value will be
overridden by the per keyspace or per table configuration which is not
implemented yet.

The following metrcis are added:

- auto_repair_needs_repair_nr

The number of tablets with auto repair enabled that needs repair

- auto_repair_enabled_nr

The number of tablets with auto repair enabled

The metrics are useful to tell if auto repair is falling behind.

In the future, more auto repair scheduling will be added, e.g.,
scheduling based on the repaired and unrepaired sstable set size,
tombstone ratio and so on, in addition to the time based scheduling.

Fixes SCYLLADB-99

New feature. No backport.

Closes scylladb/scylladb#27534

* github.com:scylladb/scylladb:
  topology_coordinator: Add metrics for tablet repair
  repair: Implement auto repair for tablet repair
2026-01-12 14:16:01 +02:00
Petr Gusev
889d7782ed treewide: use coroutine::maybe_yield in coroutines
It's more efficient since coroutine::maybe_yield returns
a lightweight struct (awaitable), not the future.

Closes scylladb/scylladb#28101
2026-01-12 10:38:47 +01:00