Compare commits

..

2456 Commits

Author SHA1 Message Date
Szymon Malewski
73f0deef6d alternator: optional stripping of http response headers
In Alternator's HTTP API, response headers can dominate bandwidth for
small payloads. The Server, Date, and Content-Type headers were sent on
every response but many clients never use them.

This patch introduces three Alternator config options:
  - alternator_http_response_server_header,
  - alternator_http_response_disable_date_header,
  - alternator_http_response_disable_content_type_header,
which allow customizing or suppressing the respective HTTP response
headers. All three options support live update (no restart needed).
The Server header is no longer sent by default; the Date and
Content-Type defaults preserve the existing behavior.

The Server and Date header suppression uses Seastar's
set_server_header() and set_generate_date_header() APIs added in
https://github.com/scylladb/seastar/pull/3217. This patch also
fixes deprecation warnings from older Seastar HTTP APIs.

Tests are in test/alternator/test_http_headers.py.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-70

Closes scylladb/scylladb#28288
2026-04-19 09:22:04 +03:00
Nadav Har'El
f83270df12 Merge 'alternator/streams: Block tablet merges for Alternator Streams on tablet tables' from Piotr Szymaniak
DynamoDB Streams API can only convey a single parent per stream shard.
Tablet merges produce two parents, making them incompatible with
Alternator Streams. This series blocks tablet merges when streams are
active on a tablet table.

For CreateTable, a freshly created table has no pending merges, so
streams are enabled immediately with tablet merges blocked.

For UpdateTable on an existing table, stream enablement is deferred:
the user's intent is stored via `enable_requested`, tablet merges are
blocked (new merge decisions are suppressed and any active merge
decision is revoked), and the topology coordinator finalizes enablement
once no in-flight merges remain.

The topology coordinator is woken promptly on error injection release
and tablet split completion, reducing finalization latency from ~60s
to seconds.

`test_parent_children_merge` is marked xfail (merges are now blocked),
and downward (merge) steps are removed from `test_parent_filtering` and
`test_get_records_with_alternating_tablets_count`.

Not addressed here: using a topology request to preempt long-running
operations like repair (tracked in SCYLLADB-1304).

Refs SCYLLADB-461

Closes scylladb/scylladb#29224

* github.com:scylladb/scylladb:
  topology: Wake coordinator promptly for stream enablement lifecycle
  test/cluster: Test deferred stream enablement on tablet tables
  alternator/streams: Block tablet merges when Alternator Streams are enabled
2026-04-19 09:15:13 +03:00
Piotr Szymaniak
a2a0868c7d topology: Wake coordinator promptly for stream enablement lifecycle
The topology coordinator sleeps on a condition variable between
iterations. Several events relevant to Alternator stream enablement
did not wake it, causing delays of up to 60s (the periodic load
stats refresh interval) at each step:

1. Error injection release: when a test disables the
   delay_cdc_stream_finalization injection, the coordinator was
   not notified. Add an on_disable callback mechanism to the error
   injection framework (register_on_disable / unregister_on_disable)
   so subsystems can react when an injection is released. The
   topology coordinator uses this to broadcast its event.

2. Tablet split completion: after all local storage groups for a
   table finish splitting, split_ready_seq_number is set but the
   coordinator only discovered this via the periodic stats refresh.
   Add an on_tablet_split_ready callback to topology_state_machine
   that the coordinator sets to trigger_load_stats_refresh(). The
   split monitor in storage_service calls it when all compaction
   groups are split-ready, giving the coordinator fresh stats
   immediately so it can finalize the resize.

These changes reduce test_deferred_stream_enablement_on_tablets
from ~120s to ~13s and fix a production issue where Alternator
stream enablement could be delayed by up to 60s at each step of
the lifecycle (error injection release, split completion).
2026-04-19 03:54:33 +02:00
Piotr Szymaniak
a5d35d2b4c test/cluster: Test deferred stream enablement on tablet tables
Async cluster test exercising the deferred enablement lifecycle:
ENABLING -> ENABLED -> disabled, verifying tablet merge blocking
and unblocking at each stage. Uses delay_cdc_stream_finalization
error injection and CQL ALTER TABLE with tablet count constraints.

Also adds tablet scheduler config to test_config.yaml (fast refresh
interval, scale factor 1) for reliable tablet count changes.
2026-04-19 03:54:33 +02:00
Piotr Szymaniak
4b6937b570 alternator/streams: Block tablet merges when Alternator Streams are enabled
DynamoDB Streams API can only convey a single parent per stream shard.
Tablet merges produce 2 parents, which is incompatible. When streams
are requested on a tablet table, block tablet merges via
tablet_merge_blocked (the allocator suppresses new merge decisions and
revokes any active merge decision).

add_stream_options() sets tablet_merge_blocked=true alongside
enabled=true, so CreateTable needs no special handling — the flag
is inert on vnode tables and immediately effective on tablet tables.

For UpdateTable, CDC enablement is deferred: store the user's intent
via enable_requested, and let the topology coordinator finalize
enablement once no in-progress merges remain. A new helper,
defer_enabling_streams_block_tablet_merges(), amends the CDC options
to this deferred state.

Disabling streams clears all flags, immediately re-allowing merges.

The tablet allocator accesses the merge-blocked flag through a
schema::tablet_merges_forbidden() accessor rather than reaching into
CDC options directly.

Mark test_parent_children_merge as xfail and remove downward
(merge) steps from tablet_multipliers in test_parent_filtering and
test_get_records_with_alternating_tablets_count.
2026-04-19 03:54:33 +02:00
Avi Kivity
f5886b4fdd Merge 'Add virtual task for vnodes-to-tablets migrations' from Nikos Dragazis
This PR exposes vnodes-to-tablets migrations through the task manager API via a virtual task. This allows users to list, query status, and wait on ongoing migrations through a standard interface, consistent with other global operations such as tablet operations and topology requests are already exposed.

The virtual task exposes all migrations that are currently in progress. Each migrating keyspace appears as a separate task, identified by a deterministic name-based (v3) UUID derived from the keyspace name. Progress is reported as the number of nodes that have switched to tablets vs. the total. The number increases on the forward path and decreases on rollback.

The task is not abortable - rolling back a migration requires a manual procedure.

The `wait` API blocks until the migration either completes (returning `done`) or is rolled back (returning `suspended`).

Example output:
```
$ scylla nodetool tasks list vnodes_to_tablets_migration
task_id                              type                        kind    scope    state   sequence_number keyspace table entity shard start_time end_time
1747b573-6cd6-312d-abb1-9b66c1c2d81f vnodes_to_tablets_migration cluster keyspace running 0               ks                    0

$ scylla nodetool tasks status 1747b573-6cd6-312d-abb1-9b66c1c2d81f
id: 1747b573-6cd6-312d-abb1-9b66c1c2d81f
type: vnodes_to_tablets_migration
kind: cluster
scope: keyspace
state: running
is_abortable: false
start_time:
end_time:
error:
parent_id: none
sequence_number: 0
shard: 0
keyspace: ks
table:
entity:
progress_units: nodes
progress_total: 3
progress_completed: 0
```

Fixes SCYLLADB-1150.

New feature, no backport needed.

Closes scylladb/scylladb#29256

* github.com:scylladb/scylladb:
  test: cluster: Verify vnodes-to-tablets migration virtual task
  distributed_loader: Link resharding tasks to migration virtual task
  distributed_loader: Make table_populator aware of migration rollbacks
  service: Add virtual task for vnodes-to-tablets migrations
  storage_service: Guard migration status against uninitialized group0
  compaction: Add parent_id to table_resharding_compaction_task_impl
  storage_service: Add keyspace-level migration status function
  storage_service: Replace migration status string with enum
  utils: Add UUID::is_name_based()
2026-04-19 00:56:33 +03:00
Nadav Har'El
2943d30b0c Update seastar submodule
* seastar 4d268e0e...22a5aa13 (36):
  > apps/httpd: replace deprecated reply::done() with write_body()
  > missing header(s)
  > net: Fix missing throw for runtime_error in create_native_net_device
  > tests/io_queue: account for token bucket refill granularity in bandwidth checks
  > Merge 'iovec: fix iovec_trim_front infinite loop on zero-length iovecs' from Travis Downs
    tests: add regression tests for zero-length iovec handling
    iovec: fix iovec_trim_front infinite loop on zero-length iovecs
  > util/process: graduate process management API from experimental
  > cooking: don't register ready.txt as a build output
  > sstring: make make_sstring not static
  > Add SparkyLinux to debian list in install-dependencies.sh
  > http: allow control over default response headers
  > Merge 'chunked_fifo: make cached chunk retention configurable' from Brandon Allard
    tests/perf: add chunked_fifo microbenchmarks
    chunked_fifo: set the default free chunk retention to 0
    chunked_fifo: make free chunk retention configurable
  > Merge 'reactor_backend: fix pollable_fd_state_completion reuse in io_uring' from Kefu Chai
    tests: add regression test for pollable_fd_state_completion reuse
    reactor_backend: use reset() in AIO and epoll poll paths
    reactor_backend: fix pollable_fd_state_completion reuse after co_await in io_uring
  > Merge 'coroutine: Generator cleanups' from Kefu Chai
    coroutine/generator: extract schedule_or_resume helper
    coroutine/generator: remove unused next_awaiter classes
    coroutine/generator: remove write-only _started field
    coroutine/generator: assert on unreachable path in buffered await_resume
    coroutine/generator: add elements_of tag and #include <ranges>
    coroutine/generator: add empty() to bounded_container concept
  > cmake: bump minimum Boost version to 1.79.0
  > seastar_test: remove unnecessary headers
  > cmake: bump minimum GnuTLS version to 3.7.4
  > Merge 'reactor: add get_all_io_queues() method' from Travis Downs
    tests: add unit test for reactor::get_all_io_queues()
    reactor: add get_all_io_queues() method
    reactor: move get_io_queue and try_get_io_queue to .cc file
  > http: deprecate reply::done(), remove _response_line dead field
  > core: Deprecate scattered_message
  > ci: add workflow dispatch to tests workflow
  > perf_tests: exit non-zero when -t pattern matches no tests
  > Replace duplicate SEGV_MAPERR check in sigsegv_action() with SEGV_ACCERR.
  > perf_tests: add total runtime to json output
  > Merge 'Relax large allocation error originating from json_list_template' from Robert Bindar
    implement move assignment operator for json_list_template
    json_list_template copy assignment operator reserves capacity upfront
  > perf_tests: add --no-perf-counters option
  > Merge 'Fix to_human_readable_value() ability to work with large values' from Pavel Emelyanov
    memory: Add compile-time test for value-to-human-readable conversion
    memory: Extend list of suffixes to have peta-s
    memory: Fix off-by-one in suffix calculation
    memory: Mark to_human_readable_value() and others constexpr
  > http: Improve writing of response_line() into the output
  > Merge 'websocket: add template parameter for text/binary frame mode and implement client-side WebSocket' from wangyuwei
    websocket: add template parameter for text/binary frame mode
    websocket: impl client side websocket function
  > file: Fix checks for file being read-only
  > reactor: Make do_dump_task_queue a task_queue method
  > Merge 'Implement fully mixed mode for output_stream-s' from Pavel Emelyanov
    tests/output_stream: sample type patterns in sanitizer builds
    tests/output_stream: extend invariant test to cover mixed write modes
    iostream: allow unrestricted mixing of buffered and zero-copy writes
    tests/output_stream: remove obsolete ad-hoc splitting tests
    tests/output_stream: add invariant-based splitting tests
    iostream: rename output_stream::_size to ::_buffer_size
  > reactor_backend: replace virtual bool methods with const bool_class members
  > resource: Avoid copying CPU vector to break it into groups
  > perf_tests: increase overhead column precision to 3 decimal places
  > Merge 'Move reactor::fdatasync() into posix_file_impl' from Pavel Emelyanov
    reactor: Deprecate fdatasync() method
    file: Do fdatasync() right in the posix_file_impl::flush()
    file: Propagate aio_fdatasync to posix_file_impl
    reactor: Move reactor::fdatasync() code to file.cc
    reactor,file: Make full use of file_open_options::durable bit
    file: Add file_open_options::durable boolean
    file: Account io_stats::fsyncs in posix_file_impl::flush()
    reactor: Move _fsyncs counter onto io_stats
  > http: Remove connection::write_body()
2026-04-18 11:52:33 +03:00
Nadav Har'El
31e0315710 Merge 'alternator: fix unnecesary cdc log entries' from Radosław Cybulski
Fix cdc writing unnecesary entries to it's log, like for example when Alternator deletes an item which in reality doesn't exist.

Originally @wps0 tackled this issue. This patch is an extension of his work. His work involved adding `should_skip` function to cdc, which would process a `mutation` object and decide, wherever changes in the object should be added to cdc log or not.

The issue with his approach is that `mutation` object might contain changes for more than one row. If - for example - the `mutation` object contains two changes, delete of non-existing row and create of non-existing row, `should_skip` function will detect changes in second item and allow whole `mutation` (BOTH items) to be added. For example (using python's boto3) running this on empty table:
```
with table.batch_writer() as batch:
    batch.put_item({'p': 'p', 'c': 'c0'})
    batch.delete_item(Key={'p': 'p', 'c': 'c1'})
```
will emit two events ("put" event and "delete" event), even though the item with `c` set to `c1` does not exist (thus can't be deleted). Note, that both entries in batch write must use the same partition key, otherwise upper layer with split them into separate `mutation` objects and the issue will not happen.

The solution is to do similar processing, but consider each change separated from others. This is tricky to implement due to a way cdc works. When cdc processes `mutation` object (containing X changes), it emits cdc entries in phases. Phase 1 - emit `preimage` (old state) for each change (if requested). Phase 2 - for each change emit actual "diff" (update / delete and so on). Phase 3 - emit `postimage` (new state).

We will know if change needs to be skipped during phase 2. By that time phase 1 is completed and preimage for the change is emited. At that moment we set a flag that the change (identified by clustering key value) needs to be skipped - we add a clustering key to a `ignore-rows` set (`_alternator_clustering_keys_to_ignore` variable) and continue normally. Once all phases finish we add a `postprocess` phase (`clean_up_noop_rows` function). It will go through generated cdc mutations and skip all modifications, for which clustering key is in `ignore-rows` set. After skipping we need to do a "cleanup" operation - each generated cdc mutation contain index (incremented by one), if we skipped some parts, the index is not consecutive anymore, so we reindex final changes.

There's a special case worth mentioning - Alternator tables without clustering keys. At that point `mutation` object passed to cdc can contain exactly one change (since different partition keys are splitted by upper layers and Alternator will never emit `mutation` object containing two (or more) changes with the same primary key. Here, when we decide the change is to be skipped we add empty `bytes` object to `ignore-rows` set. When checking `ignore-rows` set, we check if it's empty or not (we don't check for presence of empty `bytes` object).

Note: there might be some confusion between this patch and #28452 patch. Both started from the same error observation and use similar tests for validation, as both are easily triggered by BatchWrite commands (both needs `mutation` object passed to cdc to contain more than one single change). This issue tho is about wrong data written in cdc log and is fixed at cdc, where #28452 is about wrong way of parsing correct cdc data and is fixed at Alternator side of things. Note, that we need #28452 to truly verify (otherwise we will emit correct cdc entries, but Alternator will incorrectly parse them).

Note: to benefit / notice this patch you need `alternator_streams_increased_compatibility` flag turned on.

Note: rework is quite "broad" and covers a lot of ground - every operation, that might result in a no-change to the database state should be tested. An additional test was added - trying to remove a column from non-existing item, as well as trying to remove non-existing column from existing item.

Fixes: #28368
Fixes: SCYLLADB-1528
Fixes: SCYLLADB-538

Closes scylladb/scylladb#28544

* github.com:scylladb/scylladb:
  alternator: remove unnecesary code
  alternator: fix Alternator writing unnecesary cdc entries
  alternator: add failing tests for Streams
2026-04-18 00:07:51 +03:00
Nadav Har'El
32060d73df Merge 'alternator: Add stream support for tablets' from Radosław Cybulski
Implements neccesary changes for Streams to work with tablet based tables.

- add utility functions to `system_keyspace` that helps reading cdc content from cdc log tables for tablet based base tables (similar api to ones for vnodes)
- remove antitablet `if` checks, update tests that fail / skip if tablets are selected
- add two tests to extensively test tablet based version, especially while manipulating stream count

Fixes #23838
Fixes SCYLLADB-463

Closes scylladb/scylladb#28500

* github.com:scylladb/scylladb:
  alternator: add streams with tablets tests
  alternator: remove antitablet guards when using Streams
  alternator: implement streams for tablets
  treewide: add cdc helper functions to system_keyspace
  alternator: add system_keyspace reference
2026-04-17 23:48:31 +03:00
Radosław Cybulski
586bb1d345 alternator: fix issues with stream_arn copy / move
`stream_arn` object holds a full ARN as `std::string` and two
`std::string_view` fields (`table_name_` and `keyspace_name_`) pointing
into ARN itself. This prevents object from being safely copied
(as in that case both `table_name_` and `keyspace_name_` will point into
original object's ARN). Similar issue might happen with move, when
ARN contains string short enough for small string optimization to
kick in (although in practice this is not possible, as ARN has
requirements which make it's minimal length above 15 characteres -
current limit for small string optimizations in most popular string
libraries).

The patch drops `std::string_view` objects in favor of integer offsets
and sizes. The offset equal to 0 means beginning of ARN string. The api
is preserved - both `table_name` and `keyspace_name` function will
return `std::string_view` reconstructed on the fly.

Closes scylladb/scylladb#29507
2026-04-17 23:13:17 +03:00
Piotr Szymaniak
caaef45b7a audit: restore static_cast for batch inspect
Closes scylladb/scylladb#29545
2026-04-17 23:11:18 +03:00
Nikos Dragazis
d361a0dd83 test: cluster: Verify vnodes-to-tablets migration virtual task
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2026-04-17 21:13:52 +03:00
Nikos Dragazis
295e434781 distributed_loader: Link resharding tasks to migration virtual task
When a table is loaded on startup during a vnodes-to-tablets migration
(forward or rollback), the `table_populator` runs a resharding
compaction.

Set the migration virtual task as parent of the resharding task. This
enables users to easily find all node-local resharding tasks related to
a particular migration.

Make `migration_virtual_task::make_task_id()` public so that the
`distributed_loader` can compute the migration's task ID.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2026-04-17 20:59:05 +03:00
Nikos Dragazis
a3aa4f6cb4 distributed_loader: Make table_populator aware of migration rollbacks
The `table_populator` uses a `migrate_to_tablets` flag to distinguish
normal tables from tables under vnodes-to-tablets migration (forward
path), since the two require different resharding.

The next patch will set the parent info of migration-related resharding
compaction tasks so they appear as children of the migration virtual
task. For that, the table populator needs to recognize not only
migrations in the forward path, but rollbacks as well.

Replace the flag with a tri-state `migration_direction` enum (none,
forward, rollback).

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2026-04-17 20:59:05 +03:00
Nikos Dragazis
696f9f8954 service: Add virtual task for vnodes-to-tablets migrations
Add a virtual task that exposes in-progress vnodes-to-tablets migrations
through the task manager API.

The task is synthesized from the current migration state, so completed
migrations are not shown. Progress is reported as the number of nodes
that currently use tablets: it increases on the forward path and
decreases on rollback. For simplicity, per-node storage modes are not
exposed in the task status; callers that need them should use the
migration status REST endpoint.

Unlike regular tasks that use time-based UUIDs, this task uses
deterministic named UUIDs derived from the keyspace names. This keeps
the implementation simple (no need to persist them) and gives each
keyspace a stable task ID. The downside is that the start time of each
task is unknown and repeated migrations of the same keyspace
(migration -> rollback -> new migration) cannot be distinguished.

Introduce a new task manager module to keep them separate from other
tasks.

Add support for `wait()`. While its practical value is debatable
(migration is a manual procedure, rolling restart will interrupt it), it
keeps the task consistent with the task manager interface.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2026-04-17 20:59:05 +03:00
Nikos Dragazis
d1ca01b25d storage_service: Guard migration status against uninitialized group0
`storage_service::get_tablets_migration_status()` reads a group0 virtual
table, so it requires group0 to be initialized.

When invoked via the migration REST API, this condition is satisfied
since the API is only available after joining group0. However, once this
function is integrated into the task API later in this series, the
assumption will no longer hold, as the task API is exposed earlier in
the startup process.

Add a guard to detect this condition and return a clear error message.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2026-04-17 20:59:05 +03:00
Nikos Dragazis
ca830c7bce compaction: Add parent_id to table_resharding_compaction_task_impl
Required to link it with the migration task in the next patches.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2026-04-17 20:59:05 +03:00
Nikos Dragazis
46e3902daa storage_service: Add keyspace-level migration status function
`storage_service::get_tablets_migration_status()` returns the
keyspace-level migration status, indicating whether migration has not
started, is in progress, or has completed, and for migrating keyspaces
also returns per-node migration statuses. Rename it to
`get_tablets_migration_status_with_node_details()` and introduce a new
`get_tablets_migration_status()` that returns only the keyspace-level
status.

This prepares the function for reuse in the next patches, which will add
a virtual task for vnodes-to-tablets migrations. Several task-manager
paths will only need the keyspace-level migration state, not per-node
information.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2026-04-17 20:59:05 +03:00
Nikos Dragazis
3096ba0577 storage_service: Replace migration status string with enum
Using a string was sufficient while this status was only exposed through
the REST API, but the next patches will also consume it internally.
Use an enum for the internal representation and convert it back to the
existing string values in the REST API.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2026-04-17 20:59:05 +03:00
Nikos Dragazis
a00056381f utils: Add UUID::is_name_based()
The UUID class already provides `is_timestamp()` for identifying
time-based (version 1) UUIDs. Add the analogous `is_name_based()`
predicate for version 3 (name-based) UUIDs, along with a test.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2026-04-17 20:58:39 +03:00
Radosław Cybulski
9a6aed721b alternator: add streams with tablets tests
Add tests for Streams, when table uses tablets underneath.
One test verifies filtering using CHILD_SHARDS feature.
Other one makes sure we get read all data while the table
undergoes tablet count change.

Add `--tablet-load-stats-refresh-interval-in-seconds=1`
to `alternator/run` script, as otherwise newly added tests will fail.
The setting changes how often scylla refreshes tablet metadata.
This can't be done using `scylla_config_temporary`, as
1) default is 60 seconds
2) scylla will wait full timeout (60s) to read configuration variable again.
2026-04-17 18:58:27 +02:00
Radosław Cybulski
6be16cf224 alternator: remove antitablet guards when using Streams
Remove `if` condition, that prevented tables with tablets
working with Streams.
Remove a test, that verifies, that Alternator will reject
tables with tablets underneath working with Streams feature enabled
on them.
Update few tests, that were expected to fail on tablets to enable their
normal execution.
2026-04-17 18:58:26 +02:00
Radosław Cybulski
d5df3ec07c alternator: implement streams for tablets
Add a code, that will handle Streams reading, when table is
using tablets underneath.

Fixes #23838
2026-04-17 18:57:44 +02:00
Radosław Cybulski
eb35a7b6ce treewide: add cdc helper functions to system_keyspace
Add helper functions to `system_keyspace` object, that deal
with reading cdc content for tablet based table's.
`read_cdc_for_tablets_current_generation_timestamp` will read current
generation's timestamp.
`read_cdc_for_tablets_versioned_streams` will build
timestamp -> `cdc::streams_version` map similar to how
`system_distributed_keyspace::cdc_get_versioned_streams` works.
We're adding those helper functions, because their siblings in
`system_distributed_keyspace` work only, when base table is backed up
by vnodes. New additions work only, when base table is backed up
by tablets.
2026-04-17 18:57:44 +02:00
Radosław Cybulski
d93299b605 alternator: add system_keyspace reference
Add a reference to `system_keyspace` object to `executor` object in
alternator. The reference is needed, because in future commit
we will add there (and use) helper functions that read `cdc_log` tables
for tablet based tables similarly to already existing siblings
for vnodes living in `system_distributed_keyspace`.
2026-04-17 18:57:43 +02:00
Radosław Cybulski
04b9d3875f alternator: remove unnecesary code
After our fix, that prevents no-op changes being written into cdc log
we will remove Piotr Wieczorek's previous attempt, which is now
unnecesary.
2026-04-17 18:02:00 +02:00
Radosław Cybulski
6e5aaa85b6 alternator: fix Alternator writing unnecesary cdc entries
Work in this patch is a result of two bugs - spurious MODIFY event, when
remove column is used in `update_item` on non-existing item and
spurious events, when batch write item mixed noop operations with
operations involving actual changes (the former would still emit
cdc log entries).
The latter issue required rework of Piotr Wieczorek's algorithm,
which fixed former issue as well.

Piotr Wieczorek previously wrote checks, that should
prevent unnecesary cdc events from being written. His implementation
missed the fact, that a single `mutation` object passed to cdc code
to be analysed for cdc log entries can contain modifications for
multiple rows (with the same timestamp - for example as a result
to BatchWriteItem call). His code tries to skip whole `mutation`,
which in such case is not possible, because BatchWriteItem might have
one item that does nothing and second item that does modification
(this is the reason for the second bug).

His algorithm was extended and moved. Originally it was working
as follows - user would sent a `mutation` object with some changes to
be "augmented". The cdc would process those changes and built a set of
cdc log changes based on them, that would be added to cdc log table.
Piotr added a `should_skip` function, which processes user changes and
tried to determine if they all should be dropped or not.
New version, instead of trying to skip adding rows to
cdc log `mutation` object, builds a rows-to-ignore set.
After whole cdc log `mutation` object is completed, it processes it
and go through it row by row. Any row that was previously added to
a `rows_to_ignore` set will now be removed. Remaining rows are written to
new cdc log `mutation` with new clustering key
(`cdc$batch_seq_no` index value should probably be consecutive -
we just want to be safe here) and returns new `mutation` object to
be sent to cdc log table.

The first bug is fixed as a side effect of new algorithm,
which contains more precise checks detecting, if given
mutation actually made a difference.

Fixes: #28368
Fixes: SCYLLADB-538
Fixes: SCYLLADB-1528
Refs: #28452
2026-04-17 18:00:25 +02:00
Botond Dénes
6ce0968960 compaction: release GC'ed sstables incrementally during compaction
Garbage collected sstables created during incremental compaction are
deleted only at the end of the compaction, which increases the memory
footprint. This is inefficient, especially considering that the related
input sstables are released regularly during compaction.

This commit implements incremental release of GC sstables after each
output sstable is sealed. Unlike regular input sstables, GC sstables
use a different exhaustion predicate: a GC sstable is only released
when its token range no longer overlaps with any remaining input
sstable. This is because GC sstables hold tombstones that may shadow
data in still-alive overlapping input sstables; releasing them
prematurely would cause data resurrection.

Fixes #5563

Closes scylladb/scylladb#28984
2026-04-17 18:20:47 +03:00
Radosław Cybulski
2894542e57 alternator: add failing tests for Streams
Add failing tests for Streams functionality.
Trying to remove column from non-existing item is producing
a MODIFY event (while it should none).
Doing batch write with operations working on the same partition,
where one operation is without side effects and second with
will produce events for both operations, even though first changes nothing.

First test has two versions - with and without clustering key.
Second has only with clustering key, as we can't produce
batch write with two items for the same partition -
batch write can't use primary key more than once in single call.
We also add a test for batch write, where one of three operations
has no observable side effects and should not show up in Streams
output, but in current scylla's version it does show.
2026-04-17 16:28:14 +02:00
Botond Dénes
6eb2d15f39 Merge 'Replace CAS estimated histogram with estimated_histogram_with_max' from Amnon Heiman
ScyllaDB uses estimated_histogram in many places.
We already have a more efficient alternative: estimated_histogram_with_max. It is both CPU- and
memory-efficient, and it can be exported as Prometheus native histograms.

Its main limitation (which also has benefits) is that the bucket layout is fixed at compile time, so
histograms with different configurations cannot be mixed.

The end goal is to replace all uses of estimated_histogram in the codebase.
That migration requires a few small API adjustments, so it is done in steps.

This PR replaces estimated_histogram for CAS contention.
The PR includes a patch that adds functionality to the base approx_exponential_histogram, which will be used by the API.

The specific histograms are defined in a single place and cover the range 1-100; this makes future changes easy.

**New feature, no need to backport**

Closes scylladb/scylladb#29017

* github.com:scylladb/scylladb:
  storage_proxy: migrate CAS contention histograms to estimated_histogram_with_max
  estimated_histogram.hh: Add bucket offset and count to approx_exponential_histogram
2026-04-17 13:12:59 +03:00
Andrzej Jackowski
e256d9f69d test: retry get_coordinator_host() after topology coordinator stop
After stopping the topology coordinator, a new topology coordinator
may not yet be started when get_coordinator_host() is called.  Make
the function always retry via wait_for so that every caller is
protected against this race.

Fixes SCYLLADB-1553

Closes scylladb/scylladb#29489
2026-04-17 12:08:26 +02:00
Botond Dénes
fbcfe3f88f test: use uuid4 for DockerizedServer container names to avoid collisions
Container names were generated as {name}-{pid}-{counter}, where the
counter is a per-process itertools.count. This scheme breaks across CI
runs on the same host: if a prior job was killed abruptly (SIGKILL,
cancellation) its containers are left running since --rm only removes
containers on exit. A subsequent run whose worker inherits the same PID
(common in containerized CI with small PID namespaces) and reaches the
same counter value will collide with the orphaned container.

Replace pid+counter with uuid.uuid4(), which generates a random UUID,
making names unique across processes, hosts, and time without any shared
state or leaking host identifiers.

Fixes: SCYLLADB-1540

Closes scylladb/scylladb#29509
2026-04-17 11:56:51 +02:00
Botond Dénes
57f8be49e9 Merge 'Move ignore_component_digest_mismatch flag on sstables_manager' from Pavel Emelyanov
The PR serves two purposes.

First, it makes the flag usage be consistent across multiple ways to load sstables components. For example, the sstable::load_metadata() doesn't set it (like .load() does) thus potentially refusing to load "corrupted" components, as the flag assumes.

Second, it removes the fanout of db.get_config().ignore_component_digest_mismatch() over the code. This thing is called pretty much everywhere to initialize the sstable_open_config, while the option in question is "scylla state" parameter, not "sstable opening" one.

Code cleanup, not backporting

Closes scylladb/scylladb#29513

* github.com:scylladb/scylladb:
  sstables: Remove ignore_component_digest_mismatch from sstable_open_config
  sstables: Move ignore_component_digest_mismatch initialization to constructor
  sstables: Add ignore_component_digest_mismatch to sstables_manager config
2026-04-17 12:54:17 +03:00
Avi Kivity
cad3c0de94 test: write minio log to testlog dir for Jenkins artifact collection
Write the MinIO server log directly to tempdir_base (testlog/<arch>/)
instead of the per-server temp directory that gets destroyed on
shutdown. This preserves the log for Jenkins artifact collection,
helping debug S3-related flaky test failures like the
stcs_reshape_overlapping_s3_test hang (SCYLLADB-1481).

Closes scylladb/scylladb#29458
2026-04-17 12:51:55 +03:00
Botond Dénes
facb50cbf9 Merge 'test.py: refactor test.py' from Andrei Chekun
With the latest changes, there are a lot of code that is redundant in the test.py. This PR just cleans this code.
Also, it narrows using dynamic scope for fixtures to test/alternator and test/cqlpy. All the rest by default will have module scope.
test.py will be a wrapper for pytest mostly for CI use. As for now test.py have important part of calculating the number of threads to start pytest with. This is not possible to do in pytest itself.

No backport needed, framework enhancement only.

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-666

Closes scylladb/scylladb#28852

* github.com:scylladb/scylladb:
  test.py: remove testpy_test_fixture_scope
  test.py: add logger for 3rd party service
  test.py: delete dead code in test.py
2026-04-17 12:51:14 +03:00
Pawel Pery
7883f161bb vector-store: fix creating local vector search indexes with a part of the partition key
Users ought to have possibility to create the local index for Vector Search
based only on a part of the partition key. This commits provides this by
removing requirements of 'full partition key only' for custom local index.

The commit updates docs to explain that local vector index can use only a part
of the partition key.

The commit implements cqlpy test to check fixed functionality.

Fixes: SCYLLADB-953

Needs to be backported to 2026.1 as it is a fix for local vector indexes.

Closes scylladb/scylladb#28931
2026-04-17 11:44:15 +02:00
Karol Nowacki
c643f321af vector_search: decrease default connection timeout to 3s
Decrease the default connection timeout to 3s to better align with the
default CQL query timeout of 10s.

The previous timeout allowed only one failover request in high availability
scenario before hitting the CQL query timeout.
By decreasing the timeout to 3s, we can perform up to three failover requests
within the CQL query timeout, which significantly improves the chances of
successfully completing the query in high availability scenarios.

Fixes: SCYLLADB-95
2026-04-17 12:26:39 +03:00
Karol Nowacki
9269ca9cf7 vector_search: add unreachable node detection time config
Add option `vector_store_unreachable_node_detection_time_in_ms` to
control parameters related to detecting unreachable vector store nodes.
This parameter is used to set the TCP connect timeout, keepalive
parameters, and TCP_USER_TIMEOUT. By configuring these parameters,
we can detect unreachable vector store nodes faster and trigger
failover mechanisms in a timely manner.
2026-04-17 12:26:38 +03:00
Piotr Smaron
686029f52c audit: disable caching for the audit log table
The audit table had caching enabled by default, which provides no
value since audit data is write-heavy and rarely read back through
the cache. This wastes cache space that could be used for more
important user data.

Disable caching by setting keys and rows_per_partition to NONE and
enabled to false, consistent with get_disabled_caching_options()
and other system tables such as system.batchlog,
system.large_partitions, and CDC log tables.

Closes scylladb/scylladb#29506
2026-04-17 11:17:10 +02:00
Piotr Dulikowski
37fc1507f0 Merge 'Alternator: Add vector search support' from Nadav Har'El
This series adds support for vector search in Alternator based on the existing implementation in CQL.

The series adds APIs for `CreateTable` and `UpdateTable` to add or remove vector indexes to Alternator tables, `DescribeTable` to list them and check the indexing status, and `Query` to perform a vector search - which contacts the vector store for the actual ANN (approximate nearest neighbor) search.

Correct functionality of these features depend on some features of the the vector store, that were already done (see https://github.com/scylladb/vector-store/pull/394).

This initial implementation is fully functional, and can already be useful, but we do not yet support all the features we hope to eventually support. Here are things that we have **not** done yet, and plan to do later in follow-up pull requests:

1. Support a new optimized vector type ("V") - in addition to the "list of numbers" type supported in this version.
2. Allow choosing a different similarity function when creating an index, by SimilarityFunction in VectorIndex definition.
3. Allow choosing quantization (f32/f16/bf16/i8/b1) to ask the vector index to compress stored vectors.
4. Support oversampling and rescoring, defined per-index and per-query.
5. Support HNSW tuning parameters — maximum_node_connections, construction_beam_width, search_beam_width.
6. Support pre-filtering over key columns, which are available at the vector store, by sending the filter to the vector store (translated from DynamoDB filter syntax to the vector's store's filter syntax). A decision still need to be made if this will use KeyConditionExpression or FilterExpression. This version supports only post-filtering (with `FilterExpression`).
7. Support projecting non-key attributes into the index (Projection=INCLUDE and Projection=ALL), and then 1. pre-filtering using these attributes, and 2. efficiently return these attributes (using Select=ALL_PROJECTED_ATTRIBUTES, which today returns just the key columns).
8. Optimize the performance of `Query`, which today is inefficient for Select=ALL_ATTRIBUTES because it serially retrieves the matching items one at a time.
9. Returning the similarity scores with the items (the design proposes ReturnVectorSearchSimilarity).
10. Add more vector-search-specific metrics, beyond the metric we already have counting Query requests. For example separate latency and request-count metrics for vector-search Queries (distinct from GSI/LSI queries), and a metric accumulating the total Limit (K) across all vector search queries.
11. Consider how (and if at all) we want to run the tests in test/alternator/test_vector.py that need the vector store in the CI. Currently they are skipped in CI and only run manually (with `test/alternator/run --vs test_vector`).
12. UpdateTable 'Update' operation to modify index parameters. Only some can be modified, e.g., Oversampling.
13. Support for "local index" (separate index for each partition).
14. Make sure that vector search and Streams can be enabled concurrently on the same table - both need CDC but we need to verify that one doesn't confuse the other or disables options that the other needs. We can only do this after we have Alternator Streams running on tablets (since vector store requires tablets).

Testing the new Alternator vector search end-to-end requires running both Scylla and the vector store together. We will have such end-to-end tests in the vector store repository (see https://github.com/scylladb/vector-store/pull/392), but we also add in this pull request many end-to-end tests written in Python, that can be run with the command "test/alternator/run --vs test_vector.py". The "--vs" option tells the run script to run both Scylla and the vector store (currently assumed to be in `.../vector-store/target/release/vector-store`). About 65% of the tests in this pull request check supported syntax and error paths so can run without the vector store, while about 35% of the tests do perform actual Query operations and require the vector store to be running. Currently, the tests that do require the vector store will not get run by CI, but can be easily re-run manually with `test/alternator/run --vs test_vector.py`.

 In total, this series includes 78 functional tests in 2200 lines of Python code.

This series also includes documentation for the new Alternator feature and the new APIs introduced. You can see a more detailed design document here: https://docs.google.com/document/d/1cxLI7n-AgV5hhH1DTyU_Es8_f-t8Acql-1f58eQjZLY/edit

Two patches in this series split the huge alternator/executor.cc, after this series continued to grow it and it reached a whoppng 7,000 lines. These patches are just reorganization of code, no functional changes. But it's time that we finally do this (Refs #5783), we can't just continue to grow executor.cc with no end...

Closes scylladb/scylladb#29046

* github.com:scylladb/scylladb:
  test/alternator: add option to "run" script to run with vector search
  alternator: document vector search
  test/alternator: fix retries in new_dynamodb_session
  test/alternator: test for allowed characters in attribute names
  test/alternator: tests for vector index support
  alternator, vector: add validation of non-finite numbers in Query
  alternator: Query: improve error message when VectorSearch is missing
  alternator: add per-table metrics for vector query
  alternator: clean up duplicated code
  alternator: fix default Select of Query
  alternator: split executor.cc even more
  alternator: split alternator/executor.cc
  alternator: validate vector index attribute values on write
  alternator: DescribeTable for vector index: add IndexStatus and Backfilling
  alternator: implement Query with a vector index
  alternator: fix bug in describe_multi_item()
  alternator: prevent adding GSI conflicting with a vector index
  alternator: implement UpdateTable with a vector index
  alternator: implement DescribeTable with a vector index
  alternator: implement CreateTable with a vector index
  alternator: reject empty attribute names
  cdc: fix on_pre_create_column_families to create CDC log for vector search
2026-04-17 10:25:45 +02:00
Avi Kivity
04b54f363b Merge 'Enable vnodes-to-tablets migrations with arbitrary tokens' from Nikos Dragazis
This PR removes the power-of-two token constraint from vnodes-to-tablets migrations, allowing clusters with randomly generated tokens to migrate without manual token reassignment.

Previously, migrations required vnode tokens to be a power of two and aligned. In practice, these conditions are not met with Scylla's default random token assignment, so the constraint is a blocker for real-world use. With the introduction of arbitrary tablet boundaries in PR #28459, the tablet layer can now support arbitrary tablet boundaries. This PR builds on that capability to allow arbitrary vnode tokens during migration.

When the highest vnode token does not coincide with the end of the token ring, the vnode wraps around, but tablets do not support that. This is handled by splitting it into two tablets: one covering the tail end of the ring and one covering the beginning.

Testing has been updated accordingly: existing cluster tests now use randomly generated tokens instead of precomputed power-of-two values, and a new Boost test validates the wrap-around tablet boundary logic.

Fixes SCYLLADB-724.

New feature, no backport is needed.

Closes scylladb/scylladb#29319

* github.com:scylladb/scylladb:
  test: Use arbitrary tokens in vnodes->tablets migration tests
  test: boost: Add test for wrap-around vnodes
  storage_service: Support vnodes->tablets migrations w/ arbitrary tokens
  storage_service: Hoist migration precondition
2026-04-17 00:46:35 +03:00
Andrei Chekun
745debe9ec test.py: remove testpy_test_fixture_scope
With migration to pyest this fixture is useless. Removing and setting
the session to the module for the most of the tests.
Add dynamic_scope function to support running alternator fixtures in
session scope, while Test and TestSuite are not deleted. This is for
migration period, later on this function should be deleted.
2026-04-16 22:08:33 +02:00
Andrei Chekun
21addb2173 test.py: add logger for 3rd party service
With migration of preparation environment and starting 3rd party services
to the pytest, they're output the logs to the terminal. So this PR
binds them their own log file to avoid polluting the terminal.
2026-04-16 22:08:33 +02:00
Andrei Chekun
13770ab394 test.py: delete dead code in test.py
With the latest changes, there are a lot of code that is redundant in
the test.py. This PR just cleans this code.
Changes in other files are related to cleaning code from the test.py,
especially with redundant parameter --test-py-init and moving
prepare_environment to pytest itself.
2026-04-16 22:08:31 +02:00
Avi Kivity
999e108139 Merge 'test: lib: fix broken retry in start_docker_service' from Dario Mirovic
The retry loop in `start_docker_service` passes the parse callbacks via `std::move` into `create_handler` on each iteration. After the first iteration, the moved-from `std::function` objects are empty. All subsequent retries skip output parsing entirely and immediately treat the service as successfully started. This defeats the entire purpose of the retry mechanism.

Fix by passing the callbacks by copy instead of move, so the original callbacks remain valid across retries.

Fixes SCYLLADB-1542

This is a CI stability issue and should be backported.

Closes scylladb/scylladb#29504

* github.com:scylladb/scylladb:
  test/lib: fix typos in proc_utils, gcs_fixture, and dockerized_service
  test: gcs_fixture: rename container from "local-kms" to "fake-gcs-server"
  test: fix proc_utils.cc formatting from previous commit
  test: lib: use unique container name per retry attempt
  test: lib: fix broken retry in start_docker_service
2026-04-16 21:48:25 +03:00
Radosław Cybulski
c5ed6b22ae alternator: add CHILD_SHARDS filtering
Add a `CHILD_SHARDS` filter to `DescribeStream` command.
When used, user need to pass a parent stream shard id as
json's ShardFilter.ShardId field. DescribeStream will
then return only list of stream shards, that are direct
descendants of passed parent stream shard.

Each stream shard cover a consecutive part of token space.
A stream shard Q is considered to be a child of stream shard W,
when at least one token belongs to token spaces from both streams.
The filtering algorithm itself is somewhat complicated - more details
in comments in streams.cc.

CHILD_SHARDS is a Amazon's functionality and is required by KCL.

Add unit tests.

Fixes: #25160

Closes scylladb/scylladb#28189
2026-04-16 18:27:55 +03:00
Andrei Chekun
ba04e1e2c3 codeowners: add owner for the test framework
Add @xtrey as a codeowner of the test framework

Closes scylladb/scylladb#29518
2026-04-16 17:57:21 +03:00
Piotr Szymaniak
d0c3f78d76 test/alternator: extend local TTL streams timeout
Increase the non-AWS wait in the TTL streams test to reduce vnode CI flakes caused by delayed expiration visibility.

Fixes SCYLLADB-1556

Closes scylladb/scylladb#29516
2026-04-16 15:53:35 +03:00
copilot-swe-agent[bot]
ec7450bff8 topology_coordinator, tablets: Log active tablet transitions when going idle
This will make debugging of stalled tablet transitions easier. We saw
several issues when topology state machine was blocked by active
tablet migrations, which was not obvious at first glance of the
logs. Now it will be east to tell if tablet transitions are blocking
progress and which transitions are stuck.

Closes scylladb/scylladb#28616
2026-04-16 14:34:37 +03:00
Benny Halevy
05a00fe140 compaction_manager: fix use-after-free in postponed_compactions_reevaluation()
drain() signals the postponed_reevaluation condition variable to terminate
the postponed_compactions_reevaluation() coroutine but does not await its
completion. When enable() is called afterwards, it overwrites
_waiting_reevalution with a new coroutine, orphaning the old one. During
shutdown, really_do_stop() only awaits the latest coroutine via
_waiting_reevalution, leaving the orphaned coroutine still alive. After
sharded::stop() destroys the compaction_manager, the orphaned coroutine
resumes and reads freed memory (is_disabled() accesses _state).

Fix by introducing stop_postponed_compactions(), awaiting the reevaluation coroutine in
both drain() and stop() after signaling it, if postponed_compactions_reevaluation() is running.
It uses an std::optional<future<>> for _waiting_reevalution and std::exchange to leave
_waiting_reevalution disengaged when postponed_compactions_reevaluation() is not running.
This prevents a race between drain() and stop().

While at it, fix typo in _waiting_reevalution -> _waiting_reevaluation.

Fixes: SCYLLADB-1463
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#29443
2026-04-16 14:33:31 +03:00
Nadav Har'El
d3d5db37d7 test/alternator: add option to "run" script to run with vector search
Add to test/alternator/run the option "-vs" which runs alongside with
Scylla a vector store, to allow running Alternator tests with vector
indexing.

To get the vector store, do

	git clone git@github.com:scylladb/vector-store.git
	cargo build --release

"run -vs" looks for an executable in ../vector-store/target/*/vector-store
but can also be overridden by the VECTOR_STORE environment variable.

test/alternator/run runs the vector store exactly like it runs Scylla -
in a temporary directory, on a temporary IP address in the localhost
subnet (127.0.0/8), killing it when the test end, and showing the output
of both programs (Scylla and vector store). These transient runs of
Scylla and vector store are configured to be able to communicate to
each other.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-04-16 14:30:18 +03:00
Nadav Har'El
3d8463ccd2 alternator: document vector search
This patch adds a new document, docs/alternator/vector-search.md, on the
new vector search feature in Alternator. It introduces this feature, and
the DynamoDB APIs that we extended to support it.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-04-16 14:30:17 +03:00
Nadav Har'El
164b0e37e1 test/alternator: fix retries in new_dynamodb_session
The new_dynamodb_session() function had a bug which we never noticed
because we hardly used it, but it became more noticable when the new
test/alternator/test_vector.py started to use it:

By default, boto3 retries a request up to 9 times when it encounters
a retriable error (such as an Internal Server Error). We don't want such
retries in our tests - it makes failures slower, but more importantly
it can hide "flaky" bugs by retrying 9 times until it happens to succeed.

The new_dynamodb_session() had code (copied from the dynamodb fixture)
to set boto3's "max_attempts" configuration to 0, to disable this retry.
But this code had an incorrect "if" to only be done if we're testing on
"localhost". This is wrong: We almost never use "localhost" as the
target of the test; Both test/cqlpy/run and test.py pick an IP address
in the localhost subnet (127/8) and uses that IP address - not the string
"localhost".

This bug only existed in new_dynamodb_session() - the more commonly used
"dynamodb" fixture didn't have this bug.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-04-16 14:30:17 +03:00
Nadav Har'El
858dee0b30 test/alternator: test for allowed characters in attribute names
One of the tests in the previous patch checked that strange characters
are allowed in attribute names used for vector indexing. It turns out
we never had a test that verifies that regardless of vector indexes -
any character whatsoever is allowed in attribute names. This is
different from table names which are much more limited.

So this patch adds the missing test.

As usual, the new test also passes on DynamoDB, showing that these
stange characters in attribute names are also allowed by DynamoDB.
2026-04-16 14:30:17 +03:00
Nadav Har'El
58538e18e8 test/alternator: tests for vector index support
In this patch we add a large collection of basic functional tests for the
vector index support, covering the CreateTable, UpdateTable,  DescribeTable
and Query operations and the various ways in which those are allowed to
work - or expected to fail. These tests were written in parallel with
writing the code so they (hopefully) cover all the corner cases considered
during development, and make sure these corner cases are all handled
correctly and will not regress in the future.

Some of these tests do not involve querying of the index and focus on
the structure of requests and the kind of syntax allowed. But other tests
are end-to-end, requiring the vector store to be running and trying to
index Alternator data and query it. These tests are marked
"needs_vector_store", and are immediately skipped in Scylla is not
configured to connect to a vector store. In a later patch we'll add a
an option to test/alternator/run to be able to run these end-to-end
tests by automatically running both Scylla and the Vector Store.

We'll have additional end-to-end tests in the vector-store repository.

Note that vector search is a new API feature that doesn't exist in DynamoDB,
so we are adding new parameters and outputs to existing operations. The AWS
SDKs don't normally allow doing that, so the test added here begins by
teaching the Python SDK to use the new APIs we added. This piece of code
can also be used by end-users to use vector search (at least in Python...)
before we officially add this support to ScyllaDB's SDK wrappers.
2026-04-16 14:30:17 +03:00
Nadav Har'El
fe5a5a813f alternator, vector: add validation of non-finite numbers in Query
Non-finite numbers (Inf, NaN) don't make sense in vector search, and
also not allowed in the DynamoDB API as numbers. But the parsing code
in Query's QueryVector accepted "Inf" and "NaN" and then failed to
send the request to the vector store, resulting in a strange error
message. Let's fix it in the parsing code.

We have a test (test_query_vectorsearch_queryvector_bad_number_string)
that verifies this fix.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-04-16 14:30:17 +03:00
Nadav Har'El
aa070fae5b alternator: Query: improve error message when VectorSearch is missing
Before this patch, if we attempt a Query with IndexName is a vector index
but forget a "VectorSearch" parameter, the error is misleading: The code
expects a GSI or LSI, and when it can't find a GSI or LSI with that name,
it reports that the index is missing. But this is not helpful. So in this
patch we produce a more helpful message: That the index does exist, and
is a vector index, so a "VectorSearch" parameter is mandatory and is
missing.
2026-04-16 14:30:16 +03:00
Nadav Har'El
f932f94422 alternator: add per-table metrics for vector query
The per-table metrics for Query were not incremented for the
vector variant of the Query operations, only the global metrics were
incremented. This patch fixes this oversight, and add a test that
reproduces it (the new test fails before this patch, and passes after).
2026-04-16 14:30:16 +03:00
Nadav Har'El
8cf510e06c alternator: clean up duplicated code
De-duplicate some code introduced in earlier patches, such a two
nearly-identical loops over the indexes (one to check if there is a
vector index, the second to get its dimensions), and two nearly-
identical chunks of code to get the item contents when there is or
there isn't a clustering key.

There should be no functional changes in this patch.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-04-16 14:30:16 +03:00
Nadav Har'El
f15c6634a7 alternator: fix default Select of Query
In earlier patches, when Query'ing a vector index, we set the default
Select to ALL_ATTRIBUTES. However, according to the DynamoDB documentation
for Query,

   "If neither Select nor ProjectionExpression are specified, DynamoDB
    defaults to ALL_ATTRIBUTES when accessing a table, and
    ALL_PROJECTED_ATTRIBUTES when accessing an index."

This default should also apply to vector index, so this patch fixes this.
The new behavior is not only more compatible with DynamoDB, it is also
much more efficient by default, as ALL_PROJECTED_ATTRIBUTES does not need
to read from the base table - it returns the results that the vector store
returned. Of course, if the user needs the more efficient ALL_ATTRIBUTES
this option is still available - it's just no longer the default.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-04-16 14:30:16 +03:00
Nadav Har'El
2e274bbdba alternator: split executor.cc even more
This patch continues the effort to split the huge executor.cc (5000
lines before this patch) even more.

In this patch we introduce a new source file, executor_util.cc, for
various utility functions that are used for many different operations
and therefore are useful to have in a header file. These utility
functions will now be in executor_util.cc and executor_util.hh -
instead of executor.cc and executor.hh.

Various source files, including executor.cc, the executor_read.cc
introduced in the previous patch, as well as older source files like
as streams.cc, ttl.cc and serialization.cc, use the new header file.

This patch removes over 700 lines of code from executor.cc, and
also removes a large amount of utility functions declerations from
executor.hh. Originally, executor.hh was meant to be about the
interface that the Alternator server needs to *execute* the different
DynamoDB API operations - and after this patch it returns closer to
this original goal.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-04-16 14:30:16 +03:00
Nadav Har'El
751da00692 alternator: split alternator/executor.cc
Already six years ago, in #5783, we noticed that alternator/executor.cc
has grown too large. The previous patches added hundreds of more lines
to it to implement vector search, and it reached a whopping 7,000 lines
of code. This is too much.

This patch splits from executor.cc two major chunks:

1. The implementation of **read** requests - GetItem, BatchGetItem,
   Query (base table, GSI/LSI, and vector-search), and Scan - was
   moved to a new source file alternator/executor_read.cc.
   The new file has 2,000 lines.

2. Moved 250 lines of template functions dealing with attribute paths
   and maps of them to a new header file, attribute_path.hh.
   These utilities are used for many different operations - various
   read operations use them for ProjectionExpression, and UpdateItem
   uses them for modifications to nested attributes, so we need the
   new header file from both executor.cc and executor_read.cc

The remaining executor.cc is still pretty big, 5,000 lines, and
contains write operations (PutItem, UpdateItem, DeleteItem,
BatchWriteItem) as well as various table and other operations, and
also many utility functions used by many types of operations, so
we can later continue this refactoring effort.

Refs #5783

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-04-16 14:30:10 +03:00
Emil Maskovsky
91df3795fc encryption: cover system.raft table in system_info_encryption
Extend system_info_encryption to encrypt system.raft SSTables.
system.raft contains the Raft log, which may hold sensitive user data
(e.g. batched mutations), so it warrants the same treatment as
system.batchlog and system.paxos.

During upgrade, existing unencrypted system.raft SSTables remain
readable. Existing data is rewritten encrypted via compaction, or
immediately via nodetool upgradesstables -a.

Update the operator-facing system_info_encryption description to
mention system.raft and add a focused test that verifies the schema
extension is present on system.raft.

Fixes: CUSTOMER-268

Backport: 2026.1 - closes an encryption-at-rest coverage gap: system.raft may persist sensitive user-originated data unencrypted; backport to the current LTS.

Closes scylladb/scylladb#29242
2026-04-16 13:22:10 +02:00
Botond Dénes
d006c4c476 Merge 'Untie (partially) cql3/statements from db::config' from Pavel Emelyanov
There's a bunch of db::config options that are used by cql3/statements/ code. For that they use data_dictionary/database as a proxy to get db::config reference. This PR moves most of these accessed options onto cql_config

Options migrated to cql_config:

   1. select_internal_page_size
   2. strict_allow_filtering
   3. enable_parallelized_aggregation
   4. batch_size_warn_threshold_in_kb
   5. batch_size_fail_threshold_in_kb
   6. 7 keyspace replication restriction options
   7. 2 TWCS restriction options
   8. restrict_future_timestamp
   9. strict_is_not_null_in_views (with view_restrictions struct)
   10. enable_create_table_with_compact_storage

Some options need special treatment and are still abused via database, namely:

  1. enable_logstor
  2. cluster_name
  3. partitioner
  4. endpoint_snitch

Fixing components inter-dependencies, not backporting

Closes scylladb/scylladb#29424

* github.com:scylladb/scylladb:
  cql3: Move enable_create_table_with_compact_storage to cql_config
  cql3: Move strict_is_not_null_in_views to cql_config
  cql3: Move restrict_future_timestamp to cql_config
  cql3: Move TWCS restriction options to cql_config
  cql3: Move keyspace restriction options to cql_config
  cql3: Move batch_size_fail_threshold_in_kb to cql_config
  cql3: Move batch_size_warn_threshold_in_kb to cql_config
  cql3: Move enable_parallelized_aggregation to cql_config
  cql3: Move strict_allow_filtering to cql_config
  cql3: Move select_internal_page_size to cql_config
  test: Fix cql_test_env to use updateable cql_config from db::config
  cql3: Add cql_config parameter to parsed_statement::prepare()
2026-04-16 14:04:43 +03:00
Botond Dénes
88a8324e68 erge 'db: store large data records in SSTable metadata and serve via virtual tables' from Benny Halevy
`system.large_partitions`, `system.large_rows`, and `system.large_cells` store records keyed by SSTable name. When SSTables are migrated between shards or nodes (resharding, streaming, decommission), the records are lost because the destination never writes entries for the migrated SSTables.

This patch series moves the source of truth for large data records into the SSTable's scylla metadata component (new `LargeDataRecords` tag 13) and reimplements the three `system.large_*` tables as virtual tables that query live SSTables on demand. A cluster feature flag (`LARGE_DATA_VIRTUAL_TABLES`) gates the transition for safe rolling upgrades.

When the cluster feature is enabled, each node drops the old system large_* tables and starts serving the corresponding tables using virtual tables that represent the large data records now stored on the sstables.
Note that the virtual tables will be empty after upgrade until the sstables that contained large data are rewritten, therefore it is recommended to run upgrade sstables compaction or major compaction to repopulate the sstables scylla-metadata with large data records.

1. **keys: move key_to_str() to keys/keys.hh** — make the helper reusable across large_data_handler, virtual tables, and scylla-sstable
2. **sstables: add LargeDataRecords metadata type (tag 13)** — new struct with binary-serialized key fields, scylla-sstable JSON support, format documentation
3. **large_data_handler: rename partition_above_threshold to above_threshold_result** — generalize the struct for reuse
4. **large_data_handler: return above_threshold_result from maybe_record_large_cells** — separate booleans for cell size vs collection elements thresholds
5. **sstables: populate LargeDataRecords from writer** — bounded min-heaps (one per large_data_type), configurable top-N via `compaction_large_data_records_per_sstable`
6. **test: add LargeDataRecords round-trip unit tests** — verify write/read, top-N bounding, below-threshold behavior
7. **db: call initialize_virtual_tables from shard 0 only** — preparatory refactoring to enable cross-shard coordination
8. **db: implement large_data virtual tables with feature flag gating** — three virtual table classes, feature flag activation, legacy SSTable fallback, dual-threshold dedup, cross-shard collection

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1276

* Although this fixes a bug where large data entries are effectively lost when sstables are renamed or migrated, the changes are intrusive and do not warrant a backport

Closes scylladb/scylladb#29257

* github.com:scylladb/scylladb:
  db: implement large_data virtual tables with feature flag gating
  db: call initialize_virtual_tables from shard 0 only
  test: add LargeDataRecords round-trip unit tests
  sstables: populate LargeDataRecords from writer
  large_data_handler: return above_threshold_result from maybe_record_large_cells
  large_data_handler: rename partition_above_threshold to above_threshold_result
  sstables: add LargeDataRecords metadata type (tag 13)
  sstables: add fmt::formatter for large_data_type
  keys: move key_to_str() to keys/keys.hh
2026-04-16 14:03:31 +03:00
Pavel Emelyanov
4d352c7cf5 sstables: Remove ignore_component_digest_mismatch from sstable_open_config
The ignore_component_digest_mismatch flag is now initialized at sstable construction
time from sstables_manager::config (which is populated from db::config at boot time).
Remove the flag from sstable_open_config struct and all call sites that were setting
it explicitly.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-16 13:49:14 +03:00
Pavel Emelyanov
9107e055b3 sstables: Move ignore_component_digest_mismatch initialization to constructor
Initialize the ignore_component_digest_mismatch flag from sstables_manager::config
in the sstable constructor initializer list instead of in load(). This ensures the
flag value is set at construction time when the manager config is available, rather
than at load time. Mark the member const to reflect its immutability after construction.

Fixes the bootstrap path which now correctly reads the flag from manager config
initialized from db::config at boot time, instead of using the default value.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-16 13:49:00 +03:00
Pavel Emelyanov
8abfd9af00 sstables: Add ignore_component_digest_mismatch to sstables_manager config
Copy the ignore_component_digest_mismatch flag from db::config to sstables_manager::config
during database initialization. This makes the flag available early in the boot process,
before SSTables are loaded, enabling later commits to move the flag initialization from
load-time to construction-time.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-16 13:48:49 +03:00
Nadav Har'El
83670d2493 alternator: validate vector index attribute values on write
When a table has a vector index, writes to the indexed attribute
(via PutItem, UpdateItem, or BatchWriteItem) must supply a value that
is a vector of the appropriate length: It must be a list of exactly the
declared number of elements, where each element is a numeric type ("N")
representable as a 32-bit float. Before this patch, invalid values were
silently accepted and the item was simply not indexed (it was skipped
by the vector store when it read this item). Now these writes are
rejected with a ValidationException.

This is analogous to the existing validation of GSI/LSI key attribute
values - in DynamoDB after a certain attribute becomes the key of a
GSI or LSI, the user is no longer allowed to write the same type.

The implementation we add here is also analogous to the implementation
of the GSI/LSI key validation. The GSI/LSI key validation is done
by validate_value_if_index_key / si_key_attributes, and in this
patch we add the vector-index parallels: vector_index_attributes()
collects the attribute name and declared dimensions for every vector
index in the schema, and validate_value_if_vector_index_attribute()
enforces the type limitations.

For efficiency in the common case where a table has no vector indexes
and no GSIs/LSIs, both validation functions are out-of-line and each
call site guards the call with an explicit empty() check, so no
function-call overhead is incurred when there is nothing to validate.
For UpdateItem, the map of vector index attributes is cached in
update_item_operation (alongside the existing _key_attributes cache)
to avoid recomputing it on every call to update_attribute().
2026-04-16 13:31:49 +03:00
Nadav Har'El
aea7b6a66b alternator: DescribeTable for vector index: add IndexStatus and Backfilling
Add to DescribeTable's output for VectorIndexes two fields - IndexStatus
and Backfilling - which are intended to exactly mirror these two fields
that exist for GlobalSecondaryIndexes:

When a vector index is added, IndexStatus is "CREATING" before the index
is usable, and "ACTIVE" when it is finally usable for a Query. During
"CREATING" phase, "Backfilling" may be set to true when the index is
currently being backfilled (the table is scaned and an index is built).

A user is expected to call DescribeTable in a loop after creating a
vector index (via either CreateTable and UpdateTable) and only call
Query on the index after the IndexStatus is finally ACTIVE. Calling
Query earlier, while IndexStatus is still CREATING, will result in an
error.

In the current implementation, Alternator does not track the state of the
vector index, so it needs to contact the vector store to inquire about
the state of the index - using a new function introduced in this patch
that uses an existing vector-store API. This makes DescribeTable slower
on tables that have vector indexes, because the vector store is contacted
on every DescribeTable call.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-04-16 13:31:49 +03:00
Nadav Har'El
e43a2e5086 alternator: implement Query with a vector index
We introduce to the Query request a new "VectorSearch" parameter, which
take a mandatory "QueryVector" (a value which must be a numeric vector
of the right length) and "Limit".

The "Limit" of a vector search (Query with VectorSearch) determines the
number of nearest neighbors to return, and does not allow pagination
(ExclusiveKeyStart is not allowed). ConsistentRead=True is also not
allowed on a vector search query.

The "Select"/"ProjectionExpression"/"AttributesToGet" parameters are
also supported, requesting which attributes to fetch. Using Select=
ALL_PROJECTED_ATTRIBUTES means read only the attributes found in the
vector index - currently only the key columns - so it is significantly
faster than ALL_ATTRIBUTES because it doesn't require reading the items
from the base table.

The "FilterExpression" parameter is also supported. Like in DynamoDB's
traditional Query, this does post-filtering, i.e., removing some of the
results returned by the vector index that don't match the filter, and
as a result fewer than Limit results may be returned.

Pre-filtering (done on the vector store, and always returns Limit
results) is not yet implemented.
2026-04-16 13:31:47 +03:00
Nadav Har'El
68e34c57e1 alternator: fix bug in describe_multi_item()
In commit a55c5e9ec7, the function
describe_multi_item() got a new item_callback parameter, that can
be used to calculate the size of the item. This new parameter
has a default, an empty noncopyable_function. But an empty
noncopyable_function shouldn't be called - exactly like std::function,
it throws std::bad_function_call if called when empty.

So describe_multi_item() should only call this item_callback if
it's not empty.

This became a problem in the next patch, implementing vector search
query, which called describe_multi_item with the default item_callback.
But in general, the function should be usable with the default parameter
(or we shouldn't have defined a default value for this parameter!).

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-04-16 13:30:02 +03:00
Nadav Har'El
ffe1029b7c alternator: prevent adding GSI conflicting with a vector index
All the "indexes" we implement in Alternator - GSI, LSI and the new
vector index - share the same IndexName namespace, which we'll use in
Query to refer to the index. In the previous patch we already prevented
adding a vector index with the same name as an existing GSI or LSI.
In this patch we also prevent the reverse - adding a GSI with the name
of an existing vector index.

Additionally, one cannot add a GSI on a key that is already the key of
a vector index: The types conflict: The key of a vector index must be a
vector column, while the key of a GSI must have a standard key type
(string, binary or number).

We have tests for this later, this the big test patch.
2026-04-16 13:30:02 +03:00
Nadav Har'El
82de16f92c alternator: implement UpdateTable with a vector index
After an earlier patch allowed CreateTable to create vector indexes
together with a table, in this patch we add to UpdateTable the ability
to add a new vector index to an existing table, as well as the ability
to delete a vector index from an existing table.

The implementation is inspired by DynamoDB's syntax for GSI - just like
GSI has GlobalSecondaryIndexUpdates with "Create" and "Delete" operations,
for vector indexes we have VectorIndexUpdates supporting Create and
Delete. "Update" is not yet supported - we didn't implement yet any
parameter that can be updated - but we can easily implement it in the
future.
2026-04-16 13:30:02 +03:00
Nadav Har'El
217090a996 alternator: implement DescribeTable with a vector index
In this patch we add to DescribeTable the ability to list the vector indexes
enabled on an Alternator table.
2026-04-16 13:30:02 +03:00
Nadav Har'El
e156d67177 alternator: implement CreateTable with a vector index
ScyllaDB supports the "vector search" feature in CQL.
In this patch we start the path to adding vector search support also to
Alternator.

In this patch, we implement CreateTable support - allowing the user to enable
vector search in a new table. The following patches will enable additional
operations like UpdateTable (adding a vector index to an existing table or
deleting a vector index to an existing table) and DescribeTable.

Extensive tests for all these features will come at the end of the series.
Those tests were written in parallel with writing this implementation so cover
(hopefully) every nook and cranny of the imlementation.
2026-04-16 13:29:58 +03:00
Nadav Har'El
0afc730b7b alternator: reject empty attribute names
Alternator has a function validate_attr_name_length() used to validate an
attribute name passed in different operations like PutItem, UpdateItem,
GetItem, etc. It fails the request if the attribute name is longer than
65535 characters.

It turns out that we forgot to check if the attribute name length isn’t 0 -
which should be forbidden as well!

This patch fixes the validation code, and also adds a test that confirms
that after this patch empty attribute names are rejected - just like DynamoDB
does - whereas before this patch they were silently accepted.

We want to fix this issue now, because in a later patch we intend to use
the same validation function also for vector indexes - and want it to be
accurate.

Fixes SCYLLADB-1069.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-04-16 13:28:15 +03:00
Nadav Har'El
8948a50f3b cdc: fix on_pre_create_column_families to create CDC log for vector search
The vector-search feature, which is already supported in CQL, introduced
the somewhat confusing feature of enabling CDC without explicitly enabling
CDC: When a vector index is enabled on a table, CDC is "enabled" for it
even if the user didn't ask to enable CDC.

For this, some code in cdc/log.cc began to use cdc_enabled() instead of
checking schema.cdc_options.enabled() directly. This cdc_enabled()
function checks if either this enabled() is true, or has_vector_index()
is true.

But there's another twist to this story: To write with CDC, we also need
to create the CDC log table:

1. In CQL, a vector index can only be added on an existing table (with
   CREATE INDEX), so the hook on_before_update_column_family() is the
   one that noticed that a vector index was added, and created the CDC
   log table.

2. But in Alternator, a vector index can be created up-front with a
   brand-new table (in CreateTable), so the hook for a new table -
   on_pre_create_column_families(), also needs to create the CDC log
   table. It already did, but incorrectly checked just the explicit
   CDC-enabled flag instead of the new cdc_enabled() function that
   also allows vector index.

So this patch just fixes on_pre_create_column_families to use cdc_enabled().

Before this patch, when a vector index will be created in Alternator with
CreateTable, an attempt to write to the table (PutItem) will fail
because it will try to write to the CDC log, which wasn't created.
After this patch, it works. The reproducing test is
test_putitem_vectorindex_createtable (introduced in a later patch).

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-04-16 13:28:15 +03:00
Roy Dahan
d2d7604188 ci: pin GitHub Actions to commit SHAs and migrate to Node.js 24
Pin all external GitHub Actions to full commit SHAs and upgrade to
their latest major versions to reduce supply chain attack surface:

- actions/checkout: v3/v4/v5 -> v6.0.2
- actions/github-script: v7 -> v8.0.0
- actions/setup-python: v5 -> v6.2.0
- actions/upload-artifact: v4 -> v7.0.0
- astral-sh/setup-uv: v6 -> v8.0.0
- mheap/github-action-required-labels: v5.5.2 (pinned)
- redhat-plumbers-in-action/differential-shellcheck: v5.5.6 (pinned)
- codespell-project/actions-codespell: v2.2 (pinned, was @master)

Set FORCE_JAVASCRIPT_ACTIONS_TO_NODE24=true in all 21 workflows that
use JavaScript-based actions to opt into the Node.js 24 runtime now.
This resolves the deprecation warning:

  "Node.js 20 actions are deprecated. Please check if updated versions
   of these actions are available that support Node.js 24. Actions will
   be forced to run with Node.js 24 by default starting June 2nd,
   2026. Node.js 20 will be removed from the runner on September 16th,
   2026."

See: https://github.blog/changelog/2025-09-19-deprecation-of-node-20-on-github-actions-runners/

scylladb/github-automation references are intentionally left at @main
as they are org-internal reusable workflows.

Fixes: SCYLLADB-1410

Backport: Backport is required for live branches that run GH actions:
2026.1, 2025.4, 2025.1 and 2024.1

Closes scylladb/scylladb#29421
2026-04-16 13:03:33 +03:00
Pavel Emelyanov
207d3b4a68 test_backup: Remove create_schema() helper Test
Remove the create_schema() helper function and inline its logic directly into the four call sites. This simplifies the code by eliminating a trivial wrapper.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Closes scylladb/scylladb#29406
2026-04-16 12:57:26 +03:00
Botond Dénes
830d28a889 Merge 'Use standard helpers to create ks:cf and populate it in test_backup.py' from Pavel Emelyanov
The PR removed the create_and_ks() helper from backup test and patches all callers to create keyspace, table and populate them with standard explicit facilities. While patching it turned out that one test doesn't need to populate the table, so it even becomes tiny bit shorter and faster

Enhancing test, not backporting

Closes scylladb/scylladb#29417

* github.com:scylladb/scylladb:
  test_backup: Remove create_ks_and_cf helper Test
  test_backup: Replace create_ks_and_cf with async patterns Test
  test_backup: Add if-True blocks for indentation Test
2026-04-16 12:54:21 +03:00
Nikos Dragazis
7abcf94823 test: Use arbitrary tokens in vnodes->tablets migration tests
The migration tests used to start nodes with pre-computed power-of-two
tokens. This was required because the migration itself only supported
power-of-two aligned tokens. Now that arbitrary tokens are supported,
switch the tests to use Scylla's default random token assignment.

Switching to arbitrary tokens makes the tests non-deterministic, but the
migration aspects that are affected by the token distribution
(resharding, wrap-around vnode split) are out of scope for these tests
and covered by dedicated tests.

Add a `get_all_vnode_tokens()` helper that queries system.topology at
runtime to discover the actual token layout, and derive expected tablet
counts from that.

Also account for the possible extra wrap-around tablet when the last
vnode token does not coincide with MAX_TOKEN.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2026-04-16 12:47:27 +03:00
Nikos Dragazis
26f0c038af test: boost: Add test for wrap-around vnodes
Add a Boost test to verify that `prepare_for_tablets_migration()`
produces the correct tablet boundaries when a wrap-around vnode exists.

Tablets cannot wrap around the token ring as vnodes do; the last token
of the last tablet must always be MAX_TOKEN. When the last vnode token
does not coincide with MAX_TOKEN, the wrap-around vnode must be split
into two tablets.

The test is parameterized over both cases: unaligned (split expected)
and aligned (no split expected).

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2026-04-16 12:47:16 +03:00
Botond Dénes
c355df4461 Merge 'test: Lower default log level from DEBUG to INFO' from Artsiom Mishuta
1. test.py — Removed --log-level=DEBUG flag from pytest args
2. test/pytest.ini — Changed log_level to INFO (that was set DEBUG in test.py), changed log_file_level from DEBUG to INFO, added clarifying comments

+minor fix

[test/pylib: save logs on success only during teardown phase](0ede308a04)
Previously, when --save-log-on-success was enabled, logs were saved
for every test phase (setup, call, teardown)in 3 files. Restrict it to only
the teardown phase, that contains all 3 in case of test success,
to avoid redundant log entries.

Closes scylladb/scylladb#29086

* github.com:scylladb/scylladb:
  test/pylib: save logs on success only during teardown phase
  test: Lower default  log level from DEBUG to INFO
2026-04-16 12:46:11 +03:00
Nikos Dragazis
098732ff76 storage_service: Support vnodes->tablets migrations w/ arbitrary tokens
The vnodes-to-tablets migration creates tablet maps that mirror the
vnode layout: one tablet per vnode, preserving token boundaries and
replica placement. However, due to tablet restrictions, the migration
requires vnode tokens to be a power of two and uniformly distributed
across the token ring.

In practice, this restriction is too limiting. Real clusters use
randomly generated tokens and a node's token assignment is immutable.

To solve this problem, prior work (01fb97ee78) has been done to relax
the tablet constraints by allowing arbitrary tablet boundaries, removing
the requirement for power-of-two sizing and uniform distribution.

This patch leverages the relaxed tablet constraints to enable tablet map
creation from arbitrary vnode tokens:

* Removes all token-related constraints.
* Handles wrap-around vnodes. If a vnode wraps (i.e., the highest vnode
  token is not `dht::token::last()`), it is split into two tablets:
  - (last_vnode_token, dht::token::last()]
  - [dht::token::first(), first_vnode_token]

The migration ops guide has been updated to remove the power-of-two
constraint.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2026-04-16 12:39:23 +03:00
Nikos Dragazis
8ea8c05120 storage_service: Hoist migration precondition
`prepare_for_tablets_migration()` is idempotent; it filters out tables
that already have tablet maps and returns early if no tablet maps need
to be created. However, this precondition is currently misplaced. Move
it higher to skip extra work.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2026-04-16 12:19:34 +03:00
Botond Dénes
9bfcc25cf7 Merge 'streaming: stream_blob: hold table for streaming' from Michael Litvak
When initializing streaming sources in tablet_stream_files_handler we
use a reference to the table. We should hold the table while doing so,
because otherwise the table may be dropped and destroyed when we yield.
Use the table.stream_in_progress() phaser to hold the table while we
access it.

For sstable file streaming we can release the table after the snapshot
is initialized, and the table may be dropped safely because the files
are held by the snapshot and we don't access the table anymore. There
was a single access to the table for logging but it is replaced by a
pre-calculated variable.

For logstor segment streaming, currently it doesn't support discarding
the segments while they are streamed - when the table is dropped it
discard the segments by overwriting and freeing them, so they shouldn't
be accessed after that. Therefore, in that case continue to hold the
table until streaming is completed.

Fixes [SCYLLADB-1533](https://scylladb.atlassian.net/browse/SCYLLADB-1533)

It's a pre-existing use-after-free issue in sstable file streaming so should be backported to all releases.
It's also made worse with the recent changes of logstor, and affects also non-logstor tables, so the logstor fixes should be in the same release (2026.2).

[SCYLLADB-1533]: https://scylladb.atlassian.net/browse/SCYLLADB-1533?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ

Closes scylladb/scylladb#29488

* github.com:scylladb/scylladb:
  test: test drop table during streaming
  streaming: stream_blob: hold table for streaming
2026-04-16 12:12:42 +03:00
Dario Mirovic
50e498ac0d test/lib: fix typos in proc_utils, gcs_fixture, and dockerized_service
Fix assorted typos in comments, strings, and identifiers:
- path_preprend -> path_prepend (proc_utils.hh, proc_utils.cc)
- laúnch -> launch (proc_utils.cc)
- hand/fail -> hang/fail (dockerized_service.py)
- inconvinient -> inconvenient (dockerized_service.py)
- priviledges -> privileges (gcs_fixture.hh)
- remove double semicolon (gcs_fixture.cc)

Refs SCYLLADB-1542
2026-04-16 10:58:55 +02:00
Dario Mirovic
11b5997eaf test: gcs_fixture: rename container from "local-kms" to "fake-gcs-server"
The GCS fixture's fake-gcs-server container was named "local-kms",
copy-pasted from the AWS KMS fixture. It happened when both were
refactored to use the shared start_docker_service helper (bc544eb08e).

Rename to "fake-gcs-server" to match the Python-side naming and avoid
confusion in logs.

Refs SCYLLADB-1542
2026-04-16 10:58:52 +02:00
Dario Mirovic
dc7f848bf8 test: fix proc_utils.cc formatting from previous commit
Fix indentation of lines moved inside the for-loop in
start_docker_service (lines 208-225).

Refs SCYLLADB-1542
2026-04-16 10:55:48 +02:00
Dario Mirovic
be4d32c474 test: lib: use unique container name per retry attempt
The container name is generated once before the retry loop, so
all retry attempts reuse the same name. Move the name generation
inside the loop so each attempt gets a fresh name via the
incrementing counter, consistent with the comment "publish port
ephemeral, allows parallel instances".

Formatting changes (indentation) of lines 208-225 in test/lib/proc_utils.cc
will be fixed in the next commit.

Refs SCYLLADB-1542
2026-04-16 10:55:04 +02:00
Botond Dénes
33682fd14e Merge 'sstables/storage_manager: fix race between object storage config update and keyspace creation' from Dimitrios Symonidis
Previously, config_updater used a serialized_action to trigger update_config() when object_storage_endpoints changed. Because serialized_action::trigger() always schedules the action as a new reactor task (via semaphore::wait().then()), there was a window between the config value becoming visible to the REST API and update_config() actually running. This allowed a concurrent CREATE KEYSPACE to see the new endpoint via is_known_endpoint() before storage_manager had registered it in _object_storage_endpoints.

Now config observers run synchronously in a reactor turn and must not suspend. Split the previous monolithic async update_config() coroutine  into two phases:

- Sync (in the observer, never suspends): storage_manager::_object_storage_endpoints is updated in place; for already-instantiated clients, update_config_sync swaps the new config atomically
- Async (per-client gate): background fibers finish the work that can't run in the observer — S3 refreshes credentials under _creds_sem; GCS drains and closes the replaced client.

Config reloads triggered by SIGHUP are applied on shard 0 and then broadcast to all other shards. An rwlock has been also introduced to make sure that the configuration has been propagated to all cores. This guarantees that a client requesting a config via the REST API will see a consistent snapshot

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-757
Fixes: [28141](https://github.com/scylladb/scylladb/issues/28141)

Closes scylladb/scylladb#28950

* github.com:scylladb/scylladb:
  test/object_store: verify object storage client creation and live reconfiguration
  sstables/utils/s3: split config update into sync and async parts
  test_config: improve logging for wait_for_config API
  db: introduce read-write lock to synchronize config updates with REST API
2026-04-16 10:20:43 +03:00
Michael Litvak
43c76aaf2b logstor: split log record to header and data
Split the `log_record` to `log_record_header` type that has the record
metadata fields and the mutation as a separate field which is the actual
record data:

struct log_record {
    log_record_header header;
    canonical_mutation mut;
};

Both the header and mutation have variable serialized size. When a
record is serialized in a write_buffer, we first put a small
`record_header` that has the header size and data size, then the
serialized header and data follow. The `log_location` of a record points
to the beginning of the `record_header`, and the size includes the
`record_header`.

This allows us to read a record header without reading the data when
it's not needed and avoid deserializing it:
* on recovery, when scanning all segments, we read only the record
  headers.
* on compaction, we read the record header first to determine if the
  record is alive, if yes then we read the data.

Closes scylladb/scylladb#29457
2026-04-16 10:00:35 +03:00
Botond Dénes
8e7ba7efe2 Merge 'commitlog: fix segment replay order by using ordered map per shard' from Sergey Zolotukhin
The commitlog replayer groups segments by shard using a
std::unordered_multimap, then iterates per-shard segments via
equal_range(). However, equal_range() does not guarantee iteration
order for elements with the same key, so segments could be replayed
out of order within a shard.

Correct segment ordering is required for:
- Fragmented entry reconstruction, which accumulates fragments across
  segments and depends on ascending order for efficient processing.
- Commitlog-based storage used by the strongly consistent tables
  feature, which relies on replayed raft items being stored in order.

Fix by changing the data structure from
  std::unordered_multimap<unsigned, commitlog::descriptor>
to
  std::unordered_map<unsigned, utils::chunked_vector<commitlog::descriptor>>

Since the descriptors are inserted from a std::set ordered by ID, the
vector preserves insertion (and thus ID) order. The per-shard iteration
now simply iterates the vector, guaranteeing correct replay order.

Fixes: SCYLLADB-1411

Backport: It looks like this issue doesn't cause any trouble, and is required only by the strong consistent tables, so no backporting required.

Closes scylladb/scylladb#29372

* github.com:scylladb/scylladb:
  commitlog: add test to verify segment replay order
  commitlog: fix replay order by using ordered map per shard
2026-04-16 09:55:27 +03:00
Pavel Emelyanov
335261f351 cql3: Move enable_create_table_with_compact_storage to cql_config
Move enable_create_table_with_compact_storage option from db::config to
cql_config. This improves separation of concerns by consolidating CQL-specific
table creation policies in the cql_config structure. Update the CREATE TABLE
statement prepare() function to use the new location for the configuration check.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-16 08:52:20 +03:00
Pavel Emelyanov
f20ede79f9 cql3: Move strict_is_not_null_in_views to cql_config
Move strict_is_not_null_in_views option from db::config to cql_config via
new view_restrictions sub-struct. This improves separation of concerns by
keeping view-specific validation policies with other CQL configuration.
Update prepare_view() to take view_restrictions reference instead of reaching
into db::config, and update all callsites to pass the sub-struct.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-16 08:52:19 +03:00
Pavel Emelyanov
027c91f45e cql3: Move restrict_future_timestamp to cql_config
Move restrict_future_timestamp option from db::config to cql_config. This
improves separation of concerns as timestamp validation is part of CQL query
execution behavior. Update validate_timestamp() function signature to take
cql_config reference instead of db::config, and update all callsites in
modification_statement and batch_statement to pass cql_config.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-16 08:51:53 +03:00
Pavel Emelyanov
7264581881 cql3: Move TWCS restriction options to cql_config
Move twcs_max_window_count and restrict_twcs_without_default_ttl options
from db::config to cql_config via new twcs_restrictions sub-struct. This
improves separation of concerns by keeping TWCS-specific validation policies
with other CQL configuration. Update check_restricted_table_properties()
to remove unused db parameter and take twcs_restrictions reference instead.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-16 08:51:52 +03:00
Pavel Emelyanov
8b853505cd cql3: Move keyspace restriction options to cql_config
Introduce replication_restrictions, a sub-struct of cql_config, to hold
the seven keyspace-level policy options that govern how CREATE/ALTER
KEYSPACE statements are validated:
  - restrict_replication_simplestrategy
  - replication_strategy_warn_list / replication_strategy_fail_list
  - minimum/maximum_replication_factor_warn/fail_threshold

Pass replication_restrictions into check_against_restricted_replication_strategies()
instead of having it reach into db::config directly (via both
qp.db().get_config() and qp.proxy().data_dictionary().get_config()).

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-16 08:51:24 +03:00
Benny Halevy
ce00d61917 db: implement large_data virtual tables with feature flag gating
Replace the physical system.large_partitions, system.large_rows, and
system.large_cells CQL tables with virtual tables that read from
LargeDataRecords stored in SSTable scylla metadata (tag 13).

The transition is gated by a new LARGE_DATA_VIRTUAL_TABLES cluster
feature flag:

- Before the feature is enabled: the old physical tables remain in
  all_tables(), CQL writes are active, no virtual tables are registered.
  This ensures safe rollback during rolling upgrades.

- After the feature is enabled: old physical tables are dropped from
  disk via legacy_drop_table_on_all_shards(), virtual tables are
  registered on all shards, and CQL writes are skipped via
  skip_cql_writes() in cql_table_large_data_handler.

Key implementation details:

- Three virtual table classes (large_partitions_virtual_table,
  large_rows_virtual_table, large_cells_virtual_table) extend
  streaming_virtual_table with cross-shard record collection.

- generate_legacy_id() gains a version parameter; virtual tables
  use version 1 to get different UUIDs than the old physical tables.

- compaction_time is derived from SSTable generation UUID at display
  time via UUID_gen::unix_timestamp().

- Legacy SSTables without LargeDataRecords emit synthetic summary
  rows based on above_threshold > 0 in LargeDataStats.

- The activation logic uses two paths: when the feature is already
  enabled (test env, restart), it runs as a coroutine; when not yet
  enabled, it registers a when_enabled callback that runs inside
  seastar::async from feature_service::enable().

- sstable_3_x_test updated to use a simplified large_data_test_handler
  and validate LargeDataRecords in SSTable metadata directly.
2026-04-16 08:49:02 +03:00
Benny Halevy
cb6004b625 db: call initialize_virtual_tables from shard 0 only
Move the smp::invoke_on_all dispatch from the callers into
initialize_virtual_tables() itself, so the function is called
once from shard 0 and internally distributes the per-shard
virtual table setup to all shards.

This simplifies the callers and allows a single place to add
cross-shard coordination logic (e.g. feature-gated table
registration) in future commits.
2026-04-16 08:49:02 +03:00
Benny Halevy
90d4ff34fb test: add LargeDataRecords round-trip unit tests
Add three new test cases to sstable_3_x_test.cc that verify the
LargeDataRecords metadata written by the SSTable writer can be read
back after open_data():

- test_large_data_records_round_trip: verifies partition_size, row_size,
  and cell_size records are written with correct field semantics when
  thresholds are exceeded
- test_large_data_records_top_n_bounded: verifies the bounded min-heap
  keeps only the top-N largest entries per type
- test_large_data_records_none_when_below_threshold: verifies no records
  are written when data is below all thresholds

Also wire large_data_records_per_sstable from db_config into the test
env's sstables_manager::config so that config changes propagate through
the updateable_value chain to configure_writer().
2026-04-16 08:49:02 +03:00
Benny Halevy
1f7faeef57 sstables: populate LargeDataRecords from writer
During compaction (SSTable writing), maintain bounded min-heaps (one per
large_data_type) that collect the top-N above-threshold records.  On
stream end, drain all five heaps into a single LargeDataRecords array
and write it into the SSTable's scylla metadata component.

Five separate heaps are used:
- partition_size, row_size, cell_size: ordered by value (size bytes)
- rows_in_partition, elements_in_collection: ordered by elements_count

A new config option 'compaction_large_data_records_per_sstable' (default
10) controls the maximum number of records kept per type.
2026-04-16 08:49:02 +03:00
Benny Halevy
8f4976f65d large_data_handler: return above_threshold_result from maybe_record_large_cells
Change maybe_record_large_cells to return above_threshold_result with
separate booleans for cell size (.size) and collection elements
(.elements) thresholds.  This allows the writer to track above_threshold
counts for cell_size and elements_in_collection independently.
2026-04-16 08:49:02 +03:00
Benny Halevy
c1b797f288 large_data_handler: rename partition_above_threshold to above_threshold_result
Rename partition_above_threshold to above_threshold_result and its
'rows' field to 'elements', making it a generic struct that can be
reused for other large data types (e.g., cells with collection
elements).

Use designated initializers for clarity.
2026-04-16 08:49:02 +03:00
Benny Halevy
d92cd42fe6 sstables: add LargeDataRecords metadata type (tag 13)
Add a new scylla metadata component LargeDataRecords (tag 13) that
stores per-SSTable top-N large data records.  Each record carries:
  - large_data_type (partition_size, row_size, cell_size, etc.)
  - binary serialized partition key and clustering key
  - column name (for cell records)
  - value (size in bytes)
  - element count (rows or collection elements, type-dependent)
  - range tombstones and dead rows (partition records only)

The struct uses disk_string<uint32_t> for key/name fields and is
serialized via the existing describe_type framework into the SSTable
Scylla metadata component.

Add JSON support in scylla-sstable and format documentation.
2026-04-16 08:49:01 +03:00
Benny Halevy
85e2c6f2a7 sstables: add fmt::formatter for large_data_type
Add a fmt::formatter specialization for sstables::large_data_type and
use it in scylla-sstable.cc instead of the local to_string() overload,
which is removed.
2026-04-16 08:42:54 +03:00
Benny Halevy
d4283d0ffc keys: move key_to_str() to keys/keys.hh
Move the key_to_str() template function from a file-local static in
db/large_data_handler.cc to keys/keys.hh so it can be reused by:
- large_data_handler.cc for log messages
- virtual tables (db/virtual_tables.cc) for converting binary keys
  to human-readable CQL display
- scylla-sstable for JSON output of LargeDataRecords

No functional change.
2026-04-16 08:42:54 +03:00
Pavel Emelyanov
1af26a1dd6 cql3: Move batch_size_fail_threshold_in_kb to cql_config
The batch_size_fail_threshold_in_kb option controls the batch size at
which an oversized batch error is returned to the client. It belongs in
cql_config rather than db::config as it directly governs CQL batch
statement behavior.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-16 07:57:27 +03:00
Pavel Emelyanov
4d255cf533 cql3: Move batch_size_warn_threshold_in_kb to cql_config
The batch_size_warn_threshold_in_kb option controls the batch size at
which a client warning is emitted during batch execution. It belongs in
cql_config rather than db::config as it directly governs CQL batch
statement behavior.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-16 07:57:27 +03:00
Pavel Emelyanov
a3f097f100 cql3: Move enable_parallelized_aggregation to cql_config
The enable_parallelized_aggregation option controls whether aggregation
queries are fanned out across shards for parallel execution. It belongs
in cql_config rather than db::config as it directly governs CQL query
behavior at prepare time.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-16 07:57:27 +03:00
Pavel Emelyanov
4314fc0642 cql3: Move strict_allow_filtering to cql_config
The strict_allow_filtering option controls whether queries that require
ALLOW FILTERING are silently accepted, warned about, or rejected. It
belongs in cql_config rather than db::config as it directly governs CQL
query behavior at prepare time.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-16 07:57:26 +03:00
Pavel Emelyanov
3411ed8bcc cql3: Move select_internal_page_size to cql_config
The select_internal_page_size option controls CQL query execution
behavior (internal paging for aggregate/filtered SELECTs) and belongs
in cql_config rather than being read directly from db::config at
execution time.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-16 07:57:26 +03:00
Pavel Emelyanov
728eb20b42 test: Fix cql_test_env to use updateable cql_config from db::config
The test environment was creating cql_config with hardcoded default values that
were never updated when system.config was modified via CQL. This broke tests
that dynamically change configuration values (e.g., TWCS tests).

Fix by creating cql_config from db::config using sharded_parameter, which
ensures updateable_value fields track the actual db::config sources and reflect
changes made during test execution.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-04-16 07:57:26 +03:00
Pavel Emelyanov
60a834d9fa cql3: Add cql_config parameter to parsed_statement::prepare()
Pass cql_config to prepare() so that statement preparation can use
CQL-specific configuration rather than reaching into db::config
directly.

Callers that use default_cql_config:
- db/view/view.cc: builds a SELECT statement internally to compute view
  restrictions, not in response to a user query
- cql3/statements/create_view_statement.cc: same -- parses the view's
  WHERE clause as a synthetic SELECT to extract restrictions
- tools/schema_loader.cc: offline schema loading tool, no runtime
  config available
- tools/scylla-sstable.cc: offline sstable inspection tool, no runtime
  config available

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-16 07:57:25 +03:00
Nadav Har'El
f0e9177130 Merge 'audit/alternator: Make Alternator requests audited' from Piotr Szymaniak
Each Alternator API call results in the request being audited, provided the auditing is enabled.
Both successful as well as the failed requests are audited, with few exceptions.

The chosen audit types for the operations:
- CreateTable - DDL
- DescribeTable - QUERY
- DeleteTable - DDL
- UpdateTable - DDL
- PutItem - DML
- UpdateItem - DML
- GetItem - QUERY
- DeleteItem - DML
- ListTables - QUERY
- Scan - QUERY
- DescribeEndpoints - QUERY
- BatchWriteItem - DML
- BatchGetItem - QUERY
- Query - QUERY
- TagResource - DDL
- UntagResource - DDL
- ListTagsOfResource - QUERY
- UpdateTimeToLive - DDL
- DescribeTimeToLive - QUERY
- ListStreams - QUERY
- DescribeStream - QUERY
- GetShardIterator - QUERY
- GetRecords - QUERY
- DescribeContinuousBackups - QUERY

FIXME: The tests are now covering the new functionality only partially.

Fixes: scylladb/scylla-enterprise#3796
Fixes: SCYLLADB-467

No need to backport, new functionality.

Closes scylladb/scylladb#27953

* github.com:scylladb/scylladb:
  audit/alternator: support audit_tables=alternator.<table> shorthand
  audit/alternator: Add negative audit tests
  audit/alternator: Add testing of auditing
  audit/alternator: Audit requests
  audit/alternator: Refactor in preparation for auditing Alternator
2026-04-15 22:17:57 +03:00
Nikos Dragazis
d38f44208a test/cqlpy: Harden mutation_fragments tests against background flushes
Several tests in test_select_from_mutation_fragments.py assume that all
mutations end up in a single SSTable. This assumption can be violated
by background memtable flushes triggered by commitlog disk pressure.

Since the Scylla node is taken from a pool, it may carry unflushed data
from prior tests that prevents closed segments from being recycled,
thereby increasing the commitlog disk usage. A main source of such
pressure is keyspace-level flushes from earlier tests in this module,
which rotate commitlog segments without flushing system tables (e.g.,
`system.compaction_history`), leaving closed segments dirty.
Additionally, prior tests in the same module may have left unflushed
data on the shared test table (`test_table` fixture), keeping commitlog
segments dirty on its behalf as well. When commitlog disk usage exceeds
its threshold, the system flushes the test table to reclaim those
segments, potentially splitting a running test's mutations across
multiple SSTables.

This was observed in CI, where test_paging failed because its data was
split across two SSTables, resulting in more mutation fragments than the
hardcoded expected count.

This patch fixes the affected tests in two ways:

1. Where possible, tests are reworked to not assume a single SSTable:
   - test_paging
   - test_slicing_rows
   - test_many_partition_scan

2. Where rework is impractical, major compaction is added after writes
   and before validation to ensure that only one SSTable will exist:
   - test_smoke
   - test_count
   - test_metadata_and_value
   - test_slicing_range_tombstone_changes

Fixes SCYLLADB-1375.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>

Closes scylladb/scylladb#29389
2026-04-15 21:46:00 +03:00
Michael Litvak
cc94467097 test: test drop table during streaming
Add a test that drops a table while tablet streaming is running for the
table. The table is dropped after taking the storage snapshot and
initializating streaming sources - after that streaming should be able
to complete or abort correctly if the table is dropped. We want to
verify there is no incorrect access to the destroyed table.

The test tests both types of streaming in stream_blob - sstables
and logstor segments.
2026-04-15 19:23:00 +02:00
Michael Litvak
69d2a90106 streaming: stream_blob: hold table for streaming
When initializing streaming sources in tablet_stream_files_handler we
use a reference to the table. We should hold the table while doing so,
because otherwise the table may be dropped and destroyed when we yield.
Use the table.stream_in_progress() phaser to hold the table while we
access it.

For sstable file streaming we can release the table after the snapshot
is initialized, and the table may be dropped safely because the files
are held by the snapshot and we don't access the table anymore. There
was a single access to the table for logging but it is replaced by a
pre-calculated variable.

For logstor segment streaming, currently it doesn't support discarding
the segments while they are streamed - when the table is dropped it
discard the segments by overwriting and freeing them, so they shouldn't
be accessed after that. Therefore, in that case continue to hold the
table until streaming is completed.

Fixes SCYLLADB-1533
2026-04-15 19:22:42 +02:00
Avi Kivity
59ec93b86b Merge 'Allow arbitrary tablet boundaries and count' from Tomasz Grabiec
There are several reasons we want to do that.

One is that it will give us more flexibility in distributing the
load. We can subdivide tablets at any token, and achieve more
evenly-sized tablets. In particular, we can isolate large partitions
into separate tablets.

We can also split and merge incrementally individual tablets.
Currently, we do it for the whole table or nothing, which makes
splits and merges take longer and cause wide swings of the count.
This is not implemented in this PR yet, we still split/merge the whole table.

Another reason is vnode to tablets migration. We now could construct a
tablet map which matches exactly the vnode boundaries, so migration
can happen transparently from CQL-coordinator point of view.

Tablet count is still a power-of-two by default for newly created tables.
It may be different if tablet map is created by non-standard means,
or if per-table tablet option "pow2_count" is set to "false".

build/release/scylla perf-tablets:

Memory footprint for 131k tablets increased from 56 MiB to 58.1 MiB (+3.5%)

Before:
```
Generating tablet metadata
Total tablet count: 131072
Size of tablet_metadata in memory: 57456 KiB
Copied in 0.014346 [ms]
Cleared in 0.002698 [ms]
Saved in 1234.685303 [ms]
Read in 445.577881 [ms]
Read mutations in 299.596313 [ms] 128 mutations
Read required hosts in 247.482742 [ms]
Size of canonical mutations: 33.945053 [MiB]
Disk space used by system.tablets: 1.456761 [MiB]
Tablet metadata reload:
full      407.69ms
partial     2.65ms
```

After:
```
Generating tablet metadata
Total tablet count: 131072
Size of tablet_metadata in memory: 59504 KiB
Copied in 0.032475 [ms]
Cleared in 0.002965 [ms]
Saved in 1093.877441 [ms]
Read in 387.027100 [ms]
Read mutations in 255.752121 [ms] 128 mutations
Read required hosts in 211.202805 [ms]
Size of canonical mutations: 33.954453 [MiB]
Disk space used by system.tablets: 1.450162 [MiB]
Tablet metadata reload:
full      354.50ms
partial     2.19ms
```

Closes scylladb/scylladb#28459

* github.com:scylladb/scylladb:
  test: boost: tablets: Add test for merge with arbitrary tablet count
  tablets, database: Advertise 'arbitrary' layout in snapshot manifest
  tablets: Introduce pow2_count per-table tablet option
  tablets: Prepare for non-power-of-two tablet count
  tablets: Implement merged tablet_map constructor on top of for_each_sibling_tablets()
  tablets: Prepare resize_decision to hold data in decisions
  tablets: table: Make storage_group handle arbitrary merge boundaries
  tablets: Make stats update post-merge work with arbitrary merge boundaries
  locator: tablets: Support arbitrary tablet boundaries
  locator: tablets: Introduce tablet_map::get_split_token()
  dht: Introduce get_uniform_tokens()
2026-04-15 18:57:22 +03:00
Andrzej Jackowski
78926d9c96 test/random_failures: remove gossip shadow round injection
Commit c17c4806a1 removed check_for_endpoint_collision() from
the fresh bootstrap path, which was the only code path that
called do_shadow_round() for new nodes. Since the gossip shadow
round is no longer executed during bootstrap, remove the
stop_during_gossip_shadow_round error injection from the test.

The entry is marked as REMOVED_ rather than deleted to preserve
the shuffle order for seed-based test reproducibility.

The injection point in gms/gossiper.cc is also removed since it
is no longer used by any test.

Fixes: SCYLLADB-1466

Closes scylladb/scylladb#29460
2026-04-15 16:30:55 +02:00
Dario Mirovic
336dab1eec test: lib: fix broken retry in start_docker_service
The retry loop in start_docker_service passes the parse callbacks
via std::move into create_handler on each iteration. After the
first iteration, the moved-from std::function objects are empty.
All subsequent retries skip output parsing entirely and
immediately treat the service as successfully started. This
defeats the entire purpose of the retry mechanism.

Fix by passing the callbacks by copy instead of move, so the
original callbacks remain valid across retries.

Fixes SCYLLADB-1542
2026-04-15 15:25:52 +02:00
Asias He
4137a4229c test: Stabilize tablet incremental repair error test
Use async tablet repair task flow to avoid a race where client timeout
returns while server-side repair continues after injections are
disabled.

Start repair with await_completion=false, assert it does not complete
within timeout under injection, abort/wait the task, then verify
sstables_repaired_at is unchanged.

Fixes SCYLLADB-1184

Closes scylladb/scylladb#29452
2026-04-15 16:24:43 +03:00
Dimitrios Symonidis
ca003680a7 test/object_store: verify object storage client creation and live reconfiguration 2026-04-15 14:28:39 +02:00
Dimitrios Symonidis
24a7b146fa sstables/utils/s3: split config update into sync and async parts
Config observers run synchronously in a reactor turn and must not
suspend. Split the previous monolithic async update_config() coroutine
into two phases:

Sync (runs in the observer, never suspends):
  - S3: atomically swap _cfg (lw_shared_ptr) and set a credentials
    refresh flag.
  - GCS: install a freshly constructed client; stash the old one for
    async cleanup.
  - storage_manager: update _object_storage_endpoints and fire the
    async cleanup via a gate-guarded background fiber.

Async (gate-guarded background fiber):
  - S3: acquire _creds_sem, invalidate and rearm credentials only if
    the refresh flag is set.
  - GCS: drain and close stashed old clients.
2026-04-15 14:28:31 +02:00
Dimitrios Symonidis
a958da0ab9 test_config: improve logging for wait_for_config API 2026-04-15 14:28:31 +02:00
Dimitrios Symonidis
71714fdc0e db: introduce read-write lock to synchronize config updates with REST API
Config is reloaded from SIGHUP on shard 0 and broadcast to all shards
under a write lock. REST API callers reading find_config_id acquire a
read lock via value_as_json_string_for_name() and are guaranteed a
consistent snapshot even when a reload is in progress.
2026-04-15 14:28:31 +02:00
dependabot[bot]
d584e8e321 build(deps): bump sphinx-scylladb-theme from 1.9.1 to 1.9.2 in /docs
Bumps [sphinx-scylladb-theme](https://github.com/scylladb/sphinx-scylladb-theme) from 1.9.1 to 1.9.2.
- [Release notes](https://github.com/scylladb/sphinx-scylladb-theme/releases)
- [Commits](https://github.com/scylladb/sphinx-scylladb-theme/compare/1.9.1...1.9.2)

---
updated-dependencies:
- dependency-name: sphinx-scylladb-theme
  dependency-version: 1.9.2
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>

Closes scylladb/scylladb#29476
2026-04-15 14:57:37 +03:00
Gleb Natapov
ca24dd4a5f topology coordinator: log request cancellation only when request are really canceled
Currently cancellation is logged in get_next_task, but the function is
called by tablets code as well where we do not act upon its result, only
yield to the topology coordinator. But the topology coordinator will not
necessary do the cancellation as well since it can be busy with tablets
migration. As a result cancellation is logged, but not done which is
confusing. Fix it by logging cancellation when it is actually happens.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1409

Closes scylladb/scylladb#29471
2026-04-15 14:46:59 +03:00
Botond Dénes
280fe7cfb7 Merge 'Make inclusion of config.hh cheaper' from Nadav Har'El
This is an attempt (mostly suggested and implemented by AI, but with a few hours of human babysitting...), to somewhat reduce compilation time by picking one template, named_value<T>, which is used in more than a hundred source files through the config.hh header, and making it use external instantiation: The different methods of named_value<T> for various T are instantiated only once (in config.cc), and the individual translation units don't need to compile them a hundred times.

The resulting saving is a little underwhelming: The total object-file size goes down about 1% (from 346,200 before the patch to 343,488 after the patch), and previous experience shows that this object-file size is proportional to the compilation time, most of which involves code generation. But I haven't been able to measure speedup of the build itself.

1% is not nothing, but not a huge saving either. Though arguably, with 50 more of these patches, we can make the build twice faster :-)

Refs #1.

Closes scylladb/scylladb#28992

* github.com:scylladb/scylladb:
  config: move named_value<T> method bodies out-of-line
  config: suppress named_value<T> instantiation in every source file
2026-04-15 14:40:15 +03:00
Botond Dénes
00d8470554 Merge 'test: filter benign shutdown errors in tests that grep logs directly' from Marcin Maliszkiewicz
Tests that call grep_for_errors() directly and assert no errors
can fail spuriously due to benign RPC errors during graceful
shutdown (e.g. "connection dropped: Semaphore broken"), which
are already filtered by the after_test hook via filter_errors().

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1464
Backport: no, tests fix (we may decide to backport later if it occurs on release branches)

Closes scylladb/scylladb#29463

* github.com:scylladb/scylladb:
  test: filter benign errors in tests that grep logs during shutdown
  test: filter_errors: support list[list[str]] error groups
2026-04-15 14:40:15 +03:00
Piotr Szymaniak
5b00675bf0 storage_proxy: expedite speculative retry on replica disconnect
When a replica disconnects during a digest read (e.g., during
decommission), the speculating_read_executor now immediately fires
the pending speculative retry instead of waiting for the timer.

On DISCONNECT, the digest_read_resolver invokes an _on_disconnect
callback set by the executor. The callback cancels the speculate
timer and rearms it to clock_type::now() (lowres_clock::now() =
thread-local memory read, no syscall). The existing timer callback
fires on the next reactor poll with all its logic intact — checking
is_completed(), calling add_wait_targets(1), sending the request,
and incrementing speculative_digest_reads/speculative_data_reads.

The notification is fire-and-forget: on_error() does NOT absorb the
DISCONNECT. The existing error arithmetic in digest_read_resolver
already handles this correctly because _target_count_for_cl accounts
for the speculative target.

For never_speculating_read_executor (no spare target) and
always_speculating_read_executor (all requests sent upfront),
_on_disconnect is never set — no behavior change.

Fixes scylladb/scylladb#26307

Closes scylladb/scylladb#29428
2026-04-15 14:40:15 +03:00
Raphael S. Carvalho
a2eed4bb45 service: Use optimistic replicas in all_sibling_tablet_replicas_colocated
all_sibling_tablet_replicas_colocated was using committed ti.replicas to
decide whether sibling tablets are co-located and merge can be finalized.
This caused a false non-co-located window when a co-located pair was moved
by the load balancer: as both tablets migrate together, their del_transition
commits may land in different Raft rounds. After the first commit, ti.replicas
diverge temporarily (one tablet shows the new position, the other the old),
causing all_sibling_tablet_replicas_colocated to return false. This clears
finalize_resize, allowing the load balancer to start new cascading migrations
that delay merge finalization by tens of seconds.

Fix this by using the optimistic replica view (trinfo->next when transitioning,
ti.replicas otherwise) — the same view the load balancer uses for load
accounting — so finalize_resize stays populated throughout an in-flight
migration and no spurious cascades are triggered.

Steps that lead to the problem:

1. Merge is triggered. The load balancer generates co-location migrations
   for all sibling pairs that are not yet on the same shard. Some pairs
   finish co-location before others.

2. Once all pairs are co-located in committed state,
   all_sibling_tablet_replicas_colocated returns true and finalize_resize
   is set. Meanwhile the load balancer may have already started a regular
   LB migration on one co-located pair (both tablets are stable and the
   load balancer is free to move them).

3. The LB migration moves both tablets together (colocated_tablets). Their
   two del_transition commits land in separate Raft rounds. After the first
   commit, ti.replicas[t1] = new position but ti.replicas[t2] = old position.

4. In this window, all_sibling_tablet_replicas_colocated sees the pair as
   NOT co-located, clears finalize_resize, and the load balancer generates
   new migrations for other tablets to rebalance the load that the pair
   move created.

5. Those new migrations can take tens of seconds to stream, keeping the
   coordinator in handle_tablet_migration mode and preventing
   maybe_start_tablet_resize_finalization from being called. The merge
   finalization is delayed until all those cascaded migrations complete.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-821.
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1459.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Closes scylladb/scylladb#29465
2026-04-15 14:40:15 +03:00
Marcin Maliszkiewicz
53b6e9fda5 Merge 'Make DESCRIBE CLUSTER get cluster information from storage_service' from Pavel Emelyanov
Currently the statement returns cluster, partitioner and snitch names by accessing global db::config via database. As the part of an effort to detach components from global db::config, this PR tweaks the statement handler to get the cluster information from some other source. Currently the needed cluster information is stored in different components, but they are all under storage_service umbrella which seems to be a good central source of this truth. Unit test included.

Cleaning components inter-dependencies, not backporting

Closes scylladb/scylladb#29429

* github.com:scylladb/scylladb:
  test: Add test_describe_cluster_sanity for DESCRIBE CLUSTER validation
  describe_statement: Get cluster info from storage_service
  storage_service: Add describe_cluster() method
  query_processor: Expose storage_service accessor
2026-04-15 14:40:15 +03:00
Botond Dénes
d0e99e018b reader_concurrency_semaphore: drop unused stop_ext_{pre,post}()
Left over from primordial times, when reader_concurrency_semaphore was
baseclass for extensions in the separate enterprise repository.
Also remove the now unneded virtual marker from the destructor.

Closes scylladb/scylladb#29399
2026-04-15 14:40:15 +03:00
Botond Dénes
4a2d032c6f Merge 'query: result_set: change row member to a chunked vector' from Benny Halevy
To prevent large memory allocations.

This series shows over 3% improvement in perf-simple-query throughput.
```
$ build/release/scylla perf-simple-query --default-log-level=error --smp=1 --random-seed=1855519715
random-seed=1855519715
enable-cache=1
Running test with config: {partitions=10000, concurrency=100, mode=read, query_single_key=no, counters=no}
Disabling auto compaction
Creating 10000 partitions...

Before:
random-seed=1775976514
enable-cache=1
enable-index-cache=1
sstable-summary-ratio=0.0005
sstable-format=me
Running test with config: {partitions=10000, concurrency=100, mode=read, query_single_key=no, counters=no}
Disabling auto compaction
Creating 10000 partitions...
336345.11 tps ( 58.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   32788 insns/op,   12430 cycles/op,        0 errors)
348748.14 tps ( 58.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   32794 insns/op,   12335 cycles/op,        0 errors)
349012.63 tps ( 58.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   32800 insns/op,   12326 cycles/op,        0 errors)
350629.97 tps ( 58.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   32770 insns/op,   12270 cycles/op,        0 errors)
348585.00 tps ( 58.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   32804 insns/op,   12338 cycles/op,        0 errors)
throughput:
        mean=   346664.17 standard-deviation=5825.77
        median= 348748.14 median-absolute-deviation=2348.46
        maximum=350629.97 minimum=336345.11
instructions_per_op:
        mean=   32791.35 standard-deviation=13.60
        median= 32794.47 median-absolute-deviation=8.65
        maximum=32804.45 minimum=32769.57
cpu_cycles_per_op:
        mean=   12340.05 standard-deviation=57.57
        median= 12335.05 median-absolute-deviation=13.94
        maximum=12430.42 minimum=12270.28

After:
random-seed=1775976514
enable-cache=1
enable-index-cache=1
sstable-summary-ratio=0.0005
sstable-format=me
Running test with config: {partitions=10000, concurrency=100, mode=read, query_single_key=no, counters=no}
Disabling auto compaction
Creating 10000 partitions...
353770.85 tps ( 58.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   32762 insns/op,   11893 cycles/op,        0 errors)
364447.98 tps ( 58.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   32738 insns/op,   11818 cycles/op,        0 errors)
365268.97 tps ( 58.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   32734 insns/op,   11788 cycles/op,        0 errors)
344304.87 tps ( 58.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   32746 insns/op,   12506 cycles/op,        0 errors)
362263.57 tps ( 58.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   32756 insns/op,   11888 cycles/op,        0 errors)
throughput:
        mean=   358011.25 standard-deviation=8916.76
        median= 362263.57 median-absolute-deviation=6436.74
        maximum=365268.97 minimum=344304.87
instructions_per_op:
        mean=   32747.06 standard-deviation=11.85
        median= 32745.80 median-absolute-deviation=9.36
        maximum=32762.18 minimum=32734.01
cpu_cycles_per_op:
        mean=   11978.65 standard-deviation=298.06
        median= 11887.96 median-absolute-deviation=160.96
        maximum=12505.72 minimum=11788.49
```

Refs #28511
(Refs rather than Fixes for the lack of a reproducer unit test)

* No backport needed as the issue is rare and not severe

Closes scylladb/scylladb#28631

* github.com:scylladb/scylladb:
  query: result_set: change row member to a chunked vector
  query: result_set_row: make noexcept
  query: non_null_data_value: assert is_nothrow_move_constructible and assignable
  types: data_value: assert is_nothrow_move_constructible and assignable
2026-04-15 14:40:15 +03:00
Nadav Har'El
1eb8d170dd Merge 'vector_index: allow recreating vector indexes on the same column' from Dawid Pawlik
This series allows creating multiple vector indexes on the same column so users can rebuild an index without losing query availability.

The intended flow is:
1. Create a new vector index on a column that already has one.
2. Keep serving ANN queries from the old index while the new one is being built.
3. Verify the new index is ready.
4. Automatically switch to the remaining index.
5. Drop the old index.

To make that deterministic, `index_version` is changed from the base table schema version to a real creation timeuuid. When multiple vector indexes exist on the same column, ANN query planning now picks the index according to the routing implemented in Vector Store (newest serving index). This keeps queries on the old index until it the new one is up and ready.

This patch also removes the create-time restriction that rejected a second vector index on the same column. Name collisions are still rejected as before.

Test coverage is updated accordingly:
- Scylla now verifies that two vector indexes can coexist on the same column.
- Cassandra/SAI behavior is still covered and is still expected to reject duplicate indexes on the same column.

Fixes: VECTOR-610

Closes scylladb/scylladb#29407

* github.com:scylladb/scylladb:
  docs: document vector index metadata and duplicate handling
  test/cqlpy: cover vector index duplicate creation rules
  vector_index: allow multiple named indexes on one column
  vector_index: store `index_version` as creation timeuuid
2026-04-15 14:40:15 +03:00
Botond Dénes
a9c86fc2e4 docs: document schema subcomponent in sstable-scylla-format.md
Commit 234f905 (sstables: scylla_metadata: add schema member) added a
new Schema subcomponent (tag 11) to scylla_metadata. Document it in the
sstable Scylla format reference:

- Add schema to the subcomponent grammar enumeration
- Add a summary entry describing the subcomponent (tag 11) and its purpose
- Add a detailed ## schema subcomponent section with the binary grammar,
  covering table_id, table_schema_version, keyspace_name, table_name and
  the column_description array (column_kind, column_name, column_type)

Fixes https://github.com/scylladb/scylladb/issues/27960

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Closes scylladb/scylladb#28983
2026-04-15 14:40:15 +03:00
Botond Dénes
5891efc2ca Merge 'service: add missing replicas if tablet rebuild was rolled back' from Aleksandra Martyniuk
RF change of tablet keyspace starts tablet rebuilds. Even if any of the rebuilds is rolled back (because pending replica was excluded), rf change request finishes successfully. In this case we end up with the state of the replicas that isn't compatible with the expected keyspace replication.

Modify topology coordinator so that if it were to be idle, it starts checking if there are any missing replicas. It moves to transition_state::tablet_migration and run required rebuilds.

If a new RF change request encounters invalid state of replicas it fails. The state will be fixed later and the analogical ALTER KEYSPACE statement will be allowed.

Fixes: SCYLLADB-109.

Requires backport to all versions with tablet keyspace rf change.

Closes scylladb/scylladb#28709

* github.com:scylladb/scylladb:
  test: add test_failed_tablet_rebuild_is_retried_on_alter
  test: add a test to ensure that failed rebuilds are retried
  service: fail ALTER KEYSPACE if replicas do not satisfy the replication
  service: retry failed tablet rebuilds
  service: maybe_start_tablet_migration returns std::optional<group0_guard>
2026-04-15 14:40:15 +03:00
David Garcia
0eaa42c846 docs: Makefile: drop redundant -t $(FLAG) from sphinx options
Related  scylladb/scylladb-docs-homepage#153.

make multiversion failed under Sphinx 8+ with:

```
sphinx-build: error: argument --tag/-t: expected one argument
subprocess.CalledProcessError: Command '(..., '-m', 'sphinx', '-t', '-D', 'smv_metadata_path=...', ..., 'manual')' returned non-zero exit status 2.
make: *** [multiversion] Error 1
```

sphinx-multiversion's arg forwarding splits `-t manual`, sending `-t` into the options slot and `manual` to the trailing FILENAMES positional.

Sphinx 7 silently tolerated the dangling `-t`; Sphinx 8+'s stricter
argparse CLI rejects it. Instead, it now reads FLAGS from an env variable.

How to test:

````
make multiversion
make FLAG=opensource multiversion
````

Both complete and switch variants correctly.

chore: rm empty lines

Closes scylladb/scylladb#29472
2026-04-15 14:40:15 +03:00
dependabot[bot]
280ffe107f build(deps): bump sphinx-multiversion-scylla in /docs
Bumps [sphinx-multiversion-scylla](https://holzhaus.github.io/sphinx-multiversion/) from 0.3.7 to 0.3.8.

---
updated-dependencies:
- dependency-name: sphinx-multiversion-scylla
  dependency-version: 0.3.8
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>

Closes scylladb/scylladb#29466
2026-04-15 14:40:15 +03:00
Raphael S. Carvalho
1529605b32 logstor: Fix dangling reference captures and shadowed loc variable
Three bugs fixed in segment_manager.cc:

1. write_to_separator(): captured [&index] where index was a local
   coroutine-frame reference. The future is stored in
   buf.pending_updates and resolved later in flush_separator_buffer(),
   by which time the enclosing coroutine frame is destroyed, making
   &index a dangling pointer. This is a use-after-free that manifests
   as a segfault. Fix: capture index_ptr (raw pointer by value) instead.

2. add_segment_to_compaction_group(): same dangling [&index] pattern
   inside the for_each_live_record lambda during recovery. Same fix
   applied.

3. write(): local 'auto loc = seg->allocate(...)' shadowed the outer
   'log_location loc', causing the function to always return a
   zero-initialized log_location{}. Fix: remove 'auto' so the
   assignment targets the outer variable.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Closes scylladb/scylladb#29451
2026-04-15 14:40:15 +03:00
Tomasz Grabiec
266a225416 utils: avoid exceptions in disk_space_monitor polling loop
The poll loop used condition_variable::wait(timeout) to sleep between
iterations. On every normal timeout expiry, this threw a
condition_variable_timed_out exception, which incremented the C++
exception metric and triggered false alerts for support.

Replace the timed wait with a seastar::timer that broadcasts the
condition variable on expiry, combined with an untimed wait(). The
timer is cancelled automatically on scope exit when the wait is woken
early by trigger_poll() or abort.

Fixes SCYLLADB-1477

Closes scylladb/scylladb#29438
2026-04-15 14:40:15 +03:00
Pavel Emelyanov
a428472e50 db: Remove redundant enable_logstor config option
The enable_logstor configuration option is redundant with the 'logstor'
experimental feature flag. Consolidate to a single gate: use the
experimental feature to control both whether logstor is available for
table creation and whether it is initialized at database startup.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#29427
2026-04-15 14:40:15 +03:00
Botond Dénes
87eb20ba33 Merge 'cql: Include parallelized queries in the scylla_cql_select_partition_range_scan_no_bypass_cache metric' from Tomasz Grabiec
This metric is used to catch execution of scans which go via row
cache, which can have bad effect on performance.

Since f344bd0aaa, aggregate queries go
via new statement class: parallelized_select_statement. This class
inherits from select_statement directly rather than from
primary_key_select_statement. The range scan detection logic
(_range_scan, _range_scan_no_bypass_cache) was only in
primary_key_select_statement's constructor, so parallelized queries
were not counted in select_partition_range_scan and
select_partition_range_scan_no_bypass_cache metrics.

Fix by moving the range scan detection into select_statement's
constructor, so that all subclasses get it.

No backport: enhancement

Closes scylladb/scylladb#29422

* github.com:scylladb/scylladb:
  cql: Include parallelized queries in the scylla_cql_select_partition_range_scan_no_bypass_cache metric
  test: cluster: dtest: Fix double-counting of metrics
2026-04-15 14:40:15 +03:00
Botond Dénes
aecb6b1d76 Merge 'auth: sanitize {USER} substitution in LDAP URL template' from Piotr Smaron
`LDAPRoleManager` interpolated usernames directly into `ldap_url_template`,
allowing LDAP filter injection and URL structure manipulation via crafted
usernames.

This PR adds two layers of encoding when substituting `{USER}`:
1. **RFC 4515 filter escaping** — neutralises `*`, `(`, `)`, `\`, NUL
2. **URL percent-encoding** — prevents `%`, `?`, `#` from breaking
   `ldap_url_parse`'s component splitting or undoing the filter escaping
It also adds `validate_query_template()` at startup to reject templates
that place `{USER}` outside the filter component (e.g. in the host or
base DN), where filter escaping would be the wrong defense.
Fixes: SCYLLADB-1309

Compatibility note:
Templates with `{USER}` in the host, base DN, attributes, or extensions
were previously silently accepted. They are now rejected at startup with
a descriptive error. Only templates with `{USER}` in the filter component
(after the third `?`) are valid.

Fixes: SCYLLADB-1309

Due to severeness, should be backported to all maintained versions.

Closes scylladb/scylladb#29388

* github.com:scylladb/scylladb:
  auth: sanitize {USER} substitution in LDAP URL templates
  test/ldap: add LDAP filter-injection reproducers
2026-04-15 14:40:15 +03:00
Artsiom Mishuta
146a67cf6f test: explicitly wait for schema agreement in create_new_test_keyspace
Add an explicit wait_for_schema_agreement() call after CREATE KEYSPACE
in create_new_test_keyspace to ensure all nodes have applied the schema
before proceeding.

Closes scylladb/scylladb#29371
2026-04-15 14:40:15 +03:00
Pavel Emelyanov
54e3c648a5 test/cluster/dtest: improve diagnostics in test_update_schema_while_node_is_killed
The alter_table case has a known failure where point lookups at QUORUM
return 0 rows after node2 restarts, even though:
- the schema was correctly synced (ALTER TABLE received from cluster)
- the data commitlog was replayed (21 mutations, 0 skipped)
- all 3 nodes were alive, so QUORUM (2/3) should be satisfiable by
  node1+node3 regardless of node2's state

The LIMIT 1 table scan succeeds (data is present somewhere), but
specific key lookups return empty. This points to a bug in how node2,
acting as coordinator after restart, routes single-partition reads —
most likely stale tablet routing metadata.

Add diagnostics to help distinguish data loss from a coordinator/routing
bug on the next failure:
- log which key is missing
- dump all rows visible at QUORUM
- query each node individually at ONE consistency for the missing key

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Closes scylladb/scylladb#29350
2026-04-15 14:40:15 +03:00
Piotr Szymaniak
4c93c2af62 audit/alternator: support audit_tables=alternator.<table> shorthand
The real keyspace name of an Alternator table T is "alternator_T".
Expand the "alternator.T" format used in the audit_tables config flag
to the real keyspace name at parse time, so users don't need to spell
out the internal "alternator_T.T" form.
2026-04-15 12:29:15 +02:00
Piotr Szymaniak
0714d8aded audit/alternator: Add negative audit tests
Add tests for the unhappy path of Alternator audit logging:
- Category filtering: operations are not logged when their category
  (DML, QUERY, DDL) is excluded from audit_categories.
- Keyspace filtering: operations on a keyspace not listed in
  audit_keyspaces are not logged.
- Error entries: a failed operation (thrown exception after audit_info
  is set) produces an audit entry with error=true.
- Empty-keyspace bypass: global operations like ListTables and
  DescribeEndpoints are logged regardless of audit_keyspaces because
  should_log() short-circuits on an empty keyspace.
2026-04-15 12:29:15 +02:00
Piotr Szymaniak
ad05b44931 audit/alternator: Add testing of auditing
There is a new test file created, `test/alternator/test_audit.py`.
The file contains a suite of tests of all auditing operations.
2026-04-15 12:29:15 +02:00
Piotr Szymaniak
6913efab5c audit/alternator: Audit requests
Both the successful ones as well as the failed ones are audited.

Each Alternator operation sets up audit metadata via an
executor::maybe_audit() helper, which checks will_log() and only
heap-allocates audit_info_alternator when auditing is enabled. DDL
and metadata operations pass no consistency level; data read/write
operations pass the actual CL used.

BatchWriteItem and BatchGetItem guard table name collection with
will_log() to avoid unnecessary work when auditing is disabled.
ListStreams audits the input table name rather than collecting
output table names during iteration. UntagResource sets up
auditing after parameter validation. Exception re-throw in
server.cc uses co_return coroutine::exception().

The chosen audit types for the operations:
- CreateTable - DDL
- DescribeTable - QUERY
- DeleteTable - DDL
- UpdateTable - DDL
- PutItem - DML
- UpdateItem - DML
- GetItem - QUERY
- DeleteItem - DML
- ListTables - QUERY
- Scan - QUERY
- DescribeEndpoints - QUERY
- BatchWriteItem - DML
- BatchGetItem - QUERY
- Query - QUERY
- TagResource - DDL
- UntagResource - DDL
- ListTagsOfResource - QUERY
- UpdateTimeToLive - DDL
- DescribeTimeToLive - QUERY
- ListStreams - QUERY
- DescribeStream - QUERY
- GetShardIterator - QUERY
- GetRecords - QUERY
- DescribeContinuousBackups - QUERY
2026-04-15 11:55:42 +02:00
Piotr Szymaniak
9646ee05bd audit/alternator: Refactor in preparation for auditing Alternator
Prepare API in audit for auditing Alternator.
The API provides an externally-callable functions `inspect()`,
for both CQL and Alternator.
Both variants of the function would unpack parameters and merge into
calling a common `maybe_log()`, which can then call `log()` when
conditions are met.
Also, while I was at it, (const) references were favoured over raw
pointers.

The Alternator audit_info subclass (audit_info_alternator) carries an
optional consistency level — only data read/write operations have a
meaningful CL, while DDL and metadata queries store an empty string
in the audit table and syslog (matching the existing write_login
behavior). The storage helpers are updated accordingly.

Add a will_log(category, keyspace, table) method that checks whether
an operation should be audited (category check AND keyspace/table
filtering) without requiring a constructed audit_info object.
should_log() delegates to will_log().
2026-04-15 11:46:44 +02:00
Tomasz Grabiec
84361194c2 test: boost: tablets: Add test for merge with arbitrary tablet count 2026-04-15 10:40:56 +02:00
Tomasz Grabiec
7af9f5366d tablets, database: Advertise 'arbitrary' layout in snapshot manifest
Currently, the manifest advertises "powof2", which is wrong for
arbitrary count and boundaries.

Introduce a new kind of layout called "arbitrary", and produce it if
the tablet map doesn't conform to "powof2" layout.

We should also produce tablet boundaries in this case, but that's
worked on in a different PR: https://github.com/scylladb/scylladb/pull/28525
2026-04-15 10:40:56 +02:00
Tomasz Grabiec
50fbac6ea6 tablets: Introduce pow2_count per-table tablet option
By default it's true, in which case tablet count of the table is
rounded up to a power of two. This option allows lifting this, in
which case the count can be arbitrary. This will allow testing the
logic of arbitrary tablet count.
2026-04-15 10:40:56 +02:00
Tomasz Grabiec
b6a7023f68 tablets: Prepare for non-power-of-two tablet count
This is a step towards more flexibility in managing tablets.  A
prerequisite before we can split individual tablets, isolating hot
partitions, and evening-out tablet sizes by shifting boundaries.

After this patch, the system can handle tables with arbitrary tablet
count. Tablet allocator is still rounding up desired tablet count to
the nearest power of two when allocating tablets for a new table, so
unless the tablet map is allocated in some other way, the counts will
be still a power of two.

We plan to utilize arbitrary count when migrating from vnodes to
tablets, by creating a tablet map which matches vnode boundaries.

One of the reasons we don't give up on power-of-two by default yet is
that it creates an issue with merges. If tablet count is odd, one of
the tablets doesn't have a sibling and will not be merged. That can
obviously cause imbalance of token space and tablet sizes between
tablets. To limit the impact, this patch dynamically chooses which
tablet to isolate when initiating a merge. The largest tablet is
chosen, as that will minimize imbalance. Otherwise, if we always chose
the last tablet to isolate, its size would remain the same while other
tablets double in size with each odd-count merge, leading to
imbalance. The imbalance will still be there, but the difference in
tablet sizes is limited to 2x.

Example (3 tablets):
  [0] owns 1/3 of tokens
  [1] owns 1/3 of tokens
  [2] owns 1/3 of tokens

After merge:
  [0] owns 2/3 of tokens
  [1] owns 1/3 of tokens

What we would like instead:

Step 1 (split [1]):
  [0] owns 1/3 of tokens
  [1] old 1.left, owns 1/6 of tokens
  [2] old 1.right, owns 1/6 of tokens
  [3] owns 1/3 of tokens

Step 2 (merge):
  [0] owns 1/2 of tokens
  [1] owns 1/2 of tokens

To do that, we need to be able to split individual tablets, but we're
not there yet.
2026-04-15 10:40:55 +02:00
Tomasz Grabiec
f54daef4ec tablets: Implement merged tablet_map constructor on top of for_each_sibling_tablets()
This way it doesn't need to know how the scheduler chose to merge tablets.
We'll have less duplication of logic.
2026-04-15 10:40:55 +02:00
Tomasz Grabiec
66fc7967b8 tablets: Prepare resize_decision to hold data in decisions
merge decision will carry a plan - which replica to isolate.
So construction from a string will no longer do.
2026-04-15 10:40:55 +02:00
Tomasz Grabiec
d543f260bd tablets: table: Make storage_group handle arbitrary merge boundaries
We only assume that new tablets have boundaries which are equal
to some boundaries of old tablets.

In preparation for supporting arbitrary merge plan, where any replica
can be isolated (not merged with siblings) by the merge plan.
2026-04-15 10:40:55 +02:00
Nadav Har'El
022add117e test/cluster: fix flaky test test_row_ttl_scheduling_group
The test test/cluster/test_ttl_row.py::test_row_ttl_scheduling_group wants to
verify that the new CQL per-row TTL feature does all its work (expiration
scanning, deletion of expired items) on all nodes in the "streaming"
scheduling group, not in the statement scheduling group.

As originally written, the test couldn't require that it uses exactly zero
time in the statement scheduling group - because some things do happen
there - specifically the ALTER TABLE request we use to enable TTL.
So the test checked that the time in the "wrong" group is less than 0.2
of the total time, not zero.

But in one CI run, we got to exactly 0.2 and the test failed. Running
this test locally, I see the margin is pretty narrow: The test almost
always fails if I set the threshold ratio to 0.1.

The solution in this patch is to move the ALTER TABLE work to a different
scheduling group (by using an additional service level). After doing that
the CPU usage in sl:default goes down to exactly zero - not close to zero
but exactly zero.

However, it seems that there is always some rare background work in
sl:default and debug builds it can come out more than 0ms (e.g., in
one test we saw 1ms), so we keep checking that sl:default is much
lower than sl:stream - not exactly zero.

Incidentally, I converted the serial loop adding the 200 rows in the
test's setup to a parallel loop, to make the test setup slightly faster.

I also added to the test a sanity check that the scheduling group sl:default
that we are measuring that TTL does zero work in, is actually the scheduling
group that normal writes work in (to avoid the risk of having a test that
verifies that some irrelevant scheduling group is unsurprisingly getting
zero usage...).

Fixes SCYLLADB-1495.

Closes scylladb/scylladb#29447
2026-04-15 08:42:29 +03:00
Jenkins Promoter
3d0582d51e Update pgo profiles - aarch64 2026-04-15 05:26:22 +03:00
Jenkins Promoter
a4d3ab9f0e Update pgo profiles - x86_64 2026-04-15 04:26:28 +03:00
Tomasz Grabiec
6d510bcd1c tablets: Make stats update post-merge work with arbitrary merge boundaries
We only assume that new tablets share boundaries with some old tablets.

In preparation for supporting arbitrary merge plan, where any replica
can be isolated (not merged with siblings) by the merge plan.
2026-04-15 01:25:16 +02:00
Tomasz Grabiec
01fb97ee78 locator: tablets: Support arbitrary tablet boundaries
There are several reasons we want to do that.

One is that it will give us more flexibility in distributing the
load. We can subdivide tablets at any points, and achieve more
evenly-sized tablets. In particular, we can isolate large partitions
into separate tablets.

Another reason is vnode-to-tablet migration. We could construct a
tablet map which matches exactly the vnode boundaries, so migration
can happen transparently from the CQL-coordinator's point of view.

Implementation details:

We store a vector of tokens which represent tablet boundaries in the
tablet_id_map. tablet_id keeps its meaning, it's an index into vector
of tablets. To avoid logarithmic lookup of tablet_id from the token,
we introduce a lookup structure with power-of-two aligned buckets, and
store the tablet_id of the tablet which owns the first token in the
bucket. This way, lookup needs to consider tablet id range which
overlaps with one bucket. If boundaries are more or less aligned,
there are around 1-2 tablets overlapping with a bucket, and the lookup
is still O(1).

Amount of memory used increased, but not significantly relative to old
size (because tablet_info is currently fat):

For 131'072 tablets:

Before:

  Size of tablet_metadata in memory: 57456 KiB

After:

  Size of tablet_metadata in memory: 59504 KiB
2026-04-15 01:25:14 +02:00
Tomasz Grabiec
82acdae74b locator: tablets: Introduce tablet_map::get_split_token()
And reimplement existing split-related methods around it.
This way we avoid calling dht::compaction_group_of(), and
assuming anything about tablet boundaries or tablet count
being a power of two.

This will make later refactoring easier.
2026-04-15 01:24:48 +02:00
Tomasz Grabiec
2e1d41c206 dht: Introduce get_uniform_tokens() 2026-04-15 01:24:48 +02:00
Tomasz Grabiec
a58243bc1e Merge 'hint_sender: send hints to all tablet replicas if the tablet leaving due to RF--' from Ferenc Szili
Currently, hints that are sent to tablet replicas which are leaving due to RF-- can be lost, because `hint_sender` only checks if the destination host is leaving. To avoid this, we add a new method `effective_replication_map::is_leaving(host, token)` which checks if the tablet identified by the given token is leaving the host. This method is called by the `hint_sender` to check if the hint should be sent only to the destination host, or to all the replicas. This way, we increase consistency. For v-node based ERPs, `is_leaving()` calls `token_metadata::is_leaving(host)`.

Fixes: SCYLLADB-287

This is an improvement, and backport is not needed.

Closes scylladb/scylladb#28770

* github.com:scylladb/scylladb:
  test: verify hints are delivered during tablet RF reduction
  hint_sender: use per-tablet is_leaving() to avoid losing hints on RF reduction
  erm: add is_leaving() to effective_replication_map
2026-04-14 22:51:34 +02:00
Tomasz Grabiec
7fe4ae16f0 Merge 'table: don't create new split compaction groups if main compaction group is disabled' from Ferenc Szili
Fixes a race condition where tablet split can crash the server during truncation.

`truncate_table_on_all_shards()` disables compaction on all existing compaction groups, then later calls `discard_sstables()` which asserts that compaction is disabled. Between these two points, tablet split can call `set_split_mode()`, which creates new compaction groups via `make_empty_group()` — these start with `compaction_disabled_counter == 0`. When `discard_sstables()` checks its assertion, it finds these new groups and fires `on_internal_error`, aborting the server.

In `storage_group::set_split_mode()`, before creating new compaction groups, check whether the main compaction group has compaction disabled. If it does, bail out early and return `false` (not ready). This is safe because the split will be retried once truncation completes and re-enables compaction.

A new regression test `test_split_emitted_during_truncate` reproduces the
exact interleaving using two error injection points:

- **`database_truncate_wait`** — pauses truncation after compaction is disabled but before `discard_sstables()` runs.
- **`tablet_split_monitor_wait`** (new, in `service/storage_service.cc`) — pauses the split monitor at the start of `process_tablet_split_candidate()`.

The test creates a single-tablet table, triggers both operations, uses the injection points to force the problematic ordering, then verifies that truncation completes successfully and the split finishes afterward.

Fixes: SCYLLADB-1035

This needs to be backported to all currently supported version.

Closes scylladb/scylladb#29250

* github.com:scylladb/scylladb:
  test: add test_split_emitted_during_truncate
  table: fix race between tablet split and truncate
2026-04-14 22:00:40 +02:00
Avi Kivity
21d9f54a9a partition_snapshot_row_cursor: fix reversed maybe_refresh() losing latest version entry
In partition_snapshot_row_cursor::maybe_refresh(), the !is_in_latest_version()
path calls lower_bound(_position) on the latest version's rows to find the
cursor's position in that version. When lower_bound returns null (the cursor
is positioned above all entries in the latest version in table order), the code
unconditionally sets _background_continuity = true and allows the subsequent
if(!it) block to erase the latest version's entry from the heap.

This is correct for forward traversal: null means there are no more entries
ahead, so removing the version from the heap is safe.

However, in reversed mode, null from lower_bound means the cursor is above
all entries in table order -- those entries are BELOW the cursor in query
order and will be visited LATER during reversed traversal. Erasing the heap
entry permanently loses them, causing live rows to be skipped.

The fix mirrors what prepare_heap() already does correctly: when lower_bound
returns null in reversed mode, use std::prev(rows.end()) to keep the last
entry in the heap instead of erasing it.

Add test_reversed_maybe_refresh_keeps_latest_version_entry to mvcc_test,
alongside the existing reversed cursor tests. The test creates a two-version
partition snapshot (v0 with range tombstones, v1 with a live row positioned
below all v0 entries in table order), and
traverses in reverse calling maybe_refresh() at each step -- directly
exercising the buggy code path. The test fails without the fix.

The bug was introduced by 6b7473be53 ("Handle non-evictable snapshots",
2022-11-21), which added null-iterator handling for non-evictable snapshots
(memtable snapshots lack the trailing dummy entry that evictable snapshots
have). prepare_heap() got correct reversed-mode handling at that time, but
maybe_refresh() received only forward-mode logic.

The bug is intermittent because multiple mechanisms cause iterators_valid()
to return false, forcing maybe_refresh() to take the full rebuild path via
prepare_heap() (which handles reversed mode correctly):
  - Mutation cleaner merging versions in the background (changes change_mark)
  - LSA segment compaction during reserve() (invalidates references)
  - B-tree rebalancing on partition insertion (invalidates references)
  - Debug mode's always-true need_preempt() creating many multi-version
    partitions via preempted apply_monotonically()

A dtest reproducer confirmed the same root cause: with 100K overlapping range
tombstones creating a massively multi-version memtable partition (287K preemption
events), the reversed scan's latest_iterator was observed jumping discontinuously
during a version transition -- the latest version's heap entry was erased --
causing the query to walk the entire partition without finding the live row.

Fixes: SCYLLADB-1253

Closes scylladb/scylladb#29368
2026-04-14 21:50:25 +02:00
Nadav Har'El
986167a416 Merge 'cql3: fix authorization bypass via BATCH prepared cache poisoning' from Marcin Maliszkiewicz
execute_batch_without_checking_exception_message() inserted entries
into the authorized prepared cache before verifying that
check_access() succeeded. A failed BATCH therefore left behind
cached 'authorized' entries that later let a direct EXECUTE of the
same prepared statement skip the authorization check entirely.

Move the cache insertion after the access check so that entries are
only cached on success. This matches the pattern already used by
do_execute_prepared() for individual EXECUTE requests.

Introduced in 98f5e49ea8

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1221

Backport: all supported versions

Closes scylladb/scylladb#29432

* github.com:scylladb/scylladb:
  test/cqlpy: add reproducer for BATCH prepared auth cache bypass
  cql3: fix authorization bypass via BATCH prepared cache poisoning
2026-04-14 22:31:54 +03:00
Pavel Emelyanov
cec44dc68d test: Add test_describe_cluster_sanity for DESCRIBE CLUSTER validation
Add parametrized integration test that verifies DESCRIBE CLUSTER returns correct
values in both normal and maintenance modes:

The parametrization keeps the validation logic (CQL queries and assertions)
identical for both modes, while the setup phase is mode-specific. This ensures
the same assertions apply to both cluster states:
- partitioner is org.apache.cassandra.dht.Murmur3Partitioner
- snitch is org.apache.cassandra.locator.SimpleSnitch
- cluster name matches system.local cluster_name

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-14 19:33:21 +03:00
Pavel Emelyanov
debfb147f5 describe_statement: Get cluster info from storage_service
Update cluster_describe_statement::describe() to retrieve cluster metadata
from storage_service::describe_cluster() instead of directly from db::config
or gossiper.

The storage_service provides a centralized API for accessing cluster metadata
(cluster_name, partitioner, snitch_name) that works in both normal and
maintenance modes, improving separation of concerns.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-14 19:33:06 +03:00
Pavel Emelyanov
53361358ef storage_service: Add describe_cluster() method
Add cluster_info struct containing cluster_name, partitioner, and snitch_name.
Implement describe_cluster() method to provide cluster metadata by combining
data from gossiper (cluster_name, partitioner) and snitch (snitch_name).

It will be used by next patch

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-14 19:29:24 +03:00
Pavel Emelyanov
0d4a8a04ec query_processor: Expose storage_service accessor
Add storage_service() method to expose the sharded storage service to callers.
To be used by next patch.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-14 19:29:11 +03:00
Radosław Cybulski
4b984212ba alternator: improve parsing / generating of StreamArn parameter
Previously Alternator, when emit Amazon's ARN would not stick to the
standard. After our attempt to run KCL with scylla we discovered few
issues.

Amazon's ARN looks like this:

arn:partition:service:region:account-id:resource-type/resource-id

for example:

arn:aws:dynamodb:us-west-2:111122223333:table/TestTable/stream/2015-05-11T21:21:33.291

KCL checks for:
- ARN provided from Alternator calls must fit with basic Amazon's ARN
  pattern shown above,
- region constisting only of lower letter alphabets and `-`, no
  underscore character
- account-id being only digits (exactly 12)
- service being `dynamodb`
- partition starting with `aws`

The patch updates our code handling ARNs to match those findings.

1. Split `stream_arn` object into `stream_arn` - ARN for streams only and
`stream_shard_id` - id value for stream shards. The latter receives original
implementation. The former emits and parses ARN in a Amazon style.
 for example:
2. Update new `stream_arn` class to encode keyspace and table together
separating them by `@`. New ARN looks like this:

arn:aws:dynamodb:us-east-1:000000000000:table/TestKeyspace@TestTable/stream/2015-05-11T21:21:33.291

3. hardcode `dynamodb` as service, `aws` as partition, `us-east-1` as
   region and `000000000000` as account-id (must have 12 digits)
4. Update code handling ARNs for tags manipulation to be able to parse
   Amazon's style ARNs. Emiting code is left intact - the parser is now
   capable of parsing both styles.
5. Added unit tests.

Fixes #28350
Fixes: SCYLLADB-539
Fixes: #28142

Closes scylladb/scylladb#28187
2026-04-14 18:07:05 +03:00
Marcin Maliszkiewicz
de19714763 Merge 'cql3: prepare list statments metadta_id during prepare statement , send the correct metadata_id directly to the client ' from Alex Dathskovsky
This series makes result metadata handling for auth LIST statements consistent and adds coverage for the driver-visible behavior.

The first patch makes the result-column metadata construction shared across the affected statements, so the metadata shape used for PREPARE and EXECUTE stays uniform and easier to reason about.

The second patch adds regression coverage for both sides of the metadata-id flow:

- a Python auth-cluster test verifies that prepared LIST ROLES OF returns a non-empty result metadata id and that a later EXECUTE reuses it without METADATA_CHANGED
- a Boost transport test covers the recovery path where the client sends an empty request metadata id and the server responds with METADATA_CHANGED and the full metadata

Together these patches tighten the implementation and protect the prepared-metadata-id behavior exposed to drivers.

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1218

backport: this change should be backported to all active branches to help the driver operation

Closes scylladb/scylladb#29347
2026-04-14 16:09:49 +02:00
bitpathfinder
c1315f9f1e commitlog: add test to verify segment replay order
Add a boost test that verifies commitlog segments are replayed in
ascending segment ID order within each shard. The test creates
multiple segments, triggers replay via commitlog_replayer, and
captures the "Replaying" debug log messages to verify the order.

Correct segment ordering is required by the strongly consistent
tables feature, particularly commitlog-based storage that relies
on replayed raft items being stored in order.

Ref SCYLLADB-1411.
2026-04-14 16:06:13 +02:00
bitpathfinder
c06adffd6a commitlog: fix replay order by using ordered map per shard
The commitlog replayer groups segments by shard using a
std::unordered_multimap, then iterates per-shard segments via
equal_range(). However, equal_range() does not guarantee iteration
order for elements with the same key, so segments could be replayed
out of order within a shard. This can increase memory and disk
consumption during fragmented entry reconstruction, which accumulates
fragments across segments and benefits from ascending ID order.

This is also required by the strongly consistent tables feature,
particularly commitlog-based storage that relies on replayed raft
items being stored in order.

Fix by changing the data structure from
  std::unordered_multimap<unsigned, commitlog::descriptor>
to
  std::unordered_map<unsigned, utils::chunked_vector<commitlog::descriptor>>

Since the descriptors are inserted from a std::set ordered by ID, the
vector preserves insertion (and thus ID) order. The per-shard iteration
now simply iterates the vector, guaranteeing correct replay order.

Fixes SCYLLADB-1411.
2026-04-14 16:05:17 +02:00
Anna Stuchlik
633297b15d doc: remove an oudated troubleshooting page
Fixes https://github.com/scylladb/scylladb/issues/29405

Closes scylladb/scylladb#29431
2026-04-14 15:14:32 +03:00
Ernest Zaslavsky
0eb6270c82 ci: add build system comparison workflow
Add a GitHub Actions workflow that runs scripts/compare_build_systems.py
on PRs touching build system files (configure.py, **/CMakeLists.txt,
cmake/**).
This prevents future deviations between the two build systems by
catching mismatches early in the CI pipeline.

Closes scylladb/scylladb#29426
2026-04-14 14:53:12 +03:00
Avi Kivity
4a9fdb17f0 build: cmake: fix -fno-sanitize-address-use-after-scope for CQL parser
The CMake build had -fsanitize-address-use-after-scope (enable) when
it should have been -fno-sanitize-address-use-after-scope (disable).

The comment on lines 24-25 of cql3/CMakeLists.txt explains the intent:
the use-after-scope sanitizer uses too much stack space on CqlParser
and overflows the stack. The Python-ninja path in configure.py:2801-2802
correctly had -fno-sanitize-address-use-after-scope.

Found by black-box comparison of compiler flags between the Python-ninja
and CMake build paths (ninja -nv output, debug mode, CqlParser.o):

  Python-ninja: -fno-sanitize-address-use-after-scope  (correct: disable)
  CMake:        -fsanitize-address-use-after-scope      (wrong: enable)

Closes scylladb/scylladb#29439
2026-04-14 14:48:52 +03:00
Avi Kivity
ebdfa10c8f test: fix flaky test_incremental_repair_race_window_promotes_unrepaired_data
The test waited for two "Finished tablet repair" log messages on the
coordinator, expecting one per tablet.  But there are two log sources
that emit messages matching this pattern:

  repair module (repair/repair.cc:2329):
    "Finished tablet repair for table=..."
  topology coordinator (topology_coordinator.cc:2083):
    "Finished tablet repair host=..."

When the coordinator is also a repair replica (always the case with
RF=3 and 3 nodes), both messages appear in the coordinator log for the
same tablet within 1ms of each other.  The test consumed both, thinking
both tablets were done, while the second tablet repair was still running.

From the CI failure logs:

  04:08:09.658 Found: repair[...]: Finished tablet repair for table=...
    global_tablet_id=e42fd650-3542-11f1-9756-85403784a622:0
  04:08:09.660 Found: raft_topology - Finished tablet repair host=...
    tablet=e42fd650-3542-11f1-9756-85403784a622:0

Both messages are for tablet :0.  Tablet :1 repair had not finished yet.

The test then wrote keys 20-29 while the second tablet repair was still
in progress.  That repair flushed the memtable (via
prepare_sstables_for_incremental_repair), including keys 20-29 in the
repair scan, and mark_sstable_as_repaired set repaired_at=2 on the
resulting sstable.  This caused the assertion failure on servers[0]:
  "should not have post-repair keys in repaired sstables, got:
   {20, 21, 22, 23, 24, 25, 26, 27, 28, 29}"

Fix by matching "Finished tablet repair host=" which is unique to the
topology coordinator message and avoids the ambiguity.

Also fix an incorrect comment that said being_repaired=null when at that
point in the test being_repaired is still set to the session_id (the
delay_end_repair_update injection prevents end_repair from running).

Fixes: SCYLLADB-1478

Closes scylladb/scylladb#29444
2026-04-14 13:32:51 +02:00
Piotr Dulikowski
9fc2c65d18 Merge 'cql3: implement WRITETIME() and TTL() of individual elements of map, set, and UDT' from Nadav Har'El
In commit 727f68e0f5 we added the ability to SELECT:

* Individual elements of a map: `SELECT map_col[key]`.
* Individual elements of a set: `SELECT set_col[key]` returns key if the key exists in the set, or null if it doesn't, allowing to check if the element exists in the set.
* Individual pieces of a UDT: `SELECT udt_col.field`.

But at the time, we didn't provide any way to retrieve the **meta-data** for this value, namely its timestamp and TTL. We did not support `SELECT TIMESTAMP(collection[key])`, or `SELECT TIMESTAMP(udt.field)`.

Users requested to support such SELECTs in the past (see issue #15427), and Cassandra 5.0 added support for this feature - for both maps and sets and udts - so we also need this feature for compatibility. This feature was also requested recently by vector-search developers, who wanted to read Alternator columns - stored as map elements, not individual columns - with their WRITETIME information.

The first four patches in this series adds the feature (in four smaller patches instead one big one), the fifth and sixth patches add tests (cqlpy and boost tests, respectively). The seventh patch adds documentation.

All the new tests pass on Cassandra 5, failed on Scylla before the present fix, and pass with it.

The fix was surprisingly difficult. Our existing implementation (from 727f68e0f5 building on earlier machinery) doesn't just "read" `map_col[key]` and allow us to return just its timestamp. Rather, the implementation reads the entire map, serializes it in some temporary format that does **not** include the timestamps and ttls, and then takes the subscript key, at which point we no longer have the timestamp or ttl of the element. So the fix had to cross all these layers of the implementation.

While adding support for UDT fields in a pre-existing grammar nonterminal "subscriptExpr", we unintentionally added support for UDT fields also in LWT expressions (which used this nonterminal). LWT missing support for UDT fields was a long-time known compatibility issue (#13624) so we unintentionally fixed it :-) Actually, to completely fix it we needed another small change in the expression implementation, so the eighth patch in this series does this.

Fixes #15427
Fixes #13624

Closes scylladb/scylladb#29134

* github.com:scylladb/scylladb:
  cql3: support UDT fields in LWT expressions
  cql3: document WRITETIME() and TTL() for elements of map, set or UDT
  test/boost: test WRITETIME() and TTL() on map collection elements
  test/cqlpy: test WRITETIME() and TTL() on element of map, set or UDT
  cql3: prepare and evaluate WRITETIME/TTL on collection elements and UDT fields
  cql3: parse per-element timestamps/TTLs in the selection layer
  cql3: add extended wire format for per-element timestamps and TTLs
  cql3: extend WRITETIME/TTL grammar to accept collection and UDT elements
2026-04-14 12:35:46 +02:00
Dawid Pawlik
f40ab83d02 docs: document vector index metadata and duplicate handling
Document the new vector index behavior in the user-facing and developer
docs.

Describe `index_version` as a creation timeuuid stored in
`system_schema.indexes`, clarify that recreating an index changes it
while ALTER TABLE does not, and document that Scylla allows multiple
named vector indexes on the same column while still rejecting unnamed
duplicates.
2026-04-14 12:21:38 +02:00
Dawid Pawlik
800dec2180 test/cqlpy: cover vector index duplicate creation rules
Add cqlpy tests for the current CREATE INDEX behavior of vector indexes.

Cover named and unnamed duplicates, IF NOT EXISTS, coexistence of
multiple named vector indexes on the same column, interactions between
named and unnamed indexes, and the same-name-on-different-table case.
2026-04-14 12:21:38 +02:00
Marcin Maliszkiewicz
db5e4f2cb8 test/cqlpy: add reproducer for BATCH prepared auth cache bypass
An unprivileged user could bypass authorization checks by exploiting
the BATCH prepared statement cache:

1. Prepare an INSERT on a table the user has no access to
2. Execute it inside a BATCH — gets Unauthorized
3. Execute the same prepared INSERT directly — succeeds
2026-04-14 10:37:42 +02:00
Marcin Maliszkiewicz
8401e9cbbd test: filter benign errors in tests that grep logs during shutdown
Apply filter_errors() to grep_for_errors() results in
test_split_stopped_on_shutdown and
test_group0_apply_while_node_is_being_shutdown. Without filtering,
benign RPC errors like 'connection dropped: Semaphore broken' that
occur during graceful shutdown cause spurious test failures.
2026-04-13 18:33:41 +02:00
Marcin Maliszkiewicz
e78e6cd584 test: filter_errors: support list[list[str]] error groups
Accept both list[str] (from distinct_errors=True) and
list[list[str]] (from distinct_errors=False) in filter_errors(),
matching against the first line of each error group. This allows
tests that call grep_for_errors() with default arguments to
pipe results directly through filter_errors().
2026-04-13 18:33:29 +02:00
Alex
fdce8824a5 test/cluster: cover prepared LIST metadata ids in one setup
Precompute the expected metadata-id hashes for the prepared LIST auth and
service-level statements and verify that PREPARE returns them while EXECUTE
reuses the prepared metadata without METADATA_CHANGED. Run all cases in a
single auth-cluster test after preparing the cluster, role, and service level
once through the regular manager fixture.
2026-04-13 19:13:12 +03:00
Marcin Maliszkiewicz
4d3ca041bb cql3: fix authorization bypass via BATCH prepared cache poisoning
execute_batch_without_checking_exception_message() inserted entries
into the authorized prepared cache before verifying that
check_access() succeeded. A failed BATCH therefore left behind
cached 'authorized' entries that later let a direct EXECUTE of the
same prepared statement skip the authorization check entirely.

Move the cache insertion after the access check so that entries are
only cached on success. This matches the pattern already used by
do_execute_prepared() for individual EXECUTE requests.

Introduced in 98f5e49ea8

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1221
2026-04-13 17:57:22 +02:00
Alex
0f6d9ffd22 cql: expose stable result metadata for prepared LIST statements
Prepared LIST statements were not calculating metadata in PREPARE path, and sent empty string hash to client causing problematic behaviour where metadat_id was not recalculated correctly.
This patch moves metadata construction into get_result_metadata() for the affected LIST statements and reuse that metadata when building the result set.
This gives PREPARE a stable metadata id for LIST ROLES, LIST USERS, LIST PERMISSIONS and the service-level variants.
This patch also adds a new boost test that verifies that when an EXECUTE request carries an empty result metadata id while the server has a real metadata id for the result set, the response is marked METADATA_CHANGED and includes the full result metadata plus the server metadata id.
This covers the recovery path for clients that send an empty or otherwise unusable metadata id instead of a matching cached one.
2026-04-13 17:49:27 +03:00
Dawid Pawlik
63b782451e vector_index: allow multiple named indexes on one column
Allow creating multiple named vector indexes on the same column while
still rejecting duplicate unnamed ones.

`index_metadata::equals_noname()` now ignores `index_version`,
which is unique for every vector index creation, so duplicate detection
keeps working for unnamed vector indexes.

CREATE INDEX keeps using structural duplicate detection for regular
indexes and unnamed vector indexes, but named vector indexes are checked
by name only.

The explicit name check is also needed for IF NOT EXISTS when the same
index name already exists on a different table in the same keyspace,
because vector indexes have no backing view table to catch that case.
2026-04-13 15:04:59 +02:00
Ferenc Szili
e904e7a715 test: add test_split_emitted_during_truncate
Add a regression test that reproduces the race between tablet split and
truncation. The test:

1. Creates a single-tablet table and inserts data.
2. Triggers truncation and pauses it (via database_truncate_wait) after
   compaction is disabled but before discard_sstables() runs.
3. Triggers tablet split and pauses it (via tablet_split_monitor_wait)
   at the start of process_tablet_split_candidate().
4. Releases split so set_split_mode() creates new compaction groups.
5. Waits for the set_split_mode log confirming the groups exist.
6. Releases truncation so discard_sstables() encounters the new groups.
7. Verifies truncation completes and split finishes.

Adds a tablet_split_monitor_wait error injection point in
process_tablet_split_candidate() to allow pausing the split monitor
before it enters the split loop.
2026-04-13 11:05:03 +02:00
Ferenc Szili
13d9561398 table: fix race between tablet split and truncate
Tablet split can call set_split_mode() between the point where
truncate_table_on_all_shards() disables compaction on all existing
compaction groups and the point where discard_sstables() checks that
compaction is disabled. The new split-ready compaction groups created
by set_split_mode() won't have compaction disabled, causing
discard_sstables() to fire on_internal_error.

Fix by preventing set_split_mode() from creating new compaction groups
when compaction is disabled on the main group. If truncation has
already disabled compaction, split will simply report not-ready rather
than creating groups which have compaction enabled.

This is safe because split will be retried once truncation completes
and re-enables compaction.
2026-04-13 11:04:38 +02:00
Avi Kivity
0ae22a09d4 LICENSE: Update to version 1.1
Updated terms of non-commercial use (must be a never-customer).
2026-04-12 19:46:33 +03:00
Avi Kivity
22949bae52 Merge 'logstor: implement tablet split/merge and migration' from Michael Litvak
implement tablet split, tablet merge and tablet migration for tables that use the experimental logstor storage engine.

* tablet merge simply merges the histograms of segments of one compaction group with another.
* for tablet split we take the segments from the source compaction group, read them and write all live records to separate segments according to the split classifier, and move separated segments to the target compaction groups.
* for tablet migration we use stream_blob, similarly to file streaming of sstables. we add a new op type for streaming a logstor segment. on the source we take a snapshot of the segments with an input stream that reads the segment, and on the target we create a sink that allocates a new segment on the target shard and writes to it.
* we also do some improvements for recovery and loading of segments. we add a segment header that contains useful information for non-mixed segments, such as the table and token range.

Refs SCYLLADB-770

no backport - still a new and experimental feature

Closes scylladb/scylladb#29207

* github.com:scylladb/scylladb:
  test: logstor: additional logstor tests
  docs/dev: add logstor on-disk format section
  logstor: add version and crc to buffer header
  test: logstor: tablet split/merge and migration
  logstor: enable tablet balancing
  logstor: streaming of logstor segments using stream_blob
  logstor: add take_logstor_snapshot
  logstor: segment input/output stream
  logstor: implement compaction_group::cleanup
  logstor: tablet split
  logstor: tablet merge
  logstor: add compaction reenabler
  logstor: add segment header
  logstor: serialize writes to active segment
  replica: extend compaction_group functions for logstor
  replica: add compaction_group_for_logstor_segment
  logstor: code cleanup
2026-04-12 16:11:12 +03:00
Israel Fruchter
79c736455e cqlsh: update to v6.0.34-scylla
Update cqlsh to version v6.0.34-scylla.

Notable fix:
- Fix vector type formatting error (scylladb/scylla-cqlsh#165)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Closes scylladb/scylladb#29401
2026-04-12 14:54:50 +03:00
Nadav Har'El
33dbb63aef cql3: support UDT fields in LWT expressions
In an earlier patch, we used the CQL grammar's "subscriptExpr" in
the rule for WRITETIME() and TTL(). But since we also wanted these
to support UDT fields (x.a), not just collection subscripts (x[3]),
we expanded subscriptExpr to also support the field syntax.

But LWT expressions already used this subscriptExpr, which meant
that LWT expressions unintentionally gained support for UDT fields.
Missing support for UDT fields in LWT is a long-standing known
Cassandra-compatibility bug (#13624), and now our grammar finally
supports the missing syntax.

But supporting the syntax is not enough for correct implementation
of this feature - we also need to fix the expression handling:

Two bugs prevented expressions like `v.a = 0` from working in LWT IF
clauses, where `v` is a column of user-defined type.

The first bug was in get_lhs_receiver() in prepare_expr.cc: it lacked
a handler for field_selection nodes, causing an "unexpected expression"
internal error when preparing a condition like `IF v.a = 0`. The fix
adds a handler that returns a column_specification whose type is taken
from the prepared field_selection's type field.

The second bug was in search_and_replace() in expression.cc: when
recursing into a field_selection node it reconstructed it with only
`structure` and `field`, silently dropping the `field_idx` and `type`
fields that are set during preparation. As a result, any transformation
that uses search_and_replace() on a prepared expression containing a
field_selection — such as adjust_for_collection_as_maps() called from
column_condition_prepare() — would zero out those fields. At evaluation
time, type_of() on the field_selection returned a null data_type
pointer, causing a segmentation fault when the comparison operator tried
to call ->equal() through it. The fix preserves field_idx and type when
reconstructing the node.

Fixes #13624.
2026-04-12 14:28:01 +03:00
Nadav Har'El
bb2fb810bb cql3: document WRITETIME() and TTL() for elements of map, set or UDT
Add to the SELECT documentation (docs/cql/dml/select.rst) documentation
of the new ability to select WRITETIME() and TTL() of a single element
of map, set or UDT.

Also in the TTL documentation (docs/cql/time-to-live.rst), which already
had a section on "TTL for a collection", add a mention of the ability
to read a single element's TTL(), and an example.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-04-12 14:28:01 +03:00
Nadav Har'El
a544dae047 test/boost: test WRITETIME() and TTL() on map collection elements
Add tests in test/boost/expr_test.cc for the low-level implementation
of writetime() and ttl() on a map element.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-04-12 14:28:01 +03:00
Nadav Har'El
ccb94618cc test/cqlpy: test WRITETIME() and TTL() on element of map, set or UDT
This patch adds many tests verifying the behavior of WRITETIME() and
TTL() on individual elements of maps, sets and UDTs, serving as a
regression test for issue #15427. We also add tests verifying our
understanding of related issues like WRITETIME() and TTL() of entire
collections and of individual elements of *frozen* collections.

All new tests pass on Cassandra 5.0, helping to verify that our
implementation is compatible with Cassandra. They also pass on
ScyllaDB after the previous patch (most didn't before that patch).

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-04-12 14:27:40 +03:00
Nadav Har'El
35e807a36c cql3: prepare and evaluate WRITETIME/TTL on collection elements and UDT fields
Complete the implementation of SELECT WRITETIME(col[key])/TTL(col[key])
and WRITETIME(col.field)/TTL(col.field), building on the grammar (commit 1),
wire format (commit 2), and selection-layer (commit 3) changes in the
preceding patches.

* prepare_column_mutation_attribute() (prepare_expr.cc) now handles the
  subscript and field_selection nodes that the grammar produces:
  - For subscripts, it validates that the inner column is a non-frozen
    map or set and checks the 'writetime_ttl_individual_element' feature
    flag so the feature is rejected during rolling upgrades.
  - For field selections, it validates that the inner column is a
    non-frozen UDT, with the same feature-flag check.

* do_evaluate(column_mutation_attribute) (expression.cc) handles the
  same two cases. For a field selection it serializes the field index as
  a key and looks it up in collection_element_metadata; for a subscript
  it evaluates the subscript key and looks it up in the same map.
  A missing key (element not found or expired) returns NULL, matching
  Cassandra behavior.

Together with the preceding three patches, this finally fixes #15427.

The next three patches will add tests and documentation for the new
feature, and the final eighth patch will fix the implementation of
UDT fields in LWT expressions - which the first patch made the grammar
allow but is still not implemented correctly.
2026-04-12 13:28:28 +03:00
Nadav Har'El
4ac63de063 cql3: parse per-element timestamps/TTLs in the selection layer
Wire up the selection and result-set infrastructure to consume the
extended collection wire format introduced in the previous patch and
expose per-element timestamps and TTLs to the expression evaluator.

* Add collection_cell_metadata: maps from raw element-key bytes to
  timestamp and remaining TTL, one entry per collection or UDT cell.
  Add a corresponding collection_element_metadata span to
  evaluation_inputs so that evaluators can access it.

* Add a flag _collect_collection_timestamps to selection (selection.hh/cc).
  When any selected expression contains a WRITETIME(col[key])/TTL(col[key])
  or WRITETIME(col.field)/TTL(col.field) attribute, the flag is set and
  the send_collection_timestamps partition-slice option is enabled,
  causing storage nodes to use the extended wire format from the
  previous patch.

* Implement result_set_builder::add_collection() (selection.cc): when
  _collect_collection_timestamps is set, parse the extended format,
  decode per-element timestamps and remaining TTLs (computed from the
  stored expiry time and the query time), and store them in
  _collection_element_metadata indexed by column position.  When the
  flag is not set, the existing plain-bytes path is unchanged.

After this patch, the new selection feature is still not available to
the end-user because the prepare step still forbids it. The next patch
will finally complete the expression preparation and evaluation.
It will read the new collection_element_metadata and return the correct
timestamp or TTL value.
2026-04-12 12:51:06 +03:00
Nadav Har'El
bb63db34e5 cql3: add extended wire format for per-element timestamps and TTLs
Introduce the infrastructure needed to transport per-element timestamps
and TTL expiry times from replicas to coordinators, required for
WRITETIME(col[key]) / TTL(col[key]) and WRITETIME(col.field) /
TTL(col.field).

* Add a 'writetime_ttl_individual_element' cluster feature flag that
  guards usage of the new wire format during rolling upgrades: the
  extended format is only emitted and consumed when every node in the
  cluster supports it.

* Implement serialize_for_cql_with_timestamps() (types/types.cc), a
  variant of serialize_for_cql() that appends a per-element section to
  the regular CQL bytes, listing each live element's serialized key,
  timestamp, and expiry.  The format is:
    [uint32 cql_len][cql bytes]
    [int32  entry_count]
    [per entry: (int32 key_len)(key bytes)(int64 timestamp)(int64 expiry)]
  expiry is -1 when the element has no TTL.

* Add partition_slice::option::send_collection_timestamps and modify
  write_cell() (mutation_partition.cc) to use the new function
  serialize_for_cql_with_timestamps() when this option is available.

This commit stands alone with no user-visible effect: nothing yet sets
the new partition-slice option.  The next patch adds the selection-layer
code that sets the option and parses the extended response.
2026-04-12 11:49:06 +03:00
Nadav Har'El
38b675737d cql3: extend WRITETIME/TTL grammar to accept collection and UDT elements
Previously, WRITETIME() and TTL() only accepted a simple column name
(cident), so WRITETIME(m['key']) or WRITETIME(x.a) was a syntax error.
This patch begins to implements support for applying WRITETIME() and
TTL() to individual elements of a non-frozen map, set or UDT, as
requested in issue #15427.

On its own this commit only changes the parser (Cql.g). The prepare
step still rejects subscript and field-selection nodes with an
invalid_request_exception, so there is no user-visible behavior change
yet - just that a syntax error is replaced by a different error.

Upcoming patches add the extended wire format for per-element timestamps
(commit 2), the selection layer that consumes it (commit 3), and the
prepare/evaluate logic that ties everything together (commit 4), after
which WRITETIME() and TTL(col[key]) for collection or UDT elements
will finally be fully functional.

The parser change in this patch expands the subscriptExpr rule to
support the col.field syntax, not only col[key]. This change also
allows the UDT field syntax to be used in LWT conditions, which is
another long-standing missing feature (#13624); But to correctly
support this feature we'll need an additional patch to fix a couple
of remaining bugs - this will be the eighth commit in this series.
2026-04-12 11:10:23 +03:00
Benny Halevy
e4f0539acf query: result_set: change row member to a chunked vector
To prevent large memory allocations.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-04-12 10:00:49 +03:00
Benny Halevy
b433a5bcf8 query: result_set_row: make noexcept
Remove const specifier from result_set_row._cells member to make
the class nothrow_move_constructible and nothrow_move_assignable

To be used later in query result_set and friends.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-04-12 10:00:39 +03:00
Benny Halevy
c0607110c4 query: non_null_data_value: assert is_nothrow_move_constructible and assignable
To be used later in query result_set{row,} and friends.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-04-12 10:00:34 +03:00
Benny Halevy
afa438d60d types: data_value: assert is_nothrow_move_constructible and assignable
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-04-12 10:00:13 +03:00
Avi Kivity
8ccee6803e Merge 'Remove upgrade view builder' from Gleb Natapov
Since we do no longer support upgrade from versions that do not support
v2 of "view building status" code (building status is managed by raft) we can remove v1 code and upgrade code and make sure we do not boot with old "builder status" version.

v2 version was introduced by 8d25a4d678 which is included in scylla-2025.1.0.

No backport needed since this is code removal.

Closes scylladb/scylladb#29105

* github.com:scylladb/scylladb:
  view: drop unused v1 builder code
  view: remove upgrade to raft code
2026-04-12 00:39:26 +03:00
Botond Dénes
9770a4c081 test/cluster/test_encryption.py: use single-partition reads in read_verify_workload()
Replace the range scan in read_verify_workload() with individual
single-partition queries, using the keys returned by
prepare_write_workload() instead of hard-coding them.

The range scan was previously observed to time out in debug mode after
a hard cluster restart. Single-partition reads are lighter on the
cluster and less likely to time out under load.

The new verification is also stricter: instead of merely checking that
the expected number of rows is returned, it verifies that each written
key is individually readable, catching any data-loss or key-identity
mismatch that the old count-only check would have missed.

This is the second attemp at stabilizing this test, after the recent
854c374ebf. That fix made sure that the
cluster has converged on topology and nodes see each other before running
the verify workload.

Fixes: SCYLLADB-1331

Closes scylladb/scylladb#29313
2026-04-12 00:38:20 +03:00
Avi Kivity
ca80ee8586 Merge 'Introduce maintenance scheduling supergroup and do initial population' from Pavel Emelyanov
The supergroup replaces streaming (a.k.a. maintenance as well) group, inherits 200 shares from it and consists of four sub-groups (all have equal shares of 200 withing the new supergroup)

* maintenance_compaction. This group configures `compaction_manager::maintenance_sg()` group. User-triggered compaction runs in it
* backup. This group configures `snapshot_ctl::config::backup_sched_group`. Native backup activity runs there
* maintenance. It's a new "visible" name, everything that was called "maintenance" in the code ran in "streaming" group. Now it will run in "maintenance". The activities include those that don't communicate over RPC (see below why)
  * `tablet_allocator::balance_tablets()`
  * `sstables_manager::components_reclaim_reload_fiber()`
  * `tablet_storage_group_manager::merge_completion_fiber()`
  * metrics exporting http server altogether
* streaming. This is purely existing streaming group that just moves under the new supergroup. Everything else that was run there, continues doing so, including
  * hints sender
  * all view building related components (update generator, builder, workers)
  * repair
  * stream_manager
  * messaging service (except for verb handlers that switch groups)
  * join_cluster() activity
  * REST API
  * ... something else I forgot

The `--maintenance_io_throughput_mb_per_sec` option is introduced. It controls the IO throughput limit applied to the maintenance supergroup. If not set, the `--stream_io_throughput_mb_per_sec` option is used to preserve backward compatibility.

All new sched groups inherit `request_class::maintenance` (however, "backup" seem not to make any requests yet).

Moving more activities from "streaming" into "maintenance" (or its own group) is possible, but one will need to take care of RPC group switching. The thing is that when a client makes an RPC call, the server may switch to one of pre-negotiated scheduling groups. Verbs for existing activities that run in "streaming" group are routed through RPC index that negotiates "streaming" group on the server side. If any of that client code moves to some other group, server will still run the handlers in "streaming" which is not quite expected. That's one of the main reasons why only the selected fibers were moved to their own "maintenance" group. Similar for backup -- this code doesn't use RPC, so it can be moved. Restoring code uses load-and-stream and corresponding RPCs, so it cannot be just moved into its own new group.

Fixes SCYLLADB-351

New feature, not backporting

Closes scylladb/scylladb#28542

* github.com:scylladb/scylladb:
  code: Add maintenance/maintenance group
  backup: Add maintenance/backup group
  compaction: Add maintenance/maintenance_compaction group
  main: Introduce maintenance supergroup
  main: Move all maintenance sched group into streaming one
  database: Use local variable for current_scheduling_group
  code: Live-update IO throughputs from main
2026-04-12 00:34:48 +03:00
Botond Dénes
3289928679 repair: fix quadratic complexity when loading repair history
shared_tombstone_gc_state::update_repair_time() uses copy-on-write
semantics: each call copies the entire per_table_history_maps and the
per-table repair_history_map.  repair_service::load_history() called
this once per history entry, making the load O(N²) in both time and
memory.

Introduce batch_update_repair_time() which performs a single
copy-on-write for any number of entries belonging to the same table.
Restructure load_history() to collect entries into batches of up to
1000 and flush each batch in one call, keeping peak memory bounded.
The batch size limit is intentional: the repair history table currently
has no bound on the number of entries and can grow large.  Note that
this does not cause a problem in the in-memory history map itself:
entries are coalesced internally and only the latest repair time is
kept per range.  The unbounded entry count only makes the batched
update during load expensive.

Fixes: SCYLLADB-104

Closes scylladb/scylladb#29326
2026-04-11 23:54:26 +03:00
Michał Hudobski
7d648961ed vector_search: forward non-primary key restrictions to Vector Store service
Include non-primary key restrictions (e.g. regular column filters) in
the filter JSON sent to the Vector Store service. Previously only
partition key and clustering column restrictions were forwarded, so
filtering on regular columns was silently ignored.

Add get_nonprimary_key_restrictions() getter to statement_restrictions.

Add unit tests for non-primary key equality, range, and bind marker
restrictions in filter_test.

Fixes: SCYLLADB-970

Closes scylladb/scylladb#29019
2026-04-10 17:16:29 +02:00
Piotr Smaron
477353b15c auth: sanitize {USER} substitution in LDAP URL templates
LDAPRoleManager interpolated usernames directly into ldap_url_template.
That allowed LDAP filter metacharacters to change the query, and URL
metacharacters such as %, ?, and # to change how ldap_url_parse()
split the URL.

Apply two layers of encoding when substituting {USER}:
  1. RFC 4515 filter escaping -- neutralises filter operators.
  2. URL percent-encoding    -- prevents ldap_url_parse from
     misinterpreting %-sequences, ? delimiters, or # fragments.

Add validate_query_template() (called from start()) which uses a
sentinel round-trip through ldap_url_parse to reject templates
that place {USER} outside the filter component.  Templates that
previously placed {USER} in the host or base DN were silently
accepted; they are now rejected at startup with a descriptive
error.

Change parse_url() to take const sstring& instead of string_view
to enforce the null-termination requirement of ldap_url_parse()
at the type level.

Add regression coverage for %2a, ?, #, and invalid {USER}
placement in the base DN, host, attributes, and extensions.

Update LDAP authorization docs to document the escaping behavior
and the {USER} placement restriction.

Fixes: SCYLLADB-1309
2026-04-10 14:00:47 +02:00
Dawid Pawlik
2dd8eef38c vector_index: store index_version as creation timeuuid
Vector indexes currently store the base table schema version in
`index_version`. That value is name-based, not time-based,
so it does not represent when the index was created.

Store a timeuuid instead and change the relevant interfaces from
`table_schema_version` to `utils::UUID`. This is a prerequisite
for supporting multiple vector indexes on the same column where
the oldest index must be selected deterministically via routing
implemented in Vector Store.

Update the cqlpy tests to check the new semantics directly:
recreating the index changes `index_version`, while ALTER TABLE does not.
2026-04-10 13:05:21 +02:00
Piotr Dulikowski
3bd770d4d9 Merge 'counters: reuse counter IDs by rack' from Michael Litvak
For counter updates, use a counter ID that is constructed from the
node's rack instead of the node's host ID.

A rack can have at most two active tablet replicas at a time: a single
normal tablet replica, and during tablet migration there are two active
replicas, the normal and pending replica. Therefore we can have two
unique counter IDs per rack that are reused by all replicas in the rack.

We construct the counter ID from the rack UUID, which is constructed
from the name "dc:rack". The pending replica uses a deterministic
variation of the rack's counter ID by negating it.

This improves the performance and size of counter cells by having less
unique counter IDs and less counter shards in a counter cell.

Previously the number of counter shards was the number of different
host_id's that updated the counter, which can be typically the number of
nodes in the cluster and continue growing indefinitely when nodes are
replaced. with the rack-based counter id the number of counter shards
will be at most twice the number of different racks (including removed
racks, which should not be significant).

Fixes SCYLLADB-356

backport not needed - an enhancement

Closes scylladb/scylladb#28901

* github.com:scylladb/scylladb:
  docs/dev: add counters doc
  counters: reuse counter IDs by rack
2026-04-10 12:24:18 +02:00
Wojciech Mitros
163c6f71d6 transport: refactor result_message bounce interface
Replace move_to_shard()/move_to_host() with as_bounce()/target_shard()/
target_host() to clarify the interface after bounce was extended to
support cross-node bouncing.

- Add virtual as_bounce() returning const bounce* to the base class
  (nullptr by default, overridden in bounce to return this), replacing
  the virtual move_to_shard() which conflated bounce detection with
  shard access
- Rename move_to_shard() -> target_shard() (now non-virtual, returns
  unsigned directly) and move_to_host() -> target_host() on bounce
- Replace dynamic_pointer_cast with static_pointer_cast at call sites
  that already checked as_bounce()
- Move forward declarations of message types before the virtual
  methods so as_bounce() can reference bounce

Fixes: SCYLLADB-1066

Closes scylladb/scylladb#29367
2026-04-10 12:17:43 +02:00
Piotr Dulikowski
32e3a01718 Merge 'service: strong_consistency: Allow for aborting operations' from Dawid Mędrek
Motivation
----------

Since strongly consistent tables are based on the concept of Raft
groups, operations on them can get stuck for indefinite amounts of
time. That may be problematic, and so we'd like to implement a way
to cancel those operations at suitable times.

Description of solution
-----------------------

The situations we focus on are the following:

* Timed-out queries
* Leader changes
* Tablet migrations
* Table drops
* Node shutdowns

We handle each of them and provide validation tests.

Implementation strategy
-----------------------

1. Auxiliary commits.
2. Abort operations on timeout.
3. Abort operations on tablet removal.
4. Extend `client_state`.
5. Abort operation on shutdown.
6. Help `state_machine` be aborted as soon as possible.

Tests
-----

We provide tests that validate the correctness of the solution.

The total time spent on `test_strong_consistency.py`
(measured on my local machine, dev mode):

Before:
```
real    0m31.809s
user    1m3.048s
sys     0m21.812s
```

After:
```
real    0m34.523s
user    1m10.307s
sys     0m27.223s
```

The incremental differences in time can be found in the commit messages.

Fixes SCYLLADB-429

Backport: not needed. This is an enhancement to an experimental feature.

Closes scylladb/scylladb#28526

* github.com:scylladb/scylladb:
  service: strong_consistency: Abort state_machine::apply when aborting server
  service: strong_consistency: Abort ongoing operations when shutting down
  service: client_state: Extend with abort_source
  service: strong_consistency: Handle abort when removing Raft group
  service: strong_consistency: Abort Raft operations on timeout
  service: strong_consistency: Use timeout when mutating
  service: strong_consistency: Fix indentation
  service: strong_consistency: Enclose coordinator methods with try-catch
  service: strong_consistency: Crash at unexpected exception
  test: cluster: Extract default config & cmdline in test_strong_consistency.py
2026-04-10 11:11:21 +02:00
Pavel Emelyanov
0b336da89d Revert "cmake: add missing rolling_max_tracker_test and symmetric_key_test"
This reverts commit 8b4a91982b.

Two commits independently added rolling_max_tracker_test to test/boost/CMakeLists.txt:
8b4a919 cmake: add missing rolling_max_tracker_test and symmetric_key_test
f3a91df test/cmake: add missing tests to boost test suite

The second was merged two days after the first. They didn't conflict on
code-level and applied cleanly resulting in a duplicate add_scylla_test()
entries that breaks the CMake build:

    CMake Error: add_executable cannot create target
    "test_boost_rolling_max_tracker_test" because another target
    with the same name already exists.

Remove the duplicate.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Reported-by: Łukasz Paszkowski <lukasz.paszkowski@scylladb.com>
2026-04-10 11:19:43 +03:00
Patryk Jędrzejczak
751bf31273 Merge 'More gossiper cleanups' from Gleb Natapov
The PR contains more code cleanups, mostly in gossiper. Dropping more gossiper state leaving only NORMAL and SHUTDOWN. All other states are checked against topology state. Those two are left because SHUTDOWN state is propagated through gossiper only and when the node is not in SHUTDOWN it should be in some other state.

No need to backport. Cleanups.

Closes scylladb/scylladb#29129

* https://github.com/scylladb/scylladb:
  storage_service: cleanup unused code
  storage_service: simplify get_peer_info_for_update
  gossiper: send shutdown notifications in parallel
  gms: remove unused code
  virtual_tables: no need to call gossiper if we already know that the node is in shutdown
  gossiper: print node state from raft topology in the logs
  gossiper: use is_shutdown instead of code it manually
  gossiper: mark endpoint_state(inet_address ip) constructor as explicit
  gossiper: remove unused code
  gossiper: drop last use of LEFT state and drop the state
  gossiper: drop unused STATUS_BOOTSTRAPPING state
  gossiper: rename is_dead_state to is_left since this is all that the function checks now.
  gossiper: use raft topology state instead of gossiper one when checking node's state
  storage_service: drop check_for_endpoint_collision function
  storage_service: drop is_first_node function
  gossiper: remove unused REMOVED_TOKEN state
  gossiper: remove unused advertise_token_removed function
2026-04-10 09:56:20 +02:00
Nadav Har'El
6674aa29ca Merge 'Add Cassandra SAI (StorageAttachedIndex) compatibility' from Szymon Wasik
Cassandra's native vector index type is StorageAttachedIndex (SAI). Libraries such as CassIO, LangChain, and LlamaIndex generate `CREATE CUSTOM INDEX` statements using the SAI class name. Previously, ScyllaDB rejected these with "Non-supported custom class".

This PR adds compatibility so that SAI-style CQL statements work on ScyllaDB without modification.

1. **test: enable SAI_VECTOR_ALLOW_CUSTOM_PARAMETERS for Cassandra tests**
   Enables the `SAI_VECTOR_ALLOW_CUSTOM_PARAMETERS` Cassandra system property so that `search_beam_width` tests pass against Cassandra 5.0.7.

2. **test: modernize vector index test comments and fix xfail**
   Updates test comments from "Reproduces" to "Validates fix for" for clarity, and converts the `test_ann_query_with_pk_restriction` xfail into a stripped-down CREATE INDEX syntax test (removing unused INSERT/SELECT lines). Removes the redundant `test_ann_query_with_non_pk_restriction` test.

3. **cql: add Cassandra SAI (StorageAttachedIndex) compatibility**
   Core implementation: the SAI class name is detected and translated to ScyllaDB's native `vector_index`. The fully-qualified class name (`org.apache.cassandra.index.sai.StorageAttachedIndex`) requires exact case; short names (`StorageAttachedIndex`, `sai`) are matched case-insensitively — matching Cassandra's behavior. Non-vector and multi-column SAI targets are rejected with clear errors. Adds `skip_on_scylla_vnodes` fixture, SAI compatibility docs, and the Cassandra compatibility table entry (split into "SAI general" vs "SAI for vector search").

4. **cql: accept source_model option for Cassandra SAI compatibility**
   The `source_model` option is a Cassandra SAI property used by Cassandra libraries (e.g., CassIO) to tag vector indexes with the name of the embedding model. ScyllaDB accepts it for compatibility but does not use it — the validator is a no-op lambda. The option is preserved in index metadata and returned in DESCRIBE INDEX output.

- `cql3/statements/create_index_statement.cc`: SAI class detection and rewriting logic
- `index/secondary_index_manager.cc`: case-insensitive class name lookup (lowercasing restored before `classes.find()`)
- `index/vector_index.cc`: `source_model` accepted as a valid option with no-op validator
- `docs/cql/secondary-indexes.rst`: SAI compatibility documentation with `source_model` table row
- `docs/using-scylla/cassandra-compatibility.rst`: SAI entry split into general (not supported) and vector search (supported)
- `test/cqlpy/conftest.py`: `scylla_with_tablets` renamed to `skip_on_scylla_vnodes`
- `test/cqlpy/test_vector_index.py`: SAI tests inlined (no constants), `check_bad_option()` helper for numeric validation, uppercase class name test, merged `source_model` tests with DESCRIBE check

| Backend            | Passed | Skipped | Failed |
|--------------------|--------|---------|--------|
| ScyllaDB (dev)     | 42     | 0       | 0      |
| Cassandra 5.0.7    | 16     | 26      | 0      |

None: new feature.

Fixes: SCYLLADB-239

Closes scylladb/scylladb#28645

* github.com:scylladb/scylladb:
  cql: accept source_model option and show options in DESCRIBE
  cql: add Cassandra SAI (StorageAttachedIndex) compatibility
  test: modernize vector index test comments and fix xfail
  test: enable SAI_VECTOR_ALLOW_CUSTOM_PARAMETERS for Cassandra tests
2026-04-10 10:21:20 +03:00
Tomasz Grabiec
88bea5aaf3 cql: Include parallelized queries in the scylla_cql_select_partition_range_scan_no_bypass_cache metric
This metric is used to catch execution of scans which go via row
cache, which can have bad effect on performance.

Since f344bd0aaa, aggreagte queries go
via new statement class: parallelized_select_statement. This class
inherits from select_statement directly rather than from
primary_key_select_statement. The range scan detection logic
(_range_scan, _range_scan_no_bypass_cache) was only in
primary_key_select_statement's constructor, so parallelized queries
were not counted in select_partition_range_scan and
select_partition_range_scan_no_bypass_cache metrics.

Fix by moving the range scan detection into select_statement's
constructor, so that all subclasses get it.
2026-04-10 02:12:48 +02:00
Tomasz Grabiec
dc95d26464 test: cluster: dtest: Fix double-counting of metrics
get_node_metrics() in test/cluster/dtest/tools/metrics.py used
re.search(metric_name, metric) to match Prometheus metric lines. The
metric name select_partition_range_scan is a substring of
select_partition_range_scan_no_bypass_cache. So when querying for
select_partition_range_scan, the regex matched both Prometheus lines:

scylla_cql_select_partition_range_scan{shard="0",...} 1
scylla_cql_select_partition_range_scan_no_bypass_cache{shard="0",...} 1

And because the code does metrics_res[metric_name] += val, it summed
both values, making it look like the counter was incremented
by 2 when it was actually incremented by 1. The fix appends r"[\s{]"
to the regex so the metric name must be followed by { (labels) or
whitespace (value), preventing substring matches.
2026-04-10 02:12:48 +02:00
Avi Kivity
f67d0739d0 test: user_function_test: adjust Lua error message tests
Lua 5.5 changed the error message slightly ("?:-1" -> "?:?"). Relax
the error message tests to avoid this unimportant fragment.

Closes scylladb/scylladb#29414
2026-04-10 01:09:35 +03:00
Piotr Szymaniak
98d6edaa88 alternator: add comment explaining delta_mode::keys in add_stream_options()
Clarify that cdc::delta_mode is ignored by Alternator, so we use the
least expensive mode (keys) to reduce overhead.

Fixes scylladb/scylladb#24812

Closes scylladb/scylladb#29408
2026-04-10 01:07:21 +03:00
Michał Hudobski
c8b9fde828 auth: allow VECTOR_SEARCH_INDEXING permission to access system.tablets
Add system.tablets to the set of system resources that can be
accessed with the VECTOR_SEARCH_INDEXING permission.

Fixes: VECTOR-605

Closes scylladb/scylladb#29397
2026-04-09 21:53:07 +03:00
Pavel Emelyanov
5ffd3ccc8e test_backup: Remove create_ks_and_cf helper Test
Remove the create_ks_and_cf() helper function and its now-unused import
of format_tuples(). All callers have been converted to use the new
async patterns with new_test_keyspace() and cql.run_async().

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-09 20:40:13 +03:00
Pavel Emelyanov
2c81e54d6d test_backup: Replace create_ks_and_cf with async patterns Test
Replace all 6 calls to create_ks_and_cf() with new async patterns:
- Use new_test_keyspace() context manager for keyspace creation
- Use cql.run_async() for CREATE TABLE statement
- Use asyncio.gather() with cql.run_async() for data insertion

The test_restore_with_non_existing_sstable only needs the ks:table
structure to exist; it doesn't use the pre-populated data.

This change makes the code more explicit and maintains proper async
semantics throughout.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-09 20:39:56 +03:00
Pavel Emelyanov
66d9f6e042 test_backup: Add if-True blocks for indentation Test
Add if-True blocks to wrap code that uses create_ks_and_cf() in all 6
test functions. This is a mechanical change to set up the next step
where the helper will be replaced with new async patterns. All code
after the create_ks_and_cf() call until the end of each test is now
indented under the if-True block.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-09 20:35:21 +03:00
Szymon Wasik
573def7cd8 cql: accept source_model option and show options in DESCRIBE
Accept the Cassandra SAI 'source_model' option for vector indexes.
This option is used by Cassandra libraries (e.g., CassIO, LangChain)
to tag vector indexes with the name of the embedding model that
produced the vectors.

ScyllaDB does not use the source_model value but stores it and
includes it in the DESCRIBE INDEX output for Cassandra compatibility.

Additionally, extend vector_index::describe() to emit a
WITH OPTIONS = {...} clause containing all user-provided index options
(filtering out system keys: target, class_name, index_version).
This makes options like similarity_function, source_model, etc.
visible in DESCRIBE output.
2026-04-09 17:20:03 +02:00
Szymon Wasik
80a2e4a0ab cql: add Cassandra SAI (StorageAttachedIndex) compatibility
Libraries such as CassIO, LangChain, and LlamaIndex create vector
indexes using Cassandra's StorageAttachedIndex (SAI) class name.
This commit lets ScyllaDB accept these statements without modification.

When a CREATE CUSTOM INDEX statement specifies an SAI class name on a
vector column, ScyllaDB automatically rewrites it to the native
vector_index implementation. Accepted class names (case-insensitive):
  - org.apache.cassandra.index.sai.StorageAttachedIndex
  - StorageAttachedIndex
  - sai

SAI on non-vector columns is rejected with a clear error directing
users to a secondary index instead.

The SAI detection and rewriting logic is extracted into a dedicated
static function (maybe_rewrite_sai_to_vector_index) to keep the
already-long validate_while_executing method manageable.

Multi-column (local index) targets and nonexistent columns are
skipped with continue — the former are treated as filtering columns
by vector_index::check_target(), and the latter are caught later by
vector_index::validate().

Tests that exercise features common to both backends (basic creation,
similarity_function, IF NOT EXISTS, bad options, etc.) now use the
SAI class name with the skip_on_scylla_vnodes fixture so they run
against both ScyllaDB and Cassandra. ScyllaDB-specific tests continue
to use USING 'vector_index' with scylla_only.
2026-04-09 17:20:03 +02:00
Szymon Wasik
fa7edc627c test: modernize vector index test comments and fix xfail
- Change 'Reproduces' to 'Validates fix for' in test comments to
  reflect that the referenced issues are already fixed.
- Condense the VECTOR-179 comment to two lines.
- Replace the xfailed test_ann_query_with_restriction_works_only_on_pk
  with a focused test (test_ann_query_with_pk_restriction) that creates
  a vector index on a table with a PK column restriction, validating
  the VECTOR-374 fix.
2026-04-09 17:20:02 +02:00
Szymon Wasik
4eab050be4 test: enable SAI_VECTOR_ALLOW_CUSTOM_PARAMETERS for Cassandra tests 2026-04-09 17:20:02 +02:00
Andrzej Jackowski
23c386a27f test: perf: add audit-unix-socket-path to perf-simple-query
To allow performance benchmarking with custom syslog sinks.

Example use case:

-- Audit + default syslog: ~100k tps
taskset -c 0,2,4 ./build/release/scylla perf-simple-query --smp 3 --write --duration 30 --audit "syslog" --audit-keyspace "ks" --audit-categories "DCL,DDL,AUTH,DML,QUERY"

```
110263.72 tps ( 66.1 allocs/op,  16.0 logallocs/op,  25.7 tasks/op,  254900 insns/op,  144796 cycles/op,        0 errors)
throughput:
	mean=   107137.48 standard-deviation=3142.98
	median= 106665.00 median-absolute-deviation=1786.03
	maximum=111435.19 minimum=97620.79
instructions_per_op:
	mean=   256311.36 standard-deviation=5037.13
	median= 256288.09 median-absolute-deviation=2223.08
	maximum=274220.89 minimum=248141.40
cpu_cycles_per_op:
	mean=   146443.47 standard-deviation=2844.19
	median= 146001.85 median-absolute-deviation=1514.82
	maximum=157177.54 minimum=142981.03
```

-- Audit + custom syslog: ~400k tps
socat -u UNIX-RECV:/tmp/audit-null.sock,type=2 OPEN:/dev/null
taskset -c 0,2,4 ./build/release/scylla perf-simple-query --smp 3 --write --duration 30 --audit "syslog" --audit-keyspace "ks" --audit-categories "DCL,DDL,AUTH,DML,QUERY" --audit-unix-socket-path /tmp/audit-null.sock

```
404929.62 tps ( 65.9 allocs/op,  16.0 logallocs/op,  25.5 tasks/op,   77406 insns/op,   35559 cycles/op,        0 errors)
throughput:
	mean=   399868.39 standard-deviation=6232.88
	median= 401770.65 median-absolute-deviation=3859.09
	maximum=406126.79 minimum=383434.84
instructions_per_op:
	mean=   77481.26 standard-deviation=168.31
	median= 77405.54 median-absolute-deviation=84.33
	maximum=78081.46 minimum=77332.84
cpu_cycles_per_op:
	mean=   35871.32 standard-deviation=516.83
	median= 35699.70 median-absolute-deviation=251.15
	maximum=37454.86 minimum=35432.60
```

-- No audit: ~800k tps
taskset -c 0,2,4 ./build/release/scylla perf-simple-query --smp 3 --write --duration 30

```
808970.95 tps ( 53.3 allocs/op,  16.0 logallocs/op,  14.9 tasks/op,   49904 insns/op,   20471 cycles/op,        0 errors)
throughput:
	mean=   809065.31 standard-deviation=6222.39
	median= 810507.10 median-absolute-deviation=1827.99
	maximum=815213.41 minimum=782104.84
instructions_per_op:
	mean=   49905.50 standard-deviation=21.81
	median= 49900.12 median-absolute-deviation=7.72
	maximum=50010.97 minimum=49892.57
cpu_cycles_per_op:
	mean=   20429.00 standard-deviation=41.40
	median= 20425.18 median-absolute-deviation=29.11
	maximum=20530.74 minimum=20355.42
```

Closes scylladb/scylladb#29396
2026-04-09 16:00:41 +03:00
Anna Stuchlik
c6587c6a70 doc: Fix malformed markdown link in alternator network docs
Fixes https://github.com/scylladb/scylladb/issues/29400

Closes scylladb/scylladb#29402
2026-04-09 15:54:43 +03:00
Botond Dénes
5886d1841a Merge 'cmake: align CMake build system with configure.py and add comparison script' from Ernest Zaslavsky
Every time someone modifies the build system — adding a source file, changing a compilation flag, or wiring a new test — the change tends to land in only one of our two build systems (configure.py or CMake). Over time this causes three classes of problems:

1. **CMake stops compiling entirely.** Missing defines, wrong sanitizer flags, or misplaced subdirectory ordering cause hard build failures that are only discovered when someone tries to use CMake (e.g. for IDE integration).

2. **Missing build targets.** Tests or binaries present in configure.py are never added to CMake, so `cmake --build` silently skips them. This PR fixes several such cases (e.g. `symmetric_key_test`, `auth_cache_test`, `sstable_tablet_streaming`).

3. **Missing compilation units in targets.** A `.cc` file is added to a test binary in one system but not the other, causing link errors or silently omitted test coverage.

To fix the existing drift and prevent future divergence, this series:

**Adds a build-system comparison script**
(`scripts/compare_build_systems.py`) that configures both systems into a temporary directory, parses their generated `build.ninja` files, and compares per-file compilation flags, link target sets, and per-target libraries. configure.py is treated as the baseline; CMake must match it. The script supports a `--ci` mode suitable for gating PRs that touch
build files.

**Fixes all current mismatches** found by the script:
- Mode flag alignment in `mode.common.cmake` and `mode.Coverage.cmake`
  (sanitizer flags, `-fno-lto`, stack-usage warnings, coverage defines).
- Global define alignment (`SEASTAR_NO_EXCEPTION_HACK`, `XXH_PRIVATE_API`,
  `BOOST_ALL_DYN_LINK`, `SEASTAR_TESTING_MAIN` placement).
- Seastar build configuration (shared vs static per mode, coverage
  sanitizer link options).
- Abseil sanitizer flags (`-fno-sanitize=vptr`).
- Missing test targets in `test/boost/CMakeLists.txt`.
- Redundant per-test flags now covered by global settings.
- Lua library resolution via a custom `cmake/FindLua.cmake` using
  pkg-config, matching configure.py's approach.

**Adds documentation** (`docs/dev/compare-build-systems.md`) describing how to run the script and interpret its output.

No backport needed — this is build infrastructure improvement only.

Closes scylladb/scylladb#29273

* github.com:scylladb/scylladb:
  scripts: remove lua library rename workaround from comparison script
  cmake: add custom FindLua using pkg-config to match configure.py
  test/cmake: add missing tests to boost test suite
  test/cmake: remove per-test LTO disable
  cmake: add BOOST_ALL_DYN_LINK and strip per-component defines
  cmake: move SEASTAR_TESTING_MAIN after seastar and abseil subdirs
  cmake: add -fno-sanitize=vptr for abseil sanitizer flags
  cmake: align Seastar build configuration with configure.py
  cmake: align global compile defines and options with configure.py
  cmake: fix Coverage mode in mode.Coverage.cmake
  cmake: align mode.common.cmake flags with configure.py
  configure.py: add sstable_tablet_streaming to combined_tests
  docs: add compare-build-systems.md
  scripts: add compare_build_systems.py to compare ninja build files
2026-04-09 15:46:09 +03:00
Yaniv Michael Kaul
13879b023f tracing: set_skip_when_empty() for error-path metrics
Add .set_skip_when_empty() to all error-path metrics in the tracing
module. Tracing itself is not a commonly used feature, making all of
these metrics almost always zero:

Tier 1 (very rare - corruption/schema issues):
- tracing_keyspace_helper::bad_column_family_errors: tracing schema
  missing or incompatible, should never happen post-bootstrap
- tracing::trace_errors: internal error building trace parameters

Tier 2 (overload - tracing backend saturated):
- tracing::dropped_sessions: too many pending sessions
- tracing::dropped_records: too many pending records

Tier 3 (general tracing write errors):
- tracing_keyspace_helper::tracing_errors: errors during writes to
  system_traces keyspace

Since tracing is an opt-in feature that most deployments rarely use,
all five metrics are almost always zero and create unnecessary
reporting overhead.

AI-Assisted: yes
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>

Closes scylladb/scylladb#29346
2026-04-09 14:28:16 +03:00
Michael Litvak
3964040008 docs/dev: add counters doc
Add a documentation of the counters feature implementation in
docs/dev/counters.md.

The documentation is taken from the wiki and updated according to the
current state of the code - legacy details are removed, and a section
about the counter id is added.
2026-04-09 13:08:02 +02:00
Michael Litvak
b71762d5da counters: reuse counter IDs by rack
For counter updates, use a counter ID that is constructed from the
node's rack instead of the node's host ID.

A rack can have at most two active tablet replicas at a time: a single
normal tablet replica, and during tablet migration there are two active
replicas, the normal and pending replica. Therefore we can have two
unique counter IDs per rack that are reused by all replicas in the rack.

We construct the counter ID from the rack UUID, which is constructed
from the name "dc:rack". The pending replica uses a deterministic
variation of the rack's counter ID by negating it.

This improves the performance and size of counter cells by having less
unique counter IDs and less counter shards in a counter cell.

Previously the number of counter shards was the number of different
host_id's that updated the counter, which can be typically the number of
nodes in the cluster and continue growing indefinitely when nodes are
replaced. with the rack-based counter id the number of counter shards
will be at most twice the number of different racks (including removed
racks, which should not be significant).

Fixes SCYLLADB-356
2026-04-09 13:08:02 +02:00
Yaniv Michael Kaul
2c0076d3ef replica: set_skip_when_empty() for rare error-path metrics
Add .set_skip_when_empty() to four metrics in replica/database.cc that
are only incremented on very rare error paths and are almost always zero:

- database::dropped_view_updates: view updates dropped due to overload.
  NOTE: this metric appears to never be incremented in the current
  codebase and may be a candidate for removal.
- database::multishard_query_failed_reader_stops: documented as a 'hard
  badness counter' that should always be zero. NOTE: no increment site
  was found in the current codebase; may be a candidate for removal.
- database::multishard_query_failed_reader_saves: documented as a 'hard
  badness counter' that should always be zero.
- database::total_writes_rejected_due_to_out_of_space_prevention: only
  fires when disk utilization is critical and user table writes are
  disabled, a very rare operational state.

These metrics create unnecessary reporting overhead when they are
perpetually zero. set_skip_when_empty() suppresses them from metrics
output until they become non-zero.

AI-Assisted: yes
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>

Closes scylladb/scylladb#29345
2026-04-09 14:07:28 +03:00
Botond Dénes
86417d49de Merge 'transport: improve memory accounting for big responses and slow network' from Marcin Maliszkiewicz
After obtaining the CQL response, check if its actual size exceeds the initially acquired memory permit. If so, acquire additional semaphore units and adopt them into the permit, ensuring accurate memory accounting for large responses.

Additionally, move the permit into a .then() continuation so that the semaphore units are kept alive until write_message finishes, preventing premature release of memory permit. This is especially important with slow networks and big responses when buffers can accumulate and deplete a node's memory.

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1306
Related https://scylladb.atlassian.net/browse/SCYLLADB-740

Backport: all supported versions

Closes scylladb/scylladb#29288

* github.com:scylladb/scylladb:
  transport: add per-service-level pending response memory metric
  transport: hold memory permit until response write completes
  transport: account for response size exceeding initial memory estimate
2026-04-09 13:36:31 +03:00
Yaniv Michael Kaul
5c8b4a003e db: set_skip_when_empty() for rare error-path metrics
Add .set_skip_when_empty() to four metrics in the db module that are
only incremented on very rare error paths and are almost always zero:

- cache::pinned_dirty_memory_overload: described as 'should sit
  constantly at 0, nonzero is indicative of a bug'
- corrupt_data::entries_reported: only fires on actual data corruption
- hints::corrupted_files: only fires on on-disk hint file corruption
- rate_limiter::failed_allocations: only fires when the rate limiter
  hash table is completely full and gives up allocating, requiring
  extreme cardinality pressure

These metrics create unnecessary reporting overhead when they are
perpetually zero. set_skip_when_empty() suppresses them from metrics
output until they become non-zero.

AI-Assisted: yes
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>

Closes scylladb/scylladb#29344
2026-04-09 13:32:09 +03:00
Gleb Natapov
dbaba7ab8a storage_service: cleanup unused code
Remove unused definition and double includes.
2026-04-09 13:31:41 +03:00
Gleb Natapov
b050b593b3 storage_service: simplify get_peer_info_for_update
It does nothing for fields managed in raft, so drop their processing.
2026-04-09 13:31:41 +03:00
Gleb Natapov
d0576c109f gossiper: send shutdown notifications in parallel 2026-04-09 13:31:40 +03:00
Gleb Natapov
1586fa65af gms: remove unused code
Also moved version_string(...) and make_token_string(...) to private: — they are internal helpers used only by normal(), not part of the public API
2026-04-09 13:31:40 +03:00
Gleb Natapov
b2e35c538f virtual_tables: no need to call gossiper if we already know that the node is in shutdown 2026-04-09 13:31:40 +03:00
Gleb Natapov
e17fc180a0 gossiper: print node state from raft topology in the logs
Raft topology has real node's state now. gossiper sate are now set to
NORMAL and SHUTDOWN only.
2026-04-09 13:31:40 +03:00
Gleb Natapov
8439154851 gossiper: use is_shutdown instead of code it manually 2026-04-09 13:31:39 +03:00
Gleb Natapov
7d700d0377 gossiper: mark endpoint_state(inet_address ip) constructor as explicit
get_live_members function called is_shutdown which inet_address
argument, which caused temporary endpoint_state to be created. Fix
it by prohibiting implicit conversion and calling the correct
is_shutdown function instead.
2026-04-09 13:31:39 +03:00
Gleb Natapov
6df4f572d5 gossiper: remove unused code 2026-04-09 13:31:39 +03:00
Gleb Natapov
67102496c8 gossiper: drop last use of LEFT state and drop the state
The decommission sets left gossiper state only to prevent shutdown
notification be issued by the node during shutdown. Since the
notification code now checks the state in raft topology this is no
longer needed.
2026-04-09 13:31:39 +03:00
Gleb Natapov
54d2c95094 gossiper: drop unused STATUS_BOOTSTRAPPING state 2026-04-09 13:31:38 +03:00
Gleb Natapov
7c895ced19 gossiper: rename is_dead_state to is_left since this is all that the function checks now. 2026-04-09 13:31:38 +03:00
Gleb Natapov
7dfb0577b8 gossiper: use raft topology state instead of gossiper one when checking node's state
Raft topology state is a truth source for the nodes state, so use it
instead of a gossiper one.
2026-04-09 13:31:38 +03:00
Gleb Natapov
c17c4806a1 storage_service: drop check_for_endpoint_collision function
All the checks that it does are also done by join coordinator and the
join coordinator uses more reliable raft state instead of gossiper one.
2026-04-09 13:31:37 +03:00
Gleb Natapov
1ac8edb22b storage_service: drop is_first_node function
It make no sense now since the first node to bootstrap is determined by
discover_group0 algorithm.
2026-04-09 13:31:37 +03:00
Gleb Natapov
681aa9ebe1 gossiper: remove unused REMOVED_TOKEN state 2026-04-09 13:31:37 +03:00
Gleb Natapov
5af17aa578 gossiper: remove unused advertise_token_removed function 2026-04-09 13:31:36 +03:00
Dawid Mędrek
f0dfe29d88 service: strong_consistency: Abort state_machine::apply when aborting server
The state machine used by strongly consistent tablets may block on a
read barrier if the local schema is insufficient to resolve pending
mutations [1]. To deal with that, we perform a read barrier that may
block for a long time.

When a strongly consistent tablet is being removed, we'd like to cancel
all ongoing executions of `state_machine::apply`: the shard is no
longer responsible for the tablet, so it doesn't matter what the outcome
is.

---

In the implementation, we abort the operations by simply throwing
an exception from `state_machine::apply` and not doing anything.
That's a red flag considering that it may lead to the instance
being killed on the spot [2].

Fortunately for us, strongly consistent tables use the default Raft
server implementation, i.e. `raft::server_impl`, which actually
handles one type of an exception thrown by the method: namely,
`abort_requested_exception`, which is the default exception thrown
by `seastar::abort_source` [3]. We leverage this property.

---

Unfortunately, `raft::server_impl::abort` isn't perfectly suited for
us. If we look into its code, we'll see that the relevant portion of
the procedure boils down to three steps:

1. Prevent scheduling adding new entries.
2. Wait for the applier fiber.
3. Abort the state machine.

Since aborting the state machine happens only after the applier fiber
has already finished, there will no longer be anything to abort. Either
all executions of `state_machine::apply` have already finished, or they
are hanging and we cannot do anything.

That's a pre-existing problem that we won't be solving here (even
though it's possible). We hope the problem will be solved, and it seems
likely: the code suggests that the behavior is not intended. For more
details, see e.g. [4].

---

We provide two validation tests. They simulate the abortion of
`state_machine::apply` in two different scenarios:

* when the table is dropped (which should also cover the case of tablet
  migration),
* when the node is shutting down.

The value of the tests isn't high since they don't ensure that the
state of the group is still valid (though it should be), nor do they
perform any other check. Instead, we rely on the testing framework to
spot any anomalies or errors. That's probably the best we can do at
the moment.

Unfortunately, both tests are marked as skipped becuause of the current
limitations of `raft::server_impl::abort` described above and in [4].

References:
[1] 4c8dba1
[2] See the description of `raft::state_machine` in `raft/raft.hh`.
[3] See `server_impl::applier_fiber` in `raft/server.cc`.
[4] SCYLLADB-1056
2026-04-09 11:36:51 +02:00
Dawid Mędrek
ad8a263683 service: strong_consistency: Abort ongoing operations when shutting down
These changes are complementary to those from a recent commit where we
handled aborting ongoing operations during tablet events, such as
tablet migration. In this commit, we consider the case of shutting down
a node.

When a node is shutting down, we eventually close the connections. When
the client can no longer get a response from the server, it makes no
sense to continue with the queries. We'd like to cancel them at that
point.

We leverage the abort source passed down via `client_state` down to
the strongly consistent coordinator. This way, the transport layer can
communicate with it and signal that the queries should be canceled.
The abort source is triggered by the CQL server (cf.
`generic_server::server::{stop,shutdown}`).

---

Note that this is not an optional change. In fact, if we don't abort
those requests, we might hang for an indefinite amount of time when
executing the following code in `main.cc`:

```
// Register at_exit last, so that storage_service::drain_on_shutdown will be called first
auto do_drain = defer_verbose_shutdown("local storage", [&ss] {
    ss.local().drain_on_shutdown().get();
});
```

The problem boils down to the fact that `generic_server::server::stop`
will wait for all connections to be closed, but that won't happen until
all ongoing operations (at least those to strongly consistent tables)
are finished.

It's important to highlight that even though we hang on this, the
client can no longer get any response. Thus, it's crucial that at that
point we simply abort ongoing operations to proceed with the rest of
shutdown.

---

Two tests are added to verify that the implementation is correct:
one focusing on local operations, the other -- on a forwarded write.

Difference in time spent on the whole test file
`test_strong_consistency.py` on my local machine, in dev mode:

Before:
```
real    0m31.775s
user    1m4.475s
sys     0m22.615s
```

After:
```
real    0m32.024s
user    1m10.751s
sys     0m23.871s
```

Individual runs of the added tests:

test_queries_when_shutting_down:
```
real    0m12.818s
user    0m36.726s
sys     0m4.577s
```

test_abort_forwarded_write_upon_shutdown:
```
real    0m12.930s
user    0m36.622s
sys     0m4.752s
```
2026-04-09 11:36:17 +02:00
Dawid Mędrek
4a87bdc778 service: client_state: Extend with abort_source
We make `client_state` store a pointer to an `abort_source`. This will
be useful in the following commit that will implement aborting ongoing
requests to strongly consistent tables upon connection shutdowns.
It might also be useful in some other places in the code in the future.

We set the abort source for client states in relevant places.
2026-04-09 11:35:35 +02:00
Dawid Mędrek
89c049b889 service: strong_consistency: Handle abort when removing Raft group
When a strongly consistent Raft group is being removed, it means one of
the following cases:

(A) The node is shutting down and it's simply part of the the shutdown
    procedure.

(B) The tablet is somehow leaving the replica. For example, due to:
    - Tablet migration
    - Tablet split/merge
    - Tablet removal (e.g. because the table is dropped)

In this commit, we focus on case (A). Case (B) will be handled in the
following one.

---

The changes in the code are literally none, and there's a reason to it.

First, let's note that we've already implemented abortion of timed-out
requests. There is a limit to how long a query can run and sooner or
later it will finish, regardless of what we do.

Second, we need to ask ourselves if the cases we're considering in this
commit (i.e. case (B)) is a situation where we'd like to speed up the
process. The answer is no.

Tablet migrations are effectively internal operations that are invisible
to the users. User requests are, quite obviously, the opposite of that.
Because of that, we want to patiently wait for the queries to finish or
time out, even though it's technically possible to lead to an abort
earlier.

Lastly, the changes in the code that actually appear in this commit are
not completely irrelevant either. We consider the important case of
the `leader_info_updater` fiber and argue that it's safe to not pass
any abort source to the Raft methods used by it.

---

Unfortunately, we don't have tablet migrations implemented yet [1],
so our testing capabilities are limited. Still, we provide a new test
that corresponds to case (B) described above. We simulate a tablet
migration by dropping a table and observe how reads and writes behave
in such a situation. There's no extremely careful validation involved
there, but that's what we can have for the time being.

Difference in time spent on the whole test file
`test_strong_consistency.py` on my local machine, in dev mode:

Before:
```
real  0m30.841s
user  1m3.294s
sys   0m21.091s
```

After:
```
real    0m31.775s
user    1m4.475s
sys     0m22.615s
```

The time spent on the new test only:
```
real    0m5.264s
user    0m34.646s
sys     0m3.374s
```

References:
[1] SCYLLADB-868
2026-04-09 11:35:31 +02:00
Dawid Mędrek
7dcc3e85b9 service: strong_consistency: Abort Raft operations on timeout
If a query, either a write, or a read to a strongly consistent table,
times out, we immediately abort the operation and throw an exception.

Unfortunately, due to the inconsistency in exception types thrown
on timeout by the many methods we use in the code, it results in
pretty messy `try-catch` clauses. Perhaps there's a better alternative
to this, but it's beyond the scope of this work, so we leave it as-is.

We provide a validation test that consists of three cases corresponding
to reads, writes, and waiting for the leader. They verify that the code
works as expected in all affected places.

A comparison of time spent on the whole `test_strong_consistency.py` on
my local machine, in dev mode:

Before:
```
real    0m32.185s
user    0m55.391s
sys     0m15.745s
```

After:
```
real  0m30.841s
user  1m3.294s
sys   0m21.091s
```

The time spent on the new test only:
```
real  0m7.077s
user  0m35.359s
sys   0m3.717s
```
2026-04-09 11:35:04 +02:00
Piotr Szymaniak
65a1bdd368 docs: document Alternator auditing in the operator-facing auditing guide
- Document Alternator (DynamoDB-compatible API) auditing support in
  the operator-facing auditing guide (docs/operating-scylla/security/auditing.rst)
- Cover operation-to-category mapping, operation field format,
  keyspace/table filtering, and audit log examples
- Document the audit_tables=alternator.<table> shorthand format
- Minor wording improvements throughout (Scylla -> ScyllaDB,
  clarify default audit backend)

Closes scylladb/scylladb#29231
2026-04-09 12:26:57 +03:00
Dawid Mędrek
2243e0ffea service: strong_consistency: Use timeout when mutating
We remove the inconsistency between reads and writes to strongly
consistent tables. Before the commit, only reads used a timeout.
Now, writes do as well.

Although the parameter isn't used yet, that will change in the following
commit. This is a prerequisite for it.
2026-04-09 11:25:57 +02:00
Dawid Mędrek
fd9c907be1 service: strong_consistency: Fix indentation 2026-04-09 11:25:57 +02:00
Dawid Mędrek
ca7f24516e service: strong_consistency: Enclose coordinator methods with try-catch
We enclose `coordinator::{mutate,query}` with `try-catch` clauses. They
do nothing at the moment, but we'll use them later. We do this now to
avoid noise in the upcoming commits.

We'll fix the indentation in the following commit.
2026-04-09 11:25:57 +02:00
Dawid Mędrek
e9ea9e7259 service: strong_consistency: Crash at unexpected exception
The loop shouldn't throw any other exception than the ones already
covered by the `catch` claues. Crash, at least when
`abort_on_internal_error` is set, if we catch any other type since
that may be a sign of a bug.
2026-04-09 11:25:57 +02:00
Dawid Mędrek
f499a629ab test: cluster: Extract default config & cmdline in test_strong_consistency.py
All used configs and cmdlines share the same values. Let's extract them
to avoid repeating them every time a new test is written. Those options
should be enabled for all tests in the file anyway.
2026-04-09 11:25:57 +02:00
Geoff Montee
7d7ec7025e docs: Document system keyspaces for developers / internal usage
Fixes #29043 with the following docs changes:

- docs/dev/system-keyspaces.md: Added a new file that documents all keyspaces created internally

Closes scylladb/scylladb#29044
2026-04-09 11:49:58 +03:00
Guy Shtub
40a861016a docs/faq.rst: Fixing small spelling mistake
Closes scylladb/scylladb#29131
2026-04-09 11:48:46 +03:00
Pavel Emelyanov
78f5bab7cf table: Add formatter for group_id argument in tablet merge exception message
Fixes: SCYLLADB-1432

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#29143
2026-04-09 11:45:57 +03:00
Botond Dénes
fbbe2bdce8 Merge 'Introduce repair_service::config and cut dependency from db::config' from Pavel Emelyanov
Spreading db::config around and making all services depend on it is not nice. Most other service that need configuration provide their own config that's populated from db::config in main.cc/cql_test_env.cc and use it, not the global config.

This PR does the same for repair_service.

Enhancing components dependencies, not backporting

Closes scylladb/scylladb#29153

* github.com:scylladb/scylladb:
  repair: Remove db/config.hh from repair/*.cc files
  repair: Move repair_multishard_reader options onto repair_service::config
  repair: Move critical_disk_utilization_level onto repair_service::config
  repair: Move repair_partition_count_estimation_ratio onto repair_service::config
  repair: Move repair_hints_batchlog_flush_cache_time_in_ms onto repair_service::config
  repair: Move enable_small_table_optimization_for_rbno onto repair_service::config
  repair: Introduce repair_service::config
2026-04-09 11:44:25 +03:00
Botond Dénes
76c8794f4f Merge 'Strong consistency: allow taking snapshots (but not transfer) and make them less likely' from Piotr Dulikowski
While working on benchmarks for strong consistency we noticed that the raft logic attempted to take snapshots during the benchmark. Snapshot transfer is not implemented for strong consistency yet and the methods that take or transfer snapshots throw exceptions. This causes the raft groups to stop working completely.

While implementing snapshot transfers is out of scope, we can implement some mitigations now to stop the tests from breaking:

- The first commit adjusts the configuration options. First, it disables periodic snapshotting (i.e. creating a snapshot every X log entries). Second, it increases the memory threshold for the raft log before which a snapshot is created from 2MB to 10MB.
- The second commit relaxes the take snapshot / drop snapshot methods and makes it possible to actually use them - they are no-ops. It is still forbidden to transfer snapshots.

I am including both commits because applying only the first one didn't completely prevent the issue from occurring when testing locally.

Refs: SCYLLADB-1115

Strong consistency is experimental, no need for backport.

Closes scylladb/scylladb#29189

* github.com:scylladb/scylladb:
  strong_consistency: fake taking and dropping snapshots
  strong_consistency: adjust limits for snapshots
2026-04-09 11:44:03 +03:00
Anna Stuchlik
dd34d2afb4 doc: remove references to old versions from Docker Hub docs
This commit removes references ScyllaDB versions ("Since x.y")
from the ScyllaDB documentation on Docker Hub, as they are
redundant and confusing (some versions are super ancient).

Fixes SCYLLADB-1212

Closes scylladb/scylladb#29204
2026-04-09 11:43:40 +03:00
Botond Dénes
c162277b28 Merge 'Perform full connection set-up for CertificateAuthorization in process_startup()' from Pavel Emelyanov
The code responds ealry with READY message, but lack some necessary set up, namely:

* update_scheduling_group(): without it, the connection runs under the default scheduling group instead of the one mapped to the user's service level.

* on_connection_ready(): without it, the connection never releases its slot in the uninitialized-connections concurrency semaphore (acquired at connection creation), leaking one unit per cert-authenticated connection for the lifetime of the connection.

* _authenticating = false / _ready = true: without them, system.clients reports connection_stage = AUTHENTICATING forever instead of READY (not critical, but not nice either)

The PR fixes it and adds a regression test, that (for sanity) also covers AllowAll and Password authrticators

Fixes SCYLLADB-1226

Present since 2025.1, probably worth backporting

Closes scylladb/scylladb#29220

* github.com:scylladb/scylladb:
  transport: fix process_startup cert-auth path missing connection-ready setup
  transport: test that connection_stage is READY after auth via all process_startup paths
2026-04-09 11:43:02 +03:00
Raphael S. Carvalho
16e387d5f9 repair/replica: Fix race window where post-repair data is wrongly promoted to repaired
During incremental repair, each tablet replica holds three SSTable views:
UNREPAIRED, REPAIRING, and REPAIRED.  The repair lifecycle is:

  1. Replicas snapshot unrepaired SSTables and mark them REPAIRING.
  2. Row-level repair streams missing rows between replicas.
  3. mark_sstable_as_repaired() runs on all replicas, rewriting the
     SSTables with repaired_at = sstables_repaired_at + 1 (e.g. N+1).
  4. The coordinator atomically commits sstables_repaired_at=N+1 and
     the end_repair stage to Raft, then broadcasts
     repair_update_compaction_ctrl which calls clear_being_repaired().

The bug lives in the window between steps 3 and 4.  After step 3, each
replica has on-disk SSTables with repaired_at=N+1, but sstables_repaired_at
in Raft is still N.  The classifier therefore sees:

  is_repaired(N, sst{repaired_at=N+1}) == false
  sst->being_repaired == null   (lost on restart, or not yet set)

and puts them in the UNREPAIRED view.  If a new write arrives and is
flushed (repaired_at=0), STCS minor compaction can fire immediately and
merge the two SSTables.  The output gets repaired_at = max(N+1, 0) = N+1
because compaction preserves the maximum repaired_at of its inputs.

Once step 4 commits sstables_repaired_at=N+1, the compacted output is
classified REPAIRED on the affected replica even though it contains data
that was never part of the repair scan.  Other replicas, which did not
experience this compaction, classify the same rows as UNREPAIRED.  This
divergence is never healed by future repairs because the repaired set is
considered authoritative.  The result is data resurrection: deleted rows
can reappear after the next compaction that merges unrepaired data with the
wrongly-promoted repaired SSTable.

The fix has two layers:

Layer 1 (in-memory, fast path): mark_sstable_as_repaired() now also calls
mark_as_being_repaired(session) on the new SSTables it writes.  This keeps
them in the REPAIRING view from the moment they are created until
repair_update_compaction_ctrl clears the flag after step 4, covering the
race window in the normal (no-restart) case.

Layer 2 (durable, restart-safe): a new is_being_repaired() helper on
tablet_storage_group_manager detects the race window even after a node
restart, when being_repaired has been lost from memory.  It checks:

  sst.repaired_at == sstables_repaired_at + 1
  AND tablet transition kind == tablet_transition_kind::repair

Both conditions survive restarts: repaired_at is on-disk in SSTable
metadata, and the tablet transition is persisted in Raft.  Once the
coordinator commits sstables_repaired_at=N+1 (step 4), is_repaired()
returns true and the SSTable naturally moves to the REPAIRED view.

The classifier in make_repair_sstable_classifier_func() is updated to call
is_being_repaired(sst, sstables_repaired_at) in place of the previous
sst->being_repaired.uuid().is_null() check.

A new test, test_incremental_repair_race_window_promotes_unrepaired_data,
reproduces the bug by:
  - Running repair round 1 to establish sstables_repaired_at=1.
  - Injecting delay_end_repair_update to hold the race window open.
  - Running repair round 2 so all replicas complete mark_sstable_as_repaired
    (repaired_at=2) but the coordinator has not yet committed step 4.
  - Writing post-repair keys to all replicas and flushing servers[1] to
    create an SSTable with repaired_at=0 on disk.
  - Restarting servers[1] so being_repaired is lost from memory.
  - Waiting for autocompaction to merge the two SSTables on servers[1].
  - Asserting that the merged SSTable contains post-repair keys (the bug)
    and that servers[0] and servers[2] do not see those keys as repaired.

NOTE FOR MAINTAINER: Copilot initially only implemented Layer 1 (the
in-memory being_repaired guard), missing the restart scenario entirely.
I pointed out that being_repaired is lost on restart and guided Copilot
to add the durable Layer 2 check.  I also polished the implementation:
moving is_being_repaired into tablet_storage_group_manager so it can
reuse the already-held _tablet_map (avoiding an ERM lookup and try/catch),
passing sstables_repaired_at in from the classifier to avoid re-reading it,
and using compaction_group_for_sstable inside the function rather than
threading a tablet_id parameter through the classifier.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1239.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Closes scylladb/scylladb#29244
2026-04-09 11:42:28 +03:00
Dawid Mędrek
a8bc90a375 Merge 'cql3: fix DESCRIBE INDEX WITH INTERNALS name' from Piotr Smaron
This series fixes two related inconsistencies around secondary-index
names.
1. `DESCRIBE INDEX ... WITH INTERNALS` returned the backing
   materialized-view name in the `name` column instead of the logical
   index name.
2. The snapshot REST API accepted backing table names for MV-backed
   secondary indexes, but not the logical index names exposed to users.

The snapshot side now resolves logical secondary-index names to backing
table names where applicable, reports logical index names in snapshot
details, rejects vector index names with HTTP 400, and keeps multi-keyspace
DELETE atomic by resolving all keyspaces before deleting anything.
The tests were also extended accordingly, and the snapshot test helper
was fixed to clean up multi-table snapshots using one DELETE per table.

Fixes: SCYLLADB-1122

Minor bugfix, no need to backport.

Closes scylladb/scylladb#29083

* github.com:scylladb/scylladb:
  cql3: fix DESCRIBE INDEX WITH INTERNALS name
  test: add snapshot REST API tests for logical index names
  test: fix snapshot cleanup helper
  api: clarify snapshot REST parameter descriptions
  api: surface no_such_column_family as HTTP 400
  db: fix clear_snapshot() atomicity and use C++23 lambda form
  db: normalize index names in get_snapshot_details()
  db: add resolve_table_name() to snapshot_ctl
2026-04-09 08:37:51 +03:00
Piotr Dulikowski
ec0231c36c Merge 'db/view/view_building_worker: lock staging sstables mutex for all necessary shards when creating tasks' from Michał Jadwiszczak
To create `process_staging` view building tasks, we firstly need to collect informations about them on shard0, create necessary mutations, commit them to group0 and move staging sstables objects to their original shards.

But there is a possible race after committing the group0 command and before moving the staging sstables to their shards. Between those two events, the coordinator may schedule freshly created tasks and dispatch them to the worker but the worker won't have the sstables objects because they weren't moved yet.

This patch fixes the race by holding `_staging_sstables_mutex` locks from all necessary shards when executing `create_staging_sstable_tasks()`. With this, even if the task will be scheduled and dispatched quickly, the worker will wait with executing it until the sstables objects are moved and the locks are released.

Fixes SCYLLADB-816

This PR should be backported to all versions containing view building coordinator (2025.4 and newer).

Closes scylladb/scylladb#29174

* github.com:scylladb/scylladb:
  db/view/view_building_worker: fix indentation
  db/view/view_building_worker: lock staging sstables mutex for necessary shards when creating tasks
2026-04-09 08:37:51 +03:00
Piotr Smaron
ecc3bcabd4 test/ldap: add LDAP filter-injection reproducers
Add tests that reproduce LDAP filter injection via unescaped {USER}
substitution (SCYLLADB-1309).  A wildcard username ('*') matches
every group entry, and a parenthesis payload (")(uid=*") breaks the
search filter.

Extend the LDAP test fixture (ldap_server.py, slapd.conf) with
memberUid attributes and the NIS schema so the new tests can
exercise direct filter-value substitution.
2026-04-08 13:53:49 +02:00
Piotr Smaron
d458ff50b0 cql3: fix DESCRIBE INDEX WITH INTERNALS name
DESCRIBE INDEX ... WITH INTERNALS returned the name of
the backing materialized view in the name column instead
of the logical index name.

Return the logical index name from schema::describe()
for index schemas so all callers observe the
user-facing name consistently.

Fixes: SCYLLADB-1122
2026-04-08 13:38:17 +02:00
Piotr Smaron
04837ba20f test: add snapshot REST API tests for logical index names
Add focused REST coverage for logical secondary-index
names in snapshot creation, deletion, and details
output.

Also cover vector-index rejection and verify
multi-keyspace delete resolves all keyspaces before
deleting anything so mixed index kinds cannot cause
partial removal.
2026-04-08 13:38:17 +02:00
Piotr Smaron
6b85da3ce3 test: fix snapshot cleanup helper
The snapshot REST helper cleaned up multi-table
snapshots with a single DELETE request that passed a
comma-separated cf filter, but the API accepts only one
table name there.

Delete each table snapshot separately so existing tests
that snapshot multiple tables use the API as
documented.
2026-04-08 13:36:27 +02:00
Piotr Smaron
3090684dad api: clarify snapshot REST parameter descriptions
Document the current /storage_service/snapshots behavior
more accurately.

For DELETE, cf is a table filter applied independently
in each keyspace listed in kn. If cf is omitted or
empty, snapshots for all tables are eligible, and
secondary indexes can be addressed by their logical
index name.
2026-04-08 13:36:27 +02:00
Piotr Smaron
6ee75c74bd api: surface no_such_column_family as HTTP 400
Snapshot requests that name a non-existent table or a
non-snapshotable logical index currently surface an
internal server error.

Translate no_such_column_family into a bad request so
callers get a client-facing error that matches the
invalid input.
2026-04-08 13:36:27 +02:00
Piotr Smaron
7d83a264ac db: fix clear_snapshot() atomicity and use C++23 lambda form
clear_snapshot() applies a table filter independently in
each keyspace, so logical index names must be resolved
per keyspace on the delete path as well.

Resolve all keyspaces before deleting anything so a later
failure cannot partially remove a snapshot, and use the
explicit-object-parameter coroutine lambda form for the
asynchronous implementation.
2026-04-08 13:36:27 +02:00
Piotr Smaron
39baa1870e db: normalize index names in get_snapshot_details()
Snapshot details exposed backing secondary-index view
names instead of logical index names.

Normalize index entries in get_snapshot_details() so the
REST API reports the user-facing name, and update the
existing REST test to assert that behavior directly.
2026-04-08 13:36:27 +02:00
Piotr Smaron
9c37f1def2 db: add resolve_table_name() to snapshot_ctl
The snapshot REST API accepted backing secondary-index
table names, but not logical index names.

Introduce resolve_table_name() so snapshot creation can
translate a logical index name to the backing table when
the index is materialized as a view.
2026-04-08 13:36:27 +02:00
Petr Gusev
7750d5737c strong consistency: replace local consistency with global
Currently we don't support 'local' consistency, which would
imply maintaining separate raft group for each dc. What we
support is actually 'global' consistency -- one raft group
per tablet replica set. We don't plan to support local
consistency for the first GA.

Closes scylladb/scylladb#29221
2026-04-08 12:52:32 +02:00
Patryk Jędrzejczak
850db950f8 Merge 'raft: include demoted voters in read barrier during joint config' from Qian Cheng
Hi, thanks for Scylla!

We found a small issue in tracker::set_configuration() during joint consensus and put together a fix.

When a server is demoted from voter to non-voter, set_configuration processes the current config first (can_vote=false), then the previous config. But when it finds the server already in the progress map (tracker.cc:118), it hits `continue` without updating can_vote. So the server's follower_progress::can_vote stays false even though it's still a voter in the previous config.

This causes broadcast_read_quorum (fsm.cc:1055) to skip the demoted server, reducing the pool of responders. Since committed() correctly includes the server in _previous_voters for quorum calculation, read barriers can stall if other servers are slow.

The fix is to use configuration::can_vote() in tracker::set_configuration.

We included a reproduction unit test (test_tracker_voter_demotion_joint_config) that extracts the set_configuration algorithm and demonstrates the mismatch. We weren't able to build the full Scylla test suite to add an in-tree test, so we kept it as a standalone file for reference.

No backport: the bug is non-critical and the change needs some soak time in master.

Closes scylladb/scylladb#29226

* https://github.com/scylladb/scylladb:
  fix: use is_voter::yes instead of true in test assertions
  test: add tracker voter demotion test to fsm_test.cc
  fix: use configuration::can_vote() in tracker::set_configuration
2026-04-08 12:37:27 +02:00
Qian-Cheng-nju
a416238155 test: add tracker voter demotion test to fsm_test.cc 2026-04-08 12:37:19 +02:00
Qian-Cheng-nju
f72528c759 raft: use configuration::can_vote() in tracker::set_configuration 2026-04-08 12:37:16 +02:00
Michał Jadwiszczak
568f20396a test: fix flaky test_create_index_synchronous_updates trace event race
The test_create_index_synchronous_updates test in test_secondary_index_properties.py
was intermittently failing with 'assert found_wanted_trace' because the expected
trace event 'Forcing ... view update to be synchronous' was missing from the
trace events returned by get_query_trace().

Root cause: trace events are written asynchronously to system_traces.events.
The Python driver's populate() method considers a trace complete once the
session row in system_traces.sessions has duration IS NOT NULL, then reads
events exactly once. Since the session row and event rows are written as
separate mutations with no transactional guarantee, the driver can read an
incomplete set of events.

Evidence from the failed CI run logs:
- The entire test (CREATE TABLE through DROP TABLE) completed in ~300ms
  (01:38:54,859 - 01:38:55,157)
- The INSERT with tracing happened in a ~50ms window between the second
  CREATE INDEX completing (01:38:55,108) and DROP TABLE starting
  (01:38:55,157)
- The 'Forcing ... synchronous' trace message is generated during the
  INSERT write path (db/view/view.cc:2061), so it was produced, but
  not yet flushed to system_traces.events when the driver read them
- This matches the known limitation documented in test/alternator/
  test_tracing.py: 'we have no way to know whether the tracing events
  returned is the entire trace'

Fix: replace the single-shot trace.events read with a retry loop that
directly queries system_traces.events until the expected event appears
(with a 30s timeout). Use ConsistencyLevel.ONE since system_traces has
RF=2 and cqlpy tests run on a single-node cluster.

The same race condition pattern exists in test_mv_synchronous_updates in
test_materialized_view.py (which this test was modeled after), so the
same fix is proactively applied there as well.

Fixes SCYLLADB-1314

Closes scylladb/scylladb#29374
2026-04-08 12:35:10 +02:00
Raphael S. Carvalho
f941a77867 scripts/base36-uuid: dump date in UTC
Previously, the timestamp decoded from a timeuuid was printed using the
local timezone via datetime.fromtimestamp(), which produces different
output depending on the machine's locale settings.

ScyllaDB logs are emitted in UTC by default. Printing the decoded date
in UTC makes it straightforward to correlate SSTable identifiers with
log entries without having to mentally convert timezones.

Also fix the embedded pytest assertion, which was accidentally correct
only on machines in UTC+8 — it now uses an explicit UTC-aware datetime.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Closes scylladb/scylladb#29253
2026-04-08 12:19:55 +03:00
Yaniv Michael Kaul
c385c0bdf9 .github/workflows/call_validate_pr_author_email.yml: add missing workflow permissions
Add explicit permissions block (contents: read, pull-requests: write,
statuses: write) matching the requirements of the called reusable
workflow which checks out code, posts PR comments, and sets commit
statuses. Fixes code scanning alert #172.

Closes scylladb/scylladb#29183
2026-04-08 12:19:55 +03:00
Pavel Emelyanov
788ecaa682 api: Fix enable_injection to accept case-insensitive bool parameter
Replace strict case-sensitive '== "True"' check with strcasecmp(..., "true")
so that Python's str(True) -> "True" is properly recognized. Accepts any
case variation of "true" ("True", "TRUE", etc.), with empty string
defaulting to false.

Maintains backward compatibility with out-of-tree tests that rely on
Python's bool stringification.

The goal is to reduce the number of distinct ways API handlers use to
convert string http query parameters into bool variables. This place is the
only one that simply compares param to "True".

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Closes scylladb/scylladb#29236
2026-04-08 12:19:55 +03:00
Avi Kivity
0fd9ea9701 abseil: update to lts_2026_01_07
Switch to branch lts_2026_01_07, which is exactly equal to
upstream now.

There were no notable changes in the release notes, but the
new versions are more friendly to newer compilers (specifically,
in include hygiene).

configure.py needs a few library updates; cmake works without
change.

scylla-gdb.py updated for new hash table layout (by Claude Opus 4.6).

* abseil d7aaad83...255c84da (1179):
  > Abseil LTS branch, Jan 2026, Patch 1 (#2007)
  > Cherry-picks for LTS 20260107 (#1990)
  > Apply LTS transformations for 20260107 LTS branch (#1989)
  > Mark legacy Mutex methods and MutexLock pointer constructors as deprecated
  > `cleanup`: specify that it's safe to use the class in a signal handler.
  > Suppress bugprone-use-after-move in benign cases
  > StrFormat: format scientific notation without heap allocation
  > Introduce a legacy copy of GetDebugStackTraceHook API.
  > Report 1ns instead of 0ns for probe_benchmarks. Some tools incorrectly assume that benchmark was not run if 0ns reported.
  > Add absl::chunked_queue
  > `CRC32` version of `CombineContiguous` for length <= 32.
  > Add `absl::down_cast`
  > Fix FixedArray iterator constructor, which should require input_iterator, not forward_iterator
  > Add a latency benchmark for hashing a pair of integers.
  > Delete absl::strings_internal::STLStringReserveAmortized()
  > As IsAtLeastInputIterator helper
  > Use StringAppendAndOverwrite() in CEscapeAndAppendInternal()
  > Add support for absl::(u)int128 in FastIntToBuffer()
  > absl/strings: Prepare helper for printing objects to string representations.
  > Use SimpleAtob() for parsing bool flags
  > No-op changes to relative timeout support code.
  > Adjust visibility of heterogeneous_lookup_testing.h
  > Remove -DUNORDERED_SET_CXX17 since the macro no longer exists
  > [log] Prepare helper for streaming container contents to strings.
  > Restrict the visibility of some internal testing utilities
  > Add absl::linked_hash_set and absl::linked_hash_map
  > [meta] Add constexpr testing helper.
  > BUILD file reformatting.
  > `absl/meta`: Add C++17 port of C++20 `requires` expression for internal use
  > Remove the implementation of `absl::string_view`, which was only needed prior to C++17. `absl::string_view` is now an alias for `std::string_view`. It is recommended that clients simply use `std::string_view`.
  > No public description
  > absl:🎏 Stop echoing file content in flagfile parsing errors Modified ArgsList::ReadFromFlagfile to redact the content of unexpected lines from error messages. \
  > Refactor the declaration of `raw_hash_set`/`btree` to omit default template parameters from the subclasses.
  > Import of CCTZ from GitHub.
  > Add ABSL_ATTRIBUTE_LIFETIME_BOUND to Flag help generator
  > Correct `Mix4x16Vectors` comment.
  > Special implementation for string hash with sizes greater than 64.
  > Reorder function parameters so that hash state is the first argument.
  > Search more aggressively for open slots in absl::internal_stacktrace::BorrowedFixupBuffer
  > Implement SpinLockHolder in terms of std::lock_guard.
  > No public description
  > Avoid discarding test matchers.
  > Import of CCTZ from GitHub.
  > Automated rollback of commit 9f40d6d6f3cfc1fb0325dd8637eb65f8299a4b00.
  > Enable clang-specific warnings on the clang-cl build instead of just trying to be MSVC
  > Enable clang-specific warnings on the clang-cl build instead of just trying to be MSVC
  > Make AnyInvocable remember more information
  > Add further diagnostics under clang for string_view(nullptr)
  > Import of CCTZ from GitHub.
  > Document the differing trimming behavior of absl::Span::subspan() and std::span::subspan()
  > Special implementation for string hash with sizes in range [33, 64].
  > Add the deleted string_view(std::nullptr_t) constructor from C++23
  > CI: Use a cached copy of GoogleTest in CMake builds if possible to minimize the possibility of errors downloading from GitHub
  > CI: Enable libc++ hardening in the ASAN build for even more checks https://libcxx.llvm.org/Hardening.html
  > Call the common case of AllocateBackingArray directly instead of through the function pointer.
  > Change AlignedType to have a void* array member so that swisstable backing arrays end up in the pointer-containing partition for heap partitioning.
  > base: Discourage use of ABSL_ATTRIBUTE_PACKED
  > Revert: Add an attribute to HashtablezInfo which performs a bitwise XOR on all hashes. The purposes of this attribute is to identify if identical hash tables are being created. If we see a large number of identical tables, it's likely the code can be improved by using a common table as opposed to keep rebuilding the same one.
  > Import of CCTZ from GitHub.
  > Record insert misses in hashtable profiling.
  > Add absl::StatusCodeToStringView.
  > Add a missing dependency on str_format that was being pulled in transitively
  > Pico-optimize `SkipWhitespace` to use `StripLeadingAsciiWhitespace`.
  > absl::string_view: Upgrade the debug assert on the single argument char* constructor to ABSL_HARDENING_ASSERT
  > Use non-stack storage for stack trace buffers
  > Fixed incorrect include for ABSL_NAMESPACE_BEGIN
  > Add ABSL_REFACTOR_INLINE to separate the inliner directive from the deprecated directive so that we can give users a custom deprecation message.
  > Reduce stack usage when unwinding without fixups
  > Reduce stack usage when unwinding from 170 to 128 on x64
  > Rename RecordInsert -> RecordInsertMiss.
  > PR #1968: Use std::move_backward within InlinedVector's Storage::Insert
  > Use the new absl::StringResizeAndOverwrite() in CUnescape()
  > Explicitly instantiate common `raw_hash_set` backing array functions.
  > Rollback reduction of maximum load factor. Now it is back to 28/32.
  > Export Mutex::Dtor from shared libraries in NDEBUG mode
  > Allow `IsOkAndHolds` to rely on duck typing for matching `StatusOr` like types instead of uniquely `absl::StatusOr`, e.g. `google::cloud::StatusOr`.
  > Fix typo in macro and add missing static_cast for WASM builds.
  > windows(cmake): add abseil_test_dll to target link libraries when required
  > Handle empty strings in `SimpleAtof` after stripping whitespace
  > Avoid using a thread_local in an inline function since this causes issues on some platforms.
  > (Roll forward) Change Abseil's SpinLock adaptive_spin_count to a class static variable that can be set by tcmalloc friend classes.
  > Change Abseil's SpinLock adaptive_spin_count to a class static variable that can be set by tcmalloc friend classes.
  > Change Abseil's SpinLock adaptive_spin_count to a class static variable that can be set by tcmalloc friend classes.
  > Fixes for String{Resize|Append}AndOverwrite   - StringAppendAndOverwrite() should always call StringResizeAndOverwrite()     with at least capacity() in case the standard library decides to shrink     the buffer (Fixes #1965)   - Small refactor to make the minimum growth an addition for clarity and     to make it easier to test 1.5x growth in the future   - Turn an ABSL_HARDENING_ASSERT into a ThrowStdLengthError   - Add a missing std::move
  > Correct the supported features of Status Matchers
  > absl/time: Use "memory order acquire" for loads, which would allow for the safe removal of the data memory barrier.
  > Use the new absl::StringResizeAndOverwrite() in string escaping utilities
  > Add an internal-only helper StringAppendAndOverwrite() similar to StringResizeAndOverwrite() but optimized for repeated appends, using exponential growth to ensure amortized complexity of increasing a string size by a small amount is O(1).
  > Release `ABSL_EXPECT_OK` and `ABSL_ASSERT_OK`.
  > Fix the CHECK_XX family of macros to not print `char*` arguments as C-strings if the comparison happened as pointers. Printing as pointers is more relevant to the result of the comparison.
  > Rollback StringAppendAndOverwrite() - the problem is that StringResizeAndOverwrite has MSAN testing of the entire string. This causes quadratic MSAN verification on small appends.
  > Add an internal-only helper StringAppendAndOverwrite() similar to StringResizeAndOverwrite() but optimized for repeated appends, using exponential growth to ensure amortized complexity of increasing a string size by a small amount is O(1).
  > PR #1961: Fix Clang warnings on powerpc
  > Use the new absl::StringResizeAndOverwrite() in string escaping utilities
  > Use the new absl::StringResizeAndOverwrite() in string escaping utilities
  > macOS CI: Move the Bazel vendor_dir to ${HOME} to workaround a Bazel issue where it does not work when it is in ${TMP} and also fix the quoting which was causing it to incorrectly receive the argument
  > Use __msan_check_mem_is_initialized for detailed MSan report
  > Optimize stack unwinding by reducing `AddressIsReadable` calls.
  > Add internal API to allow bypassing stack trace fixups when needed
  > absl::StrFormat: improve test coverage with scientific exponent test cases
  > Add throughput and latency benchmarks for `absl::ToDoubleXYZ` functions.
  > CordzInfo: Use absl::NoDestructor to remove a global destructor. Chromium requires no global destructors.
  > string_view: Enable std::view and std::borrowed_range
  > cleanup: s/logging_internal/log_internal/ig for consistency
  > Use the new absl::StringResizeAndOverwrite() in string escaping utilities
  > Use the new absl::StringResizeAndOverwrite() in string escaping utilities
  > Use the new absl::StringResizeAndOverwrite() in absl::AsciiStrTo{Lower|Upper}
  > Use the new absl::StringResizeAndOverwrite() in absl::StrJoin()
  > Use the new absl::StringResizeAndOverwrite() in absl::StrCat()
  > string_view: Fix include order
  > Don't pass nullptr as the 1st arg of `from_chars`
  > absl/types: format code with clang-format.
  > Validate absl::StringResizeAndOverwrite op has written bytes as expected.
  > Skip the ShortStringCollision test on WASM.
  > Rollback `absl/types`: format code with clang-format.
  > Remove usage of the WasmOffsetConverter for Wasm / Emscripten stack-traces.
  > Use the new absl::StringResizeAndOverwrite() in absl::CordCopyToString()
  > Remove an undocumented behavior of --vmodule and absl::SetVLogLevel that could set a module_pattern to defer to the global vlog threshold.
  > Update to rules_cc 0.2.9
  > Avoid redefine warnings with ntstatus constants
  > PR #1944: Use same element-width for non-temporal loads and stores on Arm
  > absl::StringResizeAndOverwrite(): Add the requirement that the only value that can be written to buf[size] is the terminator character.
  > absl/types: format code with clang-format.
  > Minor formatting changes.
  > Remove `IntIdentity` and `PtrIdentity` from `raw_hash_set_probe_benchmark`.
  > Automated rollback of commit cad60580dba861d36ed813564026d9774d9e4e2b.
  > FlagStateInterface implementors need only support being restored once.
  > Clarify the post-condition of `reserve()` in Abseil hash containers.
  > Clarify the post-condition of `reserve()` in Abseil hash containers.
  > Represent dropped samples in hashtable profile.
  > Add lifetimebound to absl::implicit_cast and make it work for rvalue references as it already does with lvalue references
  > Clean up a doc example where we had `absl_nonnull` and `= nullptr;`
  > Change Cordz to synchronize tracked cords with Snapshots / DeleteQueue
  > Minor refactor to `num_threads` in deadlock test
  > Rename VLOG macro parameter to match other uses of this pseudo type.
  > `time`: Fix indentation
  > Automated Code Change
  > Adds `absl::StringResizeAndOverwrite` as a polyfill for C++23's `std::basic_string<CharT,Traits,Allocator>::resize_and_overwrite`
  > Internal-only change
  > absl/time: format code with clang-format.
  > No public description
  > Expose typed releasers of externally appended memory.
  > Fix __declspec support for ABSL_DECLARE_FLAG()
  > Annotate absl::AnyInvocable as an owner type via [[gsl::Owner]] and absl_internal_is_view = std::false_type
  > Annotate absl::FunctionRef as a view type via [[gsl::Pointer]] and absl_internal_is_view
  > Remove unnecessary dep on `core_headers` from the `nullability` cc_library
  > type_traits: Add type_identity and type_traits_t backfills
  > Refactor raw_hash_set range insertion to call private insert_range function.
  > Fix bug in absl::FunctionRef conversions from non-const to const
  > PR #1937: Simplify ConvertSpecialToEmptyAndFullToDeleted
  > Improve absl::FunctionRef compatibility with C++26
  > Add a workaround for unused variable warnings inside of not-taken if-constexpr codepaths in older versions of GCC
  > Annotate ABSL_DIE_IF_NULL's return type with `absl_nonnull`
  > Move insert index computation into `PrepareInsertLarge` in order to reduce inlined part of insert/emplace operations.
  > Automated Code Change
  > PR #1939: Add missing rules_cc loads
  > Expose (internally) a LogMessage constructor taking file as a string_view for (internal, upcoming) FFI integration.
  > Fixed up some #includes in mutex.h
  > Make absl::FunctionRef support non-const callables, aligning it with std::function_ref from C++26
  > Move capacity update in `Grow1To3AndPrepareInsert` after accessing `common.infoz()` to prevent assertion failure in `control()`.
  > Fix check_op(s) compilation failures on gcc 8 which eagerly tries to instantiate std::underlying_type for non-num types.
  > Use `ABSL_ATTRIBUTE_ALWAYS_INLINE`for lambda in `find_or_prepare_insert_large`.
  > Mark the implicit floating operators as constexpr for `absl::int128` and `absl::uint128`
  > PR #1931: raw_hash_set: fix instantiation for recursive types on MSVC with /Zc:__cplusplus
  > Add std::pair specializations for IsOwner and IsView
  > Cast ABSL_MIN_LOG_LEVEL to absl::LogSeverityAtLeast instead of absl::LogSeverity.
  > Fix a corner case in the aarch64 unwinder
  > Fix inconsistent nullability annotation in ReleasableMutexLock
  > Remove support for Native Client
  > Rollback f040e96b93dba46e8ed3ca59c0444cbd6c0a0955
  > When printing CHECK_XX failures and both types are unprintable, don't bother printing " (UNPRINTABLE vs. UNPRINTABLE)".
  > PR #1929: Fix shorten-64-to-32 warning in stacktrace_riscv-inl.inc
  > Refactor `find_or_prepare_insert_large` to use a single return statement using a lambda.
  > Use possible CPUs to identify NumCPUs() on Linux.
  > Fix incorrect nullability annotation of `absl::Cord::InlineRep::set_data()`.
  > Move SetCtrl* family of functions to cc file.
  > Change absl::InlinedVector::clear() so that it does not deallocate any allocated space. This allows allocations to be reused and matches the behavior specification of std::vector::clear().
  > Mark Abseil container algorithms as `constexpr` for C++20.
  > Fix `CHECK_<OP>` ambiguous overload for `operator<<` in older versions of GCC when C-style strings are compared
  > stacktrace_test: avoid spoiling errno in the test signal handler.
  > Optimize `CRC32AcceleratedX86ARMCombinedMultipleStreams::Extend` by interleaving the `CRC32_u64` calls at a lower level.
  > stacktrace_test: avoid spoiling errno in the test signal handler.
  > stacktrace_test: avoid spoiling errno in the test signal handler.
  > std::multimap::find() is not guaranteed to return the first entry with the requested key. Any may be returned if many exist.
  > Mark `/`, `%`, and `*` operators as constexpr when intrinsics are available.
  > Add the C++20 string_view contructor that uses iterators
  > Implement absl::erase_if for absl::InlinedVector
  > Adjust software prefetch to fetch 5 cachelines ahead, as benchmarking suggests this should perform better.
  > Reduce maximum load factor to 27/32 (from 28/32).
  > Remove unused include
  > Remove unused include statement
  > PR #1921: Fix ABSL_BUILD_DLL mode (absl_make_dll) with mingw
  > PR #1922: Enable mmap for WASI if it supports the mman header
  > Rollback C++20 string_view constructor that uses iterators due to broken builds
  > Add the C++20 string_view contructor that uses iterators
  > Bump versions of dependencies in MODULE.bazel
  > Automated Code Change
  > PR #1918: base: add musl + ppc64le fallback for UnscaledCycleClock::Frequency
  > Optimize crc32 Extend by removing obsolete length alignment.
  > Fix typo in comment of `ABSL_ATTRIBUTE_UNUSED`.
  > Mark AnyInvocable as being nullability compatible.
  > Ensure stack usage remains low when unwinding the stack, to prevent stack overflows
  > Shrink #if ABSL_HAVE_ATTRIBUTE_WEAK region sizes in stacktrace_test.cc
  > <filesystem> is not supported for XTENSA. Disable it in //absl/hash/internal/hash.h.
  > Use signal-safe dynamic memory allocation for stack traces when necessary
  > PR #1915: Fix SYCL Build Compatibility with Intel LLVM Compiler on Windows for abseil
  > Import of CCTZ from GitHub.
  > Tag tests that currently fail on ios_sim_arm64 with "no_test_ios_sim_arm64"
  > Automated Code Change
  > Automated Code Change
  > Import of CCTZ from GitHub.
  > Move comment specific to pointer-taking MutexLock variant to its definition.
  > Add lifetime annotations to MutexLock, SpinLockHolder, etc.
  > Add lifetimebound annotations to absl::MakeSpan and absl::MakeConstSpan to detect dangling references
  > Remove comment mentioning deferenceability.
  > Add referenceful MutexLock with Condition overload.
  > Mark SpinLock camel-cased methods as ready for inlining.
  > Whitespace change
  > In logging tests that write expectations against `ScopedMockLog::Send`, suppress the default behavior that forwards to `ScopedMockLog::Log` so that unexpected logs are printed with full metadata.  Many of these tests are poking at those metadata, and a failure message that doesn't include them is unhelpful.
  > Add ABSL_ATTRIBUTE_LIFETIME_BOUND to absl::ClippedSubstr
  > Inline internal usages of Mutex::Lock, etc. in favor of lock.
  > Inline internal usages of pointerful SpinLockHolder/MutexLock.
  > Remove wrong comment in Cord::Unref
  > Update the crc32 dynamic dispatch table with newer platforms.
  > PR #1914: absl/base/internal/poison.cc: Minor build fix
  > Accept references on SpinLockHolder/MutexLock
  > Import of CCTZ from GitHub.
  > Fix typos in comments.
  > Inline SpinLock Lock->lock, Unlock->unlock internal to Abseil.
  > Rename Mutex methods to use the typical C++ lower case names.
  > Rename SpinLock methods to use the typical C++ lower case names.
  > Add an assert that absl::StrSplit is not called with a null char* argument.
  > Fix sign conversion warning
  > PR #1911: Fix absl_demangle_test on ppc64
  > Disallow using a hash function whose return type is smaller than size_t.
  > Optimize CRC-32C extension by zeroes
  > Deduplicate stack trace implementations in stacktrace.cc
  > Align types of location_table_ and mapping_table_ keys (-Wshorten-64-to-32).
  > Move SigSafeArena() out to absl/base/internal/low_level_alloc.h
  > Allow CHECK_<OP> variants to be used with unprintable types.
  > Import of CCTZ from GitHub.
  > Adds required load statements for C++ rules to BUILD and bzl files.
  > Disable sanitizer bounds checking in ComputeZeroConstant.
  > Roll back NDK weak symbol mode for backtrace() due to internal test breakage
  > Add converter for extracting SwissMap profile information into a https://github.com/google/pprof suitable format for inspection.
  > Allocate memory for frames and sizes during stack trace fix-up when no memory is provided
  > Support NDK weak symbol mode for backtrace() on Android.
  > Change skip_empty_or_deleted to not use groups.
  > Fix bug of dereferencing invalidated iterator in test case.
  > Refactor: split erase_meta_only into large and small versions.
  > Fix a TODO to use std::is_nothrow_swappable when it became available.
  > Clean up the testing of alternate options that were removed in previous changes
  > Only use generic stacktrace when ABSL_HAVE_THREAD_LOCAL.
  > Automated Code Change
  > Add triviality tests for absl::Span
  > Loosen the PointerAlignment test to allow up to 5 stuck bits to avoid flakiness.
  > Prevent conversion constructions from absl::Span to itself
  > Skip flaky expectations in waiter_test for MSVC.
  > Refactor: call AssertIsFull from iterator::assert_is_full to avoid passing the same arguments repeatedly.
  > In AssertSameContainer, remove the logic checking for whether the iterators are from SOO tables or not since we don't use it to generate a more informative debug message.
  > Remove unused NonIterableBitMask::HighestBitSet function.
  > Refactor: move iterator unchecked_* members before data members to comply with Google C++ style guide.
  > Mix pointers once instead of twice now that we've improved mixing on 32-bit platforms and improved the kMul constant.
  > Remove unused utility functions/constants.
  > Revert a change for breaking downstream third party libs
  > Remove unneeded include from cord_rep_btree_navigator.h
  > Refactor: move find_first_non_full into raw_hash_set.cc.
  > Perform stronger mixing on 32-bit platforms and enable the LowEntropyStrings test.
  > Include deallocated caller-provided size in delete hooks.
  > Roll back one more time: In debug mode, assert that the probe sequence isn't excessively long.
  > Allow a `std::move` of `delimiter_` to happen in `ByString::ByString(ByString&&)`. Right now the move ctor is making a copy because the source object is `const`.
  > Assume that control bytes don't alias CommonFields.
  > Consistently use [[maybe_unused]] in raw_hash_set.h for better compiler warning compatibility.
  > Roll forward: In debug mode, assert that the probe sequence isn't excessively long.
  > Add a new test for hash collisions for short strings when PrecombineLengthMix has low quality.
  > Refactor: define CombineRawImpl for repeated `Mix(state ^ value, kMul)` operations.
  > Automated Code Change
  > Mark hash_test as large so that the timeout is increased.
  > Change the value of kMul to have higher entropy and prevent collisions when keys are aligned integers or pointers.
  > Fix LIFETIME annotations for op*/op->/value operators for reference types.
  > Update StatusOr to support lvalue reference value types.
  > Rollback debug assertion that the probe sequence isn't excessively long.
  > AnyInvocable: Fix operator==/!= comments
  > In debug mode, assert that the probe sequence isn't excessively long.
  > Improve NaN handling in absl::Duration arithmetic.
  > Change PrecombineLengthMix to sample data from kStaticRandomData.
  > Fix includes and fuse constructors of SpinLock.
  > Enable `operator==` for `StatusOr` only if the contained type is equality-comparable
  > Enable SIMD memcpy-crc on ARM cores.
  > Improve mixing on 32-bit platforms.
  > Change DurationFromDouble to return -InfiniteDuration() for all NaNs.
  > Change return type of hash internal `Seed` to `size_t` from `uint64_t`
  > CMake: Add a fatal error when the compiler defaults to or is set to a C++ language standard prior to C++17.
  > Make bool true hash be ~size_t{} instead of 1 so that all bits are different between true/false instead of only one.
  > Automated Code Change
  > Pass swisstable seed as seed to absl::Hash so we can save an XOR in H1.
  > Add support for scoped enumerations in CHECK_XX().
  > Revert no-inline on Voidify::operator&&() -- caused unexpected binary size growth
  > Mark Voidify::operator&&() as no-inline. This improves stack trace for `LOG(FATAL)` with optimization on.
  > Refactor long strings hash computations and move `len <= PiecewiseChunkSize()` out of the line to keep only one function call in the inlined hash code.
  > rotr/rotl: Fix undefined behavior when passing INT_MIN as the number of positions to rotate by
  > Reorder members of MixingHashState to comply with Google C++ style guide ordering of type declarations, static constants, ctors, non-ctor functions.
  > Delete unused function ShouldSampleHashtablezInfoOnResize.
  > Remove redundant comments that just name the following symbol without providing additional information.
  > Remove unnecessary modification of growth info in small table case.
  > Suppress CFI violation on VDSO call.
  > Replace WeakMix usage with Mix and change H2 to use the most significant 7 bits - saving 1 cycle in H1.
  > Fix -Wundef warning
  > Fix conditional constexpr in ToInt64{Nano|Micro|Milli}seconds under GCC7 and GCC8 using an else clause as a workaround
  > Enable CompressedTupleTest.NestedEbo test case.
  > Lift restriction on using EBCO[1] for nested CompressedTuples. The current implementation of CompressedTuple explicitly disallows EBCO for cases where CompressedTuples are nested. This is because the implentation for a tuple with EBCO-compatible element T inherits from Storage<T, I>, where I is the index of T in the tuple, and
  > absl::string_view: assert against (data() == nullptr && size() != 0)
  > Fix a false nullability warning in [Q]CHECK_OK by replacing nullptr with an empty char*
  > Make `combine_contiguous` to mix length in a weak way by adding `size << 24`, so that we can avoid a separate mixing of size later. The empty range is mixing 0x57 byte.
  > Add a test case that -1.0 and 1.0 have different hashes.
  > Update CI to a more recent Clang on Linux x86-64
  > `absl::string_view`: Add a debug assert to the single-argument constructor that the argument is not `nullptr`.
  > Fix CI on macOS Sequoia
  > Use Xcode 16.3 for testing
  > Use a proper fix instead of a workaround for a parameter annotated absl_nonnull since the latest Clang can see through the workaround
  > Assert that SetCtrl isn't called on small tables - there are no control bytes in such cases.
  > Use `MaskFullOrSentinel` in `skip_empty_or_deleted`.
  > Reduce flakiness in MockDistributions.Examples test case.
  > Rename PrepareInsertNonSoo to PrepareInsertLarge now that it's no longer used in all non-SOO cases.
  > PR #1895: use c++17 in podspec
  > Avoid hashing the key in prefetch() for small tables.
  > Remove template alias nullability annotations.
  > Add `Group::MaskFullOrSentinel` implementation without usage.
  > Move `hashtable_control_bytes` tests into their own file.
  > Simplify calls to `EqualElement` by introducing `equal_to` helper function.
  > Do `common.increment_size()` directly in SmallNonSooPrepareInsert if inserting to reserved 1 element table.
  > Import of CCTZ from GitHub.
  > Small cleanup of `infoz` processing to get the logic out of the line or removed.
  > Extract the entire PrepareInsert to Small non SOO table out of the line.
  > Take `get_hash` implementation out of the SwissTable class to minimize number of instantiations.
  > Change kEmptyGroup to kDefaultIterControl now that it's only used for default-constructed iterators.
  > [bits] Add tests for return types
  > Avoid allocating control bytes in capacity==1 swisstables.
  > PR #1888: Adjust Table.GrowExtremelyLargeTable to avoid OOM on i386
  > Avoid mixing after `Hash64` calls for long strings by passing `state` instead of `Seed` to low level hash.
  > Indent absl container examples consistently
  > Revert- Doesn't actually work because SWIG doesn't use the full preprocessor
  > Add tags to skip some tests under UBSAN.
  > Avoid subtracting `it.control()` and `table.control()` in single element table during erase.
  > Remove the `salt` parameter from low level hash and use a global constant. That may potentially remove some loads.
  > In SwissTable, don't hash the key when capacity<=1 on insertions.
  > Remove the "small" size designation for thread_identity_test, which causes the test to timeout after 60s.
  > Add comment explaining math behind expressions.
  > Exclude SWIG from ABSL_DEPRECATED and ABSL_DEPRECATE_AND_INLINE
  > stacktrace_x86: Handle nested signals on altstack
  > Import of CCTZ from GitHub.
  > Simplify MixingHashState::Read9To16 to not depend on endianness.
  > Delete deprecated `absl::Cord::Get` and its remaining call sites.
  > PR #1884: Remove duplicate dependency
  > Remove relocatability test that is no longer useful
  > Import of CCTZ from GitHub.
  > Fix a bug of casting sizeof(slot_type) to uint16_t instead of uint32_t.
  > Rewrite `WideToUtf8` for improved readability.
  > Avoid requiring default-constructability of iterator type in algorithms that use ContainerIterPairType
  > Added test cases for invalid surrogates sequences.
  > Use __builtin_is_cpp_trivially_relocatable to implement absl::is_trivially_relocatable in a way that is compatible with PR2786 in the upcoming C++26.
  > Remove dependency on `wcsnlen` for string length calculation.
  > Stop being strict about validating the "clone" part of mangled names
  > Add support for logging wide strings in `absl::log`.
  > Deprecate `ABSL_HAVE_STD_STRING_VIEW`.
  > Change some nullability annotations in absl::Span to absl_nullability_unknown to workaround a bug that makes nullability checks trigger in foreach loops, while still fixing the -Wnullability-completeness warnings.
  > Linux CI update
  > Fix new -Wnullability-completeness warnings found after upgrading the Clang version used in the Linux ARM CI to Clang 19.
  > Add __restrict for uses of PolicyFunctions.
  > Use Bazel vendor mode to cache external dependencies on Windows and macOS
  > Move PrepareInsertCommon from header file to cc file.
  > Remove the explicit from the constructor to a test allocator in hash_policy_testing.h. This is rejected by Clang when using the libstdc++ that ships with GCC15
  > Extract `WideToUtf8` helper to `utf8.h`.
  > Updates the documentation for `CHECK` to make it more explicit that it is used to require that a condition is true.
  > Add PolicyFunctions::soo_capacity() so that the compiler knows that soo_capacity() is always 0 or 1.
  > Expect different representations of pointers from the Windows toolchain.
  > Add set_no_seed_for_testing for use in GrowExtremelyLargeTable test.
  > Update GoogleTest dependency to 1.17.0 to support GCC15
  > Assume that frame pointers inside known stack bounds are readable.
  > Remove fallback code in absl/algorithm/container.h
  > Fix GCC15 warning that <ciso646> is deprecated in C++17
  > Fix misplaced closing brace
  > Remove unused include.
  > Automated Code Change
  > Type erase copy constructor.
  > Refactor to use hash_of(key) instead of hash_ref()(key).
  > Create Table.Prefetch test to make sure that it works.
  > Remove NOINLINE on the constructor with buckets.
  > In SwissTable, don't hash the key in find when capacity<=1.
  > Use 0x57 instead of Seed() for weakly mixing of size.
  > Use absl::InsecureBitGen in place of std::random_device in Abseil tests.
  > Remove unused include.
  > Use large 64 bits kMul for 32 bits platforms as well.
  > Import of CCTZ from GitHub.
  > Define `combine_weakly_mixed_integer` in HashSelect::State in order to allow `friend auto AbslHashValue` instead of `friend H AbslHashValue`.
  > PR #1878: Fix typos in comments
  > Update Abseil dependencies in preparation for release
  > Use weaker mixing for absl::Hash for types that mix their sizes.
  > Update comments on UnscaledCycleClock::Now.
  > Use alignas instead of the manual alignment for the Randen entropy pool.
  > Document nullability annotation syntax for array declarations (not many people may know the syntax).
  > Import of CCTZ from GitHub.
  > Release tests for ABSL_RAW_DCHECK and ABSL_RAW_DLOG.
  > Adjust threshold for stuck bits to avoid flaky failures.
  > Deprecate template type alias nullability annotations.
  > Add more probe benchmarks
  > PR #1874: Simplify detection of the powerpc64 ELFv1 ABI
  > Make `absl::FunctionRef` copy-assignable. This brings it more in line with `std::function_ref`.
  > Remove unused #includes from absl/base/internal/nullability_impl.h
  > PR #1870: Retry SymInitialize on STATUS_INFO_LENGTH_MISMATCH
  > Prefetch from slots in parallel with reading from control.
  > Migrate template alias nullability annotations to macros.
  > Improve dependency graph in `TryFindNewIndexWithoutProbing` hot path evaluation.
  > Add latency benchmarks for Hash for strings with size 3, 5 and 17.
  > Exclude UnwindImpl etc. from thread sanitizer due to false positives.
  > Use `GroupFullEmptyOrDeleted` inside of `transfer_unprobed_elements_to_next_capacity_fn`.
  > PR #1863: [minor] Avoid variable shadowing for absl btree
  > Extend stack-frame walking functionality to allow dynamic fixup
  > Fix "unsafe narrowing" in absl for Emscripten
  > Roll back change to address breakage
  > Extend stack-frame walking functionality to allow dynamic fixup
  > Introduce `absl::Cord::Distance()`
  > Avoid aliasing issues in growth information initialization.
  > Make `GrowSooTableToNextCapacityAndPrepareInsert` in order to initialize control bytes all at once and avoid two function calls on growth right after SOO.
  > Simplify `SingleGroupTableH1` since we do not need to mix all bits anymore. Per table seed has a good last bit distribution.
  > Use `NextSeed` instead of `NextSeedBaseNumber` and make the result type to be `uint16_t`. That avoids unnecessary bit twiddling and simplify the code.
  > Optimize `GrowthToLowerBoundCapacity` in order to avoid division.
  > [base] Make :endian internal to absl
  > Fully qualify absl names in check macros to avoid invalid name resolution when the user scope has those names defined.
  > Fix memory sanitization in `GrowToNextCapacityAndPrepareInsert`.
  > Define and use `ABSL_SWISSTABLE_ASSERT` in cc file since a lot of logic moved there.
  > Remove `ShouldInsertBackwards` functionality. It was used for additional order randomness in debug mode. It is not necessary anymore with introduction of separate per table `seed`.
  > Fast growing to the next capacity based on carbon hash table ideas.
  > Automated Code Change
  > Refactor CombinePiecewiseBuffer test case to (a) call PiecewiseChunkSize() to get the chunk size and (b) use ASSERT for expectation in a loop.
  > PR #1867: Remove global static in stacktrace_win32-inl.inc
  > Mark Abseil hardening assert in AssertIsValidForComparison as slow.
  > Roll back a problematic change.
  > Add absl::FastTypeId<T>()
  > Automated Code Change
  > Update TestIntrinsicInt128 test to print the indices with the conflicting hashes.
  > Code simplification: we don't need XOR and kMul when mixing large string hashes into hash state.
  > Refactor absl::CUnescape() to use direct string output instead of pointer/size.
  > Rename `policy.transfer` to `policy.transfer_n`.
  > Optimize `ResetCtrl` for small tables with `capacity < Group::KWidth * 2` (<32 if SSE enabled and <16 if not).
  > Use 16 bits of per-table-seed so that we can save an `and` instruction in H1.
  > Fully annotate nullability in headers where it is partially annotated.
  > Add note about sparse containers to (flat|node)_hash_(set|map).
  > Make low_level_alloc compatible with -Wthread-safety-pointer
  > Add missing direct includes to enable the removal of unused includes from absl/base/internal/nullability_impl.h.
  > Add tests for macro nullability annotations analogous to existing tests for type alias annotations.
  > Adds functionality to return stack frame pointers during stack walking, in addition to code addresses
  > Use even faster reduction algorithm in FinalizePclmulStream()
  > Add nullability annotations to some very-commonly-used APIs.
  > PR #1860: Add `unsigned` to character buffers to ensure they can provide storage (https://eel.is/c++draft/intro.object#3)
  > Release benchmarks for absl::Status and absl::StatusOr
  > Use more efficient reduction algorithm in FinalizePclmulStream()
  > Add a test case to make it clear that `--vmodule=foo/*=1` does match any children and grandchildren and so on under `foo/`.
  > Gate use of clang nullability qualifiers through absl nullability macros on `nullability_on_classes`.
  > Mark `absl::StatusOr::status()` as ABSL_MUST_USE_RESULT
  > Cleanups related to benchmarks   * Fix many benchmarks to be cc_binary instead of cc_test   * Add a few benchmarks for StrFormat   * Add benchmarks for Substitute   * Add benchmarks for Damerau-Levenshtein distance used in flags
  > Add a log severity alias `DO_NOT_$UBMIT` intended for logging during development
  > Avoid relying on true and false tokens in the preprocessor macros used in any_invocable.h
  > Avoid relying on true and false tokens in the preprocessor macros used in absl/container
  > Refactor to make it clear that H2 computation is not repeated in each iteration of the probe loop.
  > Turn on C++23 testing for GCC and Clang on Linux
  > Fix overflow of kSeedMask on 32 bits platform in `generate_new_seed`.
  > Add a workaround for std::pair not being trivially copyable in C++23 in some standard library versions
  > Refactor WeakMix to include the XOR of the state with the input value.
  > Migrate ClearPacBits() to a more generic implementation and location
  > Annotate more Abseil container methods with [[clang::lifetime_capture_by(...)]] and make them all forward to the non-captured overload
  > Make PolicyFunctions always be the second argument (after CommonFields) for type-erased functions.
  > Move GrowFullSooTableToNextCapacity implementation with some dependencies to cc file.
  > Optimize btree_iterator increment/decrement to avoid aliasing issues by using local variables instead of repeatedly writing to `this`.
  > Add constexpr conversions from absl::Duration to int64_t
  > PR #1853: Add support for QCC compiler
  > Fix documentation for key requirements of flat_hash_set
  > Use `extern template` for `GrowFullSooTableToNextCapacity` since we know the most common  set of paramenters.
  > C++23: Fix log_format_test to match the stream format for volatile pointers
  > C++23: Fix compressed_tuple_test.
  > Implement `btree::iterator::+=` and `-=`.
  > Stop calling `ABSL_ANNOTATE_MEMORY_IS_INITIALIZED` for threadlocal counter.
  > Automated Code Change
  > Introduce seed stored in the hash table inside of the size.
  > Replace ABSL_ATTRIBUTE_UNUSED with [[maybe_unused]]
  > Minor consistency cleanups to absl::BitGen mocking.
  > Restore the empty CMake targets for bad_any_cast, bad_optional_access, and bad_variant_access to allow clients to migrate.
  > bits.h: Add absl::endian and absl::byteswap polyfills
  > Use absl::NoDestructor an absl::Mutex instance in the flags library to prevent some exit-time destructor warnings
  > Add thread GetEntropyFromRandenPool test
  > Update nullability annotation documentation to focus on macro annotations.
  > Simplify some random/internal types; expose one function to acquire entropy.
  > Remove pre-C++17 workarounds for lack of std::launder
  > UBSAN: Use -fno-sanitize-recover
  > int128_test: Avoid testing signed integer overflow
  > Remove leading commas in `Describe*` methods of `StatusIs` matcher.
  > absl::StrFormat: Avoid passing null to memcpy
  > str_cat_test: Avoid using invalid enum values
  > hash_generator_testing: Avoid using invalid enum values
  > absl::Cord: Avoid passing null to memcpy and memset
  > graphcycles_test: Avoid applying a non-zero offset to a null pointer
  > Make warning about wrapping empty std::function in AnyInvocable stronger.
  > absl/random: Convert absl::BitGen / absl::InsecureBitGen to classes from aliases.
  > Fix buffer overflow the internal demangling function
  > Avoid calling `ShouldRehashForBugDetection` on the first two inserts to the table.
  > Remove the polyfill implementations for many type traits and alias them to their std equivalents. It is recommended that clients now simple use the std equivalents.
  > ROLLBACK: Limit slot_size to 2^16-1 and maximum table size to 2^43-1.
  > Limit `slot_size` to `2^16-1` and maximum table size to `2^43-1`.
  > Use C++17 [[nodiscard]] instead of the deprecated ABSL_MUST_USE_RESULT
  > Remove the polyfills for absl::apply and absl::make_from_tuple, which were only needed prior to C++17. It is recommended that clients simply use std::apply and std::make_from_tuple.
  > PR #1846: Fix build on big endian
  > Bazel: Move environment variables to --action_env
  > Remove the implementation of `absl::variant`, which was only needed prior to C++17. `absl::variant` is now an alias for `std::variant`. It is recommended that clients simply use `std::variant`.
  > MSVC: Fix warnings c4244 and c4267 in the main library code
  > Update LowLevelHashLenGt16 to be LowLevelHashLenGt32 now that the input is guaranteed to be >32 in length.
  > Xtensa does not support thread_local. Disable it in absl/base/config.h.
  > Add support for 8-bit and 16-bit integers to absl::SimpleAtoi
  > CI: Update Linux ARM latest container
  > Add time hash tests
  > `any_invocable`: Update comment that refer to C++17 and C++11
  > `check_test_impl.inc`: Use C++17 features unconditionally
  > Remove the implementation of `absl::optional`, which was only needed prior to C++17. `absl::optional` is now an alias for `std::optional`. It is recommended that clients simply use `std::optional`.
  > Move hashtable control bytes manipulation to a separate file.
  > Fix a use-after-free bug in which the string passed to `AtLocation` may be referenced after it is destroyed. While the string does live until the end of the full statement, logging (previously occurred) in the destructor of the `LogMessage` which may be constructed before the temporary string (and thus destroyed after the temporary string's destructor).
  > `internal/layout`: Delete pre-C++17 out of line definition of constexpr class member
  > Extract slow path for PrepareInsertNonSoo to a separate function `PrepareInsertNonSooSlow`.
  > Minor code cleanups
  > `internal/log_message`: Use `if constexpr` instead of SFINAE for `operator<<`
  > [absl] Use `std::min` in `constexpr` contexts in `absl::string_view`
  > Remove the implementation of `absl::any`, which was only needed prior to C++17. `absl::any` is now an alias for `std::any`. It is recommended that clients simply use `std::any`.
  > Remove ABSL_INTERNAL_NEED_REDUNDANT_CONSTEXPR_DECL which is longer needed with the C++17 floor
  > Make `OptimalMemcpySizeForSooSlotTransfer` ready to work with MaxSooSlotSize upto `3*sizeof(size_t)`.
  > `internal/layout`: Replace SFINAE with `if constexpr`
  > PR #1830: C++17 improvement: use if constexpr in internal/hash.h
  > `absl`: Deprecate `ABSL_HAVE_CLASS_TEMPLATE_ARGUMENT_DEDUCTION`
  > Add a verification for access of being destroyed table. Also enabled access after destroy check in ASAN optimized mode.
  > Store `CharAlloc` in SwissTable in order to simplify type erasure of functions accepting allocator as `void*`.
  > Introduce and use `SetCtrlInLargeTable`, when we know that table is at least one group. Similarly to `SetCtrlInSingleGroupTable`, we can save some operations.
  > Make raw_hash_set::slot_type private.
  > Delete absl/utility/internal/if_constexpr.h
  > `internal/any_invocable`: Use `if constexpr` instead of SFINAE when initializing storage accessor
  > Depend on string_view directly
  > Optimize and slightly simplify `PrepareInsertNonSoo`.
  > PR #1833: Make ABSL_INTERNAL_STEP_n macros consistent in crc code
  > `internal/any_invocable`: Use alias `RawT` consistently in `InitializeStorage`
  > Move the implementation of absl::ComputeCrc32c to the header file, to facilitate inlining.
  > Delete absl/base/internal/inline_variable.h
  > Add lifetimebound to absl::StripAsciiWhitespace
  > Revert: Random: Use target attribute instead of -march
  > Add return for opt mode in AssertNotDebugCapacity to make sure that code is not evaluated in opt mode.
  > `internal/any_invocable`: Delete TODO, improve comment and simplify pragma in constructor
  > Split resizing routines and type erase similar instructions.
  > Random: Use target attribute instead of -march
  > `internal/any_invocable`: Use `std::launder` unconditionally
  > `internal/any_invocable`: Remove suppresion of false positive -Wmaybe-uninitialized on GCC 12
  > Fix feature test for ABSL_HAVE_STD_OPTIONAL
  > Support C++20 iterators in raw_hash_map's random-access iterator detection
  > Fix mis-located test dependency
  > Disable the DestroyedCallsFail test on GCC due to flakiness.
  > `internal/any_invocable`: Implement invocation using `if constexpr` instead of SFINAE
  > PR #1835: Bump deployment_target version and add visionos to podspec
  > PR #1828: Fix spelling of pseudorandom in README.md
  > Make raw_hash_map::key_arg private.
  > `overload`: Delete obsolete macros for undefining `absl::Overload` when C++ < 17
  > `absl/base`: Delete `internal/invoke.h` and `invoke_test.cc`
  > Remove `WORKSPACE.bazel`
  > `absl`: Replace `base_internal::{invoke,invoke_result_t,is_invocable_r}` with `std` equivalents
  > Allow C++20 forward iterators to use fast paths
  > Factor out some iterator traits detection code
  > Type erase IterateOverFullSlots to decrease code size.
  > `any_invocable`: Delete pre-C++17 workarounds for `noexcept` and guaranteed copy elision
  > Make raw_hash_set::key_arg private.
  > Rename nullability macros to use new lowercase spelling.
  > Fix bug where ABSL_REQUIRE_EXPLICIT_INIT did not actually result in a linker error
  > Make Randen benchmark program use runtime CPU detection.
  > Add CI for the C++20/Clang/libstdc++ combination
  > Move Abseil to GoogleTest 1.16.0
  > `internal/any_invocable`: Use `if constexpr` instead of SFINAE in `InitializeStorage`
  > More type-erasing of InitializeSlots by removing the Alloc and AlignOfSlot template parameters.
  > Actually use the hint space instruction to strip PAC bits for return addresses in stack traces as the comment says
  > `log/internal`: Replace `..._ATTRIBUTE_UNUSED_IF_STRIP_LOG` with C++17 `[[maybe_unused]]`
  > `attributes`: Document `ABSL_ATTRIBUTE_UNUSED` as deprecated
  > `internal/any_invocable`: Initialize using `if constexpr` instead of ternary operator, enum, and templates
  > Fix flaky tests due to sampling by introducing utility to refresh sampling counters for the current thread.
  > Minor reformatting in raw_hash_set: - Add a clear_backing_array member to declutter calls to ClearBackingArray. - Remove some unnecessary `inline` keywords on functions. - Make PoisonSingleGroupEmptySlots static.
  > Update CI for linux_gcc-floor to use GCC9, Bazel 7.5, and CMake 3.31.5.
  > `internal/any_invocable`: Rewrite `IsStoredLocally` type trait into a simpler constexpr function
  > Add ABSL_REQUIRE_EXPLICIT_INIT to Abseil to enable enforcing explicit field initializations
  > Require C++17
  > Minimize number of `InitializeSlots` with respect to SizeOfSlot.
  > Leave the call to `SampleSlow` only in type erased InitializeSlots.
  > Update comments for Read4To8 and Read1To3.
  > PR #1819: fix compilation with AppleClang
  > Move SOO processing inside of InitializeSlots and move it once.
  > PR #1816: Random: use getauxval() via <sys/auxv.h>
  > Optimize `InitControlBytesAfterSoo` to have less writes and make them with compile time known size.
  > Remove stray plus operator in cleanup_internal::Storage
  > Include <cerrno> to fix compilation error in chromium build.
  > Adjust internal logging namespacing for consistency s/ABSL_LOGGING_INTERNAL_/ABSL_LOG_INTERNAL_/
  > Rewrite LOG_EVERY_N (et al) docs to clarify that the first instance is logged.  Also, deliberately avoid giving exact numbers or examples since IRL behavior is not so exact.
  > ABSL_ASSUME: Use a ternary operator instead of do-while in the implementations that use a branch marked unreachable so that it is usable in more contexts.
  > Simplify the comment for raw_hash_set::erase.
  > Remove preprocessors for now unsupported compilers.
  > `absl::ScopedMockLog`: Explicitly document that it captures logs emitted by all threads
  > Fix potential integer overflow in hash container create/resize
  > Add lifetimebound to StripPrefix/StripSuffix.
  > Random: Rollforward support runtime dispatch on AArch64 macOS
  > Crc: Only test non_temporal_store_memcpy_avx on AVX targets
  > Provide information about types of all flags.
  > Deprecate the precomputed hash find() API in swisstable.
  > Import of CCTZ from GitHub.
  > Adjust whitespace
  > Expand documentation for absl::raw_hash_set::erase to include idiom example of iterator post-increment.
  > Performance improvement for absl::AsciiStrToUpper() and absl::AsciiStrToLower()
  > Crc: Remove the __builtin_cpu_supports path for SupportsArmCRC32PMULL
  > Use absl::NoDestructor for some absl::Mutex instances in the flags library to prevent some exit-time destructor warnings
  > Update the WORKSPACE dependency of rules_cc to 0.1.0
  > Rollback support runtime dispatch on AArch64 macOS for breaking some builds
  > Downgrade to rules_cc 0.0.17 because 0.1.0 was yanked
  > Use unused set in testing.
  > Random: Support runtime dispatch on AArch64 macOS
  > crc: Use absl::nullopt when returning absl::optional
  > Annotate absl::FixedArray to warn when unused.
  > PR #1806: Fix undefined symbol: __android_log_write
  > Move ABSL_HAVE_PTHREAD_CPU_NUMBER_NP to the file where it is needed
  > Use rbit instruction on ARM rather than rev.
  > Debugging: Report the CPU we are running on under Darwin
  > Add a microbenchmark for very long int/string tuples.
  > Crc: Detect support for pmull and crc instructions on Apple AArch64 With a newer clang, we can use __builtin_cpu_supports which caches all the feature bits.
  > Add special handling for hashing integral types so that we can optimize Read1To3 and Read4To8 for the strings case.
  > Use unused FixedArray instances.
  > Minor reformatting
  > Avoid flaky expectation in WaitDurationWoken test case in MSVC.
  > Use Bazel rules_cc for many compiler-specific rules instead of our custom ones from before the Bazel rules existed.
  > Mix pointers twice in absl::Hash.
  > New internal-use-only classes `AsStructuredLiteralImpl` and `AsStructuredValueImpl`
  > Annotate some Abseil container methods with [[clang::lifetime_capture_by(...)]]
  > Faster copy from inline Cords to inline Strings
  > Add new benchmark cases for hashing string lengths 1,2,4,8.
  > Move the Arm implementation of UnscaledCycleClock::Now() into the header file, like the x86 implementation, so it can be more easily inlined.
  > Minor include cleanup in absl/random/internal
  > Import of CCTZ from GitHub.
  > Use Bazel Platforms to support AES-NI compile options for Randen
  > In HashState::Create, require that T is a subclass of HashStateBase in order to discourage users from defining their own HashState types.
  > PR #1801: Remove unncessary <iostream> includes
  > New class StructuredProtoField
  > Mix pointers twice in TSan and MSVC to avoid flakes in the PointerAlignment test.
  > Add a test case that type-erased absl::HashState is consistent with absl::HashOf.
  > Mix pointers twice in build modes in which the PointerAlignment test is flaky if we mix once.
  > Increase threshold for stuck bits in PointerAlignment test on android.
  > Use hashing ideas from Carbon's hashtable in absl hashing: - Use byte swap instead of mixing pointers twice. - Change order of branches to check for len<=8 first. - In len<=16 case, do one multiply to mix the data instead of using the logic from go/absl-hash-rl (reinforcement learning was used to optimize the instruction sequence). - Add special handling for len<=32 cases in 64-bit architectures.
  > Test that using a table that was moved-to from a moved-from table fails in sanitizer mode.
  > Remove a trailing comma causing an issue for an OSS user
  > Add missing includes in hash.h.
  > Use the public implementation rule for "@bazel_tools//tools/cpp:clang-cl"
  > Import of CCTZ from GitHub.
  > Change the definition of is_trivially_relocatable to be a bit less conservative.
  > Updates to CI to support newer versions of tools
  > Check if ABSL_HAVE_INTRINSIC_INT128 is defined
  > Print hash expansions in the hash_testing error messages.
  > Avoid flakiness in notification_test on MSVC.
  > Roll back: Add more debug capacity validation checks on moves.
  > Add more debug capacity validation checks on moves.
  > Add macro versions of nullability annotations.
  > Improve fork-safety by opening files with `O_CLOEXEC`.
  > Move ABSL_HARDENING_ASSERTs in constexpr methods to their own lines.
  > Add test cases for absl::Hash: - That hashes are consistent for the same int value across different int types. - That hashes of vectors of strings are unequal even when their concatenations are equal. - That FragmentedCord hashes works as intended for small Cords.
  > Skip the IterationOrderChangesOnRehash test case in ASan mode because it's flaky.
  > Add missing includes in absl hash.
  > Try to use file descriptors in the 2000+ range to avoid mis-behaving client interference.
  > Add weak implementation of the __lsan_is_turned_off in Leak Checker
  > Fix a bug where EOF resulted in infinite loop.
  > static_assert that absl::Time and absl::Duration are trivially destructible.
  > Move Duration ToInt64<unit> functions to be inline.
  > string_view: Add defaulted copy constructor and assignment
  > Use `#ifdef` to avoid errors when `-Wundef` is used.
  > Strip PAC bits for return addresses in stack traces
  > PR #1794: Update cpu_detect.cc fix hw crc32 and AES capability check, fix undefined
  > PR #1790: Respect the allocator's .destroy method in ~InlinedVector
  > Cast away nullability in the guts of CHECK_EQ (et al) where Clang doesn't see that the nullable string returned by Check_EQImpl is statically nonnull inside the loop.
  > string_view: Correct string_view(const char*, size_type) docs
  > Add support for std::string_view in StrCat even when absl::string_view != std::string_view.
  > Misc. adjustments to unit tests for logging.
  > Use local_config_cc from rules_cc and make it a dev dependency
  > Add additional iteration order tests with reservation. Reserved tables have a different way of iteration randomization compared to gradually resized tables (at least for small tables).
  > Use all the bits (`popcount`) in `FindFirstNonFullAfterResize` and `PrepareInsertAfterSoo`.
  > Mark ConsumePrefix, ConsumeSuffix, StripPrefix, and StripSuffix as constexpr since they are all pure functions.
  > PR #1789: Add missing #ifdef pp directive to the TypeName() function in the layout.h
  > PR #1788: Fix warning for sign-conversion on riscv
  > Make StartsWith and EndsWith constexpr.
  > Simplify logic for growing single group table.
  > Document that absl::Time and absl::Duration are trivially destructible.
  > Change some C-arrays to std::array as this enables bounds checking in some hardened standard library builds
  > Replace outdated select() on --cpu with platform API equivalent.
  > Take failure_message as const char* instead of string_view in LogMessageFatal and friends.
  > Mention `c_any_of` in the function comment of `absl::c_linear_search`.
  > Import of CCTZ from GitHub.
  > Rewrite some string_view methods to avoid a -Wunreachable-code warning
  > IWYU: Update includes and fix minor spelling mistakes.
  > Add comment on how to get next element after using erase.
  > Add ABSL_ATTRIBUTE_LIFETIME_BOUND and a doc note about absl::LogAsLiteral to clarify its intended use.
  > Import of CCTZ from GitHub.
  > Reduce memory consumption of structured logging proto encoding by passing tag value
  > Remove usage of _LIBCPP_HAS_NO_FILESYSTEM_LIBRARY.
  > Make Span's relational operators constexpr since C++20.
  > distributions: support a zero max value in Zipf.
  > PR #1786: Fix typo in test case.
  > absl/random: run clang-format.
  > Add some nullability annotations in logging and tidy up some NOLINTs and comments.
  > CMake: Change the default for ABSL_PROPAGATE_CXX_STD to ON
  > Delete UnvalidatedMockingBitGen
  > PR #1783: [riscv][debugging] Fix a few warnings in RISC-V inlines
  > Add conversion operator to std::array for StrSplit.
  > Add a comment explaining the extra comparison in raw_hash_set::operator==. Also add a small optimization to avoid the extra comparison in sets that use hash_default_eq as the key_equal functor.
  > Add benchmark for absl::HexStringToBytes
  > Avoid installing options.h with the other headers
  > Add ABSL_ATTRIBUTE_LIFETIME_BOUND to absl::Span constructors.
  > Annotate absl::InlinedVector to warn when unused.
  > Make `c_find_first_of`'s `options` parameter a const reference to allow temporaries.
  > Disable Elf symbols for Xtensa
  > PR #1775: Support symbolize only on WINAPI_PARTITION_DESKTOP
  > Require through an internal presubmit that .h|.cc|.inc files contain either the string ABSL_NAMESPACE_BEGIN or SKIP_ABSL_INLINE_NAMESPACE_CHECK
  > Xtensa supports mmap, enable it in absl/base/config.h
  > PR #1777: Avoid std::ldexp in `operator double(int128)`.
  > Marks absl::Span as view and borrowed_range, like std::span.
  > Mark inline functions with only a simple comparison in strings/ascii.h as constexpr.
  > Add missing Abseil inline namespace and fix includes
  > Fix bug where the high bits of `__int128_t`/`__uint128_t` might go unused in the hash function. This fix increases the hash quality of these types.
  > Add a test to verify bit casting between signed and unsigned int128 works as expected
  > Add suggestions to enable sanitizers for asserts when doing so may be helpful.
  > Add nullability attributes to nullability type aliases.
  > Refactor swisstable moves.
  > Improve ABSL_ASSERT performance by guaranteeing it is optimized away under NDEBUG in C++20
  > Mark Abseil hardening assert in AssertSameContainer as slow.
  > Add workaround for q++ 8.3.0 (QNX 7.1) compiler by making sure MaskedPointer is trivially copyable and copy constructible.
  > Small Mutex::Unlock optimization
  > Optimize `CEscape` and `CEscapeAndAppend` by up to 40%.
  > Fix the conditional compilation of non_temporal_store_memcpy_avx to verify that AVX can be forced via `gnu::target`.
  > Delete TODOs to move functors when moving hashtables and add a test that fails when we do so.
  > Fix benchmarks in `escaping_benchmark.cc` by properly calling `benchmark::DoNotOptimize` on both inputs and outputs and by removing the unnecessary and wrong `ABSL_RAW_CHECK` condition (`check != 0`) of `BM_ByteStringFromAscii_Fail` benchmark.
  > It seems like commit abc9b916a94ebbf251f0934048295a07ecdbf32a did not work as intended.
  > Fix a bug in `absl::SetVLogLevel` where a less generic pattern incorrectly removed a more generic one.
  > Remove the side effects between tests in vlog_is_on_test.cc
  > Attempt to fix flaky Abseil waiter/sleep tests
  > Add an explicit tag for non-SOO CommonFields (removing default ctor) and add a small optimization for early return in AssertNotDebugCapacity.
  > Make moved-from swisstables behave the same as empty tables. Note that we may change this in the future.
  > Tag tests that currently fail on darwin_arm64 with "no_test_darwin_arm64"
  > add gmock to cmake defs for no_destructor_test
  > Optimize raw_hash_set moves by allowing some members of CommonFields to be uninitialized when moved-from.
  > Add more debug capacity validation checks on iteration/size.
  > Add more debug capacity validation checks on copies.
  > constinit -> constexpr for DisplayUnits
  > LSC: Fix null safety issues diagnosed by Clang’s `-Wnonnull` and `-Wnullability`.
  > Remove the extraneous variable creation in Match().
  > Import of CCTZ from GitHub.
  > Add more debug capacity validation checks on merge/swap.
  > Add `absl::` namespace to c_linear_search implementation in order to avoid ADL
  > Distinguish the debug message for the case of self-move-assigned swiss tables.
  > Update LowLevelHash comment regarding number of hash state variables.
  > Add an example for the `--vmodule` flag.
  > Remove first prefetch.
  > Add moved-from validation for the case of self-move-assignment.
  > Allow slow and fast abseil hardening checks to be enabled independently.
  > Update `ABSL_RETIRED_FLAG` comment to reflect `default_value` is no longer used.
  > Add validation against use of moved-from hash tables.
  > Provide file-scoped pragma behind macro ABSL_POINTERS_DEFAULT_NONNULL to indicate the default nullability. This is a no-op for now (not understood by checkers), but does communicate intention to human readers.
  > Add stacktrace config for android using the generic implementation
  > Fix nullability annotations in ABSL code.
  > Replace CHECKs with ASSERTs and EXPECTs -- no reason to crash on failure.
  > Remove ABSL_INTERNAL_ATTRIBUTE_OWNER and ABSL_INTERNAL_ATTRIBUTE_VIEW
  > Migrate ABSL_INTERNAL_ATTRIBUTE_OWNER and ABSL_INTERNAL_ATTRIBUTE_VIEW to ABSL_ATTRIBUTE_OWNER and ABSL_ATTRIBUTE_VIEW
  > Disable ABSL_ATTRIBUTE_OWNER and ABSL_ATTRIBUTE_VIEW prior to Clang-13 due to false positives.
  > Make ABSL_ATTRIBUTE_VIEW and ABSL_ATTRIBUTE_OWNER public
  > Optimize raw_hash_set::AssertHashEqConsistent a bit to avoid having as much runtime overhead.
  > PR #1728: Workaround broken compilation against NDK r25
  > Add validation against use of destroyed hash tables.
  > Do not truncate `ABSL_RAW_LOG` output at null bytes
  > Use several unused cord instances in tests and benchmarks.
  > Add comments about ThreadIdentity struct allocation behavior.
  > Refactoring followup for reentrancy validation in swisstable.
  > Add debug mode checks that element constructors/destructors don't make reentrant calls to raw_hash_set member functions.
  > Add tagging for cc_tests that are incompatible with Fuchsia
  > Add GetTID() implementation for Fuchsia
  > PR #1738: Fix shell option group handling in pkgconfig files
  > Disable weak attribute when absl compiled as windows DLL
  > Remove `CharIterator::operator->`.
  > Mark non-modifying container algorithms as constexpr for C++20.
  > PR #1739: container/internal: Explicitly include <cstdint>
  > Don't match -Wnon-virtual-dtor in the "flags are needed to suppress warnings in headers". It should fall through to the "don't impose our warnings on others" case. Do this by matching on "-Wno-*" instead of "-Wno*".
  > PR #1732: Fix build on NVIDIA Jetson board. Fix #1665
  > Update GoogleTest dependency to 1.15.2
  > Enable AsciiStrToLower and AsciiStrToUpper overloads for rvalue references.
  > PR #1735: Avoid `int` to `bool` conversion warning
  > Add `absl::swap` functions for `*_hash_*` to avoid calling `std::swap`
  > Change internal visibility
  > Remove resolved issue.
  > Increase test timeouts to support running on Fuchsia emulators
  > Add tracing annotations to absl::Notification
  > Suppress compiler optimizations which may break container poisoning.
  > Disable ABSL_INTERNAL_HAVE_DEBUGGING_STACK_CONSUMPTION for Fuchsia
  > Add tracing annotations to absl::BlockingCounter
  > Add absl_vlog_is_on and vlog_is_on to ABSL_INTERNAL_DLL_TARGETS
  > Update swisstable swap API comments to no longer guarantee that we don't move/swap individual elements.
  > PR #1726: cmake: Fix RUNPATH when using BUILD_WITH_INSTALL_RPATH=True
  > Avoid unnecessary copying when upper-casing or lower-casing ASCII string_view
  > Add weak internal tracing API
  > Fix LINT.IfChange syntax
  > PR #1720: Fix spelling mistake: occurrance -> occurrence
  > Add missing include for Windows ASAN configuration in poison.cc
  > Delete absl/strings/internal/has_absl_stringify.h now that the GoogleTest version we depend on uses the public file
  > Update versions of dependencies in preparation for release
  > PR #1699: Add option to build with MSVC static runtime
  > Remove unneeded 'be' from comment.
  > PR #1715: Generate options.h using CMake only once
  > Small type fix in absl/log/internal/log_impl.h
  > PR #1709: Handle RPATH CMake configuration
  > PR #1710: fixup! PR #1707: Fixup absl_random compile breakage in Apple ARM64 targets
  > PR #1695: Fix time library build for Apple platforms
  > Remove cyclic cmake dependency that breaks in cmake 3.30.0
  > Roll forward poisoned pointer API and fix portability issues.
  > Use GetStatus in IsOkAndHoldsMatcher
  > PR #1707: Fixup absl_random compile breakage in Apple ARM64 targets
  > PR #1706: Require CMake version 3.16
  > Add an MSVC implementation of ABSL_ATTRIBUTE_LIFETIME_BOUND
  > Mark c_min_element, c_max_element, and c_minmax_element as constexpr in C++17.
  > Optimize the absl::GetFlag cost for most non built-in flag types (including string).
  > Encode some additional metadata when writing protobuf-encoded logs.
  > Replace signed integer overflow, since that's undefined behavior, with unsigned integer overflow.
  > Make mutable CompressedTuple::get() constexpr.
  > vdso_support: support DT_GNU_HASH
  > Make c_begin, c_end, and c_distance conditionally constexpr.
  > Add operator<=> comparison to absl::Time and absl::Duration.
  > Deprecate `ABSL_ATTRIBUTE_NORETURN` in favor of the `[[noreturn]]` standardized in C++11
  > Rollback new poisoned pointer API
  > Static cast instead of reinterpret cast raw hash set slots as casting from void* to T* is well defined
  > Fix absl::NoDestructor documentation about its use as a global
  > Declare Rust demangling feature-complete.
  > Split demangle_internal into a tree of smaller libraries.
  > Decode Rust Punycode when it's not too long.
  > Add assertions to detect reentrance in `IterateOverFullSlots` and `absl::erase_if`.
  > Decoder for Rust-style Punycode encodings of bounded length.
  > Add `c_contains()` and `c_contains_subrange()` to `absl/algorithm/container.h`.
  > Three-way comparison spaceship <=> operators for Cord.
  > internal-only change
  > Remove erroneous preprocessor branch on SGX_SIM.
  > Add an internal API to get a poisoned pointer.
  > optimization.h: Add missing <utility> header for C++
  > Add a compile test for headers that require C compatibility
  > Fix comment typo
  > Expand documentation for SetGlobalVLogLevel and SetVLogLevel.
  > Roll back 6f972e239f668fa29cab43d7968692cd285997a9
  > PR #1692: Add missing `<utility>` include
  > Remove NOLINT for `#include <new>` for __cpp_lib_launder
  > Remove not used after all kAllowRemoveReentrance parameter from IterateOverFullSlots.
  > Create `absl::container_internal::c_for_each_fast` for SwissTable.
  > Disable flaky test cases in kernel_timeout_internal_test.
  > Document that swisstable and b-tree containers are not exception-safe.
  > Add `ABSL_NULLABILITY_COMPATIBLE` attribute.
  > LSC: Move expensive variables on their last use to avoid copies.
  > Add ABSL_INTERNAL_ATTRIBUTE_VIEW and ABSL_INTERNAL_ATTRIBUTE_OWNER attributes to more types in Abseil
  > Drop std:: qualification from integer types like uint64_t.
  > Increase slop time on MSVC in PerThreadSemTest.Timeouts again due to continued flakiness.
  > Turn on validation for out of bounds MockUniform in MockingBitGen
  > Use ABSL_UNREACHABLE() instead of equivalent
  > If so configured, report which part of a C++ mangled name didn't parse.
  > Sequence of 1-to-4 values with prefix sum to support Punycode decoding.
  > Add the missing inline namespace to the nullability files
  > Add ABSL_INTERNAL_ATTRIBUTE_VIEW and ABSL_INTERNAL_ATTRIBUTE_OWNER attributes to types in Abseil
  > Disallow reentrance removal in `absl::erase_if`.
  > Fix implicit conversion of temporary bitgen to BitGenRef
  > Use `IterateOverFullSlots` in `absl::erase_if` for hash table.
  > UTF-8 encoding library to support Rust Punycode decoding.
  > Disable negative NaN float ostream format checking on RISC-V
  > PR #1689: Minor: Add missing quotes in CMake string view library definition
  > Demangle template parameter object names, TA <template-arg>.
  > Demangle sr St <simple-id> <simple-id>, a dubious encoding found in the wild.
  > Try not to lose easy type combinators in S::operator const int*() and the like.
  > Demangle fixed-width floating-point types, DF....
  > Demangle _BitInt types DB..., DU....
  > Demangle complex floating-point literals.
  > Demangle <extended-qualifier> in types, e.g., U5AS128 for address_space(128).
  > Demangle operator co_await (aw).
  > Demangle fully general vendor extended types (any <template-args>).
  > Demangle transaction-safety notations GTt and Dx.
  > Demangle C++11 user-defined literal operator functions.
  > Demangle C++20 constrained friend names, F (<source-name> | <operator-name>).
  > Demangle dependent GNU vector extension types, Dv <expression> _ <type>.
  > Demangle elaborated type names, (Ts | Tu | Te) <name>.
  > Add validation that hash/eq functors are consistent, meaning that `eq(k1, k2) -> hash(k1) == hash(k2)`.
  > Demangle delete-expressions with the global-scope operator, gs (dl | da) ....
  > Demangle new-expressions with braced-init-lists.
  > Demangle array new-expressions, [gs] na ....
  > Demangle object new-expressions, [gs] nw ....
  > Demangle preincrement and predecrement, pp_... and mm_....
  > Demangle throw and rethrow (tw... and tr).
  > Remove redundant check of is_soo() while prefetching heap blocks.
  > Demangle ti... and te... expressions (typeid).
  > Demangle nx... syntax for noexcept(e) as an expression in a dependent signature.
  > Demangle alignof expressions, at... and az....
  > Demangle C++17 structured bindings, DC...E.
  > Demangle modern _ZGR..._ symbols.
  > Remove redundant check of is_soo() while prefetching heap blocks.
  > Demangle sizeof...(pack captured from an alias template), sP ... E.
  > Demangle types nested under vendor extended types.
  > Demangle il ... E syntax (braced list other than direct-list-initialization).
  > Avoid signed overflow for Ed <number> _ manglings with large <number>s.
  > Remove redundant check of is_soo() while prefetching heap blocks.
  > Remove obsolete TODO
  > Clarify function comment for `erase` by stating that this idiom only works for "some" standard containers.
  > Move SOVERSION to global CMakeLists, apply SOVERSION to DLL
  > Set ABSL_HAVE_THREAD_LOCAL to 1 on all platforms
  > Demangle constrained auto types (Dk <type-constraint>).
  > Parse <discriminator> more accurately.
  > Demangle lambdas in class member functions' default arguments.
  > Demangle unofficial <unresolved-qualifier-level> encodings like S0_IT_E.
  > Do not make std::filesystem::path hash available for macOS <10.15
  > Include flags in DLL build (non-Windows only)
  > Enable building monolithic shared library on macOS and Linux.
  > Demangle Clang's last-resort notation _SUBSTPACK_.
  > Demangle C++ requires-expressions with parameters (rQ ... E).
  > Demangle Clang's encoding of __attribute__((enable_if(condition, "message"))).
  > Demangle static_cast and friends.
  > Demangle decltype(expr)::nested_type (NDT...E).
  > Optimize GrowIntoSingleGroupShuffleControlBytes.
  > Demangle C++17 fold-expressions.
  > Demangle thread_local helper functions.
  > Demangle lambdas with explicit template arguments (UlTy and similar forms).
  > Demangle &-qualified function types.
  > Demangle valueless literals LDnE (nullptr) and LA<number>_<type>E ("foo").
  > Correctly demangle the <unresolved-name> at the end of dt and pt (x.y, x->y).
  > Add missing targets to ABSL_INTERNAL_DLL_TARGETS
  > Build abseil_test_dll with ABSL_BUILD_TESTING
  > Demangle C++ requires-expressions without parameters (rq ... E).
  > overload: make the constructor constexpr
  > Update Abseil CI Docker image to use Clang 19, GCC 14, and CMake 3.29.3
  > Workaround symbol resolution bug in Clang 19
  > Workaround bogus GCC14 -Wmaybe-uninitialized warning
  > Silence a bogus GCC14 -Warray-bounds warning
  > Forbid absl::Uniform<absl::int128>(gen)
  > Use IN_LIST to replace list(FIND) + > -1
  > Recognize C++ vendor extended expressions (e.g., u9__is_same...E).
  > `overload_test`: Remove a few unnecessary trailing return types
  > Demangle the C++ this pointer (fpT).
  > Stop eating an extra E in ParseTemplateArg for some L<type><value>E literals.
  > Add ABSL_INTERNAL_ATTRIBUTE_VIEW and ABSL_INTERNAL_ATTRIBUTE_OWNER attributes to Abseil.
  > Demangle C++ direct-list-initialization (T{1, 2, 3}, tl ... E).
  > Demangle the C++ spaceship operator (ss, operator<=>).
  > Demangle C++ sZ encodings (sizeof...(pack)).
  > Demangle C++ so ... E encodings (typically array-to-pointer decay).
  > Recognize dyn-trait-type in Rust demangling.
  > Rework casting in raw_hash_set's IsFull().
  > Remove test references to absl::SharedBitGen, which was never part of the open source release. This was only used in tests that never ran as part in the open source release.
  > Recognize fn-type and lifetimes in Rust demangling.
  > Support int128/uint128 in validated MockingBitGen
  > Recognize inherent-impl and trait-impl in Rust demangling.
  > Recognize const and array-type in Rust mangled names.
  > Remove Asylo from absl.
  > Recognize generic arguments containing only types in Rust mangled names.
  > Fix missing #include <random> for std::uniform_int_distribution
  > Move `prepare_insert` out of the line as type erased `PrepareInsertNonSoo`.
  > Revert: Add -Wdead-code-aggressive to ABSL_LLVM_FLAGS
  > Add (unused) validation to absl::MockingBitGen
  > Support `AbslStringify` with `DCHECK_EQ`.
  > PR #1672: Optimize StrJoin with tuple without user defined formatter
  > Give ReturnAddresses and N<uppercase> namespaces separate stacks for clarity.
  > Demangle Rust backrefs.
  > Use Nt for struct and trait names in Rust demangler test inputs.
  > Allow __cxa_demangle on MIPS
  > Add a `string_view` overload to `absl::StrJoin`
  > Demangle Rust's Y<type><path> production for passably simple <type>s.
  > `convert_test`: Delete obsolete condition around ASSERT_EQ in TestWithMultipleFormatsHelper
  > `any_invocable`: Clean up #includes
  > Resynchronize absl/functional/CMakeLists.txt with BUILD.bazel
  > `any_invocable`: Add public documentation for undefined behavior when invoking an empty AnyInvocable
  > `any_invocable`: Delete obsolete reference to proposed standard type
  > PR #1662: Replace shift with addition in crc multiply
  > Doc fix.
  > `convert_test`: Extract loop over tested floats from helper function
  > Recognize some simple Rust mangled names in Demangle.
  > Use __builtin_ctzg and __builtin_clzg in the implementations of CountTrailingZeroesNonzero16 and CountLeadingZeroes16 when they are available.
  > Remove the forked absl::Status matchers implementation in statusor_test
  > Add comment hack to fix copybara reversibility
  > Add GoogleTest matchers for absl::Status
  > [random] LogUniform: Document as a discrete distribution
  > Enable Cord tests with Crc.
  > Fix order of qualifiers in `absl::AnyInvocable` documentation.
  > Guard against null pointer dereference in DumpNode.
  > Apply ABSL_MUST_USE_RESULT to try lock functions.
  > Add public aliases for default hash/eq types in hash-based containers
  > Import of CCTZ from GitHub.
  > Remove the hand-rolled CordLeaker and replace with absl::NoDestructor to test the after-exit behavior
  > `convert_test`: Delete obsolete `skip_verify` parameter in test helper
  > overload: allow using the underlying type with CTAD directly.
  > PR #1653: Remove unnecessary casts when calling CRC32_u64
  > PR #1652: Avoid C++23 deprecation warnings from float_denorm_style
  > Minor cleanup for `absl::Cord`
  > PR #1651: Implement ABSL_INTERNAL_DISABLE_DEPRECATED_DECLARATION_WARNING for MSVC compiler
  > Add `operator<=>` support to `absl::int128` and `absl::uint128`
  > [absl] Re-use the existing `std::type_identity` backfill instead of redefining it again
  > Add `absl::AppendCordToString`
  > `str_format/convert_test`: Delete workaround for [glibc bug](https://sourceware.org/bugzilla/show_bug.cgi?id=22142)
  > `absl/log/internal`: Document conditional ABSL_ATTRIBUTE_UNUSED, add C++17 TODO
  > `log/internal/check_op`: Add ABSL_ATTRIBUTE_UNUSED to CHECK macros when STRIP_LOG is enabled
  > log_benchmark: Add VLOG_IS_ON benchmark
  > Restore string_view detection check
  > Remove an unnecessary ABSL_ATTRIBUTE_UNUSED from a logging macro
  < Abseil LTS Branch, Jan 2024, Patch 2 (#1650)
  > In example code, add missing template parameter.
  > Optimize crc32 V128_From2x64 on Arm
  > Annotate that Mutex should warn when unused.
  > Add ABSL_ATTRIBUTE_LIFETIME_BOUND to Cord::Flatten/TryFlat
  > Deprecate `absl::exchange`, `absl::forward` and `absl::move`, which were only useful before C++14.
  > Temporarily revert dangling std::string_view detection until dependent is fixed
  > Use _decimal_ literals for the CivilDay example.
  > Fix bug in BM_EraseIf.
  > Add internal traits to absl::string_view for lifetimebound detection
  > Add internal traits to absl::StatusOr for lifetimebound detection
  > Add internal traits to absl::Span for lifetimebound detection
  > Add missing dependency for log test build target
  > Add internal traits for lifetimebound detection
  > Use local decoding buffer in HexStringToBytes
  > Only check if the frame pointer is inside a signal stack with known bounds
  > Roll forward: enable small object optimization in swisstable.
  > Optimize LowLevelHash by breaking dependency between final loads and previous len/ptr updates.
  > Fix the wrong link.
  > Optimize InsertMiss for tables without kDeleted slots.
  > Use GrowthInfo without applying any optimizations based on it.
  > Disable small object optimization while debugging some failing tests.
  > Adjust conditonal compilation in non_temporal_memcpy.h
  > Reformat log/internal/BUILD
  > Remove deprecated errno constants from the absl::Status mapping
  > Introduce GrowthInfo with tests, but without usage.
  > Enable small object optimization in swisstable.
  > Refactor the GCC unintialized memory warning suppression in raw_hash_set.h.
  > Respect `NDEBUG_SANITIZER`
  > Revert integer-to-string conversion optimizations pending more thorough analysis
  > Fix a bug in `Cord::{Append,Prepend}(CordBuffer)`: call `MaybeRemoveEmptyCrcNode()`. Otherwise appending a `CordBuffer` an empty Cord with a CRC node crashes (`RemoveCrcNode()` which increases the refcount of a nullptr child).
  > Add `BM_EraseIf` benchmark.
  > Record sizeof(key_type), sizeof(value_type) in hashtable profiles.
  > Fix ClangTidy warnings in btree.h.
  > LSC: Move expensive variables on their last use to avoid copies.
  > PR #1644: unscaledcycleclock: remove RISC-V support
  > Reland: Make DLOG(FATAL) not understood as [[noreturn]]
  > Separate out absl::StatusOr constraints into statusor_internal.h
  > Use Layout::WithStaticSizes in btree.
  > `layout`: Delete outdated comments about ElementType alias not being used because of MSVC
  > Performance improvement for absl::AsciiStrToUpper() and absl::AsciiStrToLower()
  > `layout_benchmark`: Replace leftover comment with intended call to MyAlign
  > Remove absl::aligned_storage_t
  > Delete ABSL_ANNOTATE_MEMORY_IS_INITIALIZED under Thread Sanitizer
  > Remove vestigial variables in the DumpNode() helper in absl::Cord
  > Do hashtablez sampling on the first insertion into an empty SOO hashtable.
  > Add explicit #include directives for <tuple>, "absl/base/config.h", and "absl/strings/string_view.h".
  > Add a note about the cost of `VLOG` in non-debug builds.
  > Fix flaky test failures on MSVC.
  > Add template keyword to example comment for Layout::WithStaticSizes.
  > PR #1643: add xcprivacy to all subspecs
  > Record sampling stride in cord profiling to facilitate unsampling.
  > Fix a typo in a comment.
  > [log] Correct SetVLOGLevel to SetVLogLevel in comments
  > Add a feature to container_internal::Layout that lets you specify some array sizes at compile-time as template parameters. This can make offset and size calculations faster.
  > `layout`: Mark parameter of Slices with ABSL_ATTRIBUTE_UNUSED, remove old workaround
  > `layout`: Use auto return type for functions that explicitly instantiate std::tuple in return statements
  > Remove redundant semicolons introduced by macros
  > [log] Make :vlog_is_on/:absl_vlog_is_on public in BUILD.bazel
  > Add additional checks for size_t overflows
  > Replace //visibility:private with :__pkg__ for certain targets
  > PR #1603: Disable -Wnon-virtual-dtor warning for CommandLineFlag implementations
  > Add several missing includes in crc/internal
  > Roll back extern template instatiations in swisstable due to binary size increases in shared libraries.
  > Add nodiscard to SpinLockHolder.
  > Test that rehash(0) reduces capacity to minimum.
  > Add extern templates for common swisstable types.
  > Disable ubsan for benign unaligned access in crc_memcpy
  > Make swisstable SOO support GDB pretty printing and still compile in OSS.
  > Fix OSX support with CocoaPods and Xcode 15
  > Fix GCC7 C++17 build
  > Use UnixEpoch and ZeroDuration
  > Make flaky failures much less likely in BasicMocking.MocksNotTriggeredForIncorrectTypes test.
  > Delete a stray comment
  > Move GCC uninitialized memory warning suppression into MaybeInitializedPtr.
  > Replace usages of absl::move, absl::forward, and absl::exchange with their std:: equivalents
  > Fix the move to itself
  > Work around an implicit conversion signedness compiler warning
  > Avoid MSan: use-of-uninitialized-value error in find_non_soo.
  > Fix flaky MSVC test failures by using longer slop time.
  > Add ABSL_ATTRIBUTE_UNUSED to variables used in an ABSL_ASSUME.
  > Implement small object optimization in swisstable - disabled for now.
  > Document and test ability to use absl::Overload with generic lambdas.
  > Extract `InsertPosition` function to be able to reuse it.
  > Increase GraphCycles::PointerMap size
  > PR #1632: inlined_vector: Use trivial relocation for `erase`
  > Create `BM_GroupPortable_Match`.
  > [absl] Mark `absl::NoDestructor` methods with `absl::Nonnull` as appropriate
  > Automated Code Change
  > Rework casting in raw_hash_set's `IsFull()`.
  > Adds ABSL_ATTRIBUTE_LIFETIME_BOUND to absl::BitGenRef
  > Workaround for NVIDIA C++ compiler being unable to parse variadic expansions in range of range-based for loop
  > Rollback: Make DLOG(FATAL) not understood as [[noreturn]]
  > Make DLOG(FATAL) not understood as [[noreturn]]
  > Optimize `absl::Duration` division and modulo: Avoid repeated redundant comparisons in `IDivFastPath`.
  > Optimize `absl::Duration` division and modulo: Allow the compiler to inline `time_internal::IDivDuration`, by splitting the slow path to a separate function.
  > Fix typo in example code snippet.
  > Automated Code Change
  > Add braces for conditional statements in raw_hash_map functions.
  > Optimize `prepare_insert`, when resize happens. It removes single unnecessary probing before resize that is beneficial for small tables the most.
  > Add noexcept to move assignment operator and swap function
  > Import of CCTZ from GitHub.
  > Minor documentation updates.
  > Change find_or_prepare_insert to return std::pair<iterator, bool> to match return type of insert.
  > PR #1618: inlined_vector: Use trivial relocation for `SwapInlinedElements`
  > Improve raw_hash_set tests.
  > Performance improvement for absl::AsciiStrToUpper() and absl::AsciiStrToLower()
  > Use const_cast to avoid duplicating the implementation of raw_hash_set::find(key).
  > Import of CCTZ from GitHub.
  > Performance improvement for absl::AsciiStrToUpper() and absl::AsciiStrToLower()
  > Annotate that SpinLock should warn when unused.
  > PR #1625: absl::is_trivially_relocatable now respects assignment operators
  > Introduce `Group::MaskNonFull` without usage.
  > `demangle`: Parse template template and C++20 lambda template param substitutions
  > PR #1617: fix MSVC 32-bit build with -arch:AVX
  > Minor documentation fix for `absl::StrSplit()`
  > Prevent overflow in `absl::CEscape()`
  > `demangle`: Parse optional single template argument for built-in types
  > PR #1412: Filter out `-Xarch_` flags from pkg-config files
  > `demangle`: Add complexity guard to `ParseQRequiresExpr`
  < Prepare 20240116.1 patch for Apple Privacy Manifest (#1623)
  > Remove deprecated symbol absl::kuint128max
  > Add ABSL_ATTRIBUTE_WARN_UNUSED.
  > `demangle`: Parse `requires` clauses on template params, before function return type
  > On Apple, implement absl::is_trivially_relocatable with the fallback.
  > `demangle`: Parse `requires` clauses on functions
  > Make `begin()` to return `end()` on empty tables.
  > `demangle`: Parse C++20-compatible template param declarations, except those with `requires` expressions
  > Add the ABSL_DEPRECATE_AND_INLINE() macro
  > Span: Fixed comment referencing std::span as_writable_bytes() as as_mutable_bytes().
  > Switch rank structs to be consistent with written guidance in go/ranked-overloads
  > Avoid hash computation and `Group::Match` in small tables copy and use `IterateOverFullSlots` for iterating for all tables.
  > Optimize `absl::Hash` by making `LowLevelHash` faster.
  > Add -Wdead-code-aggressive to ABSL_LLVM_FLAGS
  < Backport Apple Privacy Manifest (#1613)
  > Stop using `std::basic_string<uint8_t>` which relies on a non-standard generic `char_traits<>` implementation, recently removed from `libc++`.
  > Add absl_container_hash-based HashEq specialization
  > `demangle`: Implement parsing for simplest constrained template arguments
  > Roll forward 9d8588bfc4566531c4053b5001e2952308255f44 (which was rolled back in 146169f9ad357635b9cd988f976b38bcf83476e3) with fix.
  > Add a version of absl::HexStringToBytes() that returns a bool to validate that the input was actually valid hexadecimal data.
  > Enable StringLikeTest in hash_function_defaults_test
  > Fix a typo.
  > Minor changes to the BUILD file for absl/synchronization
  > Avoid static initializers in case of ABSL_FLAGS_STRIP_NAMES=1
  > Rollback 9d8588bfc4566531c4053b5001e2952308255f44 for breaking the build
  > No public description
  > Decrease the precision of absl::Now in x86-64 debug builds
  > Optimize raw_hash_set destructor.
  > Add ABSL_ATTRIBUTE_UNINITIALIZED macros for use with clang and GCC's `uninitialized`
  > Optimize `Cord::Swap()` for missed compiler optimization in clang.
  > Type erased hash_slot_fn that depends only on key types (and hash function).
  > Replace `testonly = 1` with `testonly = True` in abseil BUILD files.
  > Avoid extra `& msbs` on every iteration over the mask for GroupPortableImpl.
  > Missing parenthesis.
  > Early return from destroy_slots for trivially destructible types in flat_hash_{*}.
  > Avoid export of testonly target absl::test_allocator in CMake builds
  > Use absl::NoDestructor for cordz global queue.
  > Add empty WORKSPACE.bzlmod
  > Introduce `RawHashSetLayout` helper class.
  > Fix a corner case in SpyHashState for exact boundaries.
  > Add nullability annotations
  > Use absl::NoDestructor for global HashtablezSampler.
  > Always check if the new frame pointer is readable.
  > PR #1604: Add privacy manifest
  < Disable ABSL_ATTRIBUTE_TRIVIAL_ABI in open-source builds  (#1606)
  > Remove code pieces for no longer supported GCC versions.
  > Disable ABSL_ATTRIBUTE_TRIVIAL_ABI in open-source builds
  > Prevent brace initialization of AlphaNum
  > Remove code pieces for no longer supported MSVC versions.
  > Added benchmarks for smaller size copy constructors.
  > Migrate empty CrcCordState to absl::NoDestructor.
  > Add protected copy ctor+assign to absl::LogSink, and clarify thread-safety requirements to apply to the interface methods.
  < Apply LTS transformations for 20240116 LTS branch (#1599)

Closes scylladb/scylladb#28756
2026-04-08 12:19:54 +03:00
Liapkovich
4f17cc6d83 docs: add missing rack value for internode_compression parameter
The rack option was fully implemented in the code but omitted from
both docs/operating-scylla/admin.rst and conf/scylla.yaml comments.

Closes scylladb/scylladb#29239
2026-04-08 12:19:54 +03:00
Pavel Emelyanov
0ea76a468f schema: Avoid copies in column_mapping::operator==
In a multi-declarator declaration, the & ref-qualifier is part of each
individual declarator, not the shared type specifier. So:

    const auto& a = x(), b = y();

declares 'a' as a reference but 'b' as a value, silently copying y().
The same applies to:

    const T& a = v[i], b = v[j];

Both operator== lines had this pattern, causing an unnecessary copy of
the column vector and an unnecessary copy of each entry on every call.

Fix by repeating & on the second declarator in both lines.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Closes scylladb/scylladb#29213
2026-04-08 12:19:54 +03:00
Pavel Emelyanov
b7c14c6d29 token_metadata: Clear _topology_change_info gently
clear_gently() (introduced in 322aa2f8b5) clears all token_metadata_impl
members using co_await to avoid reactor stalls on large data structures.

_topology_change_info (introduced in 10bf8c7901) was added later and not
included in clear_gently().

update_topology_change_info() already uses utils::clear_gently() when
replacing the value, so it looks reasonable to apply the same pattern
in clear_gently().

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Closes scylladb/scylladb#29210
2026-04-08 12:19:54 +03:00
Pavel Emelyanov
54fbbf0410 locator/tablets: Fix missing selector value in error messages
Some on_internal_error() calls have the selector argument to a format
string with no placeholder for it in the format string.

"While at it", disambiguate selector type in the message text.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#29208
2026-04-08 12:19:54 +03:00
Botond Dénes
418141ec08 Merge 'Drop create_dataset() helper from object_store tests' from Pavel Emelyanov
There's only one test left that uses it, and it can be patched to use standard ks/cf creation helpers from pylib. This patch does so and drops the lengthy create_dataset() helper

Tests improvements, no need to backport

Closes scylladb/scylladb#29176

* github.com:scylladb/scylladb:
  test/backup: drop create_dataset helper
  test/backup: use new_test_keyspace in test_restore_primary_replica
2026-04-08 12:19:54 +03:00
Petr Gusev
1e3c8c5a87 test_mutation_schema_change: use tablets
The enable_tablets(false) was added when LWT wasn't supported for tablets, now it's, so no need in this attribute are more.

The test covers behavior which should work in similar way for both vnodes and tablets -> it doesn't seem it would benefit much from running it in both enable_tablets(true) and enable_tablets(false) modes.

Closes scylladb/scylladb#29167
2026-04-08 12:19:54 +03:00
Pavel Emelyanov
7f854c0255 hints: Use shorter fault-injection overload
In order to apply fsult-injected delay, there's the inject(duration)
overload. Results in shorter code

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#29168
2026-04-08 10:51:37 +03:00
Botond Dénes
aeefbda304 Merge 'Simplify and improve API descibe_ring code flow' from Pavel Emelyanov
The endpoint in question has some places worth fixing, in particular

- the keyspace parameter is not validated
- the validated table name is resolved into table_id, but the id is unused
- two ugly static helpers to stream obtained token ranges into json

Improving the API code flow, not backporting

Closes scylladb/scylladb#29154

* github.com:scylladb/scylladb:
  api: Inline describe_ring JSON handling
  storage_service: Make describe_ring_for_table() take table_id
2026-04-08 10:50:07 +03:00
Artsiom Mishuta
b1e9c0b867 test/pylib: add typed skip markers plugin
Add skip_reason_plugin.py — a framework-agnostic pytest plugin that
provides typed skip markers (skip_bug, skip_not_implemented, skip_slow,
skip_env) so that the reason a test is skipped is machine-readable in
JUnit XML and Allure reports.  Bare untyped pytest.mark.skip now
triggers a warning (to become an error after full migration).  Runtime
skips via skip() are also enriched by parsing the [type] prefix from
the skip message.

The plugin is a class (SkipReasonPlugin) that receives the concrete
SkipType enum and an optional report_callback from conftest.py, keeping
it decoupled from allure and project-specific types.

Extract SkipType enum and convenience runtime skip wrappers (skip_bug,
skip_env, etc.) into test/pylib/skip_types.py so callers only need a
single import instead of importing both SkipType and skip() separately.
conftest.py imports SkipType from the new module and registers the
plugin instance unconditionally (for all test runners).

New files:
- test/pylib/skip_reason_plugin.py: core plugin — typed marker
  processing, bare-skip warnings, JUnit/Allure report enrichment
  (including runtime skip() parsing via _parse_skip_type helper)
- test/pylib/skip_types.py: SkipType enum and convenience wrappers
  (skip_bug, skip_not_implemented, skip_slow, skip_env)
- test/pylib_test/test_skip_reason_plugin.py: 17 pytester-based
  test functions (51 cases across 3 build modes) covering markers,
  warnings, reports, callbacks, and skip_mode interaction

Infrastructure changes:
- test/conftest.py: import SkipType from skip_types, register
  SkipReasonPlugin with allure report callback
- test/pylib/runner.py: set SKIP_TYPE_KEY/SKIP_REASON_KEY stash keys
  for skip_mode so the report hook can enrich JUnit/Allure with
  skip_type=mode without longrepr parsing
- test/pytest.ini: register typed marker definitions (required for
  --strict-markers even when plugin is not loaded)

Migrated test files (representative samples):
- test/cluster/test_tablet_repair_scheduler.py:
  skip -> skip_bug (#26844), skip -> skip_not_implemented
- test/cqlpy/.../timestamp_test.py: skip -> skip_slow
- test/cluster/dtest/schema_management_test.py: skip -> skip_not_implemented
- test/cluster/test_change_replication_factor_1_to_0.py: skip -> skip_bug (#20282)
- test/alternator/conftest.py: skip -> skip_env
- test/alternator/test_https.py: use skip_env() wrapper

Fixes SCYLLADB-79

Closes scylladb/scylladb#29235
2026-04-08 10:38:56 +03:00
Pavel Emelyanov
e0fa9ee332 Merge 'storage: implement sstable clone for object storage' from Ernest Zaslavsky
This patch series implements `object_storage_base::clone`, which was previously a stub that aborted at runtime. Clone creates a copy of an sstable under a new generation and is used during compaction.

The implementation uses server-side object copies (S3 CopyObject / GCS Objects: rewrite) and mirrors the filesystem clone semantics: TemporaryTOC is written first to mark the operation as in-progress, component objects are copied, and TemporaryTOC is removed to commit (unless the caller requested the destination be left unsealed).

The first two patches fix pre-existing bugs in the underlying storage clients that were exposed by the new clone code path:
- GCS `copy_object` used the wrong HTTP method (PUT instead of POST) and sent an invalid empty request body.
- S3 `copy_object` silently ignored the abort_source parameter.

1. **gcp_client: fix copy_object request method and body** — Fix two bugs in the GCS rewrite API call.
2. **s3_client: pass through abort_source in copy_object** — Stop ignoring the abort_source parameter.
3. **object_storage: add copy_object to object_storage_client** — New interface method with S3 and GCS implementations.
4. **storage: add make_object_name overload with generation** — Helper for building destination object names with a different generation.
5. **storage: make delete_object const** — Needed by the const clone method.
6. **storage: implement object_storage_base::clone** — The actual clone implementation plus a copy_object wrapper.
7. **test/boost: enable sstable clone tests for S3 and GCS** — Re-enable the previously skipped tests.

A test similar to `sstable_clone_leaving_unsealed_dest_sstable` was added to properly test the sealed/unsealed states for object storage. Works for both  S3 and GCS.

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1045
Prerequisite: https://github.com/scylladb/scylladb/pull/28790
No need to backport since this code targets future feature

Closes scylladb/scylladb#29166

* github.com:scylladb/scylladb:
  compaction_test: enable sstable clone tests for S3 and GCS
  storage: implement object_storage_base::clone
  storage: make delete_object const in object_storage_base
  storage: add make_object_name overload with generation
  sstables: add get_format() accessor to sstable
  object_storage: add copy_object to object_storage_client
  s3_client: pass through abort_source in copy_object
  gcp_client: fix copy_object request method and body
2026-04-08 09:35:10 +03:00
Nadav Har'El
4eeb9f4120 lwt, vector: write to CDC when vector index is enabled.
The vector-search feature introduced the somewhat confusing feature of
enabling CDC without explicitly enabling CDC: When a vector index is
enabled on a table, CDC is "enabled" for it even if the user didn't
ask to enable CDC.

For this, write-path code began to use a new cdc_enabled() function
instead of checking schema.cdc_options.enabled() directly.  This
cdc_enabled() function checks if either this enabled() is true, or
has_vector_index() is true.

Unfortunately, LWT writes continued to use cdc_options.enabled() instead
of the new cdc_enabled(). This means that if a vector index is used and
a vector is written using an LWT write, the new value is not indexed.

This patch fixes this bug. It also adds a regression test that fails
before this patch and passes afterwards - the new test verifies that
when a table has a vector index (but no explicit CDC enabled), the CDC
log is updated both after regular writes and after successful LWT writes.

This patch was also tested in the context of the upcoming vector-search-
for-Alternator pull request, which has a test reproducing this bug
(Alternator uses LWT frequently, so this is very important there).
It will also be tested by the vector-store test suite ("validator").

Fixes SCYLLADB-1342

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#29300
2026-04-08 07:55:05 +03:00
Marcin Maliszkiewicz
1bf3110adb Merge 'test: add test_upgrade_preserves_ddl_audit_for_tables' from Andrzej Jackowski
Verify that upgrading from 2025.1 to master does not silently drop DDL
auditing for table-scoped audit configurations ([SCYLLADB-1155](https://scylladb.atlassian.net/browse/SCYLLADB-1155)).

Test time in dev: 4s

Refs: SCYLLADB-1155
Fixes: SCYLLADB-1305
No backport, test for bug on master

[SCYLLADB-1155]: https://scylladb.atlassian.net/browse/SCYLLADB-1155?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ

Closes scylladb/scylladb#29223

* github.com:scylladb/scylladb:
  test: add test_upgrade_preserves_ddl_audit_for_tables
  test: audit: split validate helper so callers need not pass audit_settings
  test: audit: declare manager attribute in AuditTester base class
2026-04-07 17:29:11 +02:00
Marcin Maliszkiewicz
895fdb6d29 Merge 'ldap: fix double-free of LDAPMessage in poll_results()' from Andrzej Jackowski
In the unregistered-ID branch, ldap_msgfree() was called on a result
already owned by an RAII ldap_msg_ptr, causing a double-free on scope
exit. Remove the redundant manual free.

Fixes: SCYLLADB-1344

Backport: 2026.1, 2025.4, 2025.1 - it's a memory corruption, with a one-line fix, so better backport it everywhere.

Closes scylladb/scylladb#29302

* github.com:scylladb/scylladb:
  test: ldap: add regression test for double-free on unregistered message ID
  ldap: fix double-free of LDAPMessage in poll_results()
2026-04-07 17:27:43 +02:00
Ernest Zaslavsky
422f107122 compaction_test: enable sstable clone tests for S3 and GCS
Now that object_storage_base::clone is implemented,
remove the early-return skips and re-enable the
sstable_clone_leaving_unsealed_dest_sstable tests for
both S3 and GCS storage backends.
2026-04-07 18:16:52 +03:00
Ernest Zaslavsky
7cd9bbb010 storage: implement object_storage_base::clone
Implement the clone method for object_storage_base, which creates
a copy of an sstable with a new generation using server-side object
copies. Also add a const copy_object convenience wrapper, similar
to the existing put_object and delete_object wrappers.

A dedicated test for the new object storage clone path will be
added in the following commit. The preexisting local-filesystem
clone is already covered by the sstable_clone_leaving_unsealed_dest_sstable
test.
2026-04-07 18:16:52 +03:00
Ernest Zaslavsky
8fa82e6b6f storage: make delete_object const in object_storage_base
The method doesn't modify any member state. Making it
const is needed for calling it from the const clone
method.
2026-04-07 18:16:52 +03:00
Ernest Zaslavsky
47387341bb storage: add make_object_name overload with generation
Add a make_object_name overload that accepts a target
generation parameter for constructing object names with
a generation different from the source sstable's own.

Refactor the original make_object_name to delegate to
the new overload, eliminating code duplication.

This is needed by clone to build destination object
names for the new generation.
2026-04-07 18:16:52 +03:00
Ernest Zaslavsky
8bd891c6ed sstables: add get_format() accessor to sstable
Add a public get_format() accessor for the _format member, following
the same pattern as the existing get_version(). This allows storage
implementations to access the sstable format without reaching into
private members, and is needed by the upcoming object_storage_base::clone
to construct entry_descriptor for the sstables registry.
2026-04-07 18:16:52 +03:00
Ernest Zaslavsky
3d23490615 object_storage: add copy_object to object_storage_client
Add a copy_object method to the object_storage_client
interface for server-side object copies, with
implementations for both S3 and GCS wrappers.

The S3 wrapper delegates to s3::client::copy_object.
The GCS wrapper delegates to gcp::storage::client's
cross-bucket copy_object overload.

This is a prerequisite for implementing sstable clone
on object storage.
2026-04-07 18:16:52 +03:00
Ernest Zaslavsky
1702d6e6d4 s3_client: pass through abort_source in copy_object
The abort_source parameter in s3::client::copy_object
was ignored — the function accepted it but always passed
nullptr to the underlying copy_s3_object. Forward it
properly so callers can cancel in-progress copies.
2026-04-07 18:16:52 +03:00
Ernest Zaslavsky
bfdc1e5267 gcp_client: fix copy_object request method and body
The GCP copy_object (rewrite API) had two bugs:

1. The request body was an empty string, but the GCP
   rewrite endpoint always parses it as JSON metadata.
   An empty string is not valid JSON, resulting in
   400 "Metadata in the request couldn't decode".
   Fix: send "{}" (empty JSON object) as the body.

2. The HTTP method was PUT, but the GCP Objects: rewrite
   API requires POST per the documentation.
   Fix: use POST.

Test coverage in a follow-up patch
2026-04-07 18:16:52 +03:00
Nadav Har'El
a0e79f391f Merge 'alternator: fix batch write item squashing cdc entries' from Radosław Cybulski
When `BatchWriteItem` operates on multiple items sharing the same partition key in `always_use_lwt` write isolation mode, all CDC log entries are emitted under a single timestamp. The previous `get_records` parsing algorithm in `alternator/streams.cc` assumed that all CDC log entries sharing the same timestamp correspond to a single DynamoDB item change. As a result, it would incorrectly squash multiple distinct item changes into a single Streams record — producing wrong event data (e.g., one INSERT instead of four, with mismatched key/attribute values).

Note: the bug is specific to `always_use_lwt` mode because only in LWT mode does the entire batch share a single timestamp. In non-LWT modes, each item in the batch receives a separate timestamp, so the entries naturally stay separate.

**Commit 1: alternator: add BatchWriteItem Streams test**

- Adds new tests `test_streams_batchwrite_no_clustering_deletes_non_existing_items` and `test_streams_batchwrite_no_clustering_deletes_existing_items` that cover the corner cases of batch-deleting a existing and non-existing item in a table without a clustering key. CDC tables without clustering keys are handled differently, and this path was previously untested for delete operations.
- Adds a new test `test_streams_batchwrite_into_the_same_partition_will_report_wrong_stream_data`, that is a simple way to trigger a bug.
- Adds a new test `test_streams_batchwrite_into_the_same_partition_deletes_existing_items`, that validates various combinations of puts and deletes in a single BatchWrite against the same partition.
- Adds a new `test_table_ss_new_and_old_images_write_isolation_always` fixture and extends `create_table_ss` to accept `additional_tags`, enabling tests with a specific write isolation mode.

**Commit 2: alternator: fix BatchWriteItem squashed Streams entries**

The core fix rewrites the CDC log entry parsing in `get_records` to distinguish items by their clustering key:

- Introduces `managed_bytes_ptr_hash` and `managed_bytes_ptr_equal` helper structs for pointer-based hash map lookups on `managed_bytes`.
- Replaces the single `record`/`dynamodb` pair with a `std::unordered_map<const managed_bytes*, Record, ...>` (`records_map`) keyed by the base table's clustering key value from each CDC log row. For tables without a clustering key, all entries map to a single sentinel key.
- Adds a validation that Alternator tables have at most one clustering key column (as required by the DynamoDB data model).
- On end-of-record (`eor`), flushes all accumulated per-clustering-key records into the output, each with a unique `eventID` (the `event_id` format now includes an index suffix).
- Adjusts the limit check: since a single CDC timestamp bucket can now produce multiple output records, the limit may be slightly exceeded to avoid breaking mid-batch.

Fixes #28439
Fixes: SCYLLADB-540

Closes scylladb/scylladb#28452

* github.com:scylladb/scylladb:
  alternator/test: explain why 'always' write isolation mode is used in tests
  alternator/test: add scylla_only to always write isolation fixture
  alternator: fix BatchWriteItem squashed Streams entries
  alternator: add BatchWriteItem test (failing)
2026-04-07 17:49:23 +03:00
Nadav Har'El
22e7ef46a7 Merge 'vector_search: fix SELECT on local vector index' from Karol Nowacki
Queries against local vector indexes were failing with the error:
```ANN ordering by vector requires the column to be indexed using 'vector_index'```

This was a regression introduced by 15788c3734, which incorrectly
assumed the first column in the targets list is always the vector column.
For local vector indexes, the first column is the partition key, causing
the failure.

Previously, serialization logic for the target index option was shared
between vector and secondary indexes. This is no longer viable due to
the introduction of local vector indexes and vector indexes with filtering
columns, which have  different target format.

This commit introduces a dedicated JSON-based serialization format for
vector index targets, identifying the target column (tc), filtering
columns (fc), and partition key columns (pk). This ensures unambiguous
serialization and deserialization for all vector index types.

This change is backward compatible for regular vector indexes. However,
it breaks compatibility for local vector indexes and vector indexes with
filtering columns created in version 2026.1.0. To mitigate this, usage
of these specific index types will be blocked in the 2026.1.0 release
by failing ANN queries against them in vector-store service.

Fixes: SCYLLADB-895

Backport to 2026.1 is required as this issue occurs also on this branch.

Closes scylladb/scylladb#28862

* github.com:scylladb/scylladb:
  index: fix DESC INDEX for vector index
  vector_search: test: refactor boilerplate setup
  vector_search: fix SELECT on local vector index
  index: test: vector index target option serialization test
  index: test: secondary index target option serialization test
2026-04-07 17:43:35 +03:00
Michał Jadwiszczak
9cf94116c2 db/view/view_building_worker: fix indentation 2026-04-07 16:12:04 +02:00
Michał Jadwiszczak
c9aa5bb09c db/view/view_building_worker: lock staging sstables mutex for necessary shards when creating tasks
To create `process_staging` view building tasks, we firstly need to
collect informations about them on shard0, create necessary mutations,
commit them to group0 and move staging sstables objects to their
original shards.

But there is a possible race after committing the group0 command
and before moving the staging sstables to their shards.
Between those two events, the coordinator may schedule freshly created
tasks and dispatch them to the worker but the worker won't have the
sstables objects because they weren't moved yet.

This patch fixes the race by holding `_staging_sstables_mutex` locks
from necessary shards when executing `create_staging_sstable_tasks()`.
With this, even if the task will be scheduled and dispatched quickly,
the worker will wait with executing it until the sstables objects are
moved and the locks are released.

Fixes SCYLLADB-816
2026-04-07 16:11:45 +02:00
Pavel Emelyanov
58e59e8c0d Merge 'test: add test_sstable_clone_preserves_staging_state' from Benny Halevy
Add a test that verifies filesystem_storage::clone preserves the sstable
state: an sstable in staging is cloned to a new generation, the clone is
re-loaded from the staging directory, and its state is asserted to still
be staging.

The change proves that https://scylladb.atlassian.net/browse/SCYLLADB-1205
is invalid, and can be closed.

* No functional change and no backport needed

Closes scylladb/scylladb#29209

* github.com:scylladb/scylladb:
  test: add test_sstable_clone_preserves_staging_state
  test: derive sstable state from directory in test_env::make_sstable
  sstables: log debug message in filesystem_storage::clone
2026-04-07 17:02:04 +03:00
Botond Dénes
816f2bf163 Merge 'cql3: fix null handling in data_value formatting' from Dario Mirovic
`data_value::to_parsable_string()` crashes with a null pointer dereference when called on a `null` data_value. Return `"null"` instead.

Added tests after the fix. Manually checked that tests fail without the fix.

Fixes SCYLLADB-1350

This is a fix that prevents format crash. No known occurrence in production, but backport is desirable.

Closes scylladb/scylladb#29262

* github.com:scylladb/scylladb:
  test: boost: test null data value to_parsable_string
  cql3: fix null handling in data_value formatting
2026-04-07 16:35:31 +03:00
Dimitrios Symonidis
701808d7aa test/object_store: parametrize test_basic over replication factor
Extend test_basic to run with both RF=1 and RF=3 to verify that
object storage works correctly with multiple replicas. The test now
starts one server per replica (each on its own rack), flushes all
nodes, validates tablet replica counts for RF>1, and restarts all
servers before verifying data is still readable.

Fixes: SCYLLADB-546

Closes scylladb/scylladb#28583
2026-04-07 16:27:44 +03:00
Nadav Har'El
f642db0693 test/alternator: tests for missing support of ReturnConsumedCapacity
As noted in issue #5027 and issue #29138, Alternator's support for
ReturnConsumedCapacity is lacking in a two areas:

1. While ReturnConsumedCapacity is supported for most relevant
   operations, it's not supported in two operations: Query and Scan.

2. While ReturnConsumedCapacity=TOTAL is supported, INDEXES is not
   supported at all.

This patch adds extensive tests for all these cases. All these tests
pass on DynamoDB but fail on Alternator, so are marked with "xfail".

The tests for ReturnConsumedCapacity=INDEXES are deliberately split
into two: First, we test the case where the table has no indexes, so
INDEXES is almost the same as TOTAL and should be very easy to
implement. A second test checks the cases where there are indexes,
and different operations increment the capacity of the base table
and/or indexes differently - it will require significantly more work
to make the second test pass.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#29188
2026-04-07 16:07:41 +03:00
Nadav Har'El
f590ee2b7e cdc, vector: fix CDC result tracker for vector indexes
When a table has a vector index, cdc::cdc_enabled() returns true because
vector index writes are implemented via the CDC augmentation path. However,
register_cdc_operation_result_tracker() was checking only
cdc_options().enabled(), which is false for tables that have a vector index
but not traditional CDC.

As a result, the operation_result_tracker was never attached to write
response handlers for vector-indexed tables. This tracker was added in
commit 1b92cbe, and its job is to update metrics of CDC operations,
and since vector search really does use CDC under the hood, these
metrics could be useful when diagnosing problems.

Fix by using cdc::cdc_enabled() instead of cdc_options().enabled(), which
covers both traditional CDC and vector-indexed tables.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#29343
2026-04-07 15:54:51 +03:00
Avi Kivity
8c629d55b0 test: vector_search: check [[nodiscard]] return values of expected<> types
Clang 22 verifies [[nodiscard]] for co_await,
causing compilation failures where return values of expected<> were
silently discarded.

These call sites were discarding the return value of client::request()
and vector_store_client::ann(), both of which return expected<> types
marked [[nodiscard]]. Rather than suppressing the warning with (void)
casts, properly check the return values using the established test
patterns: BOOST_CHECK(result) where the call is expected to succeed,
and BOOST_CHECK(!result) where the call is expected to fail.

Closes scylladb/scylladb#29297
2026-04-07 15:25:08 +03:00
Anna Stuchlik
176f6fb59e doc: add the 2026.x patch release upgrade guide-from-2025
This issue adds the upgrade guide for all patch releases within 2026.x major release.

In addition, it fixes the link to Upgrade Policy in the 2025.x-to-2026.1 upgrade guide.

Fixes SCYLLADB-1247

Closes scylladb/scylladb#29307
2026-04-07 13:52:16 +02:00
Anna Stuchlik
d329c91f9e doc: remove About Upgrade and redirect to Upgrade Policy
While fixing https://github.com/scylladb/scylladb/issues/28997, we added a new page about upgrade policy:
https://docs.scylladb.com/stable/versioning/upgrade-policy.html

This commit removes the old page and adds redirections to the new Upgrade Policy page
in the unversioned documentation set.

Closes scylladb/scylladb#29251
2026-04-07 13:44:10 +02:00
Andrei Chekun
93583bf193 test.py: use safe_drive_shutdown in the tests
These methods for closing driver was missed during original fix.

Fixes: SCYLLADB-900

Closes scylladb/scylladb#29093
2026-04-07 14:35:18 +03:00
Avi Kivity
00409b61f1 Merge 'Add Vnodes to Tablets Migration Procedure' from Nikos Dragazis
This PR introduces the vnodes-to-tablets migration procedure, which enables converting an existing vnode-based keyspace to tablets.

The migration is implemented as a manual, operator-driven process executed in several stages. The core idea is to first create tablet maps with the same token boundaries and replica hosts as the vnodes, and then incrementally convert the storage of each node to the tablets layout. At a high level, the procedure is the following:
1. Create tablet maps for all tables in the keyspace.
2. Sequentially upgrade all nodes from vnodes to tablets:
    1. Mark a node for upgrade in the topology state.
    2. Restart the node. During startup, while the node is offline, it reshards the SSTables on vnode boundaries and switches to a tablet ERM.
    3. Wait for the node to return online before proceeding to the next node.
4. Finalize the migration:
    1. Update the keyspace schema to mark it as tablet-based.
    2. Clear the group0 state related to the migration.

From the client's perspective, the migration is online; the cluster can still serve requests on that keyspace, although performance may be temporarily degraded.

During the migration, some nodes use vnode ERMs while others use tablet ERMs. Cluster-level algorithms such as load balancing will treat the keyspace's tables as vnode-based. Once migration is finalized, the keyspace is permanently switched to tablets and cannot be reverted back to vnodes. However, a rollback procedure is available before finalization.

The patch series consists of:
* Load balancer adjustments to ignore tablets belonging to a migrating keyspace.
* A new vnode-based resharding mode, where SSTables are segregated on vnode boundaries rather than with the static sharder.
* A new per-node `intended_storage_mode` column in `system.topology`. Represents migration intent (whether migration should occur on restart) and direction.
* Four new REST endpoints for driving the migration (start, node upgrade/downgrade, finalize, status), along with `nodetool` wrappers. The finalization is implemented as a global topology request.
* Wiring of the migration process into the startup logic: the `distributed_loader` determines a migrating table's ERM flavor from the `intended_storage_mode` and the ERM flavor determines the `table_populator`'s resharding mode. Token metadata changes have been adjusted to preserve the ERM flavor.
* Cluster tests for the migration process.

Fixes SCYLLADB-722.
Fixes SCYLLADB-723.
Fixes SCYLLADB-725.
Fixes SCYLLADB-779.
Fixes SCYLLADB-948.

New feature, no backport is needed.

Closes scylladb/scylladb#29065

* github.com:scylladb/scylladb:
  docs: Add ops guide for vnodes-to-tablets migration
  test: cluster: Add test for migration of multiple keyspaces
  test: cluster: Add test for error conditions
  test: cluster: Add vnodes->tablets migration test (rollback)
  test: cluster: Add vnodes->tablets migration test (1 table, 3 nodes)
  test: cluster: Add vnodes->tablets migration test (1 table, 1 node)
  scylla-nodetool: Add migrate-to-tablets subcommand
  api: Add REST endpoint for vnode-to-tablet migration status
  api: Add REST endpoint for migration finalization
  topology_coordinator: Add `finalize_migration` request
  database: Construct migrating tables with tablet ERMs
  api: Add REST endpoint for upgrading nodes to tablets
  api: Add REST endpoint for starting vnodes-to-tablets migration
  topology_state_machine: Add intended_storage_mode to system.topology
  distributed_loader: Wire vnode-based resharding into table populator
  replica: Pick any compaction group for resharding
  compaction: resharding_compaction: add vnodes_resharding option
  storage_service: Preserve ERM flavor of migrating tables
  tablet_allocator: Exclude migrating tables from load balancing
  feature_service: Add vnodes_to_tablets_migrations feature
2026-04-07 14:32:22 +03:00
Łukasz Paszkowski
6f364fd3b7 db: fix system.size_estimates to aggregate sstable estimates across all shards
The estimate() function in the size_estimates virtual reader only
considered sstables local to the shard that happened to own the
keyspace's partition key token. Since sstables are distributed across
shards, this caused partition count estimates to be approximately
1/smp_count of the actual value.

This bug has been present since the virtual reader was introduced in
225648780d.

Use db.container().map_reduce0() to aggregate sstable estimates
across all shards. Each shard contributes its local count and
estimated_histogram, which are then merged to produce the correct
total.

Also fix the `test_partitions_estimate_full_overlap` test which becomes
flaky (xpassing ~1% of runs) because autocompaction could merge the
two overlapping sstables before the size estimate was read. Wrap the
test body in nodetool.no_autocompaction_context to prevent this race.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1179
Refs https://github.com/scylladb/scylladb/issues/9083

Closes scylladb/scylladb#29286
2026-04-07 14:13:26 +03:00
Piotr Smaron
7d449a307c docs: remove old audit design doc
As discussed with @ScyllaPiotr in
https://github.com/scylladb/scylladb/pull/29232, the doc about to be
removed is just:
> Looking at history, I think this audit.md is a design doc: scylladb/scylla-enterprise@87a5c19, for which the feature has been implemented differently, eventually, and was created around the time when design docs, apparently, where stored within the repository itself. So for me it's some trash (sorry for strong language) that can be safely removed.

Closes scylladb/scylladb#29316
2026-04-07 14:11:53 +03:00
Avi Kivity
8b4a91982b cmake: add missing rolling_max_tracker_test and symmetric_key_test
Added in 5b2a07b408 and c596ae6eb1 without cmake integration.

Closes scylladb/scylladb#29328
2026-04-07 14:09:00 +03:00
Avi Kivity
d01c9a425f test: test_out_of_storage_prevention: fix invalid escape in regex
Python warns that the sequence "\(" is an invalid escape and
might be rejected in the future. Protect against that by using
a raw string.

Closes scylladb/scylladb#29334
2026-04-07 14:06:32 +03:00
Pavel Emelyanov
0ae781c008 Merge 'test: auth_test: coroutinize' from Avi Kivity
Convert auth_test.cc to coroutines for improved readability. Each test is converted in its own commit. Some
are trivial.

Indentation is left broken in some commits to reduce the diff, then fixed up in the last commit.

Code cleanup, so no backport.

Closes scylladb/scylladb#29336

* github.com:scylladb/scylladb:
  auth_test: fix whitespace
  auth_test: coroutinize test_try_describe_schema_with_internals_and_passwords_as_anonymous_user
  auth_test: coroutinize test_try_login_after_creating_roles_with_hashed_password
  auth_test: coroutinize test_create_roles_with_hashed_password_and_log_in
  auth_test: coroutinize test_try_create_role_with_hashed_password_as_anonymous_user
  auth_test: coroutinize test_try_to_create_role_with_password_and_hashed_password
  auth_test: coroutinize test_try_to_create_role_with_hashed_password_and_password
  auth_test: coroutinize test_alter_with_workload_type
  auth_test: coroutinize test_alter_with_timeouts
  auth_test: coroutinize role_permissions_table_is_protected
  auth_test: coroutinize role_members_table_is_protected
  auth_test: coroutinize roles_table_is_protected
  auth_test: coroutinize test_password_authenticator_operations
  auth_test: coroutinize test_password_authenticator_attributes
  auth_test: coroutinize test_default_authenticator
2026-04-07 14:05:32 +03:00
Botond Dénes
513af59130 encryption: improve error message when KMS host is not configured
When an SSTable was encrypted with a KMS host that is not present in
scylla.yaml, the error thrown was:

  std::invalid_argument (No such host: <host-name>)

This message is very obscure in general, and especially confusing when
encountered while using the scylla-sstable tool: it gives no indication
that the SSTable is encrypted, that a KMS host lookup is involved, or
what the user needs to do to fix the problem.

Replace it with a message that names the missing host and points
directly to the relevant scylla.yaml section:

  Encryption host "<host-name>" is not defined in scylla.yaml.
  Make sure it is listed under the "kmip_hosts" section.

The wording is intentionally kept neutral (not framed as an SSTable tool
problem) because the same code path is exercised by production ScyllaDB
when a node's configuration no longer contains a host referenced by an
existing data file (e.g. after a config rollback or when restoring data
from a different cluster). The production use-case takes precedence, but
the message is equally actionable from the tool.

Closes scylladb/scylladb#29228
2026-04-07 14:00:27 +03:00
Botond Dénes
7344c05494 scylla-gdb.py: fix small_vector.__len__()
start - end will result in negative length, rejected by the python
runtime. Use the correct end - start to calculate length.

Closes scylladb/scylladb#29249
2026-04-07 13:57:21 +03:00
Botond Dénes
f71d2e78d8 tombstone_gc: don't use real-db for validation and determining default
data_dictionary::database was converted to replica::database in two
places, just to call find_keyspace(), then call
get_replication_strategy() on the returned keyspace. This is not
necessary, data_dictionary::database already has find_keyspace() and the
returned data_dictionary::keyspace also has get_replication_strategy().

This patch removes a small layering violation but more importantly, it
is necessary for the sstable tool to be able to load schemas from disk,
when said schema has tombstone_gc props.

Closes scylladb/scylladb#29279
2026-04-07 13:56:24 +03:00
Pavel Emelyanov
d6df5ef60a Merge 'compaction_test: Make compaction tests backend‑agnostic and add S3/GCS support' from Ernest Zaslavsky
This series updates the storage abstraction and extends the compaction tests to support object‑storage backends (S3 and GCS), while tightening several parts of the test environment.

The changes include:

- New exists/object_exists helpers across storage backends and clock fixes in the S3 client to make signature generation stable under test conditions.

- A new get_storage_for_tests accessor and adjustments to the test environment to avoid premature teardown of the sstable registry.

- Refactoring of compaction tests to remove direct sstable access, ensure proper schema setup, and avoid use of moved‑from objects.

- Extraction of test_env‑based logic into reusable functions and addition of S3/GCS variants of the compaction tests.

Not all tests were converted to be backend‑agnostic yet, and a few require further investigation before they can run cleanly against S3/GCS backends. These will be addressed in follow‑up work.

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-704 however, followup is needed

No backport needed since this change targeting future feature

Closes scylladb/scylladb#28790

* github.com:scylladb/scylladb:
  compaction_test: fix formatting after previous patches
  compaction_test: add S3/GCS variations to tests
  compaction_test: extract test_env-based tests into functions
  compaction_test: replace file_exists with storage::exists
  compaction_test: initialize tables with schema via make_table_for_tests
  compaction_test: use sstable APIs to manipulate component files
  compaction_test: fix use-after-move issue
  sstable_utils: add `get_storage` and `open_file` helpers
  test_env: delay unplugging sstable registry
  storage: add `exists` method to storage abstraction
  s3_client: use lowres_system_clock for aws_sigv4
  s3_client: add `object_exists` helper
  gcs_client: add `object_exists` helper
2026-04-07 13:53:48 +03:00
Piotr Dulikowski
4161273b4c Merge 'view_building_worker: fix race during draining procedure' from Michał Jadwiszczak
View building worker was breaking semaphores without holding their locks.
This lead to races like SCYLLADB-844 and SCYLLADB-543,
where a new batch was started after `view_building_worker::state` was cleared in the `drain()` process.

This patch fix the race by:
- taking a lock of the mutex before breaking it
- distinguishing between `state::clear()`(can happen multiple times) and `state::drain()`(can be called only once during shutdown)
- asserting that the state is not doing any new work after it was drained

Fixes SCYLLADB-844
Fixes SCYLLADB-543

This PR should be backported to all versions containing view building coordinator (2025.4 and newer).

Closes scylladb/scylladb#29303

* github.com:scylladb/scylladb:
  view_building_worker: extract starting a new batch to state's method
  view_building_worker: distinguish between state's `clear()` and `drain()`
  view_building_worker: lock mutexes before breaking them in `drain()`
  view_building_worker: execute drain() once
2026-04-07 12:13:51 +02:00
Avi Kivity
bc10e1a171 test: fix flaky test_login by not retrying authentication failures
The fix for SCYLLADB-1373 (b4f652b7c1) changed get_session() to use
the default timeout=30 for the retry loop in patient_*_cql_connection
(previously timeout=0.1). This correctly allowed retrying transient
NoHostAvailable errors during node startup, but introduced a new
flakiness in test_login and other auth tests.

The failure chain:

1. test_login connects with bad credentials (e.g. user="doesntexist")
2. get_session() calls patient_exclusive_cql_connection(), which calls
   retry_till_success() with bypassed_exception=NoHostAvailable
3. The first attempt correctly fails: the server rejects the credentials
   with AuthenticationFailed, wrapped in NoHostAvailable
4. retry_till_success() catches NoHostAvailable indiscriminately and
   retries, not distinguishing between transient errors (node not ready)
   and permanent errors (bad credentials)
5. A subsequent retry attempt times out (connect_timeout=5), producing
   OperationTimedOut wrapped in NoHostAvailable
6. After 30 seconds, the last NoHostAvailable is raised -- now wrapping
   OperationTimedOut instead of the original AuthenticationFailed
7. The assertion `isinstance(..., AuthenticationFailed)` fails

With the old timeout=0.1, the deadline was already exceeded after the
first attempt, so the original AuthenticationFailed propagated.

Fix: Add a `should_retry` predicate parameter to retry_till_success()
and use it in patient_cql_connection() and
patient_exclusive_cql_connection() to immediately re-raise
NoHostAvailable when it wraps AuthenticationFailed. Retrying
authentication failures is never useful since the credentials won't
change between attempts.

Fixes: SCYLLADB-1382

Closes scylladb/scylladb#29348
2026-04-07 10:17:31 +03:00
Michał Jadwiszczak
51c164c8d2 view_building_worker: extract starting a new batch to state's method
Following the previous commit, a new batch cannot be started if the
state was already drained.
This commit also adds a check that only one batch is running at a time.
2026-04-07 08:39:05 +02:00
Michał Jadwiszczak
639aa223f3 view_building_worker: distinguish between state's clear() and drain()
While both of this methods do the same (abort current batch, clear
data), we can clear the state multiple times during view_building_worker
lifetime (for instance when processing base table is changed) but
`view_building_worker::state::drain()` should be called only once and
after this no other work on the state should be done.
2026-04-07 08:39:05 +02:00
Michał Jadwiszczak
7aea524f52 view_building_worker: lock mutexes before breaking them in drain()
Not doing this may lead to races like SCYLLADB-844.
If some consumer is holding a lock of a mutex and `drain()`
is just braking the mutex without locking it beforehand,
then the consumer may process its code which should be aborted.

An example of the race is SCYLLADB-844, where `work_on_tasks()` is
holding `_state._mutex` while it is broken by `drain()`.
This causes a new batch is started after the `_state` is cleared.
2026-04-07 08:39:00 +02:00
Michał Jadwiszczak
91c7ac1fb2 view_building_worker: execute drain() once
Future changes will require that the view building worker is drained
only once per its lifetime.
2026-04-07 08:35:02 +02:00
Avi Kivity
b4f652b7c1 test: fix flaky test_create_ks_auth by removing bad retry timeout
get_session() was passing timeout=0.1 to patient_exclusive_cql_connection
and patient_cql_connection, leaving only 0.1 seconds for the retry loop
in retry_till_success(). Since each connection attempt can take up to 5
seconds (connect_timeout=5), the retry loop effectively got only one
attempt with no chance to retry on transient NoHostAvailable errors.

Use the default timeout=30 seconds, consistent with all other callers.

Fixes: SCYLLADB-1373

Closes scylladb/scylladb#29332
2026-04-05 19:13:15 +03:00
Avi Kivity
2f0d178510 auth_test: fix whitespace
Fix over-indented lines inside do_with_mc lambda bodies introduced
during coroutinization.
2026-04-05 18:28:23 +03:00
Avi Kivity
7a24da9e88 auth_test: coroutinize test_try_describe_schema_with_internals_and_passwords_as_anonymous_user
Use co_await instead of return for improved readability.
2026-04-05 18:26:30 +03:00
Avi Kivity
e1b52cf337 auth_test: coroutinize test_try_login_after_creating_roles_with_hashed_password
Use co_await instead of return for improved readability.
2026-04-05 18:26:30 +03:00
Avi Kivity
24d36ad459 auth_test: coroutinize test_create_roles_with_hashed_password_and_log_in
Use co_await instead of return for improved readability.
2026-04-05 18:26:30 +03:00
Avi Kivity
6f20129eec auth_test: coroutinize test_try_create_role_with_hashed_password_as_anonymous_user
Use co_await instead of return for improved readability.
2026-04-05 18:26:30 +03:00
Avi Kivity
cece181113 auth_test: coroutinize test_try_to_create_role_with_password_and_hashed_password
Use co_await instead of return for improved readability.
2026-04-05 18:26:30 +03:00
Avi Kivity
752391f757 auth_test: coroutinize test_try_to_create_role_with_hashed_password_and_password
Use co_await instead of return for improved readability.
2026-04-05 18:26:30 +03:00
Avi Kivity
287625b297 auth_test: coroutinize test_alter_with_workload_type
Use co_await instead of return for improved readability.
2026-04-05 18:26:30 +03:00
Avi Kivity
4eeb5ef54d auth_test: coroutinize test_alter_with_timeouts
Use co_await instead of return for improved readability.
2026-04-05 18:26:30 +03:00
Avi Kivity
170c71b25d auth_test: coroutinize role_permissions_table_is_protected
Use co_await for improved readability.
2026-04-05 18:26:30 +03:00
Avi Kivity
13eccf519f auth_test: coroutinize role_members_table_is_protected
Use co_await for improved readability.
2026-04-05 18:26:30 +03:00
Avi Kivity
43ff3798ad auth_test: coroutinize roles_table_is_protected
Use co_await for improved readability.
2026-04-05 18:26:30 +03:00
Avi Kivity
c586eeb003 auth_test: coroutinize test_password_authenticator_operations
Flatten continuation chains (.then()) into linear thread-style code
with .get() calls for improved readability. Remove the now-unused
require_throws helper template.
2026-04-05 18:26:25 +03:00
Avi Kivity
fbccfe5c9d auth_test: coroutinize test_password_authenticator_attributes
Use co_await instead of return+do_with_cql_env+make_ready_future
for improved readability.
2026-04-05 17:28:09 +03:00
Avi Kivity
e3dee64003 auth_test: coroutinize test_default_authenticator
Use co_await instead of return+do_with_cql_env+make_ready_future
for improved readability.
2026-04-05 17:27:45 +03:00
Jenkins Promoter
ab4a2cdde2 Update pgo profiles - aarch64 2026-04-05 16:58:02 +03:00
Jenkins Promoter
b97cf0083c Update pgo profiles - x86_64 2026-04-05 16:00:15 +03:00
Nikos Dragazis
6d50e67bd2 scylla_swap_setup: Remove Before=swap.target dependency from swap unit
When a Scylla node starts, the scylla-image-setup.service invokes the
`scylla_swap_setup` script to provision swap. This script allocates a
swap file and creates a swap systemd unit to delegate control to
systemd. By default, systemd injects a Before=swap.target dependency
into every swap unit, allowing other services to use swap.target to wait
for swap to be enabled.

On Azure, this doesn't work so well because we store the swap file on
the ephemeral disk [1] which has network dependencies (`_netdev` mount
option, configured by cloud-init [2]). This makes the swap.target
indirectly depend on the network, leading to dependency cycles such as:

swap.target -> mnt-swapfile.swap -> mnt.mount -> network-online.target
-> network.target -> systemd-resolved.service -> tmp.mount -> swap.target

This patch breaks the cycle by removing the swap unit from swap.target
using DefaultDependencies=no. The swap unit will still be activated via
WantedBy=multi-user.target, just not during early boot.

Although this problem is specific to Azure, this patch applies the fix
to all clouds to keep the code simple.

Fixes #26519.
Fixes SCYLLADB-1257

[1] https://github.com/scylladb/scylla-machine-image/pull/426
[2] https://github.com/canonical/cloud-init/pull/1213#issuecomment-1026065501

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>

Closes scylladb/scylladb#28504
2026-04-05 15:07:50 +03:00
Tomasz Grabiec
74542be5aa test: pylib: Ignore exceptions in wait_for()
ManagerClient::get_ready_cql() calls server_sees_others(), which waits
for servers to see each other as alive in gossip. If one of the
servers is still early in boot, RESTful API call to
"gossiper/endpoint/live" may fail. It throws an exception, which
currently terminates the wait_for() and propagates up, failing the test.

Fix this by ignoring errors when polling inside wait_for. In case of
timeout, we log the last exception. This should fix the problem not
only in this case, for all uses of wait_for().

Example output:

```
pred = <function ManagerClient.server_sees_others.<locals>._sees_min_others at 0x7f022af9a140>
deadline = 1775218828.9172852, period = 1.0, before_retry = None
backoff_factor = 1.5, max_period = 1.0, label = None

    async def wait_for(
            pred: Callable[[], Awaitable[Optional[T]]],
            deadline: float,
            period: float = 0.1,
            before_retry: Optional[Callable[[], Any]] = None,
            backoff_factor: float = 1.5,
            max_period: float = 1.0,
            label: Optional[str] = None) -> T:
        tag = label or getattr(pred, '__name__', 'unlabeled')
        start = time.time()
        retries = 0
        last_exception: Exception | None = None
        while True:
            elapsed = time.time() - start
            if time.time() >= deadline:
                timeout_msg = f"wait_for({tag}) timed out after {elapsed:.2f}s ({retries} retries)"
                if last_exception is not None:
                    timeout_msg += (
                        f"; last exception: {type(last_exception).__name__}: {last_exception}"
                    )
                    raise AssertionError(timeout_msg) from last_exception
                raise AssertionError(timeout_msg)

            try:
>               res = await pred()

test/pylib/util.py:80:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

    async def _sees_min_others():
>       raise Exception("asd")
E       Exception: asd

test/pylib/manager_client.py:802: Exception

The above exception was the direct cause of the following exception:

manager = <test.pylib.manager_client.ManagerClient object at 0x7f022af7e7b0>

    @pytest.mark.asyncio
    async def test_auth_after_reset(manager: ManagerClient) -> None:
        servers = await manager.servers_add(3, config=auth_config, auto_rack_dc="dc1")
>       cql, _ = await manager.get_ready_cql(servers)

test/cluster/auth_cluster/test_auth_after_reset.py:33:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
test/pylib/manager_client.py:137: in get_ready_cql
    await self.servers_see_each_other(servers)
test/pylib/manager_client.py:820: in servers_see_each_other
    await asyncio.gather(*others)
test/pylib/manager_client.py:806: in server_sees_others
    await wait_for(_sees_min_others, time() + interval, period=.5)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

pred = <function ManagerClient.server_sees_others.<locals>._sees_min_others at 0x7f022af9a140>
deadline = 1775218828.9172852, period = 1.0, before_retry = None
backoff_factor = 1.5, max_period = 1.0, label = None

    async def wait_for(
            pred: Callable[[], Awaitable[Optional[T]]],
            deadline: float,
            period: float = 0.1,
            before_retry: Optional[Callable[[], Any]] = None,
            backoff_factor: float = 1.5,
            max_period: float = 1.0,
            label: Optional[str] = None) -> T:
        tag = label or getattr(pred, '__name__', 'unlabeled')
        start = time.time()
        retries = 0
        last_exception: Exception | None = None
        while True:
            elapsed = time.time() - start
            if time.time() >= deadline:
                timeout_msg = f"wait_for({tag}) timed out after {elapsed:.2f}s ({retries} retries)"
                if last_exception is not None:
                    timeout_msg += (
                        f"; last exception: {type(last_exception).__name__}: {last_exception}"
                    )
>                   raise AssertionError(timeout_msg) from last_exception
E                   AssertionError: wait_for(_sees_min_others) timed out after 45.30s (46 retries); last exception: Exception: asd

test/pylib/util.py:76: AssertionError
```

Fixes a failure observed in test_auth_after_reset:

```
manager = <test.pylib.manager_client.ManagerClient object at 0x7fb3740e1630>

    @pytest.mark.asyncio
    async def test_auth_after_reset(manager: ManagerClient) -> None:
        servers = await manager.servers_add(3, config=auth_config, auto_rack_dc="dc1")
        cql, _ = await manager.get_ready_cql(servers)
        await cql.run_async("ALTER ROLE cassandra WITH PASSWORD = 'forgotten_pwd'")

        logging.info("Stopping cluster")
        await asyncio.gather(*[manager.server_stop_gracefully(server.server_id) for server in servers])

        logging.info("Deleting sstables")
        for table in ["roles", "role_members", "role_attributes", "role_permissions"]:
            await asyncio.gather(*[manager.server_wipe_sstables(server.server_id, "system", table) for server in servers])

        logging.info("Starting cluster")
        # Don't try connect to the servers yet, with deleted superuser it will be possible only after
        # quorum is reached.
        await asyncio.gather(*[manager.server_start(server.server_id, connect_driver=False) for server in servers])

        logging.info("Waiting for CQL connection")
        await repeat_until_success(lambda: manager.driver_connect(auth_provider=PlainTextAuthProvider(username="cassandra", password="cassandra")))
>       await manager.get_ready_cql(servers)

test/cluster/auth_cluster/test_auth_after_reset.py:50:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
test/pylib/manager_client.py:137: in get_ready_cql
    await self.servers_see_each_other(servers)
test/pylib/manager_client.py:819: in servers_see_each_other
    await asyncio.gather(*others)
test/pylib/manager_client.py:805: in server_sees_others
    await wait_for(_sees_min_others, time() + interval, period=.5)
test/pylib/util.py:71: in wait_for
    res = await pred()
test/pylib/manager_client.py:802: in _sees_min_others
    alive_nodes = await self.api.get_alive_endpoints(server_ip)
test/pylib/rest_client.py:243: in get_alive_endpoints
    data = await self.client.get_json(f"/gossiper/endpoint/live", host=node_ip)
test/pylib/rest_client.py:99: in get_json
    ret = await self._fetch("GET", resource_uri, response_type = "json", host = host,
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <test.pylib.rest_client.TCPRESTClient object at 0x7fb2404a0650>
method = 'GET', resource = '/gossiper/endpoint/live', response_type = 'json'
host = '127.15.252.8', port = 10000, params = None, json = None, timeout = None
allow_failed = False

    async def _fetch(self, method: str, resource: str, response_type: Optional[str] = None,
                     host: Optional[str] = None, port: Optional[int] = None,
                     params: Optional[Mapping[str, str]] = None,
                     json: Optional[Mapping] = None, timeout: Optional[float] = None, allow_failed: bool = False) -> Any:
        # Can raise exception. See https://docs.aiohttp.org/en/latest/web_exceptions.html
        assert method in ["GET", "POST", "PUT", "DELETE"], f"Invalid HTTP request method {method}"
        assert response_type is None or response_type in ["text", "json"], \
                f"Invalid response type requested {response_type} (expected 'text' or 'json')"
        # Build the URI
        port = port if port else self.default_port if hasattr(self, "default_port") else None
        port_str = f":{port}" if port else ""
        assert host is not None or hasattr(self, "default_host"), "_fetch: missing host for " \
                "{method} {resource}"
        host_str = host if host is not None else self.default_host
        uri = self.uri_scheme + "://" + host_str + port_str + resource
        logging.debug(f"RESTClient fetching {method} {uri}")

        client_timeout = ClientTimeout(total = timeout if timeout is not None else 300)
        async with request(method, uri,
                           connector = self.connector if hasattr(self, "connector") else None,
                           params = params, json = json, timeout = client_timeout) as resp:
            if allow_failed:
                return await resp.json()
            if resp.status != 200:
                text = await resp.text()
>               raise HTTPError(uri, resp.status, params, json, text)
E               test.pylib.rest_client.HTTPError: HTTP error 404, uri: http://127.15.252.8:10000/gossiper/endpoint/live, params: None, json: None, body:
E               {"message": "Not found", "code": 404}

test/pylib/rest_client.py:77: HTTPError
```

Fixes: SCYLLADB-1367

Closes scylladb/scylladb#29323
2026-04-05 13:52:26 +03:00
Ernest Zaslavsky
c7a74237b3 compaction_test: fix formatting after previous patches 2026-04-05 11:07:17 +03:00
Ernest Zaslavsky
101b4ad7fa compaction_test: add S3/GCS variations to tests
Add S3 and GCS variants of the compaction tests to expand coverage for
keyspaces configured to use object_storage backends.
2026-04-05 11:07:17 +03:00
Ernest Zaslavsky
03bd3010bf compaction_test: extract test_env-based tests into functions
Move all test code that relies on test_env into standalone free
functions so they can be reused by upcoming S3 and GCS test suites.
2026-04-05 11:07:17 +03:00
Ernest Zaslavsky
b18528e97e compaction_test: replace file_exists with storage::exists
Replace direct filesystem checks (file_exists) with the
storage-agnostic exists() method in unsealed_sstable_compaction,
sstable_clone_leaving_unsealed_dest_sstable, and
failure_when_adding_new_sstable tests, making them compatible
with object-storage backends (S3, GCS).
2026-04-05 11:07:17 +03:00
Ernest Zaslavsky
98492e4ea8 compaction_test: initialize tables with schema via make_table_for_tests
Start using `table_for_tests::make_default_schema` so test tables are
created with a real schema. This is required for object-storage
backends, which cannot operate correctly without proper schema
initialization.
2026-04-05 11:07:17 +03:00
Ernest Zaslavsky
5ba79e2ed4 compaction_test: use sstable APIs to manipulate component files
Switch tests to use sstable member functions for file manipulation
instead of opening files directly on the filesystem. This affects the
helpers that emulate sstable corruption: we now overwrite the entire
component file rather than just the first few kilobytes, which is
sufficient for producing a corrupted sstable.
2026-04-05 11:07:17 +03:00
Ernest Zaslavsky
405c032f48 compaction_test: fix use-after-move issue
We were moving `compaction_type_options` inside a loop, so on the
second iteration the test received an already moved-from instance.
2026-04-05 11:07:17 +03:00
Ernest Zaslavsky
437a581b04 sstable_utils: add get_storage and open_file helpers
Add a non-const `get_storage` accessor to expose underlying storage,
and an `open_file` helper to access sstable component files directly.
These are needed so compaction tests can read and write sstable
components.
2026-04-05 11:07:17 +03:00
Ernest Zaslavsky
2ad2dbae03 test_env: delay unplugging sstable registry
Unplugging the mock sstable_registry happened too early in the test
environment. During sstable destruction, components may still need
access to the registry, so the unplugging is moved to a later stage.
2026-04-05 11:07:17 +03:00
Ernest Zaslavsky
8f6630e9cd storage: add exists method to storage abstraction
Add an `exists` method to the storage abstraction to allow S3, GCS,
and local storage implementations to check whether an sstable
component is present.
2026-04-05 11:07:17 +03:00
Ernest Zaslavsky
ba785f6cab s3_client: use lowres_system_clock for aws_sigv4
Switch aws_sigv4 to lowres_system_clock since it is not affected by
time offsets often introduced in tests, which can skew db_clock. S3
requests cannot represent time shifts greater than 15 minutes from
server time, so a stable clock is required.
2026-04-05 11:07:17 +03:00
Ernest Zaslavsky
e08d779922 s3_client: add object_exists helper
Introduce `object_exists` to the S3 client to check whether an object
exists. This is primarily useful for test scenarios.
2026-04-05 11:07:16 +03:00
Ernest Zaslavsky
016b344a8a gcs_client: add object_exists helper
Introduce `object_exists` to the GCS client to check whether an object
exists. This is primarily useful for test scenarios.
2026-04-05 11:07:16 +03:00
Andrzej Jackowski
8c0920202b test: protect populate_range in row_cache_test from bad_alloc
When test_exception_safety_of_update_from_memtable was converted from
manual fail_after()/catch to with_allocation_failures() in 74db08165d,
the populate_range() call ended up inside the failure injection scope
without a scoped_critical_alloc_section guard. The other two tests
converted in the same commit (test_exception_safety_of_transitioning...
and test_exception_safety_of_partition_scan) were correctly guarded.

Without the guard, the allocation failure injector can sometimes
target an allocation point inside the cleanup path of populate_range().
In a rare corner case, this triggers a bad_alloc in a noexcept context
(reader_concurrency_semaphore::stop()), causing std::terminate.

Fixes SCYLLADB-1346

Closes scylladb/scylladb#29321
2026-04-04 21:13:26 +03:00
Andrzej Jackowski
ec274cf7b6 test: add test_upgrade_preserves_ddl_audit_for_tables
Verify that upgrading from 2025.1 to master does not silently drop DDL
auditing for table-scoped audit configurations (SCYLLADB-1155).

Test time in dev: 4s

Refs: SCYLLADB-1155
Fixes: SCYLLADB-1305
2026-04-03 13:53:28 +02:00
Andrzej Jackowski
9c7b7ac3e3 test: audit: split validate helper so callers need not pass audit_settings
The old execute_and_validate_audit_entry required every caller to
pass audit_settings so it could decide internally whether to expect
an entry. A test added later in this series needs to simply assert
an entry was produced, without specifying audit_settings at all.

Split into two methods:
- execute_and_validate_new_audit_entry: unconditionally expects an
  audit entry.
- execute_and_validate_if_category_enabled: checks audit_settings
  to decide whether to expect an entry or assert absence.

Local wrapper functions and **kwargs forwarding are removed in favor
of explicit arguments at each call site, and expected-error cases are
handled inline with assert_invalid + assert_entries_were_added.
2026-04-03 13:52:47 +02:00
Andrzej Jackowski
189bff1d5c test: audit: declare manager attribute in AuditTester base class
AuditTester uses self.manager throughout but never declares it.
The attribute is only assigned in the CQLAuditTester subclass
__init__, so the type checker reports 'Attribute "manager" is
unknown' on every self.manager reference in the base class.

Add an __init__ to AuditTester that accepts and stores the manager
instance, and update CQLAuditTester to forward it via super().__init__
instead of assigning self.manager directly.
2026-04-03 13:52:47 +02:00
Botond Dénes
2c22d69793 Merge 'Pytest: fix variable handling in GSServer (mock) and ensure docker service logs go to test log as well' from Calle Wilund
Fixes: SCYLLADB-1106

* Small fix in scylla_cluster - remove debug print
* Fix GSServer::unpublish so it does not except if publish was not called beforehand
* Improve dockerized_server so mock server logs echo to the test log to help diagnose CI failures (because we don't collect log files from mocks etc, and in any case correlation will be much easier).

No backport needed.

Closes scylladb/scylladb#29112

* github.com:scylladb/scylladb:
  dockerized_service: Convert log reader to pipes and push to test log
  test::cluster::conftest::GSServer: Fix unpublish for when publish was not called
  scylla_cluster: Use thread safe future signalling
  scylla_cluster: Remove left-over debug printout
2026-04-03 06:38:05 +03:00
Raphael S. Carvalho
b6ebbbf036 test/cluster/test_tablets2: Fix test_split_stopped_on_shutdown race with stale log messages
The test was failing because the call to:

    await log.wait_for('Stopping.*ongoing compactions')

was missing the 'from_mark=log_mark' argument. The log mark was updated
(line: log_mark = await log.mark()) immediately after detecting
'splitting_mutation_writer_switch_wait: waiting', and just before
launching the shutdown task. However, the wait_for call on the following
line was scanning from the beginning of the log, not from that mark.

As a result, the search immediately matched old 'Stopping N tasks for N
ongoing compactions for table system.X due to table removal' messages
emitted during initial server bootstrap (for system.large_partitions,
system.large_rows, system.large_cells), rather than waiting for the
shutdown to actually stop the user-table split compaction.

This caused the test to prematurely send the message to the
'splitting_mutation_writer_switch_wait' injection. The split compaction
was unblocked before the shutdown had aborted it, so it completed
successfully. Since the split succeeded, 'Failed to complete splitting
of table' was never logged.

Meanwhile, 'storage_service_drain_wait' was blocking do_drain() waiting
for a message. With the split already done, the test was stuck waiting
for the expected failure log that would never come (600s timeout). At
the same time, after 60s the 'storage_service_drain_wait' injection
timed out internally, triggering on_internal_error() which -- with
--abort-on-internal-error=1 -- crashed the server (exit code -6).

Fix: pass from_mark=log_mark to the wait_for('Stopping.*ongoing
compactions') call so it only matches messages that appear after the
shutdown has started, ensuring the test correctly synchronizes with the
shutdown aborting the user-table split compaction before releasing the
injection.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1319.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Closes scylladb/scylladb#29311
2026-04-03 06:28:51 +03:00
Andrei Chekun
6526a78334 test.py: fix nodetool mock server port collision
Replace the random port selection with an OS-assigned port. We open
a temporary TCP socket, bind it to (ip, 0) with SO_REUSEADDR, read back
the port number the OS selected, then close the socket before launching
rest_api_mock.py.
Add reuse_address=True and reuse_port=True to TCPSite in rest_api_mock.py
so the server itself can also reclaim a TIME_WAIT port if needed.

Fixes: SCYLLADB-1275

Closes scylladb/scylladb#29314
2026-04-02 16:24:07 +02:00
Botond Dénes
eb78498e07 test: fix flaky test_timeout_is_applied_on_lookup by using eventually_true
On slow/overloaded CI machines the lowres_clock timer may not have
fired after the fixed 2x sleep, causing the assertion on
get_abort_exception() to fail. Replace the fixed sleep with
sleep(1x) + eventually_true() which retries with exponential backoff,
matching the pattern already used in test_time_based_cache_eviction.

Fixes: SCYLLADB-1311

Closes scylladb/scylladb#29299
2026-04-01 18:20:11 +03:00
Marcin Maliszkiewicz
a74665b300 transport: add per-service-level pending response memory metric
Track the total memory consumed by responses waiting to be
written to the socket, exposed as a per-scheduling-group gauge
(cql_pending_response_memory). This complements the response
memory accounting added in the previous commits by giving
visibility into how much memory each service level is holding
in unsent response buffers.
2026-04-01 17:15:28 +02:00
Robert Bindar
e7527392c4 test: close clients if cluster teardown throws
make sure the driver is stopped even though cluster
teardown throws and avoid potential stale driver
connections entering infinite reconnect loops which
exhaust cpu resources.

Fixes: SCYLLADB-1189

Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>

Closes scylladb/scylladb#29230
2026-04-01 17:22:19 +03:00
Tomasz Grabiec
2ec47a8a21 tests: address_map_test: Fix flakiness in debug mode due to task reordering
Debug mode shuffles task position in the queue. So the following is possible:
 1) shard 1 calls manual_clock::advance(). This expires timers on shard 1 and queues a background smp call to shard 0 which will expire timers there
 2) the smp::submit_to(0, ...) from shard 1 called by the test sumbits the call
 3) shard 0 creates tasks for both calls, but (2) is run first, and preempts the reactor
 4) shard 1 sees the completion, completes m_svc.invoke_on(1, ..)
 5) shard 0 inserts the completion from (4) before task from (1)
 6) the check on shard 0: m.find(id1) fails because the timer is not expired yet

To fix that, wait for timer expiration on shard 0, so that the test
doesn't depend on task execution order.

Note: I was not able to reproduce the problem locally using test.py --mode
debug --repeat 1000.

It happens in jenkins very rarely. Which is expected as the scenario which
leads to this is quite unlikely.

Fixes SCYLLADB-1265

Closes scylladb/scylladb#29290
2026-04-01 17:17:35 +03:00
Aleksandra Martyniuk
4d4ce074bb test: node_ops_tasks_tree: reconnect driver after topology changes
The test exercises all five node operations (bootstrap, replace, rebuild,
removenode, decommission) and by the end only one node out of four
remains alive. The CQL driver session, however, still holds stale
references to the dead hosts in its connection pool and load-balancing
policy state.

When the new_test_keyspace context manager exits and attempts
DROP KEYSPACE, the driver routes the query to the dead hosts first,
gets ConnectionShutdown from each, and throws NoHostAvailable before
ever trying the single live node.

Fix by calling driver_connect() after the decommission step, which
closes the old session and creates a fresh one connected only to the
servers the test manager reports as running.

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1313.

Closes scylladb/scylladb#29306
2026-04-01 17:13:11 +03:00
Dario Mirovic
85127fded8 test: boost: test null data value to_parsable_string
Add tests for null value in data_type::to_parsable_string().
We now explicitly return "null".

Refs SCYLLADB-1350
2026-04-01 14:15:25 +02:00
Dario Mirovic
fc705dfb4b cql3: fix null handling in data_value formatting
data_value::to_parsable_string() crashed with a null pointer
dereference when called on a null data_value. Return "null" instead.

Fixes SCYLLADB-1350
2026-04-01 14:15:18 +02:00
Andrzej Jackowski
cccb014747 test: ldap: add regression test for double-free on unregistered message ID
Sends a search via the raw LDAP handle (bypassing _msgid_to_promise
registration), then triggers poll_results() through the public API
to exercise the unregistered-ID branch.

Refs: SCYLLADB-1344
2026-04-01 12:57:50 +02:00
Botond Dénes
0351756b15 Merge 'test: fix fuzzy_test timeout in release mode' from Piotr Smaron
The multishard_query_test/fuzzy_test was timing out (SIGKILL after
15 minutes) in release mode CI.

In release mode the test generates up to 64 partitions with up to
1000 clustering rows and 1000 range tombstones each.  With deeply
nested randomly-generated types (e.g. frozen<map<varint,
frozen<map<frozen<tuple<...>>>>>>), this volume of data can exceed
the 15-minute CI timeout.

Reduce the release-mode clustering-row and range-tombstone
distributions from 0-1000 to 0-200.  This caps the worst case at
~12,800 rows -- still 2x the devel-mode maximum (0-100) and
sufficient to exercise multi-partition paged scanning with many
pages.

Fixes: SCYLLADB-1270

No need to backport for now, only appeared on master.

Closes scylladb/scylladb#29293

* github.com:scylladb/scylladb:
  test: clean up fuzzy_test_config and add comments
  test: fix fuzzy_test timeout in release mode
2026-04-01 11:50:15 +03:00
Andrzej Jackowski
f0028c06dc ldap: fix double-free of LDAPMessage in poll_results()
In the unregistered-ID branch, ldap_msgfree() was called on a result
already owned by an RAII ldap_msg_ptr, causing a double-free on scope
exit. Remove the redundant manual free.

Fixes: SCYLLADB-1344
2026-04-01 10:35:13 +02:00
Andrei Chekun
18f41dcd71 test.py: introduce new scheduler for choosing job count
This commit improves how test.py chohoses the default number of
parallele jobs.
This update keeps logic of selecting number of jobs from memory and cpu limits
but simplifies the heuristic so it is smoother, easier to reason about.
This avoids discontinuities such as neighboring machine sizes producing
unexpectedly different job counts, and behaves more predictably on asymmetric
machines where CPU and RAM do not scale together.

Compared to the current threshold-based version, this approach:
- avoids hard jumps around memory cutoffs
- avoids bucketed debug scaling based on CPU count
- keeps CPU and memory as separate constraints and combines them in one place
- avoids double-penalizing debug mode
- is easier to tune later by adjusting a few constants instead of rewriting branching logic

Closes scylladb/scylladb#28904
2026-04-01 11:11:15 +03:00
Avi Kivity
d438e35cdd test/cluster: fix race in test_insert_failure_standalone audit log query
get_audit_partitions_for_operation() returns None when no audit log
rows are found. In _test_insert_failure_doesnt_report_success_assign_nodes,
this None is passed to set(), causing TypeError: 'NoneType' object is
not iterable.

The audit log entry may not yet be visible immediately after executing
the INSERT, so use wait_for() from test.pylib.util with exponential
backoff to poll until the entry appears. Import it as wait_for_async
to avoid shadowing the existing wait_for from test.cluster.dtest.dtest_class,
which has a different signature (timeout vs deadline).

Fixes SCYLLADB-1330

Closes scylladb/scylladb#29289
2026-04-01 10:59:02 +03:00
Michael Litvak
35547bfb6e test: logstor: additional logstor tests 2026-03-31 18:45:08 +02:00
Michael Litvak
5b3e2a4ca2 docs/dev: add logstor on-disk format section 2026-03-31 18:45:08 +02:00
Michael Litvak
39baa573d2 logstor: add version and crc to buffer header
add basic crc and validation to the buffer header. add also a version
field that indicates the version of the on-disk format.
2026-03-31 18:45:08 +02:00
Michael Litvak
6ace823ee4 test: logstor: tablet split/merge and migration
add basic logstor tests for tablet split/merge and migration to verify
it works as expected
2026-03-31 18:45:08 +02:00
Michael Litvak
996d623ab4 logstor: enable tablet balancing
enable tablet balancing with the logstor feature now that it works
2026-03-31 18:45:08 +02:00
Michael Litvak
b02349d755 logstor: streaming of logstor segments using stream_blob
implement tablet migration for logstor tables by streaming segments
using stream_blob, similar to file streaming of sstables.

take a snapshot of the logstor segments and create a stream_blob_info
vector with entry for each segment with the input stream that reads the
segment and an op of type file_ops::stream_logstor_segments.

the stream_blob_handler creates a logstor sink that allocates a segment
on the target shard and creates an output stream that writes to it. when
the sink is closed it loads the segment.
2026-03-31 18:45:08 +02:00
Michael Litvak
78426ae31b logstor: add take_logstor_snapshot
add the function table::take_logstor_snapshot that is similar to
take_storage_snapshot for sstables.

given a token range, for each storage group in the range, it flushes the
separator buffers and then makes a snapshot of all segments in the sg's
compaction groups while disabling compaction.

the segment snapshot holds a reference to the segment so that it won't
be freed by compaction, and it provides an input stream for reading the
segment.

this will be used for tablet migration to stream the segments.
2026-03-31 18:45:08 +02:00
Michael Litvak
754c1b83bd logstor: segment input/output stream
add functions for creating segment input and output streams, that will
be used for segment streaming.

the segment input stream creates a file input stream that reads a given
segment.

the segment output stream allocates a new local segment and creates an
output stream that writes to the segment, and when closed it loads the
segment and adds it to the compaction group.
2026-03-31 18:45:08 +02:00
Michael Litvak
17cab4181b logstor: implement compaction_group::cleanup
implement compaction group cleanup by clearing the range in the index
and discarding the segments of the compaction group.

segments are discarded by overwriting the segment header to indicate the
segment is empty while preserving the segment generation number in order
to not resurrect old data in the segment.
2026-03-31 18:45:08 +02:00
Michael Litvak
9fd6dace72 logstor: tablet split
implement tablet split for logstor.

flush the separator and then perform split as a new type of compaction:
take a batch of segments from the source compaction group, read them and
write all live records into left/right write buffers according to the
split classifier, flush them to the compaction group, and free the old
segments. segments that fit in a single target compaction group are
removed from the source and added to the correct target group.
2026-03-31 18:45:08 +02:00
Michael Litvak
5de39afc24 logstor: tablet merge
implement tablet merge with logstor.

disable compaction for the new compaction group, then merge the merging
compaction groups by merging their logstor segments set into the new cg
- simply merging the segment histogram.
2026-03-31 18:40:57 +02:00
Michael Litvak
684ce8de71 logstor: add compaction reenabler
add a function that stops and disabled compaction for a compaction group
and returns a compaction reenabler object, similarly to the normal
compaction manager.

this will be useful for disabling compaction while doing operations on
the compaction group's logstor segment set.
2026-03-31 18:40:56 +02:00
Michael Litvak
1d7c2e4f52 logstor: add segment header
we have two types of segments. the active segment is "mixed" because we
can write to it multiple write_buffers, each write buffer having records
from different tables and tablets. in constrast, the separator and
compaction write "full" segments - they write a single write_buffer that
has records from a single tablet and storage group.

for "full" segments, we add a segment header the contains additional
useful metadata such as the table and token range in the segment.

the write buffer header contains the type of the buffer, mixed or full.
if it's full then it has a segment header placed after the write buffer
header.
2026-03-31 18:40:56 +02:00
Michael Litvak
8615f68657 logstor: serialize writes to active segment
previously when writing to the active segment, the allocation was
serialized but multiple writes could proceed concurrently to different
offsets. change it instead to serialize the entire write.

we prefer to write larger buffers sequentially instead of multiple
buffers concurrently. it is also better that we don't have "holes" in
the segment.

we also change the buffered_writer to send a single flushing buffer at a
time. it has a ring of buffers, new writes are written to the head
buffer, and a single consumer flushes the tail buffer.
2026-03-31 18:40:56 +02:00
Michael Litvak
e791823874 replica: extend compaction_group functions for logstor
extend compaction_group functions such as disk size calculation and
empty() to account also for the logstor segments that the compaction
group owns.

reuse the sstable_add_gate when there is a write in process to a
compaction group, in order for the compaction group to be considered not
empty.
2026-03-31 18:40:56 +02:00
Michael Litvak
d3db967802 replica: add compaction_group_for_logstor_segment
add the function table::compaction_group_for_logstor_segment that we use
when recovering a segment to find the compaction group for a segment
based on its token range, similarly to compaction_group_for_sstable for
sstables.

extract the common logic from compaction_group_for_sstable to a common
function compaction_group_for_token_range that finds a compaction group
for a token range.
2026-03-31 18:40:56 +02:00
Michael Litvak
bf7bc5b410 logstor: code cleanup
misc code cleanup and small changes
2026-03-31 18:40:56 +02:00
Botond Dénes
2d2ff4fbda sstables: use chunked_managed_vector for promoted indexes in partition_index_page
Switch _promoted_indexes storage in partition_index_page from
managed_vector to chunked_managed_vector to avoid large contiguous
allocations.

Avoid allocation failure (or crashes with --abort-on-internal-error)
when large partitions have enough promoted index entries to trigger a
large allocation with managed_vector.

Fixes: SCYLLADB-1315

Closes scylladb/scylladb#29283
2026-03-31 18:43:57 +03:00
Piotr Smaron
2ce409dca0 test: clean up fuzzy_test_config and add comments
Remove the unused timeout field from fuzzy_test_config.  It was
declared, initialized per build mode, and logged, but never actually
enforced anywhere.

Document the intentionally small max_size (1024 bytes) passed to
read_partitions_with_paged_scan in run_fuzzy_test_scan: it forces
many pages per scan to stress the paging and result-merging logic.
2026-03-31 17:13:26 +02:00
Piotr Smaron
df2924b2a3 test: fix fuzzy_test timeout in release mode
The multishard_query_test/fuzzy_test was timing out (SIGKILL after
15 minutes) in release mode CI.

In release mode the test generates up to 64 partitions with up to
1000 clustering rows and 1000 range tombstones each.  With deeply
nested randomly-generated types (e.g. frozen<map<varint,
frozen<map<frozen<tuple<...>>>>>>), this volume of data can exceed
the 15-minute CI timeout.

Reduce the release-mode clustering-row and range-tombstone
distributions from 0-1000 to 0-200.  This caps the worst case at
~12,800 rows -- still 2x the devel-mode maximum (0-100) and
sufficient to exercise multi-partition paged scanning with many
pages.

Fixes: SCYLLADB-1270
2026-03-31 17:13:06 +02:00
Piotr Szymaniak
6d8ec8a0c0 alternator: fix flaky test_update_condition_unused_entries_short_circuit
The test was flaky because it stopped dc2_node immediately after an
LWT write, before cross-DC replication could complete. The LWT commit
uses LOCAL_QUORUM, which only guarantees persistence in the
coordinator's DC. Replication to the remote DC is async background
work, and CAS mutations don't store hints. Stopping dc2_node could
drop in-flight RPCs, leaving DC1 without the mutation.

Fix by polling both live DC1 nodes after the write to confirm
cross-DC replication completed before stopping dc2_node. Both nodes
must have the data so that the later ConsistentRead=True
(LOCAL_QUORUM) read on restarted node1 is guaranteed to succeed.

Fixes SCYLLADB-1267

Closes scylladb/scylladb#29287
2026-03-31 16:50:51 +03:00
Dawid Mędrek
f040f1b703 Merge 'raft: remake the read barrier optimization' from Patryk Jędrzejczak
The approach taken in 1ae2ae50a6 turned
out to be incorrect. The Raft member requesting a read barrier could
incorrectly advance its commit_idx and break linearizability. We revert that
commit in this PR.

We also remake the read barrier optimization with a completely new approach.
We make the leader replicate to the non-voting requester of a read barrier if
its `commit_idx` is behind.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-998

No backport: the issue is present only in master.

Closes scylladb/scylladb#29216

* github.com:scylladb/scylladb:
  raft: speed up read barrier requested by non-voters
  Revert "raft: read_barrier: update local commit_idx to read_idx when it's safe"
2026-03-31 15:11:56 +02:00
Marcin Maliszkiewicz
a26ca0f5f7 transport: hold memory permit until response write completes
Capture the memory permit in the leave lambda's .finally()
continuation so that the semaphore units are kept alive until
write_response finishes, preventing premature release of
memory accounting.

This is especially important with slow network and big responses
when buffers can accumulate and deplete node's memory.
2026-03-31 14:05:00 +02:00
Avi Kivity
216d39883a Merge 'test: audit: fix audit test syslog race' from Dario Mirovic
Fix two independent race conditions in the syslog audit test that cause intermittent `assert 2 <= 1` failures in `assert_entries_were_added`.

**Datagram ordering race:**
`UnixSockerListener` used `ThreadingUnixDatagramServer`, where each datagram spawns a new thread. The notification barrier in `get_lines()` assumes FIFO handling, but the notification thread can win the lock before an audit entry thread, so `clear_audit_logs()` misses entries that arrive moments later. Fix: switch to sequential `UnixDatagramServer`.

**Config reload race:**
The live-update path used `wait_for_config` (REST API poll on shard 0) which can return before `broadcast_to_all_shards()` completes. Fix: wait for `"completed re-reading configuration file"` in the server log after each SIGHUP, which guarantees all shards have the new config.

Fixes SCYLLADB-1277

This is CI improvement for the latest code. No need for backport.

Closes scylladb/scylladb#29282

* github.com:scylladb/scylladb:
  test: cluster: wait for full config reload in audit live-update path
  test: cluster: fix syslog listener datagram ordering race
2026-03-31 13:53:01 +03:00
Tomasz Grabiec
b355bb70c2 dtest/alternator: stop concurrent-requests test when workers hit limit
`test_limit_concurrent_requests` could create far more tables than intended
because worker threads looped indefinitely and only the probe path terminated
the test. In practice, workers often hit `RequestLimitExceeded` first, but the
test kept running and creating tables, increasing memory pressure and causing
flakiness due to bad_alloc errors in logs.

Fix by replacing the old probe-driven termination with worker-driven
termination. Workers now run until any worker sees
`RequestLimitExceeded`.

Fixes SCYLLADB-1181

Closes scylladb/scylladb#29270
2026-03-31 13:35:50 +03:00
Patryk Jędrzejczak
b9f82f6f23 raft_group0: join_group0: fix join hang when node joins group 0 before post_server_start
A joining node hung forever if the topology coordinator added it to the
group 0 configuration before the node reached `post_server_start`. In
that case, `server->get_configuration().contains(my_id)` returned true
and the node broke out of the join loop early, skipping
`post_server_start`. `_join_node_group0_started` was therefore never set,
so the node's `join_node_response` RPC handler blocked indefinitely.
Meanwhile the topology coordinator's `respond_to_joining_node` call
(which has no timeout) hung forever waiting for the reply that never came.

Fix by only taking the early-break path when not starting as a follower
(i.e. when the node is the discovery leader or is restarting). A joining
node must always reach `post_server_start`.

We also provide a regression test. It takes 6s in dev mode.

Fixes SCYLLADB-959

Closes scylladb/scylladb#29266
2026-03-31 12:33:56 +02:00
Marcin Maliszkiewicz
2645b95888 transport: account for response size exceeding initial memory estimate
After obtaining the CQL response, check if its actual size exceeds
the initially acquired memory permit. If so, take semaphore units
and adopt them into the permit (non blocking).

This doesn't fully prevent from allocating too much memory as
size is known when buffer is already allocated but improves
memory accounting for big responses.
2026-03-31 11:57:41 +02:00
Ferenc Szili
7b308f3aa0 test: verify hints are delivered during tablet RF reduction
Add test_hint_to_leaving_when_reducing_rf which verifies that mutations
stored as hints are delivered to the correct replicas when a tablet is
removed due to RF reduction. The test sets up a 3-node cluster with
RF=2, drops the hint for one replica via error injection, then reduces
RF to 1 while hints are pending. It asserts that the mutation is
readable after the topology change completes.

Also adds a "drop_hint_for_host" error injection point in
hint_endpoint_manager to selectively drop hints for a specific host.
2026-03-31 09:18:42 +02:00
Dario Mirovic
0cb63fb669 test: cluster: wait for full config reload in audit live-update path
_apply_config_to_running_servers used wait_for_config (REST API poll)
to confirm live config updates. The REST API reads from shard 0 only,
so it can return before broadcast_to_all_shards() completes — other
shards may still have stale audit config, generating unexpected entries.
Additionally, server_remove_config_option for absent keys sent separate
SIGHUPs before server_update_config, and the single wait_for_config at
the end could match a completion from an earlier SIGHUP.

Wait for "completed re-reading configuration file" in the server log
after each SIGHUP-producing operation. This message is logged only
after both read_config() and broadcast_to_all_shards() finish,
guaranteeing all shards have the new config. Each operation gets its
own mark+wait so no stale completion is matched.

Fixes SCYLLADB-1277
2026-03-31 02:27:11 +02:00
Dario Mirovic
1d623196eb test: cluster: fix syslog listener datagram ordering race
UnixSockerListener used ThreadingUnixDatagramServer, which spawns a
new thread per datagram. The notification barrier in get_lines() relies
on all prior datagrams being handled before the notification. With
threading, the notification handler can win the lock before an audit
entry handler, so get_lines() returns before the entry is appended.
clear_audit_logs() then clears an incomplete buffer, and the late
entry leaks into the next test's before/after diff.

Switch to sequential UnixDatagramServer. The server thread now handles
datagrams in kernel FIFO order, so the notification is always processed
after all preceding audit entries.

Refs SCYLLADB-1277
2026-03-31 02:27:11 +02:00
Karol Nowacki
493a4433e7 index: fix DESC INDEX for vector index
The `DESC INDEX` command returned incorrect results for local vector
indexes and for vector indexes that included filtering columns.

This patch corrects the implementation to ensure `DESCRIBE INDEX`
accurately reflects the index configuration.

This was a pre-existing issue, not a regression from recent
serialization schema changes for vector index target options.
2026-03-30 16:46:48 +02:00
Karol Nowacki
a32e4bb9f4 vector_search: test: refactor boilerplate setup
The test boilerplate setup for some vector store client tests
has been extracted to a common function.
2026-03-30 16:46:48 +02:00
Karol Nowacki
6bc88e817f vector_search: fix SELECT on local vector index
Queries against local vector indexes were failing with the error:
"ANN ordering by vector requires the column to be indexed using 'vector_index'"

This was a regression introduced by 15788c3734, which incorrectly
assumed the first column in the targets list is always the vector column.
For local vector indexes, the first column is the partition key, causing
the failure.

Previously, serialization logic for the target index option was shared
between vector and secondary indexes. This is no longer viable due to
the introduction of local vector indexes and vector indexes with filtering
columns, which have  different target format.

This commit introduces a dedicated JSON-based serialization format for
vector index targets, identifying the target column (tc), filtering
columns (fc), and partition key columns (pk). This ensures unambiguous
serialization and deserialization for all vector index types.

This change is backward compatible for regular vector indexes. However,
it breaks compatibility for local vector indexes and vector indexes with
filtering columns created in version 2026.1.0. To mitigate this, usage
of these specific index types will be blocked in the 2026.1.0 release
by failing ANN queries against them in vector-store service.

Fixes: SCYLLADB-895
2026-03-30 16:46:48 +02:00
Karol Nowacki
c0b78477a5 index: test: vector index target option serialization test
This test ensures that the serialization format for vector index target
options remains stable. Maintaining backward compatibility is critical
because the index is restored from this property on startup.
Any unintended changes to the serialization schema could break existing
indexes after an upgrade.

This option is also an interface for the vector-store service,
which uses it to identify the indexed column.
2026-03-30 16:46:48 +02:00
Karol Nowacki
4dc28dfa52 index: test: secondary index target option serialization test
Target option serialization must remain stable for backward compatibility.
The index is restored from this property on startup, so unintentional
changes to the serialization schema can break indexes after upgrade.
2026-03-30 16:46:47 +02:00
Patryk Jędrzejczak
ba54b2272b raft: speed up read barrier requested by non-voters
We achieve this by making the leader replicate to the non-voting requester
of a read barrier if its commit_idx is behind.

There are some corner cases where the new `replicate_to(*opt_progress, true);`
call will be a no-op, while the corresponding call in `tick_leader()` would
result in sending the AppendEntries RPC to the follower. These cases are:
- `progress.state == follower_progress::state::PROBE && progress.probe_sent`,
- `progress.state == follower_progress::state::PIPELINE
  && progress.in_flight == follower_progress::max_in_flight`.
We could try to improve the optimization by including some of the cases above,
but it would only complicate the code without noticeable benefits (at least
for group0).

Note: this is the second attempt for this optimization. The first approach
turned out to be incorrect and was reverted in the previous commit. The
performance improvement is the same as in the previous case.
2026-03-30 15:56:24 +02:00
Patryk Jędrzejczak
4913acd742 Revert "raft: read_barrier: update local commit_idx to read_idx when it's safe"
This reverts commit 1ae2ae50a6.

The reverted change turned out to be incorrect. The Raft member requesting
a read barrier could incorrectly advance its commit_idx and break
linearizability. More details in
https://scylladb.atlassian.net/browse/SCYLLADB-998?focusedCommentId=42935
2026-03-30 15:56:24 +02:00
Ferenc Szili
1d64ddbdd3 hint_sender: use per-tablet is_leaving() to avoid losing hints on RF reduction
hint_sender decides whether to send a hint directly to its destination
or to re-mutate from scratch based on token_metadata::is_leaving(),
which only checks whether the *host* is leaving the cluster. When a
tablet is dropped from a host due to RF reduction (RF--), the host
is still alive and is_leaving() returns false, so hint_sender sends
directly to a replica that will no longer own the data -- effectively
losing the hint.

Switch to the new ermp->is_leaving(host, token) which is tablet-aware.
When the destination's tablet is being migrated away *and* there are
pending endpoints, send directly (the pending endpoints will receive
the data via streaming); otherwise fall through to the re-mutate path
so all current replicas receive the mutation.
2026-03-30 15:49:59 +02:00
Ferenc Szili
7db239b2ed erm: add is_leaving() to effective_replication_map
token_metadata::is_leaving() only knows whether a *host* is leaving the
cluster, which is insufficient for tablets -- a tablet can be migrated
away from a host (e.g. during RF reduction) without the host itself
leaving.

Add a pure virtual is_leaving(host, token) to effective_replication_map
so callers can ask per-token questions. The vnode implementation
delegates to token_metadata::is_leaving() (host-level, as before). The
tablet implementation checks whether the tablet owning the token has a
transition whose leaving replica matches the given host.
2026-03-30 15:49:01 +02:00
Andrzej Jackowski
ab43420d30 test: use exclusive driver connection in test_limited_concurrency_of_writes
Use get_cql_exclusive(node1) so the driver only connects to node1 and
never attempts to contact the stopped node2. The test was flaky because
the driver received `Host has been marked down or removed` from node2.

Fixes: SCYLLADB-1227

Closes scylladb/scylladb#29268
2026-03-30 11:50:44 +02:00
Botond Dénes
068a7894aa test/cluster: fix flaky test_cleanup_stop by using asyncio.sleep
The test was using time.sleep(1) (a blocking call) to wait after
scheduling the stop_compaction task, intending to let it register on
the server before releasing the sstable_cleanup_wait injection point.

However, time.sleep() blocks the asyncio event loop entirely, so the
asyncio.create_task(stop_compaction) task never gets to run during the
sleep. After the sleep, the directly-awaited message_injection() runs
first, releasing the injection point before stop_compaction is even
sent. By the time stop_compaction reaches Scylla, the cleanup has
already completed successfully -- no exception is raised and the test
fails.

Fix by replacing time.sleep(1) with await asyncio.sleep(1), which
yields control to the event loop and allows the stop_compaction task
to actually send its HTTP request before message_injection is called.

Fixes: SCYLLADB-834

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Closes scylladb/scylladb#29202
2026-03-30 11:40:47 +03:00
Nikos Dragazis
3b3b02b15a docs: Add ops guide for vnodes-to-tablets migration
The vnodes-to-tablets migration is a manual procedure, so instructions
need to be provided to the users.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2026-03-29 22:18:46 +03:00
Ernest Zaslavsky
1d779804a0 scripts: remove lua library rename workaround from comparison script
Now that cmake/FindLua.cmake uses pkg-config (matching configure.py),
both build systems resolve to the same 'lua' library name.  Remove the
lua/lua-5.4 entries from _KNOWN_LIB_ASYMMETRIES and add 'm' (math
library) as a known transitive dependency that configure.py gets via
pkg-config for lua.
2026-03-29 16:17:45 +03:00
Ernest Zaslavsky
c32851b102 cmake: add custom FindLua using pkg-config to match configure.py
CMake's built-in FindLua resolves to the versioned library file
(e.g. liblua-5.4.so) instead of the unversioned symlink (liblua.so),
causing a library name mismatch between the two build systems.
Add a custom cmake/FindLua.cmake that uses pkg-config — matching
configure.py's approach — and find_library(NAMES lua) to find the
unversioned symlink.  This also mirrors the pattern used by other
Find modules in cmake/ (FindxxHash, Findlz4, etc.).
2026-03-29 16:17:45 +03:00
Ernest Zaslavsky
f3a91df0b4 test/cmake: add missing tests to boost test suite
Add symmetric_key_test (standalone, links encryption library) and
auth_cache_test to the combined_tests binary. These tests already
exist in configure.py; this aligns the CMake build.
2026-03-29 16:17:45 +03:00
Ernest Zaslavsky
de606cc17a test/cmake: remove per-test LTO disable
The per-test -fno-lto link option is now redundant since -fno-lto
was added globally in mode.common.cmake. LTO-enabled targets
(the scylla binary in RelWithDebInfo) override it via enable_lto().
2026-03-29 16:17:45 +03:00
Ernest Zaslavsky
38ba58567a cmake: add BOOST_ALL_DYN_LINK and strip per-component defines
Match configure.py's Boost handling:
- Add BOOST_ALL_DYN_LINK when using shared Boost libraries.
- Strip per-component defines (BOOST_UNIT_TEST_FRAMEWORK_DYN_LINK,
  BOOST_REGEX_DYN_LINK, etc.) that CMake's Boost package config
  adds on imported targets. configure.py only uses the umbrella
  BOOST_ALL_DYN_LINK define.
2026-03-29 16:17:45 +03:00
Ernest Zaslavsky
7e72898150 cmake: move SEASTAR_TESTING_MAIN after seastar and abseil subdirs
Place add_compile_definitions(SEASTAR_TESTING_MAIN) after both
add_subdirectory(seastar) and add_subdirectory(abseil) are processed.
This matches configure.py's global define without leaking into
seastar's subdirectory build (which would cause a duplicate main
symbol in seastar_testing).
Remove the now-redundant per-test SEASTAR_TESTING_MAIN compile
definition from test/CMakeLists.txt.
2026-03-29 16:17:45 +03:00
Ernest Zaslavsky
b0837ead3e cmake: add -fno-sanitize=vptr for abseil sanitizer flags
Match configure.py line 2192: abseil gets sanitizer flags with
-fno-sanitize=vptr to exclude vptr checks which are incompatible
with abseil's usage of type-punning patterns.
2026-03-29 16:17:45 +03:00
Ernest Zaslavsky
dd829fa69c cmake: align Seastar build configuration with configure.py
- Set BUILD_SHARED_LIBS based on build type to match configure.py's
  build_seastar_shared_libs: Debug and Dev build Seastar as a shared
  library, all other modes build it static.
- Add sanitizer link options on the seastar target for Coverage
  mode. Seastar's CMake only activates sanitizer targets for
  Debug/Sanitize configs, but Coverage mode needs them too since
  configure.py's seastar_libs_coverage carries -fsanitize flags.
2026-03-29 16:17:45 +03:00
Ernest Zaslavsky
52e4d44a75 cmake: align global compile defines and options with configure.py
- Disable CMake's automatic -fcolor-diagnostics injection for
  Clang+Ninja (CMake 3.24+), matching configure.py which does not
  add any color diagnostics flags.
- Add SEASTAR_NO_EXCEPTION_HACK and XXH_PRIVATE_API as global
  defines (previously SEASTAR_NO_EXCEPTION_HACK was only on the
  seastar target as PRIVATE; it needs to be project-wide).
- Add -fpch-validate-input-files-content to check precompiled
  header content when timestamps don't match.
2026-03-29 16:17:44 +03:00
Ernest Zaslavsky
6f2fe3c2fc cmake: fix Coverage mode in mode.Coverage.cmake
Fix multiple deviations from configure.py's coverage mode:
- Remove -fprofile-list from CMAKE_CXX_FLAGS_COVERAGE. That flag
  belongs in COVERAGE_INST_FLAGS applied to other modes, not to
  coverage mode itself.
- Replace incorrect defines (DEBUG, SANITIZE, DEBUG_LSA_SANITIZER,
  SCYLLA_ENABLE_ERROR_INJECTION) with the correct Seastar debug
  defines (SEASTAR_DEBUG, SEASTAR_DEFAULT_ALLOCATOR, etc.) that
  configure.py's pkg-config query produces for coverage mode.
- Add sanitizer and stack-clash-protection compile flags for
  Coverage config, matching the flags that Seastar's pkg-config
  --cflags output includes for debug builds.
- Change CMAKE_STATIC_LINKER_FLAGS_COVERAGE to
  CMAKE_EXE_LINKER_FLAGS_COVERAGE. Coverage flags need to reach
  the executable linker, not the static archiver.
2026-03-29 16:17:44 +03:00
Ernest Zaslavsky
7d23ba7dc8 cmake: align mode.common.cmake flags with configure.py
Add three flag-alignment changes:
- -Wno-error=stack-usage= alongside the stack-usage threshold flag,
  preventing hard errors from stack-usage warnings (matching
  configure.py behavior).
- -fno-lto global link option. configure.py adds -fno-lto to all
  binaries; LTO-enabled targets override it via enable_lto().
- Sanitizer link flags (-fsanitize=address, -fsanitize=undefined) for
  Debug/Sanitize configs, matching configure.py's cxx_ld_flags.
2026-03-29 16:17:44 +03:00
Ernest Zaslavsky
38088a8a94 configure.py: add sstable_tablet_streaming to combined_tests 2026-03-29 16:17:44 +03:00
Ernest Zaslavsky
33bca2428a docs: add compare-build-systems.md
Document the purpose, usage, and examples for
scripts/compare_build_systems.py which compares the configure.py
and CMake build systems by parsing their ninja build files.
2026-03-29 16:17:44 +03:00
Ernest Zaslavsky
d3972369a0 scripts: add compare_build_systems.py to compare ninja build files
Add a script that compares configure.py and CMake build systems by
parsing their generated build.ninja files. The script checks:
  - Per-file compilation flags (defines, warnings, optimization)
  - Link target sets (detect missing/extra targets)
  - Per-target linker flags and libraries

configure.py is treated as the baseline. CMake should match it.
Both systems are always configured into a temporary directory so the
user's build tree is never touched.

Usage:
  scripts/compare_build_systems.py -m dev   # single mode
  scripts/compare_build_systems.py          # all modes
  scripts/compare_build_systems.py --ci     # CI mode (strict)
2026-03-29 16:17:44 +03:00
Nadav Har'El
d32fe72252 Merge 'alternator: check concurrency limit before memory acquisition' from Łukasz Paszkowski
Fix the ordering of the concurrency limit check in the Alternator HTTP server so it happens before memory acquisition, and reduce test pressure to avoid LSA exhaustion on the memory-constrained test node.

The patch moves the concurrency check to right after the content-length early-out, before any memory acquisition or I/O. The check was originally placed before memory acquisition but was inadvertently moved after it during a refactoring. This allowed unlimited requests to pile up consuming memory, reading bodies, verifying signatures, and decompressing — all before being rejected. Restores the original ordering and mirrors the CQL transport (`transport/server.cc`).

Lowers `concurrent_requests_limit` from 5 to 3 and the thread multiplier from 5 to 2 (6 threads instead of 25). This is still sufficient to reliably trigger RequestLimitExceeded, while keeping flush pressure within what 512MB per shard can sustain.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1248
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1181

The test started to fail quite recently. It affects master only. No backport is needed. We might want to consider backporting a commit moving the concurrency check earlier.

Closes scylladb/scylladb#29272

* github.com:scylladb/scylladb:
  test: reduce concurrent-request-limit test pressure to avoid LSA exhaustion
  alternator: check concurrency limit before memory acquisition
2026-03-29 11:08:28 +03:00
Łukasz Paszkowski
b8e3ef0c64 test: reduce concurrent-request-limit test pressure to avoid LSA exhaustion
The test_limit_concurrent_requests dtest uses concurrent CreateTable
requests to verify Alternator's concurrency limiting.  Each admitted
CreateTable triggers Raft consensus, schema mutations, and memtable
flushes—all of which consume LSA memory.  On the 1 GB test node
(2 SMP × 512 MB), the original settings (limit=5, 25 threads) created
enough flush pressure to exhaust the LSA emergency reserve, producing
logalloc::bad_alloc errors in the node log.  The test was always
marginal under these settings and became flaky as new system tables
increased baseline LSA usage over time.

Lower concurrent_requests_limit from 5 to 3 and the thread multiplier
from 5 to 2 (6 threads total).  This is still well above the limit and
sufficient to reliably trigger RequestLimitExceeded, while keeping flush
pressure within what 512 MB per shard can sustain.
2026-03-28 20:40:33 +01:00
Łukasz Paszkowski
a86928caa1 alternator: check concurrency limit before memory acquisition
The concurrency limit check in the Alternator server was positioned after
memory acquisition (get_units), request body reading (read_entire_stream),
signature verification, and decompression. This allowed unlimited requests
to pile up consuming memory before being rejected, exhausting LSA memory
and causing logalloc::bad_alloc errors that cascade into Raft applier
and topology coordinator failures, breaking subsequent operations.

Without this fix, test_limit_concurrent_requests on a 1GB node produces
50 logalloc::bad_alloc errors and cascading failures: reads from
system.scylla_local fail, the Raft applier fiber stops, the topology
coordinator stops, and all subsequent CreateTable operations fail with
InternalServerError (500). With this fix, the cascade is eliminated --
admitted requests may still cause LSA pressure on a memory-constrained
node, but the server remains functional.

Move the concurrency check to right after the content-length early-out,
before any memory acquisition or I/O. This mirrors the CQL transport
which correctly checks concurrency before memory acquisition
(transport/server.cc).

The concurrency check was originally added in 1b8c946ad7 (Sep 2020)
*before* memory acquisition, which at the time lived inside with_gate
(after the concurrency gate). The ordering was inverted by f41dac2a3a
(Mar 2021, "avoid large contiguous allocation for request body"), which
moved get_units() earlier in the function to reserve memory before
reading the newly-introduced content stream -- but inadvertently also
moved it before the concurrency check. c3593462a4 (Mar 2025) further
worsened the situation by adding a 16MB fallback reservation for
requests without Content-Length and ungzip/deflate decompression steps
-- all before the concurrency check -- greatly increasing the memory
consumed by requests that would ultimately be rejected.
2026-03-28 20:40:33 +01:00
Aleksandra Martyniuk
166b293d06 test: add test_failed_tablet_rebuild_is_retried_on_alter
Test if alter keyspace statement with the current rf values will
fix the state of replicas.
2026-03-27 17:29:31 +01:00
Aleksandra Martyniuk
9ec54a8207 test: add a test to ensure that failed rebuilds are retried 2026-03-27 17:29:31 +01:00
Aleksandra Martyniuk
200dc084c5 service: fail ALTER KEYSPACE if replicas do not satisfy the replication
RF change of tablet keyspace starts tablet rebuilds. Even if any of
the rebuilds is rolled back (because pending replica was excluded),
rf change request finishes successfully. Yet, we are left with not
enough replicas. Then, a next new rf change request handler would
generate a rebuild of two replicas of the same tablet. Such a transition
would not be applied, as we don't allow many pending replicas.
An exception would be thrown and the request would be retried infinitely,
blocking the topology coordinator.

Throw and fail rf change request if there is not enough replicas.
The request should be retried later, after the issue is fixed
by the mechanism introduced in previous changes.
2026-03-27 17:29:26 +01:00
Aleksandra Martyniuk
7951f92270 service: retry failed tablet rebuilds
RF change of tablet keyspace starts tablet rebuilds. Even if any
of the rebuilds is rolled back (because pending replica was excluded),
rf change request finishes successfully. In this case we end up with
the state of the replicas that isn't compatible with the expected
keyspace replication.

After this change, if topology_coordinator has nothing to do, it
proceeds to check if the state of replicas reflects the keyspace
replication. If there are any mismatches, the tablet rebuilds are
scheduled. All required rebuilds of a single keyspace are scheduled
together without respecting the node's load (just as it happens
in case of keyspace rf change).
2026-03-27 17:26:45 +01:00
Aleksandra Martyniuk
6f1bba8faf service: maybe_start_tablet_migration returns std::optional<group0_guard>
maybe_start_tablet_migration takes an ownership of group0_guard and
does not give it back, even if no work was done.

In the following patches, we will proceed with different operations,
if there are no migrations to be started. Thus, the guard would be needed.

Return group0_guard from  maybe_start_tablet_migration is no work
was done.
2026-03-27 17:26:45 +01:00
Emil Maskovsky
9dad68e58d raft: abort stale snapshot transfers when term changes
**The Bug**

Assertion failure: `SCYLLA_ASSERT(res.second)` in `raft/server.cc`
when creating a snapshot transfer for a destination that already had a
stale in-flight transfer.

**Root Cause**

If a node loses leadership and later becomes leader again before the next
`io_fiber` iteration, the old transfer from the previous term can remain
in `_snapshot_transfers` while `become_leader()` resets progress state.
When the new term emits `install_snapshot(dst)`, `send_snapshot(dst)`
tries to create a new entry for the same destination and can hit the
assertion.

**The Fix**

Abort all in-flight snapshot transfers in `process_fsm_output()` when
`term_and_vote` is persisted. A term/vote change marks existing transfers
as stale, so we clean them up before dispatching messages from that batch
and before any new snapshot transfer is started.

With cross-term cleanup moved to the term-change path, `send_snapshot()`
now asserts the within-term invariant that there is at most one in-flight
transfer per destination.

Fixes: SCYLLADB-862

Backport: The issue is reproducible in master, but is present in all
active branches.

Closes scylladb/scylladb#29092
2026-03-27 10:00:15 +01:00
Andrzej Jackowski
181ad9f476 Revert "audit: disable DDL by default"
This reverts commit c30607d80b.

With the default configuration, enabling DDL has no effect because
no `audit_keyspaces` or `audit_tables` are specified. Including DDL
in the default categories can be misleading for some customers, and
ideally we would like to avoid it.

However, DDL has been one of the default audit categories for years,
and removing it risks silently breaking existing deployments that
depend on it. Therefore, the recent change to disable DDL by default
is reverted.

Fixes: SCYLLADB-1155

Closes scylladb/scylladb#29169
2026-03-27 09:55:11 +01:00
Botond Dénes
854c374ebf test/encryption: wait for topology convergence after abrupt restart
test_reboot uses a custom restart function that SIGKILLs and restarts
nodes sequentially. After all nodes are back up, the test proceeded
directly to reads after wait_for_cql_and_get_hosts(), which only
confirms CQL reachability.

While a node is restarted, other nodes might execute global token
metadata barriers, which advance the topology fence version. The
restarted node has to learn about the new version before it can send
reads/writes to the other nodes. The test issues reads as soon as the
CQL port is opened, which might happen before the last restarted node
learns of the latest topology version. If this node acts as a
coordinator for reads/write before this happens, these will fail as the
other nodes will reject the ops with the outdated topology fence
version.

Fix this by replacing wait_for_cql_and_get_hosts() on the abrupt-restart
path with the more robus get_ready_cql(), which makes sure servers see
each other before refreshing the cql connection. This should ensure that
nodes have exchanged gossip and converged on topology state before any
reads are executed. The rolling_restart() path is unaffected as it
handles this internally.

Fixes: SCYLLADB-557

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Closes scylladb/scylladb#29211
2026-03-27 09:52:27 +01:00
Avi Kivity
b708e5d7c9 Merge 'test: fix race condition in test_crashed_node_substitution' from Sergey Zolotukhin
`test_crashed_node_substitution` intermittently failed:
```python
   assert len(gossiper_eps) == (len(server_eps) + 1)
```
The test crashed the node right after a single ACK2 handshake (`finished do_send_ack2_msg`), assuming the node state was visible to all peers. However, since gossip is eventually consistent, the update may not have propagated yet, so some nodes did not see the failed node.

This change: Wait until the gossiper state is visible on peers before continuing the test and asserting.

Fixes: [SCYLLADB-1256](https://scylladb.atlassian.net/browse/SCYLLADB-1256).

backport: this issue may affect CI for all branches, so should be backported to all versions.

[SCYLLADB-1256]: https://scylladb.atlassian.net/browse/SCYLLADB-1256?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ

Closes scylladb/scylladb#29254

* github.com:scylladb/scylladb:
  test: test_crashed_node_substitution: add docstring and fix whitespace
  test: fix race condition in test_crashed_node_substitution
2026-03-26 21:40:33 +02:00
Petr Gusev
c38e312321 test_lwt_fencing_upgrade: fix quorum failure due to gossip lag
If lwt_workload() sends an update immediately after a
rolling restart, the coordinator might still see a replica as
down due to gossip lagging behind. Concurrently restarting another
node leaves only one available replica, failing the
LOCAL_QUORUM requirement for learn or eventually consistent
sp::query() in sp::cas() and resulting in
a mutation_write_failure_exception.

We fix this problem by waiting for the restarted server
to see 2 other peers. The server_change_version
doesn't do that by default -- it passes
wait_others=0 to server_start().

Fixes SCYLLADB-1136

Closes scylladb/scylladb#29234
2026-03-26 21:25:53 +02:00
bitpathfinder
627a8294ed test: test_crashed_node_substitution: add docstring and fix whitespace
Add a description of the test's intent and scenario; remove extra blanks.
2026-03-26 18:40:17 +01:00
bitpathfinder
5a086ae9b7 test: fix race condition in test_crashed_node_substitution
`test_crashed_node_substitution` intermittently failed:
```
    assert len(gossiper_eps) == (len(server_eps) + 1)
```
The test crashed the node right after a single ACK2 handshake
("finished do_send_ack2_msg"), assuming the node state was
visible to all peers. However, since gossip is eventually
consistent, the update may not have propagated yet, so some
nodes did not see the failed node.

This change: Wait until the gossiper state is visible on
peers before continuing the test and asserting.

Fixes: SCYLLADB-1256.
2026-03-26 18:25:05 +01:00
Robert Bindar
c575bbf1e8 test_refresh_deletes_uploaded_sstables should wait for sstables to get deleted
SSTable unlinking is async, so in some cases it may happen that
the upload dir is not empty immediately after refresh is done.
This patch adjusts test_refresh_deletes_uploaded_sstables so
it waits with a timeout till the upload dir becomes empty
instead of just assuming the API will sync on sstables being
gone.

Fixes SCYLLADB-1190

Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>

Closes scylladb/scylladb#29215
2026-03-26 08:43:14 +03:00
Nikos Dragazis
8789c95a85 test: cluster: Add test for migration of multiple keyspaces
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2026-03-25 19:11:29 +02:00
Nikos Dragazis
25af8bdc24 test: cluster: Add test for error conditions
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2026-03-25 19:11:29 +02:00
Nikos Dragazis
01a51817c4 test: cluster: Add vnodes->tablets migration test (rollback)
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2026-03-25 19:11:29 +02:00
Nikos Dragazis
56ec33d3e0 test: cluster: Add vnodes->tablets migration test (1 table, 3 nodes)
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2026-03-25 19:11:29 +02:00
Nikos Dragazis
58e930c490 test: cluster: Add vnodes->tablets migration test (1 table, 1 node)
This test runs the vnodes-to-tablets migration for a single table on a
single-node cluster. The node has multiple shards and multiple
power-of-two aligned vnodes, so resharding is triggered.

More details in the docstring.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2026-03-25 19:11:29 +02:00
Nikos Dragazis
8837dac2f9 scylla-nodetool: Add migrate-to-tablets subcommand
The vnodes-to-tablets migration is a manual procedure, so orchestration
must be done via nodetool.

This patch adds the following new commands:

* nodetool migrate-to-tablets start {ks}
* nodetool migrate-to-tablets upgrade
* nodetool migrate-to-tablets downgrade
* nodetool migrate-to-tablets status {ks}
* nodetool migrate-to-tablets finalize {ks}

The commands are just wrappers over the REST API.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2026-03-25 19:11:29 +02:00
Nikos Dragazis
2a5e6b832a api: Add REST endpoint for vnode-to-tablet migration status
If the keyspace is migrating, it reports the intended and actual storage
mode for each node.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2026-03-25 19:11:24 +02:00
Marcin Maliszkiewicz
7fdd650009 Merge 'test: audit: clean up test helper class naming' from Dario Mirovic
Remove unused `pytest.mark.single_node` marker from `TestCQLAudit`.

Rename `TestCQLAudit` to `CQLAuditTester` to reflect that it is a test helper, not a test class. This avoids accidental pytest collection and subsequent warning about `__init__`.

Logs before the fixes:
```
test/cluster/test_audit.py:514: 14 warnings
  /home/dario/dev/scylladb/test/cluster/test_audit.py:514: PytestCollectionWarning: cannot collect test class 'TestCQLAudit' because it has a __init__ constructor (from: cluster/test_audit.py)
    @pytest.mark.single_node
```

Fixes SCYLLADB-1237

This is an addition to the latest master code. No backport needed.

Closes scylladb/scylladb#29237

* github.com:scylladb/scylladb:
  test: audit: rename TestCQLAudit to CQLAuditTester
  test: audit: remove unused pytest.mark.single_node
2026-03-25 15:30:16 +01:00
Radosław Cybulski
1dc20cc8f9 alternator/test: explain why 'always' write isolation mode is used in tests
Improve test comments for test_streams_batchwrite_into_the_same_partition_deletes_existing_items
and test_streams_batchwrite_into_the_same_partition_will_report_wrong_stream_data to explain why
'always' write isolation mode is required: in always_use_lwt mode all items in a batch get the
same CDC timestamp, which triggers the squashing bug. In other modes each item gets a separate
timestamp so the bug doesn't manifest.

Also fix the example in the second test comment to use cleaner key values and correct event type
(INSERT, not MODIFY, since items are inserted into an empty table), and fix the issue reference
from #28452 (the PR) to #28439 (the issue).
2026-03-25 15:15:20 +01:00
Dario Mirovic
552a2d0995 test: audit: rename TestCQLAudit to CQLAuditTester
pytest tries to collect tests for execution in several ways.
One is to pick all classes that start with 'Test'. Those classes
must not have custom '__init__' constructor. TestCQLAudit does.

TestCQLAudit after migration from test/cluster/dtest is not a test
class anymore, but rather a helper class. There are two ways to fix
this:
1. Add __init__ = False to the TestCQLAudit class
2. Rename it to not start with 'Test'

Option 2 feels better because the new name itself does not convey
the wrong message about its role.

Fixes SCYLLADB-1237
2026-03-25 13:21:08 +01:00
Dario Mirovic
73de865ca3 test: audit: remove unused pytest.mark.single_node
Remove unused pytest.mark.single_node in TestCQLAudit class.
This is a leftover from audit tests migration from
test/cluster/dtest to test/cluster.

Refs SCYLLADB-1237
2026-03-25 13:18:37 +01:00
Radosław Cybulski
ded62b2c5e alternator/test: add scylla_only to always write isolation fixture
Add scylla_only fixture dependency to the
test_table_ss_new_and_old_images_write_isolation_always fixture.
This ensures all tests using the 'always' write isolation mode
are skipped when running against DynamoDB (--aws), since the
system:write_isolation tag is a Scylla-only feature.
2026-03-25 12:38:09 +01:00
Radosław Cybulski
7d404cdd51 alternator: fix BatchWriteItem squashed Streams entries
BatchWriteItem with items for the same partition (and write isolation
set to always) will trigger LWT and run different cdc code path, which
will result in wrong Streams data being returned to the user -
changes will be randomly squashed together.
For example batch write:

  batch.put_item(Item={'p': 'p', 'c': 'c0'})
  batch.put_item(Item={'p': 'p', 'c': 'c1'})
  batch.put_item(Item={'p': 'p', 'c': 'c2'})

instead of producing 3 modify / insert events will produce one:

  type=INSERT, key={'c': {'S': 'c0'}, 'p': {'S': 'p'}},
      old_image=None, new_image={'c': {'S': 'c2'}, 'p': {'S': 'p'}}

with `new_image` having different `c` key from `key` field.

This happens because BatchWriteItem (when using LWT) emits it's changes
to cdc under the same timestamp. This results in in all log entries
being put in single cdc "bucket" (under the same cdc$timestamp key).
Previous parsing algorithm would interpret those changes as a change
to a single item and squash them together.

The patch rewrites algorithm to use `std::unordered_map` for records
based on value of clustering key, that is added to every cdc log entry.
This allows rebuilding all item modifications.

Fixes #28439
Fixes: SCYLLADB-540
2026-03-25 11:40:53 +01:00
Radosław Cybulski
85da03c88d alternator: add BatchWriteItem test (failing)
Add additional BatchWriteItem tests (some failing):
- `test_streams_batchwrite_no_clustering_deletes_non_existing_items`
  `test_streams_batchwrite_no_clustering_deletes_existing_items` -
  those tests pass, we add it here for completness, as non clustering
  tables trigger different paths.
- `test_streams_batchwrite_into_the_same_partition_deletes_existing_items` -
  failing test, that checks combinations of puts and deletes in a single
  batch write (so for example 3 items, 2 puts and 1 delete).
- `test_streams_batchwrite_into_the_same_partition_will_report_wrong_stream_data` -
  failing simple test.

Tests fail, because current implementation, when writing cdc log
entries will squash all changes done to the same partition together.
The data is still there, but when GetRecords is called and we parse
cdc log entries, we don't correctly recover it (see issue #28439 for
more details).
2026-03-25 11:40:53 +01:00
Marcin Maliszkiewicz
f988ec18cb test/lib: fix port in-use detection in start_docker_service
Previously, the result of when_all was discarded. when_all stores
exceptions in the returned futures rather than throwing, so the outer
catch(in_use&) could never trigger. Now we capture the when_all result
and inspect each future individually to properly detect in_use from
either stream.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1216

Closes scylladb/scylladb#29219
2026-03-25 11:45:53 +02:00
Artsiom Mishuta
cd1679934c test/pylib: use exponential backoff in wait_for()
Change wait_for() defaults from period=1s/no backoff to period=0.1s
with 1.5x backoff capped at 1.0s. This catches fast conditions in
100ms instead of 1000ms, benefiting ~100 call sites automatically.

Add completion logging with elapsed time and iteration count.

Tested local with test/cluster/test_fencing.py::test_fence_hints (dev mode),
log output:

  wait_for(at_least_one_hint_failed) completed in 0.83s (4 iterations)
  wait_for(exactly_one_hint_sent) completed in 1.34s (5 iterations)

Fixes SCYLLADB-738

Closes scylladb/scylladb#29173
2026-03-24 23:49:49 +02:00
Botond Dénes
d52fbf7ada Merge 'test: cluster: Deflake test_startup_with_keyspaces_violating_rf_rack_valid_keyspaces' from Dawid Mędrek
The test was flaky. The scenario looked like this:

1. Stop server 1.
2. Set its rf_rack_valid_keyspaces configuration option to true.
3. Create an RF-rack-invalid keyspace.
4. Start server 1 and expect a failure during start-up.

It was wrong. We cannot predict when the Raft mutation corresponding to
the newly created keyspace will arrive at the node or when it will be
processed. If the check of the RF-rack-valid keyspaces we perform at
start-up was done before that, it won't include the keyspace. This will
lead to a test failure.

Unfortunately, it's not feasible to perform a read barrier during
start-up. What's more, although it would help the test, it wouldn't be
useful otherwise. Because of that, we simply fix the test, at least for
now.

The new scenario looks like this:

1. Disable the rf_rack_valid_keyspaces configuration option on server 1.
2. Start the server.
3. Create an RF-rack-invalid keyspace.
4. Perform a read barrier on server 1. This will ensure that it has
   observed all Raft mutations, and we won't run into the same problem.
5. Stop the node.
6. Set its rf_rack_valid_keyspaces configuration option to true.
7. Try to start the node and observe a failure.

This will make the test perform consistently.

---

I ran the test (in dev mode, on my local machine) three times before
these changes, and three times with them. I include the time results
below.

Before:
```
real    0m47.570s
user    0m41.631s
sys     0m8.634s

real    0m50.495s
user    0m42.499s
sys     0m8.607s

real    0m50.375s
user    0m41.832s
sys     0m8.789s
```

After:
```
real    0m50.509s
user    0m43.535s
sys     0m9.715s

real    0m50.857s
user    0m44.185s
sys     0m9.811s

real    0m50.873s
user    0m44.289s
sys     0m9.737s
```

Fixes SCYLLADB-1137

Backport: The test is present on all supported branches, and so we
          should backport these changes to them.

Closes scylladb/scylladb#29218

* github.com:scylladb/scylladb:
  test: cluster: Deflake test_startup_with_keyspaces_violating_rf_rack_valid_keyspaces
  test: cluster: Mark test with @pytest.mark.asyncio in test_multidc.py
2026-03-24 21:09:19 +02:00
Patryk Jędrzejczak
141aa2d696 Merge 'test/cluster/test_incremental_repair.py: fix typo + enable compaction DEBUG logs' from Botond Dénes
This PR contains two small improvements to `test_incremental_repair.py`
motivated by the sporadic failure of
`test_tablet_incremental_repair_and_scrubsstables_abort`.

The test fails with `assert 3 == 2` on `len(sst_add)` in the second
repair round. The extra SSTable has `repaired_at=0`, meaning scrub
unexpectedly produced more unrepaired SSTables than anticipated. Since
scrub (and compaction in general) logs at DEBUG level and the test did
not enable debug logging, the existing logs do not contain enough
information to determine the root cause.

**Commit 1** fixes a long-standing typo in the helper function name
(`preapre` -> `prepare`).

**Commit 2** enables `compaction=debug` for the Scylla nodes started by
`do_tablet_incremental_repair_and_ops`, which covers all
`test_tablet_incremental_repair_and_*` variants. This will capture full
compaction/scrub activity on the next reproduction, making the failure
diagnosable.

Refs: SCYLLADB-1086

Backport: test improvement, no backport

Closes scylladb/scylladb#29175

* https://github.com/scylladb/scylladb:
  test/cluster/test_incremental_repair.py: enable compaction DEBUG logs in do_tablet_incremental_repair_and_ops
  test/cluster/test_incremental_repair.py: fix typo preapre -> prepare
2026-03-24 16:27:01 +01:00
Pavel Emelyanov
2d8540f1ee transport: fix process_startup cert-auth path missing connection-ready setup
When authenticate() returns a user directly (certificate-based auth,
introduced in 20e9619bb1), process_startup was missing the same
post-authentication bookkeeping that the no-auth and SASL paths perform:

  - update_scheduling_group(): without it, the connection runs under the
    default scheduling group instead of the one mapped to the user's
    service level.

  - _authenticating = false / _ready = true: without them,
    system.clients reports connection_stage = AUTHENTICATING forever
    instead of READY.

  - on_connection_ready(): without it, the connection never releases its
    slot in the uninitialized-connections concurrency semaphore (acquired
    at connection creation), leaking one unit per cert-authenticated
    connection for the lifetime of the connection.

The omission was introduced when on_connection_ready() was added to the
else and SASL branches in 474e84199c but the cert-auth branch was missed.

Fixes: 20e9619bb1 ("auth: support certificate-based authentication")

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-03-24 18:02:46 +03:00
Pavel Emelyanov
da6fe14035 transport: test that connection_stage is READY after auth via all process_startup paths
The cert-auth path in process_startup (introduced in 20e9619bb1) was
missing _ready = true, _authenticating = false, update_scheduling_group()
and on_connection_ready(). The result is that connections authenticated
via certificate show connection_stage = AUTHENTICATING in system.clients
forever, run under the wrong service-level scheduling group, and hold
the uninitialized-connections semaphore slot for the lifetime of the
connection.

Add a parametrized cluster test that verifies all three process_startup
branches result in connection_stage = READY:
  - allow_all: AllowAllAuthenticator (no-auth path)
  - password:  PasswordAuthenticator (SASL/process_auth_response path)
  - cert_bypass: CertificateAuthenticator with transport_early_auth_bypass
                 error injection (cert-auth path -- the buggy one)

The injection is added to certificate_authenticator::authenticate() so
tests can bypass actual TLS certificate parsing while still exercising
the cert-auth code path in process_startup.

The cert_bypass case is marked xfail until the bug is fixed.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-03-24 18:01:28 +03:00
Benny Halevy
1a7b013377 test: add test_sstable_clone_preserves_staging_state 2026-03-24 16:48:01 +02:00
Benny Halevy
22f2010477 test: derive sstable state from directory in test_env::make_sstable
Instead of always passing sstable_state::normal, infer the state from
the last component of the directory path by comparing against the known
state subdirectory constants (staging_dir, upload_dir, quarantine_dir).
Any unrecognized path component (the common case for normal-state
sstables) maps to sstable_state::normal.

When a non-normal state is detected, strip the state subdirectory from
dir so that the base table directory is passed to storage.
2026-03-24 16:48:01 +02:00
Ernest Zaslavsky
c670183be8 cmake: fix precompiled header (PCH) creation
Two issues prevented the precompiled header from compiling
successfully when using CMake directly (rather than the
configure.py + ninja build system):

a) Propagate build flags to Rust binding targets reusing the
   PCH. The wasmtime_bindings and inc targets reuse the PCH
   from scylla-precompiled-header, which is compiled with
   Seastar's flags (including sanitizer flags in
   Debug/Sanitize modes). Without matching compile options,
   the compiler rejects the PCH due to flag mismatch (e.g.,
   -fsanitize=address). Link these targets against
   Seastar::seastar to inherit the required compile options.

Closes scylladb/scylladb#28941
2026-03-24 15:53:40 +02:00
Dawid Mędrek
e639dcda0b test: cluster: Deflake test_startup_with_keyspaces_violating_rf_rack_valid_keyspaces
The test was flaky. The scenario looked like this:

1. Stop server 1.
2. Set its rf_rack_valid_keyspaces configuration option to true.
3. Create an RF-rack-invalid keyspace.
4. Start server 1 and expect a failure during start-up.

It was wrong. We cannot predict when the Raft mutation corresponding to
the newly created keyspace will arrive at the node or when it will be
processed. If the check of the RF-rack-valid keyspaces we perform at
start-up was done before that, it won't include the keyspace. This will
lead to a test failure.

Unfortunately, it's not feasible to perform a read barrier during
start-up. What's more, although it would help the test, it wouldn't be
useful otherwise. Because of that, we simply fix the test, at least for
now.

The new scenario looks like this:

1. Disable the rf_rack_valid_keyspaces configuration option on server 1.
2. Start the server.
3. Create an RF-rack-invalid keyspace.
4. Perform a read barrier on server 1. This will ensure that it has
   observed all Raft mutations, and we won't run into the same problem.
5. Stop the node.
6. Set its rf_rack_valid_keyspaces configuration option to true.
7. Try to start the node and observe a failure.

This will make the test perform consistently.

---

I ran the test (in dev mode, on my local machine) three times before
these changes, and three times with them. I include the time results
below.

Before:
```
real    0m47.570s
user    0m41.631s
sys     0m8.634s

real    0m50.495s
user    0m42.499s
sys     0m8.607s

real    0m50.375s
user    0m41.832s
sys     0m8.789s
```

After:
```
real    0m50.509s
user    0m43.535s
sys     0m9.715s

real    0m50.857s
user    0m44.185s
sys     0m9.811s

real    0m50.873s
user    0m44.289s
sys     0m9.737s
```

Fixes SCYLLADB-1137
2026-03-24 14:27:36 +01:00
Patryk Jędrzejczak
503a6e2d7e locator: everywhere_replication_strategy: fix sanity_check_read_replicas when read_new is true
ERMs created in `calculate_vnode_effective_replication_map` have RF computed based
on the old token metadata during a topology change. The reading replicas, however,
are computed based on the new token metadata (`target_token_metadata`) when
`read_new` is true. That can create a mismatch for EverywhereStrategy during some
topology changes - RF can be equal to the number of reading replicas +-1. During
bootstrap, this can cause the
`everywhere_replication_strategy::sanity_check_read_replicas` check to fail in
debug mode.

We fix the check in this commit by allowing one more reading replica when
`read_new` is true.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1147

Closes scylladb/scylladb#29150
2026-03-24 13:43:39 +01:00
Jenkins Promoter
0f02c0d6fa Update pgo profiles - x86_64 2026-03-24 14:11:38 +02:00
Dawid Mędrek
4fead4baae test: cluster: Mark test with @pytest.mark.asyncio in test_multidc.py
One of the tests,
test_startup_with_keyspaces_violating_rf_rack_valid_keyspaces,
didn't have the marker. Let's add it now.
2026-03-24 12:52:00 +01:00
Botond Dénes
ffd58ca1f0 Merge 'test: cluster: Deflake test_write_cl_any_to_dead_node_generates_hints' from Dawid Mędrek
Before these changes, we would send mutations to the node and
immediately query the metrics to see how many hints had been written.
However, that could lead to random failures of the test: even if the
mutations have finished executing, hints are stored asynchronously, so
we don't have a guarantee they have already been processed.

To prevent such failures, we rewrite the check: we will perform multiple
checks against the metrics until we have confirmed that the hints have
indeed been written or we hit the timeout.

We're generous with the timeout: we give the test 60 seconds. That
should be enough time to avoid flakiness even on super slow machines,
and if the test does fail, we will know something is really wrong.

As a bonus, we improve the test in general too. We explicitly express
the preconditions we rely on, as well as bump the log level. If the
test fails in the future, it might be very difficult do debug it
without this additional information.

Fixes SCYLLADB-1133

Backport: The test is present on all supported branches. To avoid
          running into more failures, we should backport these changes
          to them.

Closes scylladb/scylladb#29191

* github.com:scylladb/scylladb:
  test: cluster: Increase log level in test_write_cl_any_to_dead_node_generates_hints
  test: cluster: Await all mutations concurrently in test_write_cl_any_to_dead_node_generates_hints
  test: cluster: Specify min_tablet_count in test_write_cl_any_to_dead_node_generates_hints
  test: cluster: Use new_test_table in test_write_cl_any_to_dead_node_generates_hints
  test: cluster: Introduce auxiliary function keyspace_has_tablets
  test: cluster: Deflake test_write_cl_any_to_dead_node_generates_hints
2026-03-24 13:39:56 +02:00
Calle Wilund
f1b3bff4a5 dockerized_service: Convert log reader to pipes and push to test log
Refs: SCYLLADB-1106

Ensures any stderr logs from mock services will echo to the test log
regardless of the log file we write. To help debug failed CI.
2026-03-24 12:35:42 +01:00
Calle Wilund
38aaed1ed4 test::cluster::conftest::GSServer: Fix unpublish for when publish was not called
Use checked dict access to check the set vars.

Fixes: SCYLLADB-1106
2026-03-24 12:33:56 +01:00
Calle Wilund
b382f3593c scylla_cluster: Use thread safe future signalling 2026-03-24 12:33:56 +01:00
Nikos Dragazis
d09196068c api: Add REST endpoint for migration finalization
The endpoint is the following:

    POST /storage_service/vnode_tablet_migrations/keyspaces/{keyspace}/finalization

When called, it issues a `finalize_migration` topology request and waits
for its completion.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2026-03-24 13:21:12 +02:00
Nikos Dragazis
c88ddecfca topology_coordinator: Add finalize_migration request
Vnodes-to-tablets migration needs a finalization step to finish or
rollback the migration. Finishing the migration involves switching the
keyspace schema to tablets and clearing the `intended_storage_mode` from
system.topology. Rolling back the migration involves deleting the tablet
maps and clearing the `intended_storage_mode`.

The finalization needs to be done as a topology request to exclude with
other operations such as repair and TRUNCATE.

This patch introduces the `finalize_migration` global topology request
for this purpose. The request takes a keyspace name as an argument.
The direction of the finalization (i.e., forward path vs rollback) is
inferred from the `intended_storage_mode` of all nodes (not ideal,
should be made explicit).

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2026-03-24 13:20:39 +02:00
Nikos Dragazis
0e1e6ebdc5 database: Construct migrating tables with tablet ERMs
Extend `database::add_column_family()` with a `storage_mode` argument.
If the table is under vnodes-to-tablets migration and the storage mode
is "tablets", create a tablet ERM.

Make the distributed loader determine the storage mode from topology
(`intended_storage_mode` column in system.topology).

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2026-03-24 13:20:39 +02:00
Nikos Dragazis
2f93ab281b api: Add REST endpoint for upgrading nodes to tablets
The endpoint is the following:

    POST /storage_service/vnode_tablet_migrations/node/storage_mode?intended_mode={tablets,vnodes}

This endpoint is part of the vnodes-to-tablets migration process and
controls a node's intended_storage_mode in system.topology. The storage
mode represents the node-local data distribution model, i.e., how data
are organized across shards. The node will apply the intended storage
mode to migrating tables upon next restart by resharding their SSTables
(either on vnode boundaries if intended_mode=tablets, or with the static
sharder if intended_mode=vnodes).

Note that this endpoint controls the intended_storage_mode of the local
node only. This has the nice benefit that once the API call returns, the
change has not only been committed to group0 but also applied to the
local node's state machine. This guarantees that the change is part of
the node's local copy upon next restart; no additional read barrier is
needed.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2026-03-24 13:20:35 +02:00
Nikos Dragazis
c4c3a95863 api: Add REST endpoint for starting vnodes-to-tablets migration
The endpoint is the following:

    POST /storage_service/vnode_tablet_migrations/keyspaces/{keyspace}

Its purpose is to start the migration of a whole keyspace from vnodes to
tablets.

When called, Scylla will synchronously create a tablet map for each
table in the specified keyspace. The tablet maps of all tables are
identical and they mirror the vnode layout; they contain one tablet per
vnode and each tablet uses the same replica hosts and token boundaries
as the corresponding vnode.

The only difference from vnodes lies in the sharding approach. Tablets
are assigned to a single shard - using a round-robin strategy in this
patch - whereas vnodes are distributed evenly across all shards. If the
tablet count per shard is low and tablet sizes are uneven, or some
shards have more tablets than others, performance may degrade during the
migration process. For example, a cluster with i8g.48xlarge (192 vCPUs),
256 vnodes per node and RF=3 will have 256 * 3 / 192 vCPUs = 4 tablet
replicas per shard during the migration. One additional tablet or a
double-sized tablet would cause 25% overcommit.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2026-03-24 13:19:47 +02:00
Andrei Chekun
f6fd3bbea0 test.py: reduce timeout for one test
Reduce the timeout for one test to 60 minutes. The longest test we had
so far was ~10-15 minutes. So reducing this timeout is pretty safe and
should help with hanging tests.

Closes scylladb/scylladb#29212
2026-03-24 12:50:10 +02:00
Benny Halevy
ca9ff134b8 sstables: log debug message in filesystem_storage::clone 2026-03-24 12:26:03 +02:00
Nikos Dragazis
b7f4ae8218 topology_state_machine: Add intended_storage_mode to system.topology
Part of the vnodes-to-tablets migration is to reshard the SSTables of
each node on vnode boundaries. Resharding is a heavy operation that
runs on startup while the node is offline. Since nodes can restart
for unexpected reasons, we need a flag to do it in a controllable way.

We also need the ability to roll back the migration, which requires
resharding in the opposite direction. This means a node must be aware of
the intended migration direction.

To address both requirements, this patch introduces a new column,
intended_storage_mode, in system.topology. A non-null value indicates
that a node should perform a migration and specifies the migration
direction.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2026-03-24 11:06:38 +02:00
Nikos Dragazis
bc8109f1a4 distributed_loader: Wire vnode-based resharding into table populator
Make the table populator migration-aware. If a table is migrating to
tablets, switch from normal resharding to vnode-based resharding.

Vnode-based resharding requires passing a vector of "owned ranges" upon
which resharding will segregate the SSTables. Compute it from the tablet
map. We could also compute them from the vnodes, since tablets are
identical to vnodes during the migration, but in the future we may
switch to a different model (multiple tablets per vnode).

Let the distributed loader decide if a table is migrating or not and
communicate that to the table populator. A table is migrating if the
keyspace replication strategy uses vnodes but the table replication
strategy uses tablets.

Currently, tables cannot enter this "migrating" state; support for this
will be introduced in the next patches.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2026-03-24 11:06:38 +02:00
Nikos Dragazis
63399951df replica: Pick any compaction group for resharding
In the previous patch, reshard compaction was extended with a special
operation mode where SSTables from vnode-based tables are segregated on
vnode boundaries and not with the static sharder. This will later be
wired into vnodes-to-tablets migration.

The problem is that resharding requires a compaction group. With a
vnode-based table, there is only one compaction group per shard, and
this is what the current code utilizes
(`try_get_compaction_group_view_with_static_sharding()`). But the new
operation mode will apply to migrating tables, which use a
`tablet_storage_group_manager`, which creates one compaction group for
each tablet. Some compaction group needs to be selected.

Pick any compaction group that is available on the current shard.
Reshard compaction is an operation that happens early in the startup
process; compaction groups do not own any SSTables yet, so all
compaction groups are equivalent.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2026-03-24 11:06:38 +02:00
Benny Halevy
d1c6141407 compaction: resharding_compaction: add vnodes_resharding option
In this mode, the output sstables generated by resharding
compaction are segregated by token range, based on the keyspace
vnode-based owned token ranges vector.

A basic unit test was also added to sstable_directory_test.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-03-24 11:06:38 +02:00
Nikos Dragazis
d153a95943 storage_service: Preserve ERM flavor of migrating tables
When a table is migrating from vnodes to tablets, the cluster is in a
mixed state where some nodes use vnode ERMs and others use tablet ERMs.
The ERM flavor is a node-local property that expresses the node's
storage organization.

Preserve the flavor across token metadata changes. The flavor needs to
be on par with storage, but the storage can change only on startup, as
it requires resharding all SSTables to conform with the flavor.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2026-03-24 11:06:38 +02:00
Nikos Dragazis
4a3e26d5e3 tablet_allocator: Exclude migrating tables from load balancing
The tablet load balancer operates on all tablet-based tables that appear
in the tablet metadata.

With the introduction of the vnodes-to-tablets migration procedure later
in this series, migrating tables will also appear in the tablet
metadata, but they need to be treated as vnode tables until migration is
finished. This patch excludes such tables from load balancing.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2026-03-24 11:06:38 +02:00
Nikos Dragazis
3e2dc078c9 feature_service: Add vnodes_to_tablets_migrations feature
Vnodes-to-tablets migrations require cluster-level support: the REST API
and the group0 state need to be supported by all nodes.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2026-03-24 11:06:38 +02:00
Marcin Maliszkiewicz
66be0f4577 Merge 'test: cluster: audit test suite optimization' from Dario Mirovic
Migrate audit tests from test/cluster/dtest to test/cluster. Optimize their execution time through cluster reuse.

The audit test suite is heavy. There are more than 70 test executions. Environment preparation is a significant part of each test case execution time.

This PR:
1. Copies audit tests from test/cluster/dtest to test/cluster, refactoring and enabling them
2. Groups tests functions by non-live cluster configuration variations to enable cluster reuse between them
    - Execution time reduced from 4m 29s to 2m 47s, which is ~38% execution time decrease
3. Removes the old audit tests from test/cluster/dtest

Includes two supporting changes:
- Allow specifying `AuthProvider` in `ManagerClient.get_cql_exclusive`
- Fix server log file handling for clean clusters

Refs [SCYLLADB-573](https://scylladb.atlassian.net/browse/SCYLLADB-573)

This PR is an improvement and does not require a backport.

[SCYLLADB-573]: https://scylladb.atlassian.net/browse/SCYLLADB-573?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ

Closes scylladb/scylladb#28650

* github.com:scylladb/scylladb:
  test: cluster: fix log clear race condition in test_audit.py
  test: pylib: shut down exclusive cql connections in ManagerClient
  test: cluster: fix multinode audit entry comparison in test_audit.py
  test: cluster: dtest: remove old audit tests
  test: cluster: group migrated audit tests for cluster reuse
  test: cluster: enable migrated audit tests and make them work
  test: pylib: manager_client: specify AuthProvider in get_cql_exclusive
  test: pylib: scylla cluster after_test log fix
  test: audit: copy audit test from dtest
2026-03-24 09:29:52 +01:00
Dario Mirovic
120f381a9d pgo: fix maintenance socket path too long
Maintenance socket path used for PGO is in the node workdir.
When the node workdir path is too long, the maintenance socket path
(workdir/cql.m) can exceed the Unix domain socket sun_path limit
and failing the PGO training pipeline.

To prevent this:
- pass an explicit --maintenance-socket override
  pointing to a short determinitic path in /tmp derived from the MD5
  hash of the workdir maintenance socket path
- update maintenance_socket_path to return the matching short path
  so that exec_cql.py connects to the right socket

The short path socket files are cleaned up after the cluster stops.

The path is using MD5 hash of the workdir path, so it is deterministic.

Fixes SCYLLADB-1070

Closes scylladb/scylladb#29149
2026-03-24 09:17:10 +01:00
Pavel Emelyanov
f112e42ddd raft: Fix split mutations freeze
Commit faa0ee9844 accidentally broke the way split snapshot mutation was
frozen -- instead of appending the sub-mutation `m` the commit kept the
old variable name of `mut` which in the new code corresponds to "old"
non-split mutation

Fixes #29051

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#29052
2026-03-24 08:53:50 +02:00
Botond Dénes
56c375b1f3 Merge 'table: don't close a disengaged querier in query()' from Pavel Emelyanov
There's a flaw in table::query() -- calling querier_opt->close() can dereferences a disengaged std::optional. The fix pretty simple. Once fixed, there are two if-s checking for querier_opt being engaged or not that are worth being merged.

The problem doesn't really shows itself becase table::query() is not called with null saved_querier, so the de-facto if is always correct. However, better to be on safe-side.

The problem doesn't show itself for real, not worth backporting

Closes scylladb/scylladb#29142

* github.com:scylladb/scylladb:
  table: merge adjacent querier_opt checks in query()
  table: don't close a disengaged querier in query()
2026-03-24 08:47:35 +02:00
Yaniv Kaul
e59a21752d .github/workflows/trigger_jenkins.yaml: add workflow permissions
Potential fix for https://github.com/scylladb/scylladb/security/code-scanning/147.

To fix the problem, add an explicit `permissions:` block to the workflow
(either at the top level or inside the `trigger-jenkins` job) that
constrains the `GITHUB_TOKEN` to the minimal necessary privileges. This
codifies least-privilege in the workflow itself instead of relying on
repository or organization defaults.

The best minimal, non‑breaking change is to define a root‑level
`permissions:` block with read‑only contents access because the job does
not perform any write operations to the repository, nor does it interact
with issues, pull requests, or other GitHub resources. A conservative,
widely accepted baseline is `contents: read`. If later steps require more
permissions, they can be added explicitly, but for this snippet, no such
need is visible.

Concretely, in `.github/workflows/trigger_jenkins.yaml`, insert:

```yaml
permissions:
  contents: read
```

between the `name:` block and the `on:` block (e.g., after line 2).
No additional methods, imports, or definitions are needed since this is
a pure YAML configuration change and does not alter runtime behavior of
the existing shell steps.

Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>

Closes scylladb/scylladb#27815
2026-03-24 08:40:30 +02:00
Yaniv Kaul
85a531819b .github/workflows/trigger-scylla-ci.yaml: add permissions to workflow
Potential fix for https://github.com/scylladb/scylladb/security/code-scanning/169.

In general, the fix is to add an explicit `permissions:` block to the
workflow (at the root level or per job) so that the `GITHUB_TOKEN` has
only the minimal scopes needed. Since this job only reads event data and
uses secrets to talk to Jenkins, we can restrict `GITHUB_TOKEN` to
read‑only repository contents.

The single best fix here is to add a top‑level `permissions:` block
right under the `name:` (and before `on:`) in
`.github/workflows/trigger-scylla-ci.yaml`, setting `contents: read`.
This applies to all jobs in the workflow, including `trigger-jenkins`,
and does not alter any existing steps or logic. No additional imports or
methods are needed, as this is purely a YAML configuration change for
GitHub Actions.

Concretely, edit `.github/workflows/trigger-scylla-ci.yaml` to insert:

```yaml
permissions:
  contents: read
```

after line 1. No other lines in the file need to change.

Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>

Closes scylladb/scylladb#27812
2026-03-24 08:37:49 +02:00
Dawid Mędrek
148217bed6 test: cluster: Increase log level in test_write_cl_any_to_dead_node_generates_hints
We increase the log level of `hints_manager` to TRACE in the test.
If it fails, it may be incredibly difficult to debug it without any
additional information.
2026-03-23 19:19:17 +01:00
Dawid Mędrek
2b472fe7fd test: cluster: Await all mutations concurrently in test_write_cl_any_to_dead_node_generates_hints 2026-03-23 19:19:17 +01:00
Dawid Mędrek
ae12c712ce test: cluster: Specify min_tablet_count in test_write_cl_any_to_dead_node_generates_hints
The test relies on the assumption that mutations will be distributed
more or less uniformly over the nodes. Although in practice this should
not be possible, theoretically it's possible that there's only one
tablet allocated for the table.

To clearly indicate this precondition, we explicitly set the property
`min_tablet_count` when creating the table. This way, we have a gurantee
that the table has multiple tablets. The load balancer should now take
care of distributing them over the nodes equally. Thanks to that,
`servers[1]` will have some tablets, and so it'll be the target for some
of the mutations we perform.
2026-03-23 19:19:14 +01:00
Dawid Mędrek
dd446aa442 test: cluster: Use new_test_table in test_write_cl_any_to_dead_node_generates_hints
The context manager is the de-facto standard in the test suite. It will
also allow us for a prettier way to conditionally enable per-table
tablet options in the following commit.
2026-03-23 19:07:01 +01:00
Dawid Mędrek
dea79b09a9 test: cluster: Introduce auxiliary function keyspace_has_tablets
The function is adapted from its counterpart in the cqlpy test suite:
cqlpy/util.py::keyspace_has_tablets. We will use it in a commit in this
series to conditionally set tablet properties when creating a table.
It might also be useful in general.
2026-03-23 19:07:01 +01:00
Dawid Mędrek
3d04fd1d13 test: cluster: Deflake test_write_cl_any_to_dead_node_generates_hints
Before these changes, we would send mutations to the node and
immediately query the metrics to see how many hints had been written.
However, that could lead to random failures of the test: even if the
mutations have finished executing, hints are stored asynchronously, so
we don't have a guarantee they have already been processed.

To prevent such failures, we rewrite the check: we will perform multiple
checks against the metrics until we have confirmed that the hints have
indeed been written or we hit the timeout.

We're generous with the timeout: we give the test 60 seconds. That
should be enough time to avoid flakiness even on super slow machines,
and if the test does fail, we will know something is really wrong.

Fixes SCYLLADB-1133
2026-03-23 19:06:57 +01:00
Piotr Dulikowski
63067f594d strong_consistency: fake taking and dropping snapshots
Snapshots are not implemented yet for strong consistency - attempting to
take, transfer or drop a snapshot results in an exception. However, the
logic of our state machine forces snapshot transfer even if there are no
lagging replicas - every raft::server::configuration::snapshot_threshold
log entries. We have actually encountered an issue in our benchmarks
where snapshots were being taken even though the cluster was not under
any disruption, and this is one of the possible causes.

It turns out that we can safely allow for taking snapshots right now -
we can just implement it as a no-op and return a random UUID.
Conversely, dropping a snapshot can also be a no-op. This is safe
because snapshot transfer still throws an exception - as long as the
taken/recovered snapshots are never attempted to be transferred.
2026-03-23 17:03:36 +01:00
Piotr Dulikowski
dd1d3dd1ee strong_consistency: adjust limits for snapshots
Raft snapshots are not implemented yet for strong consistency. Adjust
the current raft group config to make them much less likely to occur:

- snapshot_threshold config option decides how many log entries need to
  be applied after the last snapshot. Set it to the maximum value for
  size_t in order to effectively disable it.
- snapshot_threshold_log_size defines a threshold for the log memory
  usage over which a snapshot is created. Increase it from the default
  2MB to 10MB.
- max_log_size defines the threshold for the log memory usage over which
  requests are stopped to be admitted until the log is shrunk back by a
  snapshot. Set it to 20MB, as this option is recommended to be at least
  twice as much as snapshot_threshold_log_size.

Refs: SCYLLADB-1115
2026-03-23 17:03:36 +01:00
Botond Dénes
772b32d9f7 test/scylla_gdb: fix flakiness by preparing objects at test time
Fixtures previously ran GDB once (module scope) to find live objects
(sstables, tasks, schemas) and stored their addresses. Tests then
reused those addresses in separate GDB invocations. Sometimes these
addresses would become stale and the test would step on use-after-free
(e.g. sstables compacted away between invocations).

Fix by dropping the fixtures. The helper functions used by the fixtures
to obtain the required objects are converted to gdb convenience
functions, which can be used in the same expression as the test command
invocation. Thus, the object is aquired on-demand at the moment it is
used, so it is guaranteed to be fresh and relevant.

Fixes: SCYLLADB-1020

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Closes scylladb/scylladb#28999
2026-03-23 16:54:03 +02:00
Piotr Dulikowski
60fb5270a9 logstor: fix fmt::format use with std::filesystem::path
The version of fmt installed on my machine refuses to work with
`std::filesystem::path` directly. Add `.string()` calls in places that
attempt to print paths directly in order to make them work.

Closes scylladb/scylladb#29148
2026-03-23 15:15:52 +01:00
Pavel Emelyanov
3b9398dfc8 Merge 'encryption: fix deadlock in encrypted_data_source::get()' from Ernest Zaslavsky
When encrypted_data_source::get() caches a trailing block in _next, the next call takes it directly — bypassing input_stream::read(), which checks _eof. It then calls input_stream::read_exactly() on the already-drained stream. Unlike read(), read_up_to(), and consume(), read_exactly() does not check _eof when the buffer is empty, so it calls _fd.get() on a source that already returned EOS.

In production this manifested as stuck encrypted SSTable component downloads during tablet restore: the underlying chunked_download_source hung forever on the post-EOS get(), causing 4 tablets to never complete. The stuck files were always block-aligned sizes (8k, 12k) where _next gets populated and the source is fully consumed in the same call.

Fix by checking _input.eof() before calling read_exactly(). When the stream already reached EOF, buf2 is known to be empty, so the call is skipped entirely.

A comprehensive test is added that uses a strict_memory_source which fails on post-EOS get(), reproducing the exact code path that caused the production deadlock.

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1128

Backport to 2025.3/4 and 2026.1 is needed since it fixes a bug that may bite us in production, to be on the safe side

Closes scylladb/scylladb#29110

* github.com:scylladb/scylladb:
  encryption: fix deadlock in encrypted_data_source::get()
  test_lib: mark `limiting_data_source_impl` as not `final`
  Fix formatting after previous patch
  Fix indentation after previous patch
  test_lib: make limiting_data_source_impl available to tests
2026-03-23 17:12:44 +03:00
Pavel Emelyanov
57ef712243 test/backup: drop create_dataset helper
It has no more callers after the previous patch.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-03-23 17:01:20 +03:00
Pavel Emelyanov
2353091cbd test/backup: use new_test_keyspace in test_restore_primary_replica
Replace create_dataset + manual DROP/CREATE KEYSPACE with two sequential
new_test_keyspace context manager blocks, matching the pattern used by
do_test_streaming_scopes. The first block covers backup, the second
covers restore. Keyspace lifecycle is now automatic.

The streaming directions validation loop is moved outside of the second
context block, since it only parses logs and has no dependency on the
keyspace being alive.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-03-23 16:59:47 +03:00
Botond Dénes
f5438e0587 test/cluster/test_incremental_repair.py: enable compaction DEBUG logs in do_tablet_incremental_repair_and_ops
The test sporadically fails because scrub produces an unexpected number
of SSTables. Compaction logs are needed to diagnose why, but were not
captured since scrub runs at DEBUG level. Enable compaction=debug for
the servers started by do_tablet_incremental_repair_and_ops so the next
reproduction provides enough information to root-cause the issue.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-03-23 15:48:26 +02:00
Botond Dénes
f6ab576ed9 test/cluster/test_incremental_repair.py: fix typo preapre -> prepare
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-03-23 15:48:12 +02:00
Pavel Emelyanov
cb329b10bf code: Add maintenance/maintenance group
And move some activities from streaming group into it, namely

- tablet_allocator background group
- sstables_manager-s components reclaimer
- tablet storage group manager merge completion fiber
- prometheus

All other activity that was in streaming group remains there, but can be
moved to this group (or to new maintenance subgroup) later.

All but prometheus are patched here, prometheus still uses the
maintenance_sched_group variable in main.cc, so it transparently
moves into new group

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-03-23 16:00:03 +03:00
Pavel Emelyanov
de9bfe0f1d backup: Add maintenance/backup group
The snapshot_ctl::backup_task_impl runs in configured scheduling group.
Now it's streaming one. This patch introduces the maintenance/backup
group and re-configures backup task with it.

The group gets its --backup_io_throughput_mb_per_sec option that
controls bandwidth limit for this sub-group only.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-03-23 16:00:02 +03:00
Pavel Emelyanov
6f43e8562e compaction: Add maintenance/maintenance_compaction group
Compaction manager tells compaction_sched_group from
maintenance_compaction_sched_group. The latter, however, is set to be
"streaming" group. This patch adds real maintenance_compaction group
under the maintenance supergroup and makes compaction manager use it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-03-23 16:00:02 +03:00
Pavel Emelyanov
13355d1845 main: Introduce maintenance supergroup
And just move streaming group inside it. Next patches will populate this
supergroup further.

The new supergroup gets its --maintenance-io-throughput-mb-per-sec
option that controls supergroup-wide IO bandwidth applied to it. If not
configured, the supergroup gets the throughput from streaming to be
backward compatible.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-03-23 16:00:02 +03:00
Pavel Emelyanov
7cb9fa0778 main: Move all maintenance sched group into streaming one
The main.cc code uses two variables to reference streaming scheduling.
This patch stops using the maintenance_sched_group one, because it's in
fact streaming group, and real "maintenance" will appear later in this
set.

One place is deliberately not patched -- prometheus code starts before
dbcfg.streaming_scheduling_group appears, so it still sits uses the
maintenance_sched_group variable. This fact will be used in one of the
next patches.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-03-23 16:00:02 +03:00
Pavel Emelyanov
45ecf15fff database: Use local variable for current_scheduling_group
The classify_request() helper captures current scheduling group into
local variable and compares it with groups from db_config to decide
which "class" it belongs to.

One if uses current_scheduling_group(), while it could use the local
variable.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-03-23 16:00:02 +03:00
Pavel Emelyanov
15c41bfb6c code: Live-update IO throughputs from main
Currently we have two live-updateable IO-throughput options -- one for
streaming and one for compaction. Both are observed and the changed
value is applied to the corresponding scheduling_group by the relevant
serice -- respectively, stream_manager and compaction_manager.

Both observe/react/apply places use pretty heavy boilerplate code for
such simple task. Next patches will make things worse by adding two more
options to control IO throughput of some other groups.

Said that, the proposal is to hold the updating code in main.cc with the
help of a wrapper class. In there all the needed bits are at hand, and
classes can get their IO updates applied easily.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-03-23 16:00:02 +03:00
Piotr Dulikowski
df68d0c0f7 directories: add missing seastar/util/closeable.hh include
Without this include the file would not compile on its own. The issue
was most likely masked by the use of precompiled headers in our CI.

Closes scylladb/scylladb#29170
2026-03-23 15:46:56 +03:00
Yaniv Michael Kaul
051107f5bc scylla-gdb: fix sstable-summary crash on ms-format sstables
The 'scylla sstable-summary' GDB command crashes with
'ValueError: Argument "count" should be greater than zero' when
inspecting ms-format (trie-based) sstables. This happens because
ms-format sstables don't populate the traditional summary structure,
leaving all fields zeroed out, which causes gdb.read_memory() to be
called with a zero count.

Fix by:
- Adding zero-length guards to sstring.to_hex() and sstring.as_bytes()
  to return early when the data length is zero, consistent with the
  existing guard in managed_bytes.get().
- Adding the same guard to scylla_sstable_summary.to_hex().
- Detecting ms-format sstables (version == 5) early in
  scylla_sstable_summary.invoke() and printing an informative message
  instead of attempting to read the unpopulated summary.

Fixes: SCYLLADB-1180

Closes scylladb/scylladb#29162
2026-03-23 12:44:47 +02:00
Calle Wilund
b36dc80835 scylla_cluster: Remove left-over debug printout 2026-03-23 11:07:59 +01:00
Piotr Szymaniak
c8e7e20c5c test/cluster: retry create_table on transient schema agreement timeout
In test_index_requires_rf_rack_valid_keyspace, the create_table call
for a plain tablet-based table can fail with 'Unable to reach schema
agreement' after the server's 10s timeout is exceeded. This happens
when schema gossip propagation across the 4-node cluster takes longer
than expected after a sequence of rapid schema changes earlier in the
test.

Add a retry (up to 2 attempts) on schema agreement errors for this
specific create_table call rather than increasing the server-side
timeout.

Fixes: SCYLLADB-1135

Closes scylladb/scylladb#29132
2026-03-23 10:45:30 +02:00
Yaniv Kaul
fb1f995d6b .github/workflows/backport-pr-fixes-validation.yaml: workflow does not contain permissions (Potential fix for code scanning alert no. 139)
Potential fix for https://github.com/scylladb/scylladb/security/code-scanning/139,

To fix the problem, explicitly restrict the `GITHUB_TOKEN` permissions
for this workflow/job so it has only what is needed. The script reads PR
data and repository info (which is covered by `contents: read`/default
read scopes) and posts a comment via `github.rest.issues.createComment`,
which requires `issues: write`. No other write scopes (e.g., `contents:
write`, `pull-requests: write`) are necessary.

The best fix without changing functionality is to add a `permissions`
block scoped to this job (or at the workflow root). Since we only see a
single job here, we’ll add it under `check-fixes-prefix`. Concretely, in
`.github/workflows/backport-pr-fixes-validation.yaml`, between the
`runs-on: ubuntu-latest` line (line 10) and `steps:` (line 11), add:

```yaml
    permissions:
      contents: read
      issues: write
```

This keeps the token minimally privileged while still allowing the script
to create issue/PR comments.

Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>

Closes scylladb/scylladb#27810
2026-03-23 10:30:01 +02:00
Piotr Smaron
32225797cd dtest: fix flaky test_writes_schema_recreated_while_node_down
`read_barrier(session2)` was supposed to ensure `node2` has caught up on schema
before a CL=ALL write. But `patient_cql_connection(node2)` creates a
cluster-aware driver session `(TokenAwarePolicy(DCAwareRoundRobinPolicy()))`
that can route the barrier CQL statement to any node — not necessarily `node2`.
If the barrier runs on `node1` or `node3` (which already have the new schema),
it's a no-op, and `node2` remains stale, thus the observed `WriteFailure`.
The fix is to switch to `patient_exclusive_cql_connection(node2)`,
which uses `WhiteListRoundRobinPolicy([node2_ip])` to pin all CQL to `node2`.
This is already the established pattern used by other tests in the same file.

Fixes: SCYLLADB-1139

No need to backport yet, appeared only on master.

Closes scylladb/scylladb#29151
2026-03-23 10:25:54 +02:00
Michał Chojnowski
f29525f3a6 test/boost/cache_algorithm_test: disable sstable compression to avoid giant index pages
The test intentionally creates huge index pages.
But since 5e7fb08bf3,
the index reader allocates a block of memory for a whole index page,
instead of incrementally allocating small pieces during index parsing.
This giant allocation causes the test to fail spuriously in CI sometimes.

Fix this by disabling sstable compression on the test table,
which puts a hard cap of 2000 keys per index page.

Fixes: SCYLLADB-1152

Closes scylladb/scylladb#29152
2026-03-23 09:57:11 +02:00
Raphael S. Carvalho
05b11a3b82 sstables_loader: use new sstable add path
Use add_new_sstable_and_update_cache() when attaching SSTables
downloaded by the node-scoped local loader.

This is the correct variant for new SSTables: it can unlink the
SSTable on failure to add it, and it can split the SSTable if a
tablet split is in progress. The older
add_sstable_and_update_cache() helper is intended for preexisting
SSTables that are already stable on disk.

Additionally, downloaded SSTables are now left unsealed (TemporaryTOC)
until they are successfully added to the table's SSTable set. The
download path (download_fully_contained_sstables) passes
leave_unsealed=true to create_stream_sink, and attach_sstable opens
the SSTable with unsealed_sstable=true and seals it only inside the
on_add callback — matching the pattern used by stream_blob.cc and
storage_service.cc for tablet streaming.

This prevents a data-resurrection hazard: previously, if the process
crashed between download and attach_sstable, or if attach_sstable
failed mid-loop, sealed (TOC) SSTables would remain in the table
directory and be reloaded by distributed_loader on restart. With
TemporaryTOC, sstable_directory automatically cleans them up on
restart instead.

Fixes  https://scylladb.atlassian.net/browse/SCYLLADB-1085.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes scylladb/scylladb#29072
2026-03-23 10:33:04 +03:00
Piotr Szymaniak
f511264831 alternator/test: fix test_ttl_with_load_and_decommission flaky Connection refused error
The native Scylla nodetool reports ECONNREFUSED as 'Connection refused',
not as 'ConnectException' (which is the Java nodetool format). Add
'Connection refused' to the valid_errors list so that transient
connection failures during concurrent decommission/bootstrap topology
changes are properly tolerated.

Fixes SCYLLADB-1167

Closes scylladb/scylladb#29156
2026-03-22 11:01:45 +02:00
Pavel Emelyanov
c114d1b82c api: Inline describe_ring JSON handling
There are two helpers for describe_ring endpoint. Both can be squashed
together for code brevity.

Also, while at it, the "keyspace" parameter is not properly validated by
the endpoint.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-03-20 19:51:32 +03:00
Pavel Emelyanov
9a2e583f29 storage_service: Make describe_ring_for_table() take table_id
All callers already have it. It makes no difference for the method
itself with which table identifier to work, but will help to simplify
the flow in API handler (next patch)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-03-20 19:49:24 +03:00
Pavel Emelyanov
4bc8ec174c repair: Remove db/config.hh from repair/*.cc files
Now all the code uses repair_service::config and no longer needs global
config description.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-03-20 19:36:50 +03:00
Pavel Emelyanov
35f625e5c7 repair: Move repair_multishard_reader options onto repair_service::config
This actually uses two interconnected options:
repair_multishard_reader_buffer_hint_size and
repair_multishard_reader_enable_read_ahead.

Both are propagated through repair_service::config and pass their
values to repair_reader/make_reader at construction time.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-03-20 19:36:50 +03:00
Pavel Emelyanov
9bc0d27aae repair: Move critical_disk_utilization_level onto repair_service::config
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-03-20 19:23:47 +03:00
Pavel Emelyanov
80aa0fcdc2 repair: Move repair_partition_count_estimation_ratio onto repair_service::config
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-03-20 19:23:47 +03:00
Pavel Emelyanov
585cb0c718 repair: Move repair_hints_batchlog_flush_cache_time_in_ms onto repair_service::config
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-03-20 19:23:47 +03:00
Pavel Emelyanov
d8f7f86e10 repair: Move enable_small_table_optimization_for_rbno onto repair_service::config
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-03-20 19:23:47 +03:00
Pavel Emelyanov
38a23ff927 repair: Introduce repair_service::config
Most other services have their configs, rpair still uses global
db::config.

Add an empty config struct to repair_service to carry db::config options
the repair service needs.

Subsequent patches will populate the struct with options.

The config is created in main.cc as sharded_parameter because all future
options are live-updateable and should capture theirs source from
db::config on correct shard.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-03-20 19:23:47 +03:00
Pavel Emelyanov
7dce43363e table: merge adjacent querier_opt checks in query()
After the previous fix both guarding if-s start with 'if (querier_opt &&'.
Merge them into a single outer 'if (querier_opt)' block to avoid the
redundant check and make the structure easier to follow.

No functional change.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-03-20 14:48:08 +03:00
Piotr Dulikowski
cc695bc3f7 Merge 'vector_search: fix race condition on connection timeout' from Karol Nowacki
When a `with_connect` operation timed out, the underlying connection
attempt continued to run in the reactor. This could lead to a crash
if the connection was established/rejected after the client object had
already been destroyed. This issue was observed during the teardown
phase of a upcoming high-availability test case.

This commit fixes the race condition by ensuring the connection attempt
is properly canceled on timeout.

Additionally, the explicit TLS handshake previously forced during the
connection is now deferred to the first I/O operation, which is the
default and preferred behavior.

Fixes: SCYLLADB-832

Backports to 2026.1 and 2025.4 are required, as this issue also exists on those branches and is causing CI flakiness.

Closes scylladb/scylladb#29031

* github.com:scylladb/scylladb:
  vector_search: test: fix flaky test
  vector_search: fix race condition on connection timeout
2026-03-20 11:12:04 +01:00
Petr Gusev
4bfcd035ae test_fencing: add missing await-s
Fixes SCYLLADB-1099

Closes scylladb/scylladb#29133
2026-03-20 10:55:35 +01:00
Pavel Emelyanov
9c1c41df03 table: don't close a disengaged querier in query()
The condition guarding querier_opt->close() was:

When saved_querier is null the short-circuit makes the whole condition true
regardless of whether querier_opt is engaged.  If partition_ranges is empty,
query_state::done() is true before the while-loop body ever runs, so querier_opt
is never created.  Calling querier_opt->close() then dereferences a disengaged
std::optional — undefined behaviour.

Fix by checking querier_opt first:

This preserves all existing semantics (close when not saving, or when saving
wouldn't be useful) while making the no-querier path safe.

Why this doesn't surface today: the sole production call site, database::query(),
in practice.  The API header documents nullptr as valid ("Pass nullptr when
queriers are not saved"), so the bug is real but latent.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-03-20 12:25:13 +03:00
Pavel Emelyanov
c4a0f6f2e6 object_store: Don't leave dangling objects by iterating moved-from names vector
The code in upload_file std::move()-s vector of names into
merge_objects() method, then iterates over this vector to delete
objects. The iteration is apparently a no-op on moved-from vector.

The fix is to make merge_objects() helper get vector of names by const
reference -- the method doesn't modify the names collection, the caller
keeps one in stable storage.

Fixes #29060

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#29061
2026-03-20 10:09:30 +02:00
Pavel Emelyanov
712ba5a31f utils: Use yielding directory_lister in owner verification
Switch directories::do_verify_owner_and_mode() from lister::scan_dir() to
utils::directory_lister while preserving the previous hidden-entry
behavior.

Make do_verify_subpath use lister::filter_type directly so the
verification helper can pass it straight into directory_lister, and keep
a single yielding iteration loop for directory traversal.

Minus one scan_dir user twards scan_dir removal from code.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#29064
2026-03-20 10:08:38 +02:00
Pavel Emelyanov
961fc9e041 s3: Don't rearm credential timers when credentials are not refreshed
The update_credentials_and_rearm() may get "empty" credentials from
_creds_provider_chain.get_aws_credentials() -- it doesn't throw, but
returns default-initialized value. In that case the expires_at will be
set to time_point::min, and it's probably not a good idea to arm the
refresh timer and, even worse idea, to subtract 1h from it.

Fixes #29056

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#29057
2026-03-20 10:07:01 +02:00
Pavel Emelyanov
0a8dc4532b s3: Fix missing upload ID in copy_part trace log
The format string had two {} placeholders but three arguments, the
_upload_id one is skipped from formatting

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#29053
2026-03-20 10:05:44 +02:00
Botond Dénes
bb5c328a16 Merge 'Squash two primary-replica restoration tests together' from Pavel Emelyanov
The test_restore_primary_replica_same_domain and test_restore_primary_replica_different_domain tests have very much in common. Previously both tests were also split each into two, so we have four tests, and now we have two that can also be squashed, the lines-of-code savings still worth it.

This is the continuation of #28569

Tests improvement, not backporting

Closes scylladb/scylladb#28994

* github.com:scylladb/scylladb:
  test: Replace a bunch of ternary operators with an if-else block
  test: Squash test_restore_primary_replica_same|different_domain tests
  test: Use the same regexp in test_restore_primary_replica_different|same_domain-s
2026-03-20 10:05:16 +02:00
Pavel Emelyanov
ea2a214959 test/backup: Use unique_name() for backup prefix instead of cf_dir
The do_test_backup_abort() fetched the node's workdir and resolved cf_dir
solely to construct a unique-ish backup prefix:

    prefix = f'{cf_dir}/backup'

The comment already acknowledged this was only "unique(ish)" — relying
on the UUID-derived cf_dir name as a uniqueness source is roundabout.
unique_name() is already imported and used for exactly this purpose
elsewhere in the file.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#29030
2026-03-20 10:04:22 +02:00
Pavel Emelyanov
65032877d4 api: Move /storage_service/toppartitions from storage_service.cc to column_family.cc
The endpoint URL remains intact. Having it next to another toppartitions
endpoint (the /column_family/toppartitions one) is natural.

This endpoint only needs sharded<replica::database>&, grabs it from
http_context and doesn't use any other service. In column_family.cc the
database reference is already available as a parameter. Once more user
of http_context.db is gone.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Closes scylladb/scylladb#28996
2026-03-20 09:52:33 +02:00
Botond Dénes
de0bdf1a65 Merge 'Decouple test_refresh_deletes_uploaded_sstables from backup test-suite' from Pavel Emelyanov
The test in question uses several helpers from the backup sute, but it doesn't really need them -- the operations it want to perform can be performed with standard pylib methods. "While at it" also collect some dangling effectively unused local variables from this test (these were apparently left from backup tests this one was copied-and-reworked from)

Enhancing tests, not backporting

Closes scylladb/scylladb#29130

* github.com:scylladb/scylladb:
  test/refresh: Simplify refresh invocation
  test/refresh: Remove r_servers alias for servers
  test/refresh: Replace check_mutation_replicas with a plain CQL SELECT
  test/refresh: Inline keyspace/table/data setup in test_refresh_deletes_uploaded_sstables
  test/refresh: Prepare indentation for new_test_keyspace in test_refresh_deletes_uploaded_sstables
  test/refresh: Decouple test_refresh_deletes_uploaded_sstables from backup tests
  test/refresh: Remove unused wait_for_cql_and_get_hosts import
2026-03-20 09:29:15 +02:00
Botond Dénes
97430e2df5 Merge 'Fix object storage lister entries walking loop' from Pavel Emelyanov
Two issues found in the lister returned by gs_client_wrapper::make_object_lister()
Lister can report EOF too early in case filter is active, another one is potential vector out-of-bounds access

Fixes #29058

The code appeared in 2026.1, worth fixing it there as well

Closes scylladb/scylladb#29059

* github.com:scylladb/scylladb:
  sstables: Fix object storage lister not resetting position in batch vector
  sstables: Fix object storage lister skipping entries when filter is active
2026-03-20 09:12:42 +02:00
Botond Dénes
5573c3b18e Merge 'tablets: Fix deadlock in background storage group merge fiber' from Tomasz Grabiec
When it deadlocks, groups stop merging and compaction group merge
backlog will run-away.

Also, graceful shutdown will be blocked on it.

Found by flaky unit test
test_merge_chooses_best_replica_with_odd_count, which timed-out in 1
in 100 runs.

Reason for deadlock:

When storage groups are merged, the main compaction group of the new
storage group takes a compaction lock, which is appended to
_compaction_reenablers_for_merging, and released when the merge
completion fiber is done with the whole batch.

If we accumulate more than 1 merge cycle for the fiber, deadlock
occurs. Lock order will be this

Initial state:

 cg0: main
 cg1: main
 cg2: main
 cg3: main

After 1st merge:

 cg0': main [locked], merging_groups=[cg0.main, cg1.main]
 cg1': main [locked], merging_groups=[cg2.main, cg3.main]

After 2nd merge:

 cg0'': main [locked], merging_groups=[cg0'.main [locked], cg0.main, cg1.main, cg1'.main [locked], cg2.main, cg3.main]

merge completion fiber will try to stop cg0'.main, which will be
blocked on compaction lock. which is held by the reenabler in
_compaction_reenablers_for_merging, hence deadlock.

The fix is to wait for background merge to finish before we start the
next merge. It's achieved by holding old erm in the background merge,
and doing a topology barrier from the merge finalizing transition.

Background merge is supposed to be a relatively quick operation, it's
stopping compaction groups. So may wait for active requests. It
shouldn't prolong the barrier indefinitely.

Tablet tests which trigger merge need to be adjusted to call the
barrier, otherwise they will be vulnerable to the deadlock.

Fixes SCYLLADB-928

Backport to >= 2025.4 because it's the earliest vulnerable due to f9021777d8.

Closes scylladb/scylladb#29007

* github.com:scylladb/scylladb:
  tablets: Fix deadlock in background storage group merge fiber
  replica: table: Propagate old erm to storage group merge
  test: boost: tablets_test: Save tablet metadata when ACKing split resize decision
  storage_service: Extract local_topology_barrier()
2026-03-20 09:05:52 +02:00
Botond Dénes
34473302b0 Merge 'docs: document existing guardrails' from Andrzej Jackowski
This patch series introduces a new documentation for exiting guardrails.

Moreover:
 - Warning / failure messages of recently added write CL guardrails (SCYLLADB-259) are rephrased, so all guardrails have similar messages.
 - Some new tests are added, to help verify the correctness of the documentation and avoid situations where the documentation and implementation diverge.

Fixes: [SCYLLADB-257](https://scylladb.atlassian.net/browse/SCYLLADB-257)

No backport, just new docs and tests.

[SCYLLADB-257]: https://scylladb.atlassian.net/browse/SCYLLADB-257?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ

Closes scylladb/scylladb#29011

* github.com:scylladb/scylladb:
  test: add new guardrail tests matching documentation scenarios
  test: add metric assertions to guardrail replication strategy tests
  test: use regex matching in guardrail replication strategy tests
  test: extract ks_opts helper in test_guardrail_replication_strategy
  docs: document CQL guardrails
  cql: improve write consistency level guardrail messages
2026-03-20 08:56:00 +02:00
artem.penner
9898e5700b scylla-node-exporter: Add systemd collector to node exporter
This PR enables the node_exporter systemd collector and configures the unit whitelist to include scylla-server.service and systemd-coredump services.

**Motivation**: We currently lack visibility into system-level service states, which is critical for diagnosing stability issues.

This configuration enables two specific use cases:
- Detecting Coredump Loops: We encounter scenarios where ScyllaDB enters a restart loop. To pinpoint SIGSEGV (coredumps) as the root cause, we need to track when the systemd-coredump service becomes active, indicating a dump is being processed.
- Identifying Startup Failures: We need to detect when the scylla-server unit enters a failed state. This is essential for catching unrecoverable errors (e.g., corrupted commitlogs or configuration bugs) that prevent the server from starting.

example of promql queries:
- `node_systemd_unit_state{name=~"systemd-coredump@.*", state="active"} == 1`
- `node_systemd_unit_state{name="scylla-server.service", state="failed"} == 1`

Closes #28402
2026-03-20 08:39:56 +02:00
Andrzej Jackowski
10c4b9b5b0 test: verify signal() detects resource negative leak in rcs
reader_concurrency_semaphore::signal() guards against available
resources exceeding the initial limit after a signal, which would
indicate a bug such as double-returning resources. It reports the
issue via on_internal_error_noexcept and clamps resources back to
the initial values. However, before this commit there were no tests
that verified this behavior, so bugs like SCYLLADB-1014 went
undetected.

Add a test that artificially signals resources that were never
consumed and verifies that signal() detects the negative leak and
clamps available resources back to the initial limit.

Refs: SCYLLADB-1014
Fixes: SCYLLADB-1031

Closes scylladb/scylladb#28993
2026-03-20 09:21:20 +03:00
Botond Dénes
f9adbc7548 test/cqlpy/test_tombstone_limit.py: disable tombstone-gc for test table
Since 7564a56dc8, all tables default to
repair-mode tombstone-gc, which is identical to immediate-mode for RF=1
tables. Consequently the tombstones written by the tests in this test
file are immediately collectible and with some unlucky timing, some of
them can be collected before the end of the test, failing the empty-page
prefix check because the empty pages prefix will be smaller than
expected based on the number of tombstones written.
Disable tombstone-gc to remove this source of flakyness.

Fixes: SCYLLADB-1062

Closes scylladb/scylladb#29077
2026-03-20 09:14:29 +03:00
Michał Chojnowski
6b18d95dec test: add a missing reconnect_driver in test_sstable_compression_dictionaries_upgrade.py
Need to work around https://github.com/scylladb/python-driver/issues/295,
lest a CQL query fail spuriously after the cluster restart.

Fixes: SCYLLADB-1114

Closes scylladb/scylladb#29118
2026-03-20 09:05:14 +03:00
Botond Dénes
89388510a0 test/cluster/test_data_resurrection_in_memtable.py: use explicit CL
The test has expectation w.r.t which write makes it to which nodes:
* inserts make it to all nodes
* delete makes it to all-1 (QUORUM) node

However, this was not expressed with CL, and the default CL=ONE allowed
for some nodes missing the writes and this violating the tests
expectations on what data is persent on which nodes. This resulted on
the test being flaky and failing on the data checks.

Use explicit CL for the ingestion to prevent this.

The improvements to the test introduced in
a8dd13731f was of great help in
investigating this: traces are now available and the check happens after
the data was dumped to logs.

Fixes: SCYLLADB-870
Fixes: SCYLLADB-812
Fixes: SCYLLADB-1102

Closes scylladb/scylladb#29128
2026-03-20 09:02:57 +03:00
Avi Kivity
6b259babeb Merge 'logstor: initial log-structured storage for key-value tables' from Michael Litvak
Introduce an initial and experimental implementation of an alternative log-structured storage engine for key-value tables.

Main flows and components:
* The storage is composed of 32MB files, each file divided to segments of size 128k. We write to them sequentially records that contain a mutation and additional metadata. Records are written to a buffer first and then written to the active segment sequentially in 4k sized blocks.
* The primary index in memory maps keys to their location on disk. It is a B-tree per-table that is ordered by tokens, similar to a memtable.
* On reads we calculate the key and look it up in the primary index, then read the mutation from disk with a single disk IO.
* On writes we write the record to a buffer, wait for it to be written to disk, then update the index with the new location, and free the previous record.
* We track the used space in each segment. When overwriting a record, we increase the free space counter for the segment of the previous record that becomes dead. We store the segments in a histogram by usage.
* The compaction process takes segments with low utilization, reads them and writes the live records to new segments, and frees the old segments.
* Segments are initially "mixed" - we write to the active segment records from all tables and all tablets. The "separator" process rewrites records from mixed segments into new segments that are organized by compaction groups (tablets), and frees the mixed segments. Each write is written to the active segment and to a separator buffer of the compaction group, which is eventually flushed to a new segment in the compaction group.

Currently this mode is experimental and requires an experimental flag to be enabled.
Some things that are not supported yet are strong consistency, tablet migration, tablet split/merge, big mutations, tombstone gc, ttl.

to use, add to config:
```
enable_logstor: true

experimental_features:
  - logstor
```

create a table:
```
CREATE TABLE ks.t(pk int PRIMARY KEY, a int, v text) WITH storage_engine = 'logstor';
```

INSERT, SELECT, DELETE work as expected
UPDATE not supported yet

no backport - new feature

Closes scylladb/scylladb#28706

* github.com:scylladb/scylladb:
  logstor: trigger separator flush for buffers that hold old segments
  docs/dev: add logstor documentation
  logstor: recover segments into compaction groups
  logstor: range read
  logstor: change index to btree by token per table
  logstor: move segments to replica::compaction_group
  db: update dirty mem limits dynamically
  logstor: track memory usage
  logstor: logstor stats api
  logstor: compaction buffer pool
  logstor: separator: flush buffer when full
  logstor: hold segment until index updates
  logstor: truncate table
  logstor: enable/disable compaction per table
  logstor: separator buffer pool
  test: logstor: add separator and compaction tests
  logstor: segment and separator barrier
  logstor: separator debt controller
  logstor: compaction controller
  logstor: recovery: recover mixed segments using separator
  logstor: wait for pending reads in compaction
  logstor: separator
  logstor: compaction groups
  logstor: cache files for read
  logstor: recovery: initial
  logstor: add segment generation
  logstor: reserve segments for compaction
  logstor: index: buckets
  logstor: add buffer header
  logstor: add group_id
  logstor: record generation
  logstor: generation utility
  logstor: use RIPEMD-160 for index key
  test: add test_logstor.py
  api: add logstor compaction trigger endpoint
  replica: add logstor to db
  schema: add logstor cf property
  logstor: initial commit
  db: disable tablet balancing with logstor
  db: add logstor experimental feature flag
2026-03-20 00:18:09 +02:00
Avi Kivity
062751fcec Merge 'db/config: enable ms sstable format by default' from Łukasz Paszkowski
Trie-based sstable indexes are supposed to be (hopefully) a better default than the old BIG indexes.
Make the new format a new default for new clusters by naming ms in the default scylla.yaml.

New functionality. No backport needed.

This PR is basically Michał's one https://github.com/scylladb/scylladb/pull/26377, Jakub's  https://github.com/scylladb/scylladb/pull/27332 fixing `sstables_manager::get_highest_supported_format()` and one test fix.

Closes scylladb/scylladb#28960

* github.com:scylladb/scylladb:
  db/config: announce ms format as highest supported
  db/config: enable `ms` sstable format by default
  cluster/dtest/bypass_cache_test: switch from highest_supported_sstable_format to chosen_sstable_format
  api/system: add /system/chosen_sstable_version
  test/cluster/dtest: reduce num_tokens to 16
2026-03-19 18:19:01 +02:00
Pavel Emelyanov
969dddb630 test/refresh: Simplify refresh invocation
take_snapshot return values were unused so drop them. do_refresh was a
thin wrapper around load_new_sstables that added no logic; inline it
directly into the gather expression.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-03-19 18:42:57 +03:00
Pavel Emelyanov
de21572b31 test/refresh: Remove r_servers alias for servers
r_servers = servers was a no-op assignment; use servers directly.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-03-19 18:42:52 +03:00
Pavel Emelyanov
20b1531e6d test/refresh: Replace check_mutation_replicas with a plain CQL SELECT
The goal of test_refresh_deletes_uploaded_sstables is to verify that
sstables are removed from the upload directory after refresh. The replica
check was just a sanity guard; a simple SELECT of all keys is sufficient
and much lighter.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-03-19 18:42:48 +03:00
Pavel Emelyanov
c591b9ebe2 test/refresh: Inline keyspace/table/data setup in test_refresh_deletes_uploaded_sstables
Replace create_dataset() with explicit keyspace creation via new_test_keyspace,
inline CREATE TABLE, and direct cql.run_async inserts — matching the pattern
used in do_test_streaming_scopes. This removes the last dependency on backup
helpers for dataset setup and makes the test self-contained.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-03-19 18:42:44 +03:00
Pavel Emelyanov
06006a6328 test/refresh: Prepare indentation for new_test_keyspace in test_refresh_deletes_uploaded_sstables
Wrap the test body under if True: to pre-indent it, making the subsequent
patch that introduces new_test_keyspace a pure content change with no
whitespace noise.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-03-19 18:42:40 +03:00
Pavel Emelyanov
67d8cde42d test/refresh: Decouple test_refresh_deletes_uploaded_sstables from backup tests
Replace create_cluster() from object_store/test_backup.py with a plain
manager.servers_add(2) call. The test does not use object storage, so
there is no need to pull in the backup helper along with its config and
logging knobs.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-03-19 18:42:36 +03:00
Pavel Emelyanov
04f046d2d8 test/refresh: Remove unused wait_for_cql_and_get_hosts import
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-03-19 18:42:32 +03:00
Artsiom Mishuta
0ede308a04 test/pylib: save logs on success only during teardown phase
Previously, when --save-log-on-success was enabled, logs were saved
for every test phase (setup, call, teardown)in 3 files. Restrict it to only
the teardown phase, that contains all 3 in case of test success,
to avoid redundant log entries.
2026-03-19 16:35:22 +01:00
Artsiom Mishuta
cbc07569c0 test: Lower default log level from DEBUG to INFO
1. test.py — Removed --log-level=DEBUG flag from pytest args
2. test/pytest.ini — Changed log_level to INFO (that was set DEBUG in test.py), changed log_file_level from DEBUG to INFO, added clarifying comments
2026-03-19 16:32:30 +01:00
Botond Dénes
e8b37d1a89 Merge 'doc: fix the installation section' from Anna Stuchlik
This PR fixes the Installation page:

- Replaces `http `with `https `in the download command.
- Replaces the Open Source example from the Installation section for CentOS (we overlooked this example before).

Fixes https://github.com/scylladb/scylladb/issues/29087
Fixes https://github.com/scylladb/scylladb/issues/29087

This update affects all supported versions and should be backported as a bug fix.

Closes scylladb/scylladb#29088

* github.com:scylladb/scylladb:
  doc: remove the Open Source Example from Installation
  doc: replace http with https in the installation instructions
2026-03-19 17:13:53 +02:00
Dario Mirovic
d2c44722e1 test: cluster: fix log clear race condition in test_audit.py
assert_entries_were_added:
- takes a "before" snapshot of the audit log
- yields to execute a statement
- takes an "after" snapshot of the audit log
- computes new rows by diffing "after" minus "before"

If an audit entry generated by prepare() arrives between the snapshot
and the diff, it inflates the new row count and the test fails with
assert 2 <= 1.

Fix by:
- Adding clear_audit_logs() at the end of prepare(), after all setup
- Waiting for the "completed re-reading configuration file" log message
  after server_update_config
- Draining pending syslog lines before clearing the buffer

Refs SCYLLADB-573
2026-03-19 16:12:13 +01:00
Dario Mirovic
821f8696a7 test: pylib: shut down exclusive cql connections in ManagerClient
get_cql_exclusive() creates a Cluster object per call, but never
records it. driver_close() cannot shut it down. The cluster's
internal scheduler thread then tries to submit work to an already
shut down executor. This causes RuntimeError:

RuntimeError: cannot schedule new futures after shutdown

Fix this by tracking every exclusive Cluster in a list and shutting
them all down in driver_close().

Refs SCYLLADB-573
2026-03-19 16:12:13 +01:00
Dario Mirovic
d94999f87b test: cluster: fix multinode audit entry comparison in test_audit.py
assert_entries_were_added computes new audit rows by slicing the "after"
list at the length of the "before" list: rows_after[len(rows_before):].
This assumes new rows always appear at the tail of the combined sorted
list. In a multinode setup, each node generates its own event_time
timestamps. A new row from node A can sort before an old row from node
B, breaking the tail assumption. The assertion "new rows are not the
last rows in the audit table" then fires.

Fix this by splitting the before/after lists per node and computing the
new rows tail independently for each node. This guarantees that per node
ordering, which is monotonic, is respected, and the combined new rows
are sorted afterwards.

Refs SCYLLADB-573
2026-03-19 16:12:13 +01:00
Dario Mirovic
249a6cec1b test: cluster: dtest: remove old audit tests
Since audit tests have been migrated to test/cluster/test_audit.py,
old tests in test/cluster/dtest/audit_test.py have to be removed.

Refs SCYLLADB-573
2026-03-19 16:12:13 +01:00
Dario Mirovic
adc790a8bf test: cluster: group migrated audit tests for cluster reuse
This patch reorganizes the execution flow of the test functions.
They are grouped to enable cluster reuse between specific test
functions. One of the main contributors to the test execution time
is the cluster preparation. This patch significantly reduces the
total test execution time by having way less new cluster preparation
calls and more cluster reuse.

Performance increase on the developer machine is around 38%:
- before: 4m 29s
- after: 2m 47s

Fixes SCYLLADB-573
2026-03-19 16:11:47 +01:00
Dario Mirovic
967b7ff6bf test: cluster: enable migrated audit tests and make them work
Make audit tests from test/cluster/dtest to test/cluster.
test/cluster environment has less overhead, and audit tests
are heavy, their execution taking lots of time. This patch
is part of an effort to improve audit test suite performance.

This patch refactors the tests so that they execute correctly,
as well as enables them. A follow up patch will remove the
audit tests in test/cluster/dtest.

All the tests are confirmed to be running after the change.
No dead code present.

Test test_audit_categories_invalid is not parametrized anymore.
It never used the parametrized helper class, so it just ran
the same logic three times. This is why there are now 74,
and not 76, test executions.

Refs SCYLLADB-573
2026-03-19 16:07:28 +01:00
Dario Mirovic
5d51501a0b pgo: use maintenance socket for CQL setup in PGO training
The default 'cassandra' superuser was removed from ScyllaDB, which
broke PGO training. exec_cql.py relied on username/password auth
('cassandra'/'cassandra') to execute setup CQL scripts like auth.cql
and counters.cql.

Switch exec_cql.py to connect via the Unix domain maintenance socket
instead. The maintenance socket bypasses authentication, no credentials
are needed. Additionally, create the 'cassandra' superuser via the
maintenance socket during the populate phase, so that cassandra-stress
keeps working. cassandra-stress hardcodes user=cassandra password=cassandra.

Changes:
- exec_cql.py: replace host/port/username/password arguments with a
  single --socket argument; add connect_maintenance_socket() with
  wait ready logic
- pgo.py: add maintenance_socket_path() helper; update
  populate_auth_conns() and populate_counters() to pass the socket
  path to exec_cql.py

Fixes SCYLLADB-1070

Closes scylladb/scylladb#29081
2026-03-19 16:52:36 +02:00
Dario Mirovic
8367509b3b test: pylib: manager_client: specify AuthProvider in get_cql_exclusive
This patch allows ManagerClient.get_cql_exclusive to accept AuthProvider
as parameter. This will be used in a follow up patch which migrates
audit test suite to test/cluster and requires this functionality for
some tests.

Refs SCYLLADB-573
2026-03-19 15:35:24 +01:00
Dario Mirovic
0a7a69345c test: pylib: scylla cluster after_test log fix
Before any test, a pool of ScyllaCluster objects is created.

At the beginning of a test suite, a ScyllaClusterManager is created,
and given a reference to the pool.
At the end of a test suite, the ScyllaClusterManager is destroyed.

Before each test case:
- ManagerClient is constructed and connected to the ScyllaClusterManager
  of that test suite
- A ScyllaCluster object is fetched from the pool
  - If the pool is empty, a new ScyllaCluster object is created
  - If the pool is not empty, a cached ScyllaCluster object is returned

After each test case:
- Return ScyllaCluster object from ManagerClient to the pool
  - If the cluster is dirty, the pool destroys it
  - If the cluster is clean, the pool caches it
- ManagerClient is destroyed

Many actions mark a cluster as dirty. Normal test execution will always
make the cluster be destroyed upon returning to the pool.
ManagerClient.mark_clean is not used in the tests. When it is used,
the flow with cluster reuse happens.

The bug is that the log file is closed even if cluster is not dirty.
This causes an error when trying to log to a reused cluster server.

The solution in this patch is to not close the log file if the cluster
is not dirty. Upon cluster reuse the log file will be open and functional.

Another approach would be to reopen the log file if closed, but this
approach seems more clean.

Refs SCYLLADB-573
2026-03-19 15:35:24 +01:00
Dario Mirovic
899ae71349 test: audit: copy audit test from dtest
This patch just copies the audit test suite from dtest and
disables it in the test config file. Later patches will
update the code and enable the test suite.

Refs SCYLLADB-573
2026-03-19 15:35:24 +01:00
Andrzej Jackowski
4deeb7ebfc test: add new guardrail tests matching documentation scenarios
Add tests for RF guardrails (min/max warn/fail, RF=0 bypass,
threshold=-1 disable, ALTER KEYSPACE) and write consistency level
guardrails to cover all scenarios described in guardrails.rst.

Test runtime (dev):
test_guardrail_replication_strategy - 6s
test_guardrail_write_consistency_level - 5s

Refs: SCYLLADB-257
2026-03-19 15:07:03 +01:00
Andrzej Jackowski
2a03c634c0 test: add metric assertions to guardrail replication strategy tests
Verify that guardrail violations increment the corresponding metrics.

Refs: SCYLLADB-257
2026-03-19 15:07:03 +01:00
Andrzej Jackowski
81c4e717e2 test: use regex matching in guardrail replication strategy tests
Replace loose substring assertions with regex-based matching against
the exact server message formats. Add regex constants for all
guardrail messages and rewrite create_ks_and_assert_warnings_and_errors()
to verify count and content of warnings and failures.

Refs: SCYLLADB-257
2026-03-19 15:07:03 +01:00
Anna Stuchlik
6b1df5202c doc: remove the instructions to install old versions from Web Installer
The Web Installer page includes instructions to install the old pre-2025.1 Enterprise versions,
which are no longer supported (since we released 2026.1).

This commit removes those redundant and misleading instructions.

Fixes https://github.com/scylladb/scylladb/issues/29099

Closes scylladb/scylladb#29103
2026-03-19 15:47:00 +02:00
Piotr Dulikowski
171504c84f Merge 'auth: migrate some standard role manager APIs to use cache' from Marcin Maliszkiewicz
This patchset migrates: query_all_directly_granted, query_all,
get_attribute, query_attribute_for_all functions to use cache
instead of doing CQL queries. It also includes some preparatory
work which fixes cache update order and triggering.

Main motivation behind this is to make sure that all calls
from service_level_controller::auth_integration are cached,
which we achieve here.

Alternative implementation could move the whole auth_integration
data into auth cache but since auth_integration manages also lifetime
and contains service levels specific logic such solution would be
too complex for little (if any) gain.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-159
Backport: no, not a bug

Closes scylladb/scylladb#28791

* github.com:scylladb/scylladb:
  auth: switch query_attribute_for_all to use cache
  auth: switch get_attribute to use cache
  auth: cache: add heterogeneous map lookups
  auth: switch query_all to use cache
  auth: switch query_all_directly_granted to use cache
  auth: cache: add ability to go over all roles
  raft: service: reload auth cache before service levels
  service: raft: move update_service_levels_effective_cache check
2026-03-19 14:37:22 +01:00
Avi Kivity
5e7fb08bf3 Merge 'Fix bad performance for densely populated partition index pages' from Tomasz Grabiec
This applies to small partition workload where index pages have high partition count, and the index doesn't fit in cache. It was observed that the count can be in the order of hundreds. In such a workload pages undergo constant population, LSA compaction, and LSA eviction, which has severe impact on CPU utilization.

Refs https://scylladb.atlassian.net/browse/SCYLLADB-620

This PR reduces the impact by several changes:

  - reducing memory footprint in the partition index. Assuming partition key size is 16 bytes, the cost dropped from 96 bytes to 36 bytes per partition.

  - flattening the object graph and amortizing storage. Storing entries directly in the vector. Storing all key values in a single managed_bytes. Making index_entry a trivial struct.

  - index entries and key storage are now trivially moveable, and batched inside vector storage
    so LSA migration can use memcpy(), which amortizes the cost per key. This reduces the cost of LSA segment compaction.

 - LSA eviction is now pretty much constant time for the whole page
   regardless of the number of entries, because elements are trivial and batched inside vectors.
   Page eviction cost dropped from 50 us to 1 us.

Performance evaluated with:

   scylla perf-simple-query -c1 -m200M --partitions=1000000

Before:

```
7774.96 tps (166.0 allocs/op, 521.7 logallocs/op,  54.0 tasks/op,  802428 insns/op,  430457 cycles/op,        0 errors)
7511.08 tps (166.1 allocs/op, 527.2 logallocs/op,  54.0 tasks/op,  804185 insns/op,  430752 cycles/op,        0 errors)
7740.44 tps (166.3 allocs/op, 526.2 logallocs/op,  54.2 tasks/op,  805347 insns/op,  432117 cycles/op,        0 errors)
7818.72 tps (165.2 allocs/op, 517.6 logallocs/op,  53.7 tasks/op,  794965 insns/op,  427751 cycles/op,        0 errors)
7865.49 tps (165.1 allocs/op, 513.3 logallocs/op,  53.6 tasks/op,  788898 insns/op,  425171 cycles/op,        0 errors)
```

After (+318%):

```
32492.40 tps (130.7 allocs/op,  12.8 logallocs/op,  36.1 tasks/op,  109236 insns/op,  103203 cycles/op,        0 errors)
32591.99 tps (130.4 allocs/op,  12.8 logallocs/op,  36.0 tasks/op,  108947 insns/op,  102889 cycles/op,        0 errors)
32514.52 tps (130.6 allocs/op,  12.8 logallocs/op,  36.0 tasks/op,  109118 insns/op,  103219 cycles/op,        0 errors)
32491.14 tps (130.6 allocs/op,  12.8 logallocs/op,  36.0 tasks/op,  109349 insns/op,  103272 cycles/op,        0 errors)
32582.90 tps (130.5 allocs/op,  12.8 logallocs/op,  36.0 tasks/op,  109269 insns/op,  102872 cycles/op,        0 errors)
32479.43 tps (130.6 allocs/op,  12.8 logallocs/op,  36.0 tasks/op,  109313 insns/op,  103242 cycles/op,        0 errors)
32418.48 tps (130.7 allocs/op,  12.8 logallocs/op,  36.1 tasks/op,  109201 insns/op,  103301 cycles/op,        0 errors)
31394.14 tps (130.7 allocs/op,  12.8 logallocs/op,  36.1 tasks/op,  109267 insns/op,  103301 cycles/op,        0 errors)
32298.55 tps (130.7 allocs/op,  12.8 logallocs/op,  36.1 tasks/op,  109323 insns/op,  103551 cycles/op,        0 errors)
```

When the workload is miss-only, with both row cache and index cache disabled (no cache maintenance cost):

  perf-simple-query -c1 -m200M --duration 6000 --partitions=100000 --enable-index-cache=0 --enable-cache=0

Before:

```
9124.57 tps (146.2 allocs/op, 789.0 logallocs/op,  45.3 tasks/op,  889320 insns/op,  357937 cycles/op,        0 errors)
9437.23 tps (146.1 allocs/op, 789.3 logallocs/op,  45.3 tasks/op,  889613 insns/op,  357782 cycles/op,        0 errors)
9455.65 tps (146.0 allocs/op, 787.4 logallocs/op,  45.2 tasks/op,  887606 insns/op,  357167 cycles/op,        0 errors)
9451.22 tps (146.0 allocs/op, 787.4 logallocs/op,  45.3 tasks/op,  887627 insns/op,  357357 cycles/op,        0 errors)
9429.50 tps (146.0 allocs/op, 787.4 logallocs/op,  45.3 tasks/op,  887761 insns/op,  358148 cycles/op,        0 errors)
9430.29 tps (146.1 allocs/op, 788.2 logallocs/op,  45.3 tasks/op,  888501 insns/op,  357679 cycles/op,        0 errors)
9454.08 tps (146.0 allocs/op, 787.3 logallocs/op,  45.3 tasks/op,  887545 insns/op,  357132 cycles/op,        0 errors)
```

After (+55%):

```
14484.84 tps (150.7 allocs/op,   6.5 logallocs/op,  44.7 tasks/op,  396164 insns/op,  229490 cycles/op,        0 errors)
14526.21 tps (150.8 allocs/op,   6.5 logallocs/op,  44.8 tasks/op,  396401 insns/op,  228824 cycles/op,        0 errors)
14567.53 tps (150.7 allocs/op,   6.5 logallocs/op,  44.7 tasks/op,  396319 insns/op,  228701 cycles/op,        0 errors)
14545.63 tps (150.6 allocs/op,   6.5 logallocs/op,  44.7 tasks/op,  395889 insns/op,  228493 cycles/op,        0 errors)
14626.06 tps (150.5 allocs/op,   6.5 logallocs/op,  44.7 tasks/op,  395254 insns/op,  227891 cycles/op,        0 errors)
14593.74 tps (150.5 allocs/op,   6.5 logallocs/op,  44.7 tasks/op,  395480 insns/op,  227993 cycles/op,        0 errors)
14538.10 tps (150.8 allocs/op,   6.5 logallocs/op,  44.8 tasks/op,  397035 insns/op,  228831 cycles/op,        0 errors)
14527.18 tps (150.8 allocs/op,   6.5 logallocs/op,  44.8 tasks/op,  396992 insns/op,  228839 cycles/op,        0 errors)
```

Same as above, but with summary ratio increased from 0.0005 to 0.005 (smaller pages):

Before:

```
33906.70 tps (146.1 allocs/op,  83.6 logallocs/op,  45.1 tasks/op,  170553 insns/op,   98104 cycles/op,        0 errors)
32696.16 tps (146.0 allocs/op,  83.5 logallocs/op,  45.1 tasks/op,  170369 insns/op,   98405 cycles/op,        0 errors)
33889.05 tps (146.1 allocs/op,  83.6 logallocs/op,  45.1 tasks/op,  170551 insns/op,   98135 cycles/op,        0 errors)
33893.24 tps (146.1 allocs/op,  83.5 logallocs/op,  45.1 tasks/op,  170488 insns/op,   98168 cycles/op,        0 errors)
33836.73 tps (146.1 allocs/op,  83.6 logallocs/op,  45.1 tasks/op,  170528 insns/op,   98226 cycles/op,        0 errors)
33897.61 tps (146.0 allocs/op,  83.5 logallocs/op,  45.1 tasks/op,  170428 insns/op,   98081 cycles/op,        0 errors)
33834.73 tps (146.1 allocs/op,  83.5 logallocs/op,  45.1 tasks/op,  170438 insns/op,   98178 cycles/op,        0 errors)
33776.31 tps (146.3 allocs/op,  83.9 logallocs/op,  45.2 tasks/op,  170958 insns/op,   98418 cycles/op,        0 errors)
33808.08 tps (146.3 allocs/op,  83.9 logallocs/op,  45.2 tasks/op,  170940 insns/op,   98388 cycles/op,        0 errors)
```

After (+18%):

```
40081.51 tps (148.2 allocs/op,   4.4 logallocs/op,  45.0 tasks/op,  121047 insns/op,   82231 cycles/op,        0 errors)
40005.85 tps (148.6 allocs/op,   4.4 logallocs/op,  45.2 tasks/op,  121327 insns/op,   82545 cycles/op,        0 errors)
39816.75 tps (148.3 allocs/op,   4.4 logallocs/op,  45.1 tasks/op,  121067 insns/op,   82419 cycles/op,        0 errors)
39953.11 tps (148.1 allocs/op,   4.4 logallocs/op,  45.0 tasks/op,  121027 insns/op,   82258 cycles/op,        0 errors)
40073.96 tps (148.2 allocs/op,   4.4 logallocs/op,  45.0 tasks/op,  121006 insns/op,   82313 cycles/op,        0 errors)
39882.25 tps (148.2 allocs/op,   4.4 logallocs/op,  45.0 tasks/op,  120925 insns/op,   82320 cycles/op,        0 errors)
39916.08 tps (148.3 allocs/op,   4.4 logallocs/op,  45.1 tasks/op,  121054 insns/op,   82393 cycles/op,        0 errors)
39786.30 tps (148.2 allocs/op,   4.4 logallocs/op,  45.0 tasks/op,  121027 insns/op,   82465 cycles/op,        0 errors)
38662.45 tps (148.3 allocs/op,   4.4 logallocs/op,  45.0 tasks/op,  121108 insns/op,   82312 cycles/op,        0 errors)
39849.42 tps (148.3 allocs/op,   4.4 logallocs/op,  45.1 tasks/op,  121098 insns/op,   82447 cycles/op,        0 errors)
```

Closes scylladb/scylladb#28603

* github.com:scylladb/scylladb:
  sstables: mx: index_reader: Optimize parsing for no promoted index case
  vint: Use std::countl_zero()
  test: sstable_partition_index_cache_test: Validate scenario of pages with sparse promoted index placement
  sstables: mx: index_reader: Amoritze partition key storage
  managed_bytes: Hoist write_fragmented() to common header
  utils: managed_vector: Use std::uninitialized_move() to move objects
  sstables: mx: index_reader: Keep promoted_index info next to index_entry
  sstables: mx: index_reader: Extract partition_index_page::clear_gently()
  sstables: mx: index_reader: Shave-off 16 bytes from index_entry by using raw_token
  sstables: mx: index_reader: Reduce allocation_section overhead during index page parsing by batching allocation
  sstables: mx: index_reader: Keep index_entry directly in the vector
  dht: Introduce raw_token
  test: perf_simple_query: Add 'sstable-format' command-line option
  test: perf_simple_query: Add 'sstable-summary-ratio' command-line option
  test: perf-simple-query: Add option to disable index cache
  test: cql_test_env: Respect enable-index-cache config
2026-03-19 14:42:50 +02:00
Botond Dénes
4981e72607 Merge 'replica: avoid unnecessary computation on token lookup hot path' from Łukasz Paszkowski
`storage_group_of()` sits on the replica-side token lookup hot path, yet it called `tablet_map::get_tablet_id_and_range_side()`, which always computes both the tablet id and the post-split range side — even though most callers only need the storage group id.

The range-side computation is only relevant when a storage group is in tablet splitting mode, but we were paying for it unconditionally on every lookup.

This series fixes that by:

1. Adding `tablet_map::get_tablet_range_side()` so the range side can be computed independently when needed.
2. Adding lazy `select_compaction_group()` overloads that defer the range-side computation until splitting mode is actually active.
3. Switching `storage_group_of()` to use the cheaper `get_tablet_id()` path, only computing the range side on demand.

Improvements. No backport is required.

Closes scylladb/scylladb#28963

* github.com:scylladb/scylladb:
  replica/table: avoid computing token range side in storage_group_of() on hot path
  replica/compaction_group: add lazy select_compaction_group() overloads
  locator/tablets: add tablet_map::get_tablet_range_side()
2026-03-19 14:27:12 +02:00
Ernest Zaslavsky
aa9da87e97 encryption: fix deadlock in encrypted_data_source::get()
When encrypted_data_source::get() caches a trailing block in
_next, the next call takes it directly — bypassing
input_stream::read(), which checks _eof. It then calls
input_stream::read_exactly() on the already-drained stream.
Unlike read(), read_up_to(), and consume(), read_exactly()
does not check _eof when the buffer is empty, so it calls
_fd.get() on a source that already returned EOS.

In production this manifested as stuck encrypted SSTable
component downloads during tablet restore: the underlying
chunked_download_source hung forever on the post-EOS get(),
causing 4 tablets to never complete. The stuck files were
always block-aligned sizes (8k, 12k) where _next gets
populated and the source is fully consumed in the same call.

Fix by checking _input.eof() before calling read_exactly().
When the stream already reached EOF, buf2 is known to be
empty, so the call is skipped entirely.

A comprehensive test is added that uses a strict_memory_source
which fails on post-EOS get(), reproducing the exact code
path that caused the production deadlock.
2026-03-19 13:54:54 +02:00
Ernest Zaslavsky
f74a54f005 test_lib: mark limiting_data_source_impl as not final 2026-03-19 13:54:54 +02:00
Ernest Zaslavsky
151e945d9f Fix formatting after previous patch 2026-03-19 13:54:44 +02:00
Andrzej Jackowski
517bb8655d test: extract ks_opts helper in test_guardrail_replication_strategy
Factor out ks_opts() to build keyspace options with tablets handling
and use it across all existing replication strategy guardrail tests.
No behavioral changes.

This facilitates further modification of the tests later in this
patch series.

Refs: SCYLLADB-257
2026-03-19 12:49:41 +01:00
Andrzej Jackowski
9b24d9ee7d docs: document CQL guardrails
Add docs/cql/guardrails.rst covering replication factor, replication
strategy, write consistency level, and compact storage guardrails.

Fixes: SCYLLADB-257
2026-03-19 12:49:41 +01:00
Ernest Zaslavsky
537747cf5d Fix indentation after previous patch 2026-03-19 13:48:53 +02:00
Ernest Zaslavsky
2535164542 test_lib: make limiting_data_source_impl available to tests
Relocate the `limiting_data_source_impl` declaration to the header file
so that test code can access it directly.
2026-03-19 13:48:53 +02:00
Botond Dénes
86d7c82993 test/cluster/test_repair.py: use tablets in test_repair_timestamp_difference
After repair, the test does a major to compact all sstables into a
single one, so the results can be simply checked by a select from
mutation_fragments() query. Sometimes off-strategy happens parallel to
this major, so after the major there are still 2 sstables, resulting in
the test failing when checking that the query returns just a single row.
To fix, just use tablets for the test table, tablets don't use
off-strategy anymore.

Fixes: SCYLLADB-940

Closes scylladb/scylladb#29071
2026-03-19 12:42:18 +03:00
Michael Litvak
399260a6c0 test: mv: fix flaky wait for commitlog sync
Previously the test test_interrupt_view_build_shard_registration stopped
the node ungracefully and used commitlog periodic mode to persist the
view build progress in a not very reliable way.

It can happen that due to timing issues, the view build progress is not
persisted, or some of it is persisted in a different ordering than
expected.

To make the test more reliable we change it to stop the node gracefully,
so the commitlog is persisted in a graceful and consistent way, without
using the periodic mode delay. We need to also change the injection for
the shutdown to not get stuck.

Fixes SCYLLADB-1005

Closes scylladb/scylladb#29008
2026-03-19 10:41:21 +01:00
Pavel Emelyanov
f27dc12b7c Merge 'Fix directory lister leak in table::get_snapshot_details: ' from Benny Halevy
As reported in SCYLLADB-1013, the directory lister must be closed also when an exception is thrown.

For example, see backtrace below:
```
seastar::on_internal_error(seastar::logger&, std::basic_string_view<char, std::char_traits<char>>) at ./build/release/seastar/./seastar/src/core/on_internal_error.cc:57
directory_lister::~directory_lister() at ./utils/lister.cc:77
replica::table::get_snapshot_details(std::filesystem::__cxx11::path, std::filesystem::__cxx11::path) (.resume) at ./replica/table.cc:4081
std::__n4861::coroutine_handle<seastar::internal::coroutine_traits_base<db::snapshot_ctl::table_snapshot_details>::promise_type>::resume() const at /usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/coroutine:247
 (inlined by) seastar::internal::coroutine_traits_base<db::snapshot_ctl::table_snapshot_details>::promise_type::run_and_dispose() at ././seastar/include/seastar/core/coroutine.hh:129
seastar::reactor::task_queue::run_tasks() at ./build/release/seastar/./seastar/src/core/reactor.cc:2695
 (inlined by) seastar::reactor::task_queue_group::run_tasks() at ./build/release/seastar/./seastar/src/core/reactor.cc:3201
seastar::reactor::task_queue_group::run_some_tasks() at ./build/release/seastar/./seastar/src/core/reactor.cc:3185
 (inlined by) seastar::reactor::do_run() at ./build/release/seastar/./seastar/src/core/reactor.cc:3353
seastar::reactor::run() at ./build/release/seastar/./seastar/src/core/reactor.cc:3245
seastar::app_template::run_deprecated(int, char**, std::function<void ()>&&) at ./build/release/seastar/./seastar/src/core/app-template.cc:266
seastar::app_template::run(int, char**, std::function<seastar::future<int> ()>&&) at ./build/release/seastar/./seastar/src/core/app-template.cc:160
scylla_main(int, char**) at ./main.cc:756
```

Fixes: [SCYLLADB-1013](https://scylladb.atlassian.net/browse/SCYLLADB-1013)

* Requires backport to 2026.1 since the leak exists since 004c08f525

[SCYLLADB-1013]: https://scylladb.atlassian.net/browse/SCYLLADB-1013?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ

Closes scylladb/scylladb#29084

* github.com:scylladb/scylladb:
  test/boost/database_test: add test_snapshot_ctl_details_exception_handling
  table: get_snapshot_details: fix indentation inside try block
  table: per-snapshot get_snapshot_details: fix typo in comment
  table: per-snapshot get_snapshot_details: always close lister using try/catch
  table: get_snapshot_details: always close lister using deferred_close
2026-03-19 12:40:23 +03:00
Raphael S. Carvalho
3143134968 test: avoid split/major compaction deadlock in tablet split test
Run keyspace compaction asynchronously in
`test_tombstone_gc_correctness_during_tablet_split` and only await it
after `split_sstable_rewrite` is disabled.

The problem is that `keyspace_compaction()` starts with a flush, and that
flush can take around five seconds. During that window the split
compaction is stopped before major compaction is retried. The stop aborts
the in-flight major compaction attempt, then the split proceeds far enough
to enter the `split_sstable_rewrite` injection point.

At that point the test used to wait synchronously for major compaction to
finish, but major compaction cannot finish yet: when it retries, it needs
the same semaphore that is still effectively tied up behind the blocked
split rewrite. So the test waits for major compaction, while the split
waits for the injection to be released, and the code that would release
that injection never runs.

Starting major compaction as a task breaks that cycle. The test can first
disable `split_sstable_rewrite`, let the split get out of the way, and
only then wait for major compaction to complete.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-827.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes scylladb/scylladb#29066
2026-03-19 11:12:21 +02:00
Botond Dénes
2e47fd9f56 Merge 'tasks: do not fail the wait request if rpc fails' from Aleksandra Martyniuk
During decommission, we first mark a topology request as done, then shut
down a node and in the following steps we remove node from the topology.
Thus,  finished request does not imply that a node is removed from
the topology.

Due to that, in node_ops_virtual_task::wait, while gathering children
from the whole cluster, we may hit the connection exception - because
a node is still in topology, even though it is down.

Modify the get_children method to ignore the exception and warn
about the failure instead.

Keep token_metadata_ptr in get_children to prevent topology from changing.

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-867

Needs backports to all versions

Closes scylladb/scylladb#29035

* github.com:scylladb/scylladb:
  tasks: fix indentation
  tasks: do not fail the wait request if rpc fails
  tasks: pass token_metadata_ptr to task_manager::virtual_task::impl::get_children
2026-03-19 10:03:18 +02:00
Piotr Smaron
a2ad57062f docs/cql: clarify WHERE clause boolean limitations
Document that `SELECT ... WHERE` clause currently accepts only conjunctions
of relations joined by `AND` (`OR` is not supported), and that
parentheses cannot be used to group boolean subexpressions.
Add an unsupported query example and point readers to equivalent `IN`
rewrites when applicable.
This problem has been raised by one of our users in
https://forum.scylladb.com/t/error-parsing-query-or-unsupported-statement/5299,
and while one could infer answer to user's question by looking at the
syntax of the `SELECT ... WHERE`, it's not immediately obvious to
non-advanced users, so clarifying these concepts is justified.

Fixes: SCYLLADB-1116

Closes scylladb/scylladb#29100
2026-03-19 09:47:22 +02:00
Michael Litvak
31d339e54a logstor: trigger separator flush for buffers that hold old segments
A compaction group has a separator buffer that holds the mixed segments
alive until the separator buffer is flushed. A mixed segment can be
freed only after all separator buffers that hold writes from the segment
are flushed.

Typically a separator buffer is flushed when it becomes full. However
it's possible for example that one compaction groups is filled slower
than others and holds many segments.

To fix this we trigger a separator flush periodically for separator
buffers that hold old segments. We track the active segment sequence
number and for each separator buffer the oldest sequence number it
holds.
2026-03-18 19:24:28 +01:00
Michael Litvak
ad87eda835 docs/dev: add logstor documentation 2026-03-18 19:24:28 +01:00
Michael Litvak
a0da07e5b7 logstor: recover segments into compaction groups
Fix the logstor recovery to work with compaction groups. When recovering
a segment find its token range and add it to the appropriate compaction
groups. if it doesn't fit in a single compaction group then write each
record to its compaction group's separator buffer.
2026-03-18 19:24:28 +01:00
Michael Litvak
24379acc76 logstor: range read
extend the logstor mutation reader to support range read
2026-03-18 19:24:28 +01:00
Michael Litvak
a9d0211a64 logstor: change index to btree by token per table
Change the primary index to be a btree that is ordered by token,
similarly to a memtable, and create a index per-table instead of a
single global index.
2026-03-18 19:24:28 +01:00
Michael Litvak
e7c3942d43 logstor: move segments to replica::compaction_group
Add a segment_set member to replica::compaction_group that manages the
logstor segments that belong to the compaction group, similarly to how
it manages sstables. Add also a separator buffer in each compaction
group.

When writing a mutation to a compaction group, the mutation is written
to the active segment and to the separator buffer of the compaction
group, and when the separator buffer is flushed the segment is added to
the compaction_group's segment set.
2026-03-18 19:24:28 +01:00
Michael Litvak
d69f7eb0ee db: update dirty mem limits dynamically
when logstor is enabled, update the db dirty memory limits dynamically.

previously the threshold is set to 0.5 of the available memory, so 0.5
goes to memtables and 0.5 to others (cache).

when logstor is enabled, we calculate the available memory excluding
logstor, and divide it evenly between memtables and cache.
2026-03-18 19:24:27 +01:00
Michael Litvak
65cd0b5639 logstor: track memory usage
add logstor::get_memory_usage() that returns an estimate of the memory
usage by logstor. add tracking to how many unique keys are held in the
index.
2026-03-18 19:24:27 +01:00
Michael Litvak
b7bdb1010a logstor: logstor stats api
add api to get logstor statistics about segments for a table
2026-03-18 19:24:27 +01:00
Michael Litvak
8bd3bd7e2a logstor: compaction buffer pool
pre-allocate write buffers for compaction
2026-03-18 19:24:27 +01:00
Michael Litvak
caf5aa47c2 logstor: separator: flush buffer when full
flush separator buffers when they become full and switched instead of
aggregating all the buffers and flushing them when the separator is
switched.
2026-03-18 19:24:27 +01:00
Michael Litvak
6ddb7a4d13 logstor: hold segment until index updates
add a write gate to write_buffer. when writing a record to the write
buffer, the gate is held and passed back to the caller, and the caller
holds the gate until the write operation is complete, including
follow-up operations such as updating the index after the write.

in particular, when writing a mutation in logstor::write, the write
buffer is held open until the write is completed and updated in the
index.

when writing the write buffer to the active segment, we write the buffer
and then wait for the write buffer gate to close, i.e. we wait for all
index updates to complete before proceeding. the segment is held open
until all the write operations and index updates are complete.

this property is useful for correctness: when a segment is closed we
know that all the writes to it are updated in the index. this is needed
in compaction for example, where we take closed segments and check
which records in them are alive by looking them up in the index. if the
index is not updated yet then it will be wrong.
2026-03-18 19:24:27 +01:00
Michael Litvak
bd66edee5c logstor: truncate table
implement freeing all segments of a table for table truncate.

first do barrier to flush all active and mixed segments and put all the
table's data in compaction groups, then stop compaction for the table,
then free the table's segments and remove the live entries from the
index.
2026-03-18 19:24:27 +01:00
Michael Litvak
489efca47c logstor: enable/disable compaction per table
add functions to enable or disable compaction for a specific compaction
group or for all compaction groups of a table.
2026-03-18 19:24:27 +01:00
Michael Litvak
21db4f3ed8 logstor: separator buffer pool
pre-allocate write buffers for the separator
2026-03-18 19:24:27 +01:00
Michael Litvak
37c485e3d1 test: logstor: add separator and compaction tests 2026-03-18 19:24:27 +01:00
Michael Litvak
31aefdc07d logstor: segment and separator barrier
add barrier operation that forces switch of the active segment and
separator, and waits for all existing segments to close and all
separators to flush.
2026-03-18 19:24:27 +01:00
Michael Litvak
1231fafb46 logstor: separator debt controller
add tracking of the total separator debt - writes that were written to a
separator and waiting to be flushed, and add flow control to keep the
debt in control by delaying normal writes.
2026-03-18 19:24:27 +01:00
Michael Litvak
17cb173e18 logstor: compaction controller
adjust compaction shares by the compaction overhead: how many segments
compaction writes to generate a single free segment for new writes.
2026-03-18 19:24:27 +01:00
Michael Litvak
1da1bb9d99 logstor: recovery: recover mixed segments using separator
on recovery we may find mixed segments. recover them by adding them to a
separator, reading all their records and writing them to the separator,
and flush the separator.
2026-03-18 19:24:27 +01:00
Michael Litvak
b78cc787a6 logstor: wait for pending reads in compaction
we free a segment from compaction after updating all live records in the
segment to point to new locations in the index. we need to ensure they
are no running operations that use the old locations before we free the
segment.
2026-03-18 19:24:27 +01:00
Michael Litvak
600ec82bec logstor: separator
initial implementation of the separator. it replaces "mixed" segments -
segments that have records from different groups, to segments by group.

every write is written to the active segment and to a buffer in the
active separator. the active separator has in-memory buffers by group.
at some threshold number of segments we switch the active segment and
separator atomically, and start flushing the separator.

the separator is flushed by writing the buffers into new non-mixed
segments, adding them to a compaction group, and frees the mixed
segments.
2026-03-18 19:24:27 +01:00
Michael Litvak
009fc3757a logstor: compaction groups
divide the segments in the compaction manager to compaction group.
compaction will compact only segments from a single compaction group at
a time.
2026-03-18 19:24:27 +01:00
Michael Litvak
b3293f8579 logstor: cache files for read
keep all files for all segments open for read to improve reads.
2026-03-18 19:24:26 +01:00
Michael Litvak
5a16980845 logstor: recovery: initial
initial and basic recovery implementation.
* find all files, read their segments and populate the index with the
  newest record for each key.
* find which segments are used and build the usage histogram
2026-03-18 19:24:26 +01:00
Michael Litvak
bc9fc96579 logstor: add segment generation
add segment generation number that is incremented when the segment is
reused, and it's written to every buffer that is written to the segment.
this is useful for recovery.
2026-03-18 19:24:26 +01:00
Michael Litvak
719f7cca57 logstor: reserve segments for compaction
reserve segments for compaction so it always has enough segments to run
and doesn't get stuck.

do the compaction writes into full new segments instead of the active
segment.
2026-03-18 19:24:26 +01:00
Michael Litvak
521fca5c92 logstor: index: buckets
divide the primary index to buckets, each bucket containing a btree. the
bucket is determined by using bits from the key hash.
2026-03-18 19:24:26 +01:00
Michael Litvak
99c3b1998a logstor: add buffer header
add a buffer header in each write buffer we write that contains some
information that can be useful for recovery and reading.
2026-03-18 19:24:26 +01:00
Michael Litvak
ddd72a16b0 logstor: add group_id
add group_id value to each log record that is passed with the mutation
when writing it.

the group_id will be used to group log records in segments, such that a
segment will contain records only from a single group.

this will be useful for tablet migration. we want for each tablet to
have their own segments with all their records, so we can migrate them
efficiently by copying these segments.

the group_id value is set to a value equivalent to the tablet id.
2026-03-18 19:24:26 +01:00
Michael Litvak
08bea860ef logstor: record generation
add a record generation number for each record so we can compare
records and find which one is newer.
2026-03-18 19:24:26 +01:00
Michael Litvak
28f820eb1c logstor: generation utility
basic utility for generation numbers that will be useful next. a
generation number is an unsigned integer that can be incremented and
compared even if it wraparounds, assuming the values we compare were
written around the same time.
2026-03-18 19:24:26 +01:00
Michael Litvak
5f649dd39f logstor: use RIPEMD-160 for index key
use a 20-byte hash function for the index key to make hash collisions
very unlikely. we assume there are no hash collisions.
2026-03-18 19:24:26 +01:00
Michael Litvak
a521bcbcee test: add test_logstor.py
add basic tests for key-value tables with logstor storage
2026-03-18 19:24:26 +01:00
Michael Litvak
1ae1f37ec1 api: add logstor compaction trigger endpoint
add a new api endpoint that triggers logstor compaction.
2026-03-18 19:24:26 +01:00
Michael Litvak
2128b1b15c replica: add logstor to db
Add a single logstor instance in the database that is used for writing
and reading to tables with kv storage
2026-03-18 19:24:26 +01:00
Michael Litvak
9172cc172e schema: add logstor cf property
add a schema property for tables with logstor storage
2026-03-18 19:24:26 +01:00
Michael Litvak
0b1343747f logstor: initial commit
initial implementation of the logstor storage engine for key-value
tables that supports writes, reads and basic compaction.

main components:
* logstor: this is the main interface to users that supports writing and
  reading back mutations, and manages the internal components.
* index: the primary index in-memory that maps a key to a location on
  disk.
* write buffer: writes go initially to a write buffer. it accumulates
  multiple records in a buffer and writes them to the segment manager in
  4k sized blocks.
* segment manager: manages the storage - files, segments, compaction. it
  manages file and segment allocation, and writes 4k aligned buffers to
  the active segment sequentially. it tracks the used space in each
  segment. the compaction finds segment with low space usage and writes
  them to new segments, and frees the old segments.
2026-03-18 19:24:26 +01:00
Michael Litvak
27fd0c119f db: disable tablet balancing with logstor
initially logstor tables will not support tablet migrations, so
disable tablet balancing if the experimental feature flag is set.
2026-03-18 19:24:26 +01:00
Michael Litvak
ed852a2af2 db: add logstor experimental feature flag
add a new experimental feature flag for key-value tables with the new
logstor storage engine.
2026-03-18 19:24:26 +01:00
Anna Stuchlik
88b98fac3a doc: update the warning about shared dictionary training
This commit updates the inadequate warning on the Advanced Internode (RPC) Compression page.

The warning is replaced with a note about how training data is encrypted.

Fixes https://github.com/scylladb/scylladb/issues/29109

Closes scylladb/scylladb#29111
2026-03-18 19:35:18 +02:00
Avi Kivity
46a6f8e1d3 Merge 'auth: add maintenance_socket_authorizer' from Dario Mirovic
GRANT/REVOKE fails on the maintenance socket connections, because maintenance_auth_service uses allow_all_authorizer. allow_all_authorizer allows all operations, but not GRANT/REVOKE, because they make no sense in its context.

This has been observed during PGO run failure in operations from ./pgo/conf/auth.cql file.

This patch introduces maintenance_socket_authorizer that supports the capabilities of default_authorizer ('CassandraAuthorizer') without needing authorization.

Refs SCYLLADB-1070

This is an improvement, no need for backport.

Closes scylladb/scylladb#29080

* github.com:scylladb/scylladb:
  test: use NetworkTopologyStrategy in maintenance socket tests
  test: use cleanup fixture in maintenance socket auth tests
  auth: add maintenance_socket_authorizer
2026-03-18 19:29:57 +02:00
Pavel Emelyanov
d6c01be09b s3/client: Don't reconstruct regex on every parse_content_range call
Make the pattern static const so it is compiled once at first call rather
than on every Content-Range header parse.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#29054
2026-03-18 17:56:33 +02:00
Gleb Natapov
2d8b3e751b view: drop unused v1 builder code 2026-03-18 17:45:40 +02:00
Gleb Natapov
77d3245e02 view: remove upgrade to raft code
Since we do no longer support upgrade from versions that do not support
v2 of view building code we can remove upgrade code and make sure we do
not boot with old builder version.
2026-03-18 17:45:40 +02:00
Tomasz Grabiec
4410e9c61a sstables: mx: index_reader: Optimize parsing for no promoted index case
It's a common case with small partition workloads.
2026-03-18 16:25:21 +01:00
Tomasz Grabiec
32f8609b89 vint: Use std::countl_zero()
It handles 0, and could generate better code for that. On Broadwell
architecture, it translates to a single instruction (LZCNT). We're
still on Westmere, so it translates to BSR with a conditional move.

Also, drop unnecessary casts and bit arithmetic, which saves a few
instructions.

Move to header so that it's inlined in parsers.
2026-03-18 16:25:21 +01:00
Tomasz Grabiec
6017688445 test: sstable_partition_index_cache_test: Validate scenario of pages with sparse promoted index placement 2026-03-18 16:25:21 +01:00
Tomasz Grabiec
f55bb154ec sstables: mx: index_reader: Amoritze partition key storage
This change reduces the cost of partition index page construction and
LSA migration. This is achieved by several things working together:

 - index entries don't store keys as separate small objects (managed_bytes)
   They are written into one managed_bytes fragmented storage, entries
   hold offset into it.

   Before, we paid 16 bytes for managed_bytes plus LSA descriptor for
   the storage (1 byte) plus back-reference in the storage (8 bytes),
   so 25 bytes. Now we only pay 4 bytes for the size offset. If keys are 16
   bytes, that's a reduction from 31 bytes to 20 bytes per key.

 - index entries and key storage are now trivially moveable, so LSA
   migration can use memcpy() which amortizes the cost per key.
   memcpy().

   LSA eviction is now trivial and constant time for the whole page
   regardless of the number of entries. Page eviction dropped from
   14 us to 1 us.

This improves throughput in a CPU-bound miss-heavy read workload where
the partition index doesn't fit in memory.

  scylla perf-simple-query -c1 -m200M --partitions=1000000

Before:

    15328.25 tps (150.0 allocs/op,  14.1 logallocs/op,  45.4 tasks/op,  286769 insns/op,  218134 cycles/op,        0 errors)
    15279.01 tps (149.9 allocs/op,  14.1 logallocs/op,  45.3 tasks/op,  287696 insns/op,  218637 cycles/op,        0 errors)
    15347.78 tps (149.7 allocs/op,  14.1 logallocs/op,  45.3 tasks/op,  285851 insns/op,  217795 cycles/op,        0 errors)
    15403.68 tps (149.6 allocs/op,  14.1 logallocs/op,  45.2 tasks/op,  285111 insns/op,  216984 cycles/op,        0 errors)
    15189.47 tps (150.0 allocs/op,  14.1 logallocs/op,  45.5 tasks/op,  289509 insns/op,  219602 cycles/op,        0 errors)
    15295.04 tps (149.8 allocs/op,  14.1 logallocs/op,  45.3 tasks/op,  288021 insns/op,  218545 cycles/op,        0 errors)
    15162.01 tps (149.8 allocs/op,  14.1 logallocs/op,  45.4 tasks/op,  291265 insns/op,  220451 cycles/op,        0 errors)

After:

    21620.18 tps (148.4 allocs/op,  13.4 logallocs/op,  43.7 tasks/op,  176817 insns/op,  153183 cycles/op,        0 errors)
    20644.03 tps (149.8 allocs/op,  13.5 logallocs/op,  44.3 tasks/op,  187941 insns/op,  160409 cycles/op,        0 errors)
    20588.06 tps (150.1 allocs/op,  13.5 logallocs/op,  44.5 tasks/op,  188090 insns/op,  160818 cycles/op,        0 errors)
    20789.29 tps (149.5 allocs/op,  13.5 logallocs/op,  44.2 tasks/op,  186495 insns/op,  159382 cycles/op,        0 errors)
    20977.89 tps (149.5 allocs/op,  13.4 logallocs/op,  44.2 tasks/op,  183969 insns/op,  158140 cycles/op,        0 errors)
    21125.34 tps (149.1 allocs/op,  13.4 logallocs/op,  44.1 tasks/op,  183204 insns/op,  156925 cycles/op,        0 errors)
    21244.42 tps (148.6 allocs/op,  13.4 logallocs/op,  43.8 tasks/op,  181276 insns/op,  155973 cycles/op,        0 errors)

Mostly because the index now fits in memory.

When it doesn't, the benefits are still visible due to lower LSA overhead.
2026-03-18 16:25:21 +01:00
Tomasz Grabiec
1452e92567 managed_bytes: Hoist write_fragmented() to common header 2026-03-18 16:25:20 +01:00
Tomasz Grabiec
75e6412b1c utils: managed_vector: Use std::uninitialized_move() to move objects
It's shorter, and is supposed to be optimized for trivially-moveable
types.

Important for managed_vector<index_entry>, which can have lots of
elements.
2026-03-18 16:25:20 +01:00
Tomasz Grabiec
50dc7c6dd8 sstables: mx: index_reader: Keep promoted_index info next to index_entry
Densely populated pages have no promoted index (small partitions), so
we can save space in such workloads by keeping promoted index in a
separate vector.

For workloads which do have a promoted index, pages have only one
partition. There aren't many such pages and they are long-lived, so
the extra allocation of the vector is amortized.

promoted_index class is removed, and replaced with equivalent
parsed_promoted_index_entry for simplicity. Because it's removed,
make_cursor() is moved into the index_reader class.

Reducing the size of index_entry is important for performence if pages
are densly populated. It helps to reduce LSA allocator pressure and
compaction/eviction speed.

This change, combined with the earlier change "Shave-off 16 bytes from
index_entry by using raw_token", gives significant improvement in
throughput in perf_simple_query run where the index doesn't fit in
memory:

  scylla perf-simple-query -c1 -m200M --partitions=1000000

Before:

9714.78 tps (170.9 allocs/op,  16.9 logallocs/op,  55.3 tasks/op,  494788 insns/op,  343920 cycles/op,        0 errors)
9603.13 tps (171.6 allocs/op,  17.0 logallocs/op,  55.6 tasks/op,  502358 insns/op,  348344 cycles/op,        0 errors)
9621.43 tps (171.9 allocs/op,  17.0 logallocs/op,  55.8 tasks/op,  500612 insns/op,  347508 cycles/op,        0 errors)
9597.75 tps (171.6 allocs/op,  17.0 logallocs/op,  55.6 tasks/op,  501428 insns/op,  348604 cycles/op,        0 errors)
9615.54 tps (171.6 allocs/op,  16.9 logallocs/op,  55.6 tasks/op,  501313 insns/op,  347935 cycles/op,        0 errors)
9577.03 tps (171.8 allocs/op,  17.0 logallocs/op,  55.7 tasks/op,  503283 insns/op,  349251 cycles/op,        0 errors)

After:

15328.25 tps (150.0 allocs/op,  14.1 logallocs/op,  45.4 tasks/op,  286769 insns/op,  218134 cycles/op,        0 errors)
15279.01 tps (149.9 allocs/op,  14.1 logallocs/op,  45.3 tasks/op,  287696 insns/op,  218637 cycles/op,        0 errors)
15347.78 tps (149.7 allocs/op,  14.1 logallocs/op,  45.3 tasks/op,  285851 insns/op,  217795 cycles/op,        0 errors)
15403.68 tps (149.6 allocs/op,  14.1 logallocs/op,  45.2 tasks/op,  285111 insns/op,  216984 cycles/op,        0 errors)
15189.47 tps (150.0 allocs/op,  14.1 logallocs/op,  45.5 tasks/op,  289509 insns/op,  219602 cycles/op,        0 errors)
15295.04 tps (149.8 allocs/op,  14.1 logallocs/op,  45.3 tasks/op,  288021 insns/op,  218545 cycles/op,        0 errors)
15162.01 tps (149.8 allocs/op,  14.1 logallocs/op,  45.4 tasks/op,  291265 insns/op,  220451 cycles/op,        0 errors)
2026-03-18 16:25:20 +01:00
Tomasz Grabiec
5e228a8387 sstables: mx: index_reader: Extract partition_index_page::clear_gently()
There will be more elements to clear. And partition_index_page should
know how to clear itself.
2026-03-18 16:25:20 +01:00
Tomasz Grabiec
2d77e4fc28 sstables: mx: index_reader: Shave-off 16 bytes from index_entry by using raw_token
The std::optional<> adds 8 bytes.

And dht::token adds 8 bytes due to _kind, which in this case is always
kind::key.

The size changd from 56 to 48 bytes.
2026-03-18 16:25:20 +01:00
Tomasz Grabiec
e9c98274b5 sstables: mx: index_reader: Reduce allocation_section overhead during index page parsing by batching allocation
If the page has many entries, we continuously enter and leave the
allocating section for every key. This can be avoided by batching LSA
operations for the whole page, after collecting all the entries.

Later optimizations will also build on this, where we will allocate
fragmented storage for keys in LSA using a single managed_bytes
constructor.

This alone brings only a minor improvement, but it does reduce LSA
allocations, probably due to less frequent memory reclamation:

  scylla perf-simple-query -c1 -m200M --duration 6000 --partitions=1000000

Before:

  9560.42 tps (172.2 allocs/op,  19.6 logallocs/op,  57.7 tasks/op,  567741 insns/op,  345158 cycles/op,        0 errors)
  9445.95 tps (173.1 allocs/op,  19.7 logallocs/op,  58.1 tasks/op,  579075 insns/op,  352173 cycles/op,        0 errors)
  9576.75 tps (172.2 allocs/op,  19.6 logallocs/op,  57.6 tasks/op,  572004 insns/op,  347373 cycles/op,        0 errors)
  9597.16 tps (172.2 allocs/op,  19.6 logallocs/op,  57.6 tasks/op,  569615 insns/op,  346618 cycles/op,        0 errors)
  9454.07 tps (173.5 allocs/op,  19.8 logallocs/op,  58.3 tasks/op,  579213 insns/op,  351569 cycles/op,        0 errors)

After:

  9562.21 tps (172.0 allocs/op,  17.0 logallocs/op,  55.8 tasks/op,  499225 insns/op,  347832 cycles/op,        0 errors)
  9480.20 tps (172.3 allocs/op,  17.0 logallocs/op,  55.9 tasks/op,  507271 insns/op,  350640 cycles/op,        0 errors)
  9512.42 tps (172.1 allocs/op,  17.0 logallocs/op,  55.9 tasks/op,  504247 insns/op,  350392 cycles/op,        0 errors)
  9498.45 tps (172.4 allocs/op,  17.1 logallocs/op,  55.9 tasks/op,  505765 insns/op,  350320 cycles/op,        0 errors)
  9076.30 tps (173.5 allocs/op,  17.1 logallocs/op,  56.5 tasks/op,  512791 insns/op,  354792 cycles/op,        0 errors)
  9542.62 tps (171.9 allocs/op,  17.0 logallocs/op,  55.8 tasks/op,  502532 insns/op,  348922 cycles/op,        0 errors)
2026-03-18 16:25:20 +01:00
Tomasz Grabiec
0e0f9f41b3 sstables: mx: index_reader: Keep index_entry directly in the vector
Partition index entries are relatively small, and if the workload has
small partitions, index pages have a lot of elements. Currently, index
entries are indirected via managed_ref, which causes increased cost of
LSA eviction and compaction. This patch amortizes this cost by storing
them dierctly in the managed_chunked_vector.

This gives about 23% improvement in throughput in perf-simple-query
for a workload where the index doesn't fit in memory:

  scylla perf-simple-query -c1 -m200M --duration 6000 --partitions=1000000

Before:

  7774.96 tps (166.0 allocs/op, 521.7 logallocs/op,  54.0 tasks/op,  802428 insns/op,  430457 cycles/op,        0 errors)
  7511.08 tps (166.1 allocs/op, 527.2 logallocs/op,  54.0 tasks/op,  804185 insns/op,  430752 cycles/op,        0 errors)
  7740.44 tps (166.3 allocs/op, 526.2 logallocs/op,  54.2 tasks/op,  805347 insns/op,  432117 cycles/op,        0 errors)
  7818.72 tps (165.2 allocs/op, 517.6 logallocs/op,  53.7 tasks/op,  794965 insns/op,  427751 cycles/op,        0 errors)
  7865.49 tps (165.1 allocs/op, 513.3 logallocs/op,  53.6 tasks/op,  788898 insns/op,  425171 cycles/op,        0 errors)

After:

  9560.42 tps (172.2 allocs/op,  19.6 logallocs/op,  57.7 tasks/op,  567741 insns/op,  345158 cycles/op,        0 errors)
  9445.95 tps (173.1 allocs/op,  19.7 logallocs/op,  58.1 tasks/op,  579075 insns/op,  352173 cycles/op,        0 errors)
  9576.75 tps (172.2 allocs/op,  19.6 logallocs/op,  57.6 tasks/op,  572004 insns/op,  347373 cycles/op,        0 errors)
  9597.16 tps (172.2 allocs/op,  19.6 logallocs/op,  57.6 tasks/op,  569615 insns/op,  346618 cycles/op,        0 errors)
  9454.07 tps (173.5 allocs/op,  19.8 logallocs/op,  58.3 tasks/op,  579213 insns/op,  351569 cycles/op,        0 errors)

Disabling the partition index doesn't improve the throuhgput beyond
that.
2026-03-18 16:25:20 +01:00
Tomasz Grabiec
b6bfdeb111 dht: Introduce raw_token
Most tokens stored in data structures are for key-scoped tokens, and
we don't need to pay for token::kind storage.
2026-03-18 16:25:20 +01:00
Tomasz Grabiec
3775593e53 test: perf_simple_query: Add 'sstable-format' command-line option 2026-03-18 16:25:20 +01:00
Tomasz Grabiec
6ee9bc63eb test: perf_simple_query: Add 'sstable-summary-ratio' command-line option 2026-03-18 16:25:20 +01:00
Tomasz Grabiec
38d130d9d0 test: perf-simple-query: Add option to disable index cache 2026-03-18 16:25:20 +01:00
Tomasz Grabiec
5ee61f067d test: cql_test_env: Respect enable-index-cache config
Mirrors the code in main.cc
2026-03-18 16:25:20 +01:00
Aleksandra Martyniuk
2d16083ba6 tasks: fix indentation 2026-03-18 15:37:24 +01:00
Aleksandra Martyniuk
1fbf3a4ba1 tasks: do not fail the wait request if rpc fails
During decommission, we first mark a topology request as done, then shut
down a node and in the following steps we remove node from the topology.
Thus,  finished request does not imply that a node is removed from
the topology.

Due to that, in node_ops_virtual_task::wait, while gathering children
from the whole cluster, we may hit the connection exception - because
a node is still in topology, even though it is down.

Modify the get_children method to ignore the exception and warn
about the failure instead.
2026-03-18 15:37:24 +01:00
Aleksandra Martyniuk
d4fdeb4839 tasks: pass token_metadata_ptr to task_manager::virtual_task::impl::get_children
In get_children we get the vector of alive nodes with get_nodes.
Yet, between this and sending rpc to those nodes there might be
a preemption. Currently, the liveness of a node is checked once
again before the rpcs (only with gossiper not in topology - unlike
get_nodes).

Modify get_children, so that it keeps a token_metadata_ptr,
preventing topology from changing between get_nodes and rpcs.

Remove test_get_children as it checked if the get_children method
won't fail if a node is down after get_nodes - which cannot happen
currently.
2026-03-18 15:37:24 +01:00
Calle Wilund
0013f22374 memtable_test::memtable_flush_period: Change sleep to use injection signal instead
Fixes: SCYLLADB-942

Adds an injection signal _from_ table::seal_active_memtable to allow us to
reliably wait for flushing. And does so.

Closes scylladb/scylladb#29070
2026-03-18 16:23:13 +02:00
Botond Dénes
ae17596c2a Merge 'Demote log level on split failure during shutdown' from Raphael Raph Carvalho
Since commit 509f2af8db, gate_closed_exception can be triggered for ongoing split during shutdown. The commit is correct, but it causes split failure on shutdown to log an error, which causes CI instability. Previously, aborted_exception would be triggered instead which is logged as warning. Let's do the same.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-951.
Fixes https://github.com/scylladb/scylladb/issues/24850.

Only 2026.1 is affected.

Closes scylladb/scylladb#29032

* github.com:scylladb/scylladb:
  replica: Demote log level on split failure during shutdown
  service: Demote log level on split failure during shutdown
2026-03-18 16:21:05 +02:00
Pavel Emelyanov
8b1ca6dcd6 database: Rate limit all tokens from a range
The limiter scans ranges to decide whether or not to rate-limit the
query. However, when considering each range only the front one's token
is accounted. This looks like a misprint.

The limiter was introduced in cc9a2ad41f

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#29050
2026-03-18 13:50:48 +01:00
Pavel Emelyanov
d68c92ec04 test: Replace a bunch of ternary operators with an if-else block
A followup of the merge of two test cases that happened in the previous
patch. Both used `foo = N if domain == bar else M` to evaluate the
parameters for topology. Using if-else block makes it immediately obvious
which topology and scope apply for each domain value without having to
evaluate multiple inline conditionals.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-03-18 13:08:36 +03:00
Pavel Emelyanov
b1d4fc5e6e test: Squash test_restore_primary_replica_same|different_domain tests
The two tests differ only in the way they set up the topology for the
cluster and the post-restore checks against the resulting streams.

The merge happens with the help of a "scope_is_same" boolean parameter
and corresponding updates in the topology setup and post-checks.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-03-18 13:08:36 +03:00
Pavel Emelyanov
21c603a79e test: Use the same regexp in test_restore_primary_replica_different|same_domain-s
The one in "different domain" test is simpler because the test performs
less checks. Next patch will merge both tests and making regexp-s look
identical makes the merge even smother.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-03-18 13:07:09 +03:00
Emil Maskovsky
34f3916e7d .github: update test instructions for unified pytest runner
Update test running instructions to reflect unified pytest-based runner.

The test.py now requires full test paths with file extensions for both
C++ and Python tests.

No backport: The change is only relevant for recent test.py changes in
master.

Closes scylladb/scylladb#29062
2026-03-18 09:28:28 +01:00
Marcin Maliszkiewicz
04bf631d7f auth: switch query_attribute_for_all to use cache 2026-03-18 09:06:20 +01:00
Marcin Maliszkiewicz
cf578fd81a auth: switch get_attribute to use cache 2026-03-18 09:06:20 +01:00
Marcin Maliszkiewicz
06d16b6ea2 auth: cache: add heterogeneous map lookups
Some callers have only string_view role name,
they shouldn't need to allocate sstring to do the
lookup.
2026-03-18 09:06:20 +01:00
Marcin Maliszkiewicz
7fdb1118f5 auth: switch query_all to use cache 2026-03-18 09:06:20 +01:00
Marcin Maliszkiewicz
fca11c5a21 auth: switch query_all_directly_granted to use cache 2026-03-18 09:06:20 +01:00
Marcin Maliszkiewicz
6f682f7eb1 auth: cache: add ability to go over all roles
This is needed to implement auth service api where
we list all roles.
2026-03-18 09:06:20 +01:00
Marcin Maliszkiewicz
61952cd985 raft: service: reload auth cache before service levels
Since service levels depend on auth data, and not other
way around, we need to ensure a proper loading order.
2026-03-18 09:06:20 +01:00
Marcin Maliszkiewicz
c4cfb278bc service: raft: move update_service_levels_effective_cache check
The auth::cache::includes_table function also covers role_members and
role_attributes. The existing check was removed because it blocked these
tables from triggering necessary cache updates.

While previously non-critical (due to unused attributes and table coupling),
maintaining a correct cache is essential for upcoming changes.
2026-03-18 09:06:20 +01:00
Benny Halevy
c2a6d1e930 test/boost/database_test: add test_snapshot_ctl_details_exception_handling
Verify that the directory listers opened by get_snapshot_details
are properly closed when handling an (injected) exception.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-03-18 09:37:44 +02:00
Benny Halevy
6dc4ea766b table: get_snapshot_details: fix indentation inside try block
Whitespace-only change: indent the loop body one level inside the
try block added in the previous commit.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-03-18 09:28:50 +02:00
Benny Halevy
b09d45b89a table: per-snapshot get_snapshot_details: fix typo in comment
The comment says the snapshot directory may contain a `schema.sql` file,
but the code treats `schema.cql` as the special-case schema file.

Reported-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-03-18 09:27:40 +02:00
Benny Halevy
580cc309d2 table: per-snapshot get_snapshot_details: always close lister using try/catch
Since this is a coroutine, we cannot just use deferred_close,
but rather we need to catch an error, close the lister, and then
return the error, is applicable.

Fixes: SCYLLADB-1013

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-03-18 09:27:23 +02:00
Benny Halevy
78c817f71e table: get_snapshot_details: always close lister using deferred_close
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-03-18 09:26:26 +02:00
Dario Mirovic
71e6918f28 test: use NetworkTopologyStrategy in maintenance socket tests
NetworkTopologyStrategy is the preferred choice. We should not
use SimpleStrategy anymore. This patch changes the topology strategy
for all the maintenance socket tests.

Refs SCYLLADB-1070
2026-03-17 20:20:47 +01:00
Dario Mirovic
278535e4e3 test: use cleanup fixture in maintenance socket auth tests
Add a cql_clusters pytest fixture that tracks CQL driver Cluster
objects and shuts them down automatically after test completion.
This replaces manual shutdown() calls at the end of each test.

Also consolidate shutdown() calls in retry helpers into finally
blocks for consistent cleanup.

Refs SCYLLADB-1070
2026-03-17 20:15:30 +01:00
Dario Mirovic
2e4b72c6b9 auth: add maintenance_socket_authorizer
GRANT/REVOKE fails on the maintenance socket connections,
because maintenance_auth_service uses allow_all_authorizer.
allow_all_authorizer allows all operations, but not GRANT/REVOKE,
because they make no sense in its context.

This has been observed during PGO run failure in operations from
./pgo/conf/auth.cql file.

This patch introduces maintenance_socket_authorizer that supports
the capabilities of default_authorizer ('CassandraAuthorizer')
without needing authorization.

Refs SCYLLADB-1070
2026-03-17 19:19:41 +01:00
Botond Dénes
172c786079 Merge 'perf-alternator: wait for alternator port before running workload' from Marcin Maliszkiewicz
This patch is mostly for the purpose of running pgo CI job.

We may receive connection error if asyncio.sleep(5) in
pgo.py is not sufficient waiting time.

In pgo.py we do wait for port but only for cql,
anyway it's better to have high level check than
trying to wait for alternator port there.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1071
Backport: 2026.1 - it failed on CI for that build

Closes scylladb/scylladb#29063

* github.com:scylladb/scylladb:
  perf: add abort_source support to wait-for-port loops
  perf-alternator: wait for alternator port before running workload
2026-03-17 18:38:11 +02:00
Botond Dénes
5d868dcc55 Merge 's3_client: fix s3::range max value for object size' from Ernest Zaslavsky
- fix s3::range max value for object size which is 50TiB and not 5.
- refactor constants to make it accessible for all interested parties, also reuse these constants in tests

No need to backport, doubt we will encounter an object larger than 5TiB

Closes scylladb/scylladb#28601

* github.com:scylladb/scylladb:
  s3_client: reorganize tests in part_size_calculation_test
  s3_client: switch using s3 limits constants in tests
  s3_client: fix the s3::range max object size
  s3_client: remove "aws" prefix from object limits constants
  s3_client: make s3 object limits accessible
2026-03-17 16:34:42 +02:00
Anna Stuchlik
f4a6bb1885 doc: remove the Open Source Example from Installation
This commit replaces the Open Soruce example from the Installation section for CentOS.
We updated the example for Ubuntu, but not for CentOS.
We don't want to have any Open Source information in the docs.

Fixes https://github.com/scylladb/scylladb/issues/29087
2026-03-17 14:54:32 +01:00
Anna Stuchlik
95bc8911dd doc: replace http with https in the installation instructions
Fixes https://github.com/scylladb/scylladb/issues/17227
2026-03-17 14:46:16 +01:00
Dawid Mędrek
a8dd13731f Merge 'Improve debuggability of test/cluster/test_data_resurrection_in_memtable.py' from Botond Dénes
This test was observed to fail in CI recently but there is not enough information in the logs to figure out what went wrong. This PR makes a few improvements to make the next investigation easier, should it be needed:
* storage-service: add table name to mutation write failure error messages.
* database: the `database_apply` error injection used to cause trouble, catching writes to bystander tables, making tests flaky. To eliminate this, it gained a filter to apply only to non-system keyspaces. Unfortunately, this still allows it to catch writes to the trace tables. While this should not fail the test, it reduces observability, as some traces disappear. Improve this error injection to only apply to selected table. Also merge it with the `database_apply_wait` error injection, to streamline the code a bit.
* test/test_data_resurrection_in_memtable.py: dump data from the datable, before the checks for expected data, so if checks fail, the data in the table is known.

Refs: SCYLLADB-812
Refs: SCYLLADB-870
Fixes: SCYLLADB-1050 (by restricting `database_apply` error injection, so it doesn't affect writes to system traces)

Backport: test related improvement, no backport

Closes scylladb/scylladb#28899

* github.com:scylladb/scylladb:
  test/cluster/test_data_resurrection_in_memtable.py: dump rows before check
  replica/database: consolidate the two database_apply error injections
  service/storage_proxy: add name of table to error message for write errors
2026-03-17 13:35:19 +01:00
Botond Dénes
318aa07158 Merge ' test/alternator: use module-scope fixtures in test_streams.py ' from Nadav Har'El
Previously, all stream-table fixtures in test_streams.py used scope="function",
forcing a fresh table to be created for every test, slowing down the test a bit
(though not much), and discouraging writing small new tests.

 This was a workaround for a DynamoDB quirk (that Alternator doesn't have):
LATEST shard iterators have a time slack and may point slightly before  the true
stream head, causing leftover events from a previous test to appear in the next
test's reads.

The first two tests in this series fix small problems that turn up once we start
sharing test tables in test_streams.py. The final patch fixes the "LATEST" problem
and enables sharing the test table by using "module" scope fixtures instead of
"function".

After this series, test_streams.py run time went down a bit, from 20.2 seconds to 17.7 seconds.

Closes scylladb/scylladb#28972

* github.com:scylladb/scylladb:
  test/alternator: speed up test_streams.py by using module-scope fixtures
  test/alternator: test_streams.py don't use fixtures in 4 tests
  test/alternator: fix do_test() in test_streams.py
2026-03-17 13:56:16 +02:00
Ernest Zaslavsky
7f597aca67 cmake: fix broken build
Add raft_util.idl.hh to cmake to generate the code properly

Closes scylladb/scylladb#29055
2026-03-17 10:35:34 +01:00
Botond Dénes
dbe70cddca test/boost/querier_cache_test: make test_time_based_cache_eviction less sensitive to timing
This test relies on the cache entry being evicted after 200ms past the
TTL. This may not happen on a busy CI machine. Make the test less
reliant on timing by using eventually_true().
Simplify the test by dropping the second entry, it doesn't add anything
to the test.

Fixes: SCYLLADB-811

Closes scylladb/scylladb#28958
2026-03-17 10:32:23 +01:00
Botond Dénes
0fd51c4adb test/nodetool: rest_api_mock_server: add retry for status code 404
This fixtures starts the mock server and immediately connects to it to
setup the expected requests. The connection attempt might be too early,
so there is a retry loop with a timeout. The loop currently checks for
requests.exception.ConnectionError. We've seen a case where the
connection is successful but the request fails with 404. The mock
started the server but didn't setup the routes yet. Add a retry for http
404 to handle this.

Fixes: SCYLLADB-966

Closes scylladb/scylladb#29003
2026-03-17 10:30:23 +01:00
Asias He
6cb263bab0 repair: Prevent CPU stall during cross-shard row copy and destruction
When handling `repair_stream_cmd::end_of_current_rows`, passing the
foreign list directly to `put_row_diff_handler` triggered a massive
synchronous deep copy on the destination shard. Additionally, destroying
the list triggered a synchronous deallocation on the source shard. This
blocked the reactor and triggered the CPU stall detector.

This commit fixes the issue by introducing `clone_gently()` to copy the
list elements one by one, and leveraging the existing
`utils::clear_gently()` to destroy them. Both utilize
`seastar::coroutine::maybe_yield()` to allow the reactor to breathe
during large cross-shard transfers and cleanups.

Fixes SCYLLADB-403

Closes scylladb/scylladb#28979
2026-03-17 11:05:15 +02:00
Pavel Emelyanov
9fe19ec9d9 sstables: Fix object storage lister not resetting position in batch
vector

The lister loop in get() pre-fetches records in batches and keeps them
in a _info vector, iterating over it with the help of _pos cursor. When
the vector is re-read, the cursor must be reset too.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-03-17 10:32:42 +03:00
Pavel Emelyanov
1a6a7647c6 sstables: Fix object storage lister skipping entries when filter is active
The lister loop in get() method looks weird. It uses do-while(false)
loop and calls continue; inside when filter asks to skip a entry.
Skipping, thus, aborts the whole thing and EOF-s, which is not what's
supposed to happen.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-03-17 10:32:40 +03:00
Botond Dénes
035aa90d4b Merge 'Alternator: add per-table batch latency metrics and test coverage' from Amnon Heiman
This series fixes a metrics visibility gap in Alternator and adds regression coverage.

Until now, BatchGetItem and BatchWriteItem updated global latency histograms but did not consistently update per-table latency histograms. As a result, table-level latency dashboards could miss batch traffic.

It updates the batch read/write paths to compute request duration once and record it in both global and per-table latency metrics.

Add the missing tests, including a metric-agnostic helper and a dedicated per-table latency test that verifies latency counters increase for item and batch operations.

This change is metrics-only (no API/behavior change for requests) and improves observability consistency between global and per-table views.

Fixes #28721

**We assume the alternator per-table metrics exist, but the batch ones are not updated**

Closes scylladb/scylladb#28732

* github.com:scylladb/scylladb:
  test(alternator): add per-table latency coverage for item and batch ops
  alternator: track per-table latency for batch get/write operations
2026-03-16 17:18:00 +02:00
Michał Hudobski
40d180a7ef docs: update vector search filtering to reflect primary key support only
Remove outdated references to filtering on columns provided in the
index definition, and remove the note about equal relations (= and IN)
being the only supported operations. Vector search filtering currently
supports WHERE clauses on primary key columns only.

Closes scylladb/scylladb#28949
2026-03-16 17:16:16 +02:00
Botond Dénes
9de8d6798e Merge 'reader_concurrency_semaphore: skip preemptive abort for permits waiting for memory' from Łukasz Paszkowski
Permits in the `waiting_for_memory` state represent already-executing reads that are blocked on memory allocation. Preemptively aborting them is wasteful -- these reads have already consumed resources and made progress, so they should be allowed to complete.

Restrict the preemptive abort check in maybe_admit_waiters() to only apply to permits in the `waiting_for_admission` state, and tighten the state validation in `on_preemptive_aborted()` accordingly.

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1016

Backport not needed. The commit introducing replica load shedding is not part of 2026.1

Closes scylladb/scylladb#29025

* github.com:scylladb/scylladb:
  reader_concurrency_semaphore: skip preemptive abort for permits waiting for memory
  reader_concurrency_semaphore_test: detect memory leak on preemptive abort of waiting_for_memory permit
2026-03-16 17:14:25 +02:00
Marcin Maliszkiewicz
9318c80203 perf: add abort_source support to wait-for-port loops
Check abort_source on each retry iteration in
wait_for_alternator and wait_for_cql so the
wait can be interrupted on shutdown.

Didn't use sleep_abortable as the sleep is very short
anyway.
2026-03-16 16:14:10 +01:00
Calle Wilund
a5df2e79a7 storage_service: Wait for snapshot/backup before decommission
Fixes: SCYLLADB-244

Disables snapshot control such that any active ops finish/fail
before proceeding with decommission.
Note: snapshot control provided as argument, not member ref
due to storage_service being used from both main and cql_test_env.
(The latter has no snapshot_ctl to provide).

Could do the snapshot lockout on API level, but want to do
pre-checks before this.

Note: this just disables backup/snapshot fully. Could re-enable
after decommission, but this seems somewhat pointless.

v2:
* Add log message to snapshot shutdown
* Make test use log waiting instead of timeouts

Closes scylladb/scylladb#28980
2026-03-16 17:12:57 +02:00
Marcin Maliszkiewicz
edf0148bee perf-alternator: wait for alternator port before running workload
This patch is mostly for the purpose of running pgo CI job.

We may receive connection error if asyncio.sleep(5) in
pgo.py is not sufficient waiting time.

In pgo.py we do wait for port but only for cql,
anyway it's better to have high level check than
trying to wait for alternator port there.
2026-03-16 16:07:52 +01:00
bitpathfinder
85d5073234 test: Fix non-awaited coroutine in test_gossiper_empty_self_id_on_shadow_round
The line with the error was not actually needed and has therefore been removed.

Fixes: SCYLLADB-906

Closes scylladb/scylladb#28884
2026-03-16 17:07:36 +02:00
Botond Dénes
3e4e0c57b8 Merge 'Relax rf-rack-valid-keyspace option in backup/restore tests' from Pavel Emelyanov
Some tests, when create a cluster, configure nodes with the rf-rack-valid option, because sometimes they want to have it OFF. For that the option is explicitly carried around, but the cluster creating helper can guess this option itself -- out of the provided topology and replication factor.

Removing this option simplifies the code and (which a nicer outcome) the test "signature" that's used e.g. in command-line to run a specific test.

Improving tests, not backporting

Closes scylladb/scylladb#28860

* github.com:scylladb/scylladb:
  test: Relax topology_rf_validity parameter for some tests
  test: Auto detect rf-rack-valid option in create_cluster()
2026-03-16 17:06:46 +02:00
Raphael S. Carvalho
ee87b66033 replica: Demote log level on split failure during shutdown
Dtest failed with:

table - Failed to load SSTable .../me-3gyn_0qwi_313gw2n2y90v2j4fcv-big-Data.db
of origin memtable due to std::runtime_error (Cannot split
.../me-3gyn_0qwi_313gw2n2y90v2j4fcv-big-Data.db because manager has compaction
disabled, reason might be out of space prevention), it will be unlinked...

The reason is that the error above is being triggered when the cause is
shutdown, not out of space prevention. Let's distinguish between the two
cases and log the error with warning level on shutdown.

Fixes https://github.com/scylladb/scylladb/issues/24850.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2026-03-16 12:03:17 -03:00
Patryk Jędrzejczak
526e5986fe test: test_raft_no_quorum: decrease group0_raft_op_timeout_in_ms after quorum loss
`test_raft_no_quorum.py::test_cannot_add_new_node` is currently flaky in dev
mode. The bootstrap of the first node can fail due to `add_entry()` timing
out (with the 1s timeout set by the test case).

Other test cases in this test file could fail in the same way as well, so we
need a general fix. We don't want to increase the timeout in dev mode, as it
would slow down the test. The solution is to keep the timeout unchanged, but
set it only after quorum is lost. This prevents unexpected timeouts of group0
operations with almost no impact on the test running time.

A note about the new `update_group0_raft_op_timeout` function: waiting for
the log seems to be necessary only for
`test_quorum_lost_during_node_join_response_handler`, but let's do it
for all test cases just in case (including `test_can_restart` that shouldn't
be flaky currently).

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-913

Closes scylladb/scylladb#28998
2026-03-16 16:58:15 +02:00
Raphael S. Carvalho
b508f3dd38 service: Demote log level on split failure during shutdown
Since commit 509f2af8db, gate_closed_exception can be triggered
for ongoing split during shutdown. The commit is correct, but it
causes split failure on shutdown to log an error, which causes
CI instability. Previously, aborted_exception would be triggered
instead which is logged as warning. Let's do the same.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-951.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2026-03-16 11:52:00 -03:00
Dani Tweig
bc0952781a Update Jira sync calling workflow to consolidated view
Replaced multiple per-action workflow jobs with a single consolidated
call to main_pr_events_jira_sync.yml. Added 'edited' event trigger.
This makes CI actions in PRs more readable and workflow execution faster.

Fixes:PM-253

Closes scylladb/scylladb#29042
2026-03-16 08:25:32 +02:00
Artsiom Mishuta
755d528135 test.py: fix warnings
changes in this commit:
1)rename class from 'TestContext' to  'Context' so pytest will not consider this class as a test

2)extend pytest filterwarnings list to ignore warnings from external libs

3) use datetime.datetime.now(datetime.UTC) unstead  datetime.datetime.utcnow()

4) use  ResultSet.one() instead  ResultSet[0]

Fixes SCYLLADB-904
Fixes SCYLLADB-908
Related SCYLLADB-902

Closes scylladb/scylladb#28956
2026-03-15 12:00:10 +02:00
Karol Nowacki
7659a5b878 vector_search: test: fix flaky test
The test assumes that the sleep duration will be at least the value of
the sleep parameter. However, the actual sleep time can be slightly less
than requested (e.g., a 100ms sleep request might result in a 99ms
sleep).

This commit adjusts the test's time comparison to be more lenient,
preventing test flakiness.
2026-03-13 16:28:22 +01:00
Karol Nowacki
5474cc6cc2 vector_search: fix race condition on connection timeout
When a `with_connect` operation timed out, the underlying connection
attempt continued to run in the reactor. This could lead to a crash
if the connection was established/rejected after the client object had
already been destroyed. This issue was observed during the teardown
phase of a upcoming high-availability test case.

This commit fixes the race condition by ensuring the connection attempt
is properly canceled on timeout.

Additionally, the explicit TLS handshake previously forced during the
connection is now deferred to the first I/O operation, which is the
default and preferred behavior.

Fixes: SCYLLADB-832
2026-03-13 16:28:22 +01:00
Piotr Dulikowski
d8b283e1fb Merge 'Add CQL forwarding for strongly consistent tables' from Wojciech Mitros
In this series we add support for forwarding strongly consistent CQL requests to suitable replicas, so that clients can issue reads/writes to any node and have the request executed on an appropriate tablet replica (and, for writes, on the Raft leader). We return the same CQL response as what the user would get while sending the request to the correct replica and we perform the same logging/stats updates on the request coordinator as if the coordinator was the appropriate replica.

The core mechanism of forwarding a strongly consistent request is sending an RPC containing the user's cql request frame to the appropriate replica and returning back a ready, serialized `cql_transport::response`. We do this in the CQL server - it is most prepared for handling these types and forwarding a request containing a CQL frame allows us to reuse near-top-level methods for CQL request handling in the new RPC handler (such as the general `process`)

For sending the RPC, the CQL server needs to obtain the information about who should it forward the request to. This requires knowledge about the tablet raft group members and leader. We obtain this information during the execution of a `cql3/strong_consistency` statement, and we return this information back to the CQL server using the generalized `bounce_to_shard` `response_message`, where we now store the information about either a shard, or a specific replica to which we should forward to. Similarly to `bounce_to_shard`, we need to handle this `result_message` in a loop - a replica may move during statement execution, or the Raft leader can change. We also use it for forwarding strongly consistent writes when we're not a member of the affected tablet raft group - in that case we need to forward the statement twice - once to any replica of the affected tablet, then that replica can find the leader and return this information to the coordinator, which allows the second request to be directed to the leader.

This feature also allows passing through exception messages which happened on the target replica while executing the statement. For that, many methods of the `cql_transport::cql_server::connection` for creating error responses needed to be moved to `cql_transport::cql_server`. And for final exception handling on the coordinator, we added additional error info to the RPC response, so that the handling can be performed without having the `result_message::exception` or `exception_ptr` itself.

Fixes [SCYLLADB-71](https://scylladb.atlassian.net/browse/SCYLLADB-71)

[SCYLLADB-71]: https://scylladb.atlassian.net/browse/SCYLLADB-71?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ

Closes scylladb/scylladb#27517

* github.com:scylladb/scylladb:
  test: add tests for CQL forwarding
  transport: enable CQL forwarding for strong consistency statements
  transport: add remote statement preparation for CQL forwarding
  transport: handle redirect responses in CQL forwarding
  transport: add exception handling for forwarded CQL requests
  transport: add basic CQL request forwarding
  idl: add a representation of client_state for forwarding
  cql_server: handle query, execute, batch in one case
  transport: inline process_on_shard in cql_server::process
  transport: extract process() to cql_server
  transport: add messaging_service to cql_server
  transport: add response reconstruction helpers for forwarding
  transport: generalize the bounce result message for bouncing to other nodes
  strong consistency: redirect requests to live replicas from the same rack
  transport: pass foreign_ptr into sleep_until_timeout_passes and move it to cql_server
  transport: extract the error handling from process_request_one
  transport: move error response helpers from connection to cql_server
2026-03-13 15:03:10 +01:00
Andrzej Jackowski
60aaea8547 cql: improve write consistency level guardrail messages
Update warn and fail messages for the write_consistency_levels_warned
and write_consistency_levels_disallowed guardrails to include the
configuration option name and actionable guidance. The main motivation
is to make the messages follow the conventions of other guardrails.

Refs: SCYLLADB-257
2026-03-13 14:40:45 +01:00
Tomasz Grabiec
518470e89e Merge 'load_stats: improve tablet filtering for load stats' from Ferenc Szili
When computing table sizes via load_stats to determine if a split/merge is needed, we are filtering tablets which are being migrated, in order to avoid counting them twice (both on leaving and pending replica) in the total table size. The tablets are filtered so that they are counted on the leaving replica until the streaming stage, and on the pending replica after the streaming stage.

Currently, the procedure for collecting tablet sizes for load balancing also uses this same filter. This should be changed, because the load balancer needs to have as much information about tablet sizes as possible, and could ignore a node due to missing tablet sizes for tablets in the `write_both_read_new` and `use_new` stages.

For tablet size collection, we should include all the tablets which are currently taking up disk space. This means:
- on leaving replica, include all tablets until the `cleanup` stage
- on pending replica, include all tablets starting with the `write_both_read_new` and later stages

While this is an improvement, it causes problems with some of the tests, and therefore needs to be backported to 2026.1

Fixes: SCYLLADB-829

Closes scylladb/scylladb#28587

* github.com:scylladb/scylladb:
  load_stats: add filtering for tablet sizes
  load_stats: move tablet filtering for table size computation
  load_stats: bring the comment and code in sync
2026-03-13 13:08:11 +01:00
Pavel Emelyanov
d544d8602d test: Relax topology_rf_validity parameter for some tests
Tests that call create_cluster() helper no longer need to carry the
rf-validity parameter. This simplifies the code and test signature.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-03-13 14:30:32 +03:00
Pavel Emelyanov
313985fed7 test: Auto detect rf-rack-valid option in create_cluster()
The helper accepts its as boolean argument, but it can easily estimate
one from the provided topology.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-03-13 14:30:32 +03:00
Gleb Natapov
fae5282c82 service level: fix crash during migration to driver server level
Before b59b3d4 the migration code checked that service level controller
is on v2 version before migration and the check also implicitly checked
that _sl_data_accessor field is already initialized, but now that the
check is gone the migration can start before service level controller is
fully initialized. Re add the check, but to a different place.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1049

Closes scylladb/scylladb#29021
2026-03-13 11:24:26 +01:00
Łukasz Paszkowski
4c4d043a3b reader_concurrency_semaphore: skip preemptive abort for permits waiting for memory
Permits in the `waiting_for_memory` state represent already-executing
reads that are blocked on memory allocation. Preemptively aborting
them is wasteful -- these reads have already consumed resources and
made progress, so they should be allowed to complete.

Restrict the preemptive abort check in maybe_admit_waiters() to only
apply to permits in the `waiting_for_admission` state, and tighten
the state validation in `on_preemptive_aborted()` accordingly.

Adjust the following tests:
+ test_reader_concurrency_semaphore_abort_preemptively_aborted_permit
  no longer relies on requesting memory
+ test_reader_concurrency_semaphore_preemptive_abort_requested_memory_leak
  adjusted to the fix

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1016
2026-03-13 09:50:05 +01:00
Dani Tweig
aa46a0f4e0 Add VECTOR to the list of synced milestones in scylladb.git
- Added VECTOR to the comma-separated list of Jira project keys in `call_sync_milestone_to_jira.yml`.
- The `jira_project_keys` value changed from `SCYLLADB,CUSTOMER,SMI,RELENG` to `SCYLLADB,CUSTOMER,SMI,RELENG,VECTOR`.

- The VECTOR project needs to sync with scylladb.git milestones, so that when a GitHub milestone is created or closed in scylladb/scylladb, the corresponding Jira release is also created or released in the VECTOR project.
- Previously only SCYLLADB, CUSTOMER, SMI, and RELENG projects were synced.

Fixes:PM-220

Closes scylladb/scylladb#29014
2026-03-13 09:58:41 +02:00
Botond Dénes
fc8cebd671 Merge 'Verify components digests during component load and scrub in validate mode' from Taras Veretilnyk
This PR adds integrity verification for SSTable component files during loading. When component digests are present in Scylla metadata, the loader now validates each component's CRC32 digest against the stored expected value, catching silent corruption of component files. Index, Rows and Partitions components digests are also validated duriung scrub in validate mode

Added corruption tests that write an SSTable, flip a bit in a specific component file, then verify that reloading the SSTable detects the corruption and throws the expected exception.

Depends on https://github.com/scylladb/scylladb/pull/28338

Backport is not required, this is new feature

Fixes https://github.com/scylladb/scylladb/issues/20103

Closes scylladb/scylladb#28761

* github.com:scylladb/scylladb:
  test/cqlpy: test --ignore-component-digest-mismatch flag in scylla sstable upgrade
  docs: document --ignore-component-digest-mismatch flag for scylla sstable upgrade
  sstables: propagate ignore_component_digest_mismatch config to all load sites
  sstables: add option to ignore component digest mismatches
  sstable_compaction_test: Add scrub validate test for corrupted index
  sstables: add tests for component digest validation on corrupted SSTables
  sstables: validate index components digests during SSTable scrub in validate mode
  sstables: verify component digests on SSTable load
  sstables: add digest_file_random_access_reader for CRC32 digest computation
2026-03-13 09:55:55 +02:00
Avi Kivity
ae8a418744 Merge 'Await async calls in test tablets migration' from Benny Halevy
Fix several test cases that did not await async tasks:
- test_restart_leaving_replica_during_cleanup
- test_restart_in_cleanup_stage_after_cleanup
- test_tablet_back_and_forth_migration
- test_staging_backlog_is_preserved_with_file_based_streaming

Fixes SCYLLADB-910

* Minor fixes, no backport needed

Closes scylladb/scylladb#28908

* github.com:scylladb/scylladb:
  test_tablets_migration: test_staging_backlog_is_preserved_with_file_based_streaming: convert for loop to asyncio.gather
  test_tablets_migration: test_tablet_back_and_forth_migration: await move_tablet
  test_tablets_migration: test_restart_in_cleanup_stage_after_cleanup: await move_task
  test_tablets_migration: test_restart_leaving_replica_during_cleanup: await move_task
  test_tablets_migration: drop unused imports from cassandra.query
2026-03-13 00:20:29 +02:00
Avi Kivity
b228eb26e6 Merge 'dbuild: Use slirp4netns network in dbuild nested containers' from Calle Wilund
Fixes #25084

Add slirp4netns and use for nested containers. This will allow nested container port aliasing, helping CI stability.

Note: this contains and updated Dockerfile for dbuild image, but since chicken and eggs, right now will force install slirp4netns before anything in dbuild script.

Updates the mock server handling to use ephemeral ports and query from container, ensuring we don't get port collisions. (boost as well as pytest).

Includes a timeout up, and a tweak to our scylla_cluster handling, ensuring we don't deadlock when pipe size is less than requires for our sys notify messages.

Closes scylladb/scylladb#28727

* github.com:scylladb/scylladb:
  gcs_fixture: Change to use docker helper
  aws_kms_fixture: Modify to use docker helper
  test/lib/proc_util: Add docker helper
  pytest: use ephemeral port publish for docker mock servers
  dbuild: Use container network in dbuild nested containers
  scylla_cluster: Read notify sock in background to prevent deadlock
2026-03-12 23:49:25 +02:00
Nadav Har'El
ad832c263e test/cluster: mark test_alternator_concurrent_rmw_same_partition_different_server not strictly xfail
A few days ago, in commit 7b30a39 we added to pytest.ini the option
xfail_strict. This option causes every time a test XPASSes, i.e., an xfail
test actually passes - to be considered an error and fail the test.

But some tests demonstrate a timing-related bug and do not reproduce the
bug every single time. An example we noticed in one CI run is:

test/cluster/test_alternator.py::test_alternator_concurrent_rmw_same_partition_different_server

This test reproduces a timing-related bug (if you do an LWT write to
one partition on to two different coordinators "at the same time", you
can get a failure), but only most of the time, not 100% of the time.

The solution is to add "strict=False" for the xfail marker on this specific
test. This undoes the xfail_strict for this specific test, accepting that
this specific test can either pass or fail. Note that this does NOT make
this test worthless - we still see this test failing most of the time, and
when a developer finally fixes this issue, the test will begin to pass all
the time.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-941
Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#29016
2026-03-12 23:46:23 +02:00
Tomasz Grabiec
1256a9faa7 tablets: Fix deadlock in background storage group merge fiber
When it deadlocks, groups stop merging and compaction group merge
backlog will run-away.

Also, graceful shutdown will be blocked on it.

Found by flaky unit test
test_merge_chooses_best_replica_with_odd_count, which timed-out in 1
in 100 runs.

Reason for deadlock:

When storage groups are merged, the main compaction group of the new
storage group takes a compaction lock, which is appended to
_compaction_reenablers_for_merging, and released when the merge
completion fiber is done with the whole batch.

If we accumulate more than 1 merge cycle for the fiber, deadlock
occurs. Lock order will be this

Initial state:

 cg0: main
 cg1: main
 cg2: main
 cg3: main

After 1st merge:

 cg0': main [locked], merging_groups=[cg0.main, cg1.main]
 cg1': main [locked], merging_groups=[cg2.main, cg3.main]

After 2nd merge:

 cg0'': main [locked], merging_groups=[cg0'.main [locked], cg0.main, cg1.main, cg1'.main [locked], cg2.main, cg3.main]

merge completion fiber will try to stop cg0'.main, which will be
blocked on compaction lock. which is held by the reenabler in
_compaction_reenablers_for_merging, hence deadlock.

The fix is to wait for background merge to finish before we start the
next merge. It's achieved by holding old erm in the background merge,
and doing a topology barrier from the merge finalizing transition.

Background merge is supposed to be a relatively quick operation, it's
stopping compaction groups. So may wait for active requests. It
shouldn't prolong the barrier indefinitely.

Tablet boost unit tests which trigger merge need to be adjusted to
call the barrier, otherwise they will be vulnerable to the deadlock.

Two cluster tests were removed because they assumed that merge happens
in the backgournd. Now that it happens as part of merge finalization,
and blocks topology state machine, those tests deadlock because they
are unable to make topology changes (node bootstrap) while background
merge is blocked.

The test "test_tablets_merge_waits_for_lwt" needed to be adjusted. It
assumed that merge finalization doesn't wait for the erm held by the
LWT operation, and triggered tablet movement afterwards, and assumed
that this migration will issue a barrier which will block on the LWT
operation. After this commit, it's the barrier in merge finalization
which is blocked. The test was adjusted to use an earlier log mark
when waiting for "Got raft_topology_cmd::barrier_and_drain", which
will catch the barrier in merge finalization.

Fixes SCYLLADB-928
2026-03-12 22:45:01 +01:00
Tomasz Grabiec
7706c9e8c4 replica: table: Propagate old erm to storage group merge 2026-03-12 22:45:01 +01:00
Tomasz Grabiec
582a4abeb6 test: boost: tablets_test: Save tablet metadata when ACKing split resize decision
Needs to be ordered before split finalization, because storage_group
must be in split mode already at finalization time. There must be
split-ready compaction groups, otherwise finalization fails with this
error:

  Found 0 split ready compaction groups, but expected 2 instead.

Exposed by increased split activity in tests.
2026-03-12 22:45:01 +01:00
Tomasz Grabiec
279fcdd5ff storage_service: Extract local_topology_barrier()
Will be called in tests. It does the local part of the global topology
barrier.

The comment:

        // We capture the topology version right after the checks
        // above, before any yields. This is crucial since _topology_state_machine._topology
        // might be altered concurrently while this method is running,
        // which can cause the fence command to apply an invalid fence version.

was dropped, because it's no longer true after
fad6c41cee, and it doesn't make sense in
the context of local_topology_barrier(). We'd have to propagate the
version to local_topology_barrier(), but it's pointless. The fence
version is decided before calling the local barrier, and it will be
valid even if local version moves ahead.
2026-03-12 22:44:56 +01:00
Avi Kivity
03186ce60d Merge 'Cleanup after auth v1 and default superuser code removal' from Marcin Maliszkiewicz
This is short cleanup after recent removal of creating default cassandra superuser and auth-v1 code removal.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1036

Backport: no, just code cleanup

Closes scylladb/scylladb#29004

* github.com:scylladb/scylladb:
  auth: remove DEFAULT_SUPERUSER_NAME constant and dead DEFAULT_USER_PASSWORD
  auth: use configurable default_superuser in describe_roles
  auth: move default_superuser to common, remove _superuser member
  auth: use LOCAL_ONE for all auth queries
  auth: remove get_auth_ks_name indirection
2026-03-12 23:44:32 +02:00
Avi Kivity
e2eeef3e01 Merge 'service level: remove remnants of version 1 service level' from Gleb Natapov
can_use_effective_service_level_cache() always returns true now, so the function can be dropped entirely and all the code that assumes it may return false can be dropped as well. Also drop async versions of find_effective_service_level and get_user_scheduling_group since they are unused.

No need to backport, code removal,

Closes scylladb/scylladb#29002

* github.com:scylladb/scylladb:
  service level: make maybe_update_per_service_level_params synchronous
  service level: remove unused get_user_scheduling_group function
  service level: drop async find_effective_service_level
  service level: remove remnants of version 1 service level
2026-03-12 23:39:41 +02:00
Botond Dénes
eed3a6d407 sstables/mx/writer: move post-cell write yield to collection write loop
Introduced by 54bddeb3b5, the yield was
added to write_cell(), to also help the general case where there is no
collection. Arguably this was unnecessary and this patch moves the yield
to write_collection(), to the cell write loop instead, so regular cells
don't have to poll the preempt flag.

Closes scylladb/scylladb#29013
2026-03-12 21:26:35 +02:00
Avi Kivity
e8a6706d6e Merge 'shorten some sleeps to speed up bootstrap in tests' from Patryk Jędrzejczak
This PR shortens two sleeps from 1s to 100ms to speed up bootstrap in tests.
The changed sleeps are:
- the pause duration in group0 discovery,
- the retry period in `wait_for_cql`.

Refs: https://scylladb.atlassian.net/browse/SCYLLADB-918

No backport: performance improvements mostly relevant to tests.

Closes scylladb/scylladb#29020

* github.com:scylladb/scylladb:
  test: pylib: util: wait for CQL being ready with a shorter period
  group0: discovery: shorten the pause duration
2026-03-12 21:17:05 +02:00
Wojciech Mitros
32974770b0 test: add tests for CQL forwarding
Add basic cluster tests for CQL forwarding.
The test cases include:
- basic reads and writes
- prepared statements with binds
- forwarding from a non-replica
- exception passthrough during forwarding (using an injection)
- re-preparing a statement on the target node, even if the user
  query is also an EXECUTE request on a prepared statement
- verification metric updates

The existing test_basic_write_read was modified so that a few extra
cases could be validated on the same cluster.
2026-03-12 19:43:35 +01:00
Wojciech Mitros
916a9995c1 transport: enable CQL forwarding for strong consistency statements
We enable CQL forwarding by starting to return the bounce_to_node
result message in redirect_statement() instead of throwing. The
forwarding code introduced in the preceding patches reacts to these
messages, allowing the requests to be forwarded.

With the update, some tests assuming that requests can't be forwarded
need to be adjusted, so we do that as well.
2026-03-12 19:43:35 +01:00
Wojciech Mitros
21a7b036a5 transport: add remote statement preparation for CQL forwarding
During forwarding of CQL EXECUTE requests, the target node may
not have the prepared statement in its cache. If we do have this
statement as a coordinator, instead of returning PREPARED NOT FOUND
to the client, we want to prepare the statement ourselves on target
node.
For that, we add a new FORWARD_CQL_PREPARE RPC. We use the new RPC
after gettting the prepared_not_found status during forwarding. When
we try to forward a request, we always have the query string (we
decide whether to forward based on this query), so we can always use
the new RPC when getting the prepared_not_found status.
After receiving the response, we try forwarding the EXECUTE request
again.
2026-03-12 19:43:35 +01:00
Wojciech Mitros
96a5e1c7ce transport: handle redirect responses in CQL forwarding
During CQL forwarding, when the target node can't handle the request,
it will find another node which can execute the request or which knows
where the request can be executed. We return this information in
responses to CQL forwarding, and in this patch, we add handling of
this kind of a response.
After getting a redirect response, we retry forwarding to the returned
host/shard until success or timeout. This can happen many times during
a single request, when we first forward to a replica and later to the
coordinator, or when a replica/coordinator migrated while we were
performing the forwarding
2026-03-12 19:43:31 +01:00
Wojciech Mitros
8816d3038c transport: add exception handling for forwarded CQL requests
When a forwarded request fails on the remote node, we can't use the
exception handling that happens in process_request_one because we
don't go through this code path. Instead, we use the previously
extracted cql_server::handle_exception handler, which performs
all accounting on the forwarded-to node, and which prepares the
response. For the read_failure_exception_with_timeout exception,
we need to perform the sleep on the source node, so we return the
timeout in the forwarding response and use it on the source node
to know how long to sleep without any extra calculations.

The handle_forward_execute() method is extracted from the inline handler
lambda to make the error catching wrapper cleaner.
2026-03-12 19:41:37 +01:00
Wojciech Mitros
23bff5dfef transport: add basic CQL request forwarding
Add the infrastructure for forwarding CQL requests to other nodes.
When a process() call results in a node bounce (as opposed to a shard
bounce), the coordinator serializes the request and sends it via the
FORWARD_CQL_EXECUTE RPC verb to the target node.

In this patch we omit several features that allow handling more
scenarios that can happen when trying to forward a CQL request,
but the RPC request and response are already prepared for them.
They will be handled in the following commits.
2026-03-12 19:41:35 +01:00
Avi Kivity
76b6784c1a Merge 'cql3: track CQL parsing memory cost and use it for admission control' from Marcin Maliszkiewicz
Use rolling_max_tracker to record gross bytes allocated during each
CQL parse.  The rolling maximum is then added to the memory estimate
for incoming QUERY and PREPARE requests so that the admission control
in the CQL transport layer accounts for parsing overhead.

The measured memory footprint serves as upper bound rather than
exact number but it's purpose is to prevent OOMs under unprepared
statements heavy load.

In benchmark 1G memory node shows decrease of non-LSA memory usage
from peak 320MB (our coordinator budget is 10% of 1G) to 96MB. While
tps drops from 1.2 kops to 0.8 kops. Drop in tps is expected as
memory admission kicks in trying to prevent OOM.

This is phase 1 of OOM prevention, potential next steps:
- add second admission in query_processor::get_statement trying to prevent potential thundering herd problem
- decrease cql_server memory pool size
- count reads in the memory pool
- add per service level memory pool and a shared one

Related https://scylladb.atlassian.net/browse/SCYLLADB-740
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-938

Backport: no, new feature, but we may reconsider if some customer needs it

Closes scylladb/scylladb#28919

* github.com:scylladb/scylladb:
  cql3: track CQL parsing memory cost and use it for admission control
  utils: add rolling max tracker
2026-03-12 19:59:52 +02:00
Wojciech Mitros
170b82ddca idl: add a representation of client_state for forwarding
In the following patches, when we start allowing to forward CQL
requests to other nodes, we'll need to use the same client state
for executing the request on the destination node as we had on the
source. client_state contains many fields and we need to create
a new instance of it when we start handling the forwarded request,
so to prepare for the forwarding RPC, we add a serializable format
of the client_state as an IDL struct. The new class is missing some
fields that are not used while executing requests, and some whose
value is determined by the fact that the client state is used for
a forwarded request.
These include:
- driver name, driver version, client options - not used for executing
requests. Instead, we use these as data sources for the virtual
"clients" system table.
- auth_state - must be READY - we reached a bounce message, so we were
able to try executing the request locally
- _control_connection - used for altering a cql_server::connection, which
we don't have on the target node
- _default_timeout_config - used when updating service levels, also only
per-connection
- workload_type - used for deciding whether to allow shedding at the
start of processing the request, and for getting per-connection service
level params (for an API)
2026-03-12 17:48:58 +01:00
Wojciech Mitros
b4a7fefe20 cql_server: handle query, execute, batch in one case
Currently we perform the same steps when handling query, execute
and batch CQL requests. So instead of creating multiple functions
performing these steps, we can handle them all in one fallthrough
case in cql_server::connection::process_request_one.
2026-03-12 17:48:58 +01:00
Wojciech Mitros
dadb87047c transport: inline process_on_shard in cql_server::process
The process_on_shard method is relatively short, it's only used
in the process() method and the Process concept that is uses
is as long as the function itself. This area will be made more
complex by the following patches for cql forwarding, so we simplify
it by inlining process_on_shard in cql_server::process.
2026-03-12 17:48:58 +01:00
Wojciech Mitros
24cdc3a10d transport: extract process() to cql_server
Move process() and process_on_shard() from cql_server::connection to
cql_server. The process() method is no longer a template - instead, it
takes an opcode parameter and uses get_process_fn_for_opcode() to select
the appropriate internal processing function.

The process_query, process_execute, and process_batch wrappers on
connection now delegate to _server.process() with the appropriate opcode.

This refactoring is preparation for CQL request forwarding, where
process() will need to be called from a context other than connection
- the forwarding RPC handler).
2026-03-12 17:48:57 +01:00
Wojciech Mitros
0e3469e89c transport: add messaging_service to cql_server
The messaging service will be used by cql_server to register RPC
handlers for forwarding CQL requests between nodes.
We pass it through the controller to cql_server.
2026-03-12 17:48:57 +01:00
Wojciech Mitros
1376caf980 transport: add response reconstruction helpers for forwarding
Expose response::flags() and response::extract_body(), and a new constructor.
It will be needed for creating a cql_transport::response from the response body returned
during CQL forwarding.
2026-03-12 17:48:57 +01:00
Wojciech Mitros
e44820ba1f transport: generalize the bounce result message for bouncing to other nodes
In the following patches, we'll start allowing forwarding requests to strongly
consistent tables so that they'll get executed on the suitable tablet Raft group
members. For that we'll reuse the approach that we already have for bouncing
requests to other shards - we'll try to execute a request locally, and the
result of that will be a bounce message with another replica as the target.
In this patch we generalize the former bounce_to_shard result message so that
it will be able to specify the target of the bounce as another shard or specific
replica.
We also rename it to result_message::bounce so that it stops implying that only
another shard may be its target.
Aside from the host_id and the shard, the new message also includes the timeout,
because in the service handling the forwarding we won't have the access to it,
and it's needed for specifying how long we should wait for the forwarded
requests. It also includes an information whether this is a write request
to return correct timeout response in case the deadline is exceeded.
We will return other hosts in the new bounce message when executing requests to
strongly consistent tables when we can't handle the request because we aren't
a suitable replica. We can't handle this message yet, so we don't return it
anywhere and we still assume that every bounce message is a bounce to the same
host.
2026-03-12 17:48:57 +01:00
Wojciech Mitros
b4d66fda2e strong consistency: redirect requests to live replicas from the same rack
Forwarding CQL requests is not implemented yet, but we're already
prepared to return the target to forward to when trying to execute
strongly consistent requests. Currently, if we're not a replica
of the affected tablet, we redirect the request to the first replica
in the list.
This is not optimal, because this replica may be down or it may be
in another rack, making us perform cross-rack requests during forwarding.
Instead, we should forward the request to the replica from the same
rack and handle the case where the replica is down.

In this patch we change the replica selection for forwarding strongly
consistent requests, so that when the coordinator isn't a replica, it
redirects the request to the replica from the same rack.

If the replica from the same rack is down, or there is no replica in
our rack, we choose the next closest replica (preferring same-DC replicas
over other DCs). If no replica is alive, the query fails - the driver
should retry when some replica comes back up.
2026-03-12 17:48:54 +01:00
Andrzej Jackowski
3b9cd52a95 reader_concurrency_semaphore_test: detect memory leak on preemptive abort of waiting_for_memory permit
A permit in `waiting_for_memory` state can be preemptively aborted by
maybe_admit_waiters(). This is wrong: such permits have already been
admitted and are actively processing a read — they are merely blocked
waiting for memory under serialize-limit pressure.

When `on_preemptive_aborted()` fires on a `waiting_for_memory` permit,
it does not clear `_requested_memory`. A subsequent `request_memory()`
call accumulatesa on top of the stale value, causing `on_granted_memory()`
to consume more than resource_units tracks.

This commit adds a test that confirms that scenario by counting
internal_errors.
2026-03-12 17:09:34 +01:00
Alex
7fd39ba586 test/cluster: strengthen raft voters multi-DC test and tune debug runtime
The test_raft_voters_multidc_kill_dc scenario had become weaker after group0 voter count was made always odd.
  In particular, the old num_nodes == 1 case (dc1=2, dc2=1, dc3=1) could pass even without the intended balancing logic, because with 3 voters total we naturally get one voter per DC.

  This change restores coverage of the original intent:

  - Replace num_nodes parametrization with explicit DC triples.
  - Use (3, 1, 1) to force a meaningful asymmetric topology where voter placement logic is required.
  - Keep a larger topology case (6, 3, 3) for broader coverage.
  - Mark (6, 3, 3) as skip_mode(debug) with reason:
    larger topology case is too slow in debug on minipcs.

  Also updated comments/docstring to match the new setup.

Fixes: SCYLLADB-794

backport: None, it is done to deflake minipcs that will start working only on master

Closes scylladb/scylladb#29000
2026-03-12 17:07:45 +01:00
Wojciech Mitros
309abc44d9 transport: pass foreign_ptr into sleep_until_timeout_passes and move it to cql_server
Change sleep_until_timeout_passes() to accept a foreign_ptr<std::unique_ptr<response>>.
We can easily create the foreign_ptr for the responses created in the CQL server,
but we'll need this when we get responses when forwarding CQL statements - the responses
may come from other shards.
We also move it from cql_server::connection to cql_server, because for forwarded CQL
requests, we'll need to handle it at the cql_server level.
The method also loses its const qualifier - the abort_source that we pass into
sleep_abortable needs to be non-const. Apparently, we could still use it in a const
method of cql_server::connection because we passed it as _server._abort_source which
caused the const qualifier to be lost.
2026-03-12 16:03:14 +01:00
Marcin Maliszkiewicz
975cd60e05 ldap: fix use-after-move crash in ldap_reuser::reap()
After stop() moved _reaper, in-flight with_connection() callbacks could
still call reap(), which accessed the moved-from future causing a
SIGSEGV in future_base::detach_promise(). Add a seastar::gate so
stop() waits for all in-flight operations before moving _reaper.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1043

Closes scylladb/scylladb#29015
2026-03-12 16:48:45 +02:00
Patryk Jędrzejczak
c50cf32793 test: pylib: util: wait for CQL being ready with a shorter period
`wait_for_cql` is used in hundreds, if not thousands, of places in tests.
We shouldn't waste up to 1s for every call.

Also, the 1s period is clearly too long compared to the bootstrap time,
which is usually 0-3s in dev mode.

The following test speeds up from 50s to 42s with the change:
```
for _ in range(10):
    servers = await manager.servers_add(3)
    await manager.get_ready_cql(servers)
```
2026-03-12 15:40:19 +01:00
Patryk Jędrzejczak
f85628a9a0 group0: discovery: shorten the pause duration
Nodes currently pause group0 discovery for 1s. This case is always hit while
adding multiple nodes in parallel to an empty cluster by all nodes except the
one that becomes the group0 leader.

This is fine in production, but in tests, the slowdown is quite significant.
Every `manager.servers_add(n)` call for n > 1 becomes 1s slower when the
cluster is empty. Many cluster tests are affected.

In this commit, we decrease the sleep duration from 1s to 100ms to speed up
tests. The consequence of this change is that nodes might perform more steps
in group0 discovery, but the increase in CPU usage and network traffic should
be negligible.
2026-03-12 15:40:18 +01:00
Gleb Natapov
c67f876893 service level: make maybe_update_per_service_level_params synchronous
It does not call async functions any more.
2026-03-12 15:53:08 +02:00
Benny Halevy
b3fec20960 test_tablets_migration: test_staging_backlog_is_preserved_with_file_based_streaming: convert for loop to asyncio.gather
Currently the test iterates on all servers and calls manager.api.disable_injection
but it doesn't await those calls.
Use asyncio.gather to await all calls in parallel.

Co-authored-by: Copilot CLI
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-03-12 15:26:40 +02:00
Benny Halevy
61d5a2df02 test_tablets_migration: test_tablet_back_and_forth_migration: await move_tablet
Co-authored-by: Copilot CLI
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-03-12 15:26:40 +02:00
Benny Halevy
b8655748a2 test_tablets_migration: test_restart_in_cleanup_stage_after_cleanup: await move_task
Co-authored-by: Copilot CLI
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-03-12 15:26:40 +02:00
Benny Halevy
10dccc2c4e test_tablets_migration: test_restart_leaving_replica_during_cleanup: await move_task
Co-authored-by: Copilot CLI
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-03-12 15:26:40 +02:00
Benny Halevy
c9d653fb1e test_tablets_migration: drop unused imports from cassandra.query
Co-authored-by: Copilot CLI
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-03-12 15:26:40 +02:00
Amnon Heiman
03d7ab17c9 storage_proxy: migrate CAS contention histograms to estimated_histogram_with_max
Replace CAS contention histograms in storage proxy stats with
estimated_histogram_with_max<128> and switch metrics/API aggregation to the
new histogram path.

Introduce a dedicated cas_contention_histogram alias and use it for
cas_read_contention and cas_write_contention.

Update API histogram reduction to merge the new histogram type via
estimated_histogram_with_max_merge.

Convert API JSON serialization to explicit offsets/counts using
get_buckets_offsets() and get_buckets_counts().

Export CAS contention metrics with to_metrics_histogram(...) instead of the
legacy get_histogram(1, 8) path for consistent bucket handling.
2026-03-12 14:10:35 +01:00
Amnon Heiman
cedd049218 estimated_histogram.hh: Add bucket offset and count to approx_exponential_histogram
Add utility accessors to approx_exponential_histogram to export bucket
boundaries and bucket counts in a form suitable for display/tests when
Min < Precision causes repeated integer limits.

Add MAX compile-time constant alias for the template Max parameter.
Add get_buckets_offsets() to return bucket lower limits with duplicate
adjacent limits removed.

Add get_buckets_counts() to return counts aligned with the deduplicated
limits, merging counts from buckets that share the same lower limit.
Keep existing histogram behavior unchanged.
This new functionality is intended for API use and not for
performance-critical paths.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2026-03-12 14:04:40 +01:00
Gleb Natapov
c30907b8f2 service level: remove unused get_user_scheduling_group function 2026-03-12 14:28:26 +02:00
Gleb Natapov
a934d8391d service level: drop async find_effective_service_level
find_cached_effective_service_level does exactly same thing now and it
is synchronous.
2026-03-12 14:28:26 +02:00
Botond Dénes
15cfa5beeb mutation/collection_mutation: don't copy the serialized collection
serialize_collection_mutation() copies the serialized collection into
the returned collection_mutation object. Change to move to avoid the
copy.

Fixes: SCYLLADB-1041

Closes scylladb/scylladb#29010
2026-03-12 13:57:40 +02:00
Gleb Natapov
f888f2dced service level: remove remnants of version 1 service level
can_use_effective_service_level_cache() always returns true now, so the
function can be dropped entirely and all the code that assumes it may
return false can be dropped as well.
2026-03-12 12:27:52 +02:00
Nadav Har'El
27f0510280 test/alternator: test_gzip_request_oversized now passes on AWS
The Alternator test test_compressed_request.py::test_gzip_request_oversized
checks that a very large request that compresses to a small size is still
rejected. This test passed on Alternator, but used to fail on DynamoDB
because DynamoDB didn't reject this case. This was a bug in DynamoDB
(a "decompression bomb" vulnerability), and after I reported it, it
was fixed.

So now this test does pass on DynamoDB (after a small modification to
allow for different error codes). So remove its scylla_only marker,
and make the comment true to the current state.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#28820
2026-03-12 10:41:56 +01:00
Marcin Maliszkiewicz
b277d9d9aa cql3: track CQL parsing memory cost and use it for admission control
Use rolling_max_tracker to record gross bytes allocated during each
CQL parse.  The rolling maximum is then added to the memory estimate
for incoming QUERY and PREPARE requests so that the admission control
in the CQL transport layer accounts for parsing overhead.

The measured memory footprint serves as upper bound rather than
exact number but it's purpose is to prevent OOMs under unprepared
statements heavy load.

In benchmark 1G memory node shows decrease of non-LSA memory usage
from peak 320MB (our coordinator budget is 10% of 1G) to 96MB. While
tps drops from 1.2 kops to 0.8 kops. Drop in tps is expected as
memory admission kicks in trying to prevent OOM.
2026-03-12 10:16:10 +01:00
Botond Dénes
0b19a6de85 tombstone_gc: tombstone_gc_state::for_tests(): remove unused param
Closes scylladb/scylladb#28923
2026-03-12 10:01:42 +01:00
Marcin Maliszkiewicz
2d22eea2f9 Merge 'cql3: Replace SCYLLA_ASSERT and abort by throwing_assert' from Nadav Har'El
In this patch we replace every single use of SCYLLA_ASSERT(), abort() and assert() in the cql3/ directory by throwing_assert().

The problem with SCYLLA_ASSERT()/abort()/assert() is that when it fails, it crashes Scylla. This is almost always a bad idea (see #7871 discussing why), but it's even riskier in front-end code like cql3/: In front-end code, there is a risk that due to a bug in our code, a specific user request can cause Scylla to crash. A malicious user can send this query to all nodes and crash the entire cluster. When the user is not malicious, it causes a small problem (a failing request) to become a much worse crash - and worse, the user has no idea which request is causing this crash and the crash will repeat if the same request is tried again.

All of this is solved by using the new throwing_assert(), which is the same as SCYLLA_ASSERT() but throws an exception (using on_internal_error()) instead of crashing. The exception will prevent the code path with the invalid assumption from continuing, but will result in only the current user request being aborted, with a clear error message reporting the internal server error due to an assertion failure.

I reviewed all the changes that I did in these patches to check that (to the best of my understanding) none of the assertions in cql3/ involve the sort of serious corruption that might require crashing the Scylla node entirely.

throwing_assert() also improves logging of assertion failures compared to the original SCYLLA_ASSERT()/abort() - SCYLLA_ASSERT() printed a message to stderr which in many installations is lost, and abort() often prints no message at all. But throwing_assert() uses Scylla's standard logger, and also includes a backtrace in the log message.

Fixes #13970 (Exorcise assertions from CQL code paths)
Refs #7871 (Exorcise assertions from Scylla)

Closes scylladb/scylladb#28847

* github.com:scylladb/scylladb:
  cql3: remove unnecessary assert()
  cql3: replace abort() by throwing_assert()
  cql3: Replace SCYLLA_ASSERT by throwing_assert
2026-03-12 09:09:24 +01:00
Szymon Malewski
3116db6c2d test: fix testJsonOrdering
The `test/cqlpy/cassandra_tests/validation/entities/json_test.py::testJsonOrdering` was failing because of differences between Cassandra and Scylla in printing
JSON floating point values - e.g. Cassandra prints 30.0, where Scylla prints 30.
Both are valid, so in this patch, instead of comparing strings, we compare parsed JSON using `EquivalentJson`.

Fixes #28467

Closes scylladb/scylladb#28924
2026-03-12 09:07:08 +01:00
Marcin Maliszkiewicz
5b2a07b408 utils: add rolling max tracker
We will use it later to track parser memory
usage via per query samples.

Tests runtime in dev: 1.6s
2026-03-12 08:56:41 +01:00
Marcin Maliszkiewicz
54ef8fca57 auth: remove DEFAULT_SUPERUSER_NAME constant and dead DEFAULT_USER_PASSWORD
DEFAULT_SUPERUSER_NAME is no longer referenced after removing the
role_part special-casing in describe_roles. DEFAULT_USER_PASSWORD was
dead code too.
2026-03-12 08:46:00 +01:00
Marcin Maliszkiewicz
029410e159 auth: use configurable default_superuser in describe_roles
Replace the hardcoded meta::DEFAULT_SUPERUSER_NAME comparison with
default_superuser(_qp) which reads from the auth_superuser_name
config option. This makes the IF NOT EXISTS clause in DESCRIBE
output correct for clusters with a non-default superuser name.
2026-03-12 08:45:47 +01:00
Nadav Har'El
09a399ae3c Merge 'Replace estimated_histogram with approx_exponential_histogram - alternator' from Amnon Heiman
_"A journey of a thousand miles begins with a single step" Lao Tzu_

ScyllaDB uses estimated_histogram in many places.
We already have a more efficient alternative: approx_exponential_histogram. It is both CPU and
memory-efficient and can be exported as Prometheus native histograms.

Its main limitation (which has its benefits) is that the bucket layout is fixed at compile time, so
histograms with different configurations cannot be mixed.

The end goal is to replace all uses of estimated_histogram in the codebase.
That migration needs a few small API adjustments, so I am splitting the work
into steps for easier review.

This series is the first step. It introduces a base template for fixed-size
estimated histograms, and switches the Alternator's estimated_histogram with the template.
This change is self-contained and valuable on its own, while keeping the scope limited.

Minor adjustments were made to the code and tests so that the tests would pass.

Follow-up PRs will apply the same pattern to the rest of the code.

**New feature no need to backport**

Closes scylladb/scylladb#28987

* github.com:scylladb/scylladb:
  alternator: migrate to operation_size_kb histograms
  test/alternator/test_metrics.py: Update the bucket in the histogram search
  alternator: Use batch_histogram for batch size histograms
  estimated_histogram.hh: adds estimated_histogram_with_max
2026-03-12 00:06:16 +02:00
Wojciech Mitros
b1bd206147 transport: extract the error handling from process_request_one
When we forward CQL statements, we'll need to handle the errors
on the destination node. Only for read_failure_exception_with_timeout
exception, we'll still need to wait until timeout passes on the
source node.
For that we extract the exception handling to a separate method.
Additionally, we separate the waiting and all other handling,
so that all handling aside from waiting will be reusable after
forwarding, and we'll also be able to sleep on the source node
if necessary.
2026-03-11 19:40:47 +01:00
Wojciech Mitros
6184b1d5ea transport: move error response helpers from connection to cql_server
These methods are used only in the error handler in the cql server,
and outside of 3 cases, they don't need any information from the
cql_server::connection. We move them from cql_server::connection
to cql_server, so that they can be used in the following patches
for methods for CQL request forwarding where we'll have no instance
of cql_server::connection on the node forwarded to.

After the change the methods require no access to the server's
or connection's fields, so we also make them static methods.
2026-03-11 19:40:47 +01:00
Amnon Heiman
1339a44163 alternator: migrate to operation_size_kb histograms
Switch Alternator operation-size metrics from the legacy estimated
histogram implementation to estimated_histogram_with_max<512> and export
them through the native approx-exponential histogram path.

Add a dedicated operation-size histogram type alias based on
estimated_histogram_with_max<512>.
Replace all per-operation size histograms (GetItem/PutItem/DeleteItem/
UpdateItem/BatchGetItem/BatchWriteItem) with the new type.

Remove the custom legacy histogram-to-metrics adapter and use
to_metrics_histogram() for operation size metrics, aligning export
behavior with other approx-exponential histograms.
Update Alternator metrics tests to compute expected le bucket boundaries using
approx-exponential bucket math (including deduplication of equal
bounds), so assertions match the new exported histogram schema.
Update bucket helper signatures to use (max, precision) parameters and keep
+Inf handling unchanged.

Replace byte-to-KB ceiling conversion with plain integer division (bytes
/ 1024): histogram export already reports each bucket by its upper bound
(le), so rounding input values up before bucketing is unnecessary and
would over-shift borderline samples into higher buckets.
2026-03-11 17:29:14 +02:00
Marcin Maliszkiewicz
adc840919b auth: move default_superuser to common, remove _superuser member
Move default_superuser() to auth::meta in common.{hh,cc} and remove the
cached _superuser member from both standard_role_manager and
password_authenticator. The superuser name comes from config which is
immutable at runtime, so caching it is unnecessary.
2026-03-11 16:28:38 +01:00
Marcin Maliszkiewicz
993e06c1ae auth: use LOCAL_ONE for all auth queries
Removes auth-v1 hack for cassandra superuser as auth-v1
code no longer exists.

Also CL is not really used when quering raft replicated
tables (like auth ones), but LOCAL_ONE is the least confusing
one.
2026-03-11 16:27:15 +01:00
Marcin Maliszkiewicz
6d1153687a auth: remove get_auth_ks_name indirection
Replace get_auth_ks_name(qp) with db::system_keyspace::NAME directly.
The function always returned the constant "system" and its qp
parameter was unused.
2026-03-11 16:26:47 +01:00
David
79f9967eaa docs: update theme 1.9
Motivation

Upgrades Sphinx to 9.x, MyST Parser to 5.x, Python to 3.11+–3.14, Node.js to 22, and replaces Poetry with uv for dependency management.

Changelog: https://github.com/scylladb/sphinx-scylladb-theme/blob/master/docs/source/upgrade/CHANGELOG.md#190---26-february-2026
How to test

* Make sure you are using Python 3.11-3.14:
* python --version
* Install uv:
* make setupenv
* Build the docs:
* make preview
* Docs should render without errors at http://127.0.0.1:5500

Closes scylladb/scylladb#28971
2026-03-11 16:56:51 +02:00
Aleksandra Martyniuk
2e68f48068 nodetool: cluster repair: do not fail if a table was dropped
nodetool cluster repair without additional params repairs all tablet
keyspaces in a cluster. Currently, if a table is dropped while
the command is running, all tables are repaired but the command finishes
with a failure.

Modify nodetool cluster repair. If a table wasn't specified
(i.e. all tables are repaired), the command finishes successfully
even if a table was dropped.

If a table was specified and it does not exist (e.g. because it was
dropped before the repair was requested), then the behavior remains
unchanged.

Fixes: SCYLLADB-568.

Closes scylladb/scylladb#28739
2026-03-11 16:35:04 +02:00
Dani Tweig
45d7d9a96c .github/workflow: also call call_sync_milestone_to_jira.yml for close milestone event
What changed

* Added closed to milestone event types in
  call_sync_milestone_to_jira.yml (types: [created] -> types: [created, closed])
* Added VECTOR to the list of Jira project keys being synced
  (jira_project_keys: SCYLLADB,CUSTOMER,SMI,RELENG -> jira_project_keys: SCYLLADB,CUSTOMER,SMI,RELENG,VECTOR)

Why (Requirements Summary)

* The call_sync_milestone_to_jira.yml workflow only triggered on
  milestone creation. When a GitHub milestone is closed, the
  corresponding Jira versions (in SCYLLADB, CUSTOMER, SMI, RELENG
  projects) should be marked as released. Adding the closed trigger
  enables the called workflow (main_sync_milestone_to_jira_release.yml
  in github-automation) to handle both creating and releasing Jira
  versions from GitHub milestone events.
* Added the VECTOR project so its Jira versions are also created/released
  when milestones are created or closed in scylladb.git.
* This is consistent with the same change already applied to the staging
  and scylla-machine-image repos.

Fixes:PM-216

Update call_sync_milestone_to_jira.yml in scylladb.git - add close trigger and VECTOR project sync

Closes scylladb/scylladb#28981
2026-03-11 15:56:55 +02:00
Amnon Heiman
69fbcd32bd test/alternator/test_metrics.py: Update the bucket in the histogram search 2026-03-11 15:24:05 +02:00
Amnon Heiman
50af1f3671 alternator: Use batch_histogram for batch size histograms
Switch batch-related histograms to estimated_histogram_with_max.
Results with better memory consumption and improve efficiency.
2026-03-11 15:21:25 +02:00
Amnon Heiman
b22162c719 estimated_histogram.hh: adds estimated_histogram_with_max
This patch adds estimated_histogram_with_max template that will be a
based for specific estimated_histograms, eventually replacing the current
struct implementation.

Introduce estimated_histogram_with_max<Max> as a reusable wrapper around
approx_exponential_histogram<1, Max, 4>, providing merge support and the
same add helpers used by existing estimated_histogra type.

Add estimated_histogram_with_max_merge()

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2026-03-11 15:02:37 +02:00
Radosław Cybulski
fe8117feee alternator: fix shard's parent calculation for vnodes
Fix an invalid condition, when searching for a parent shard, when table
is based on vnodes. Shards have associated with them `last token` -
token, than marks the end of the range of tokens they consume (inclusive).
An additional assumptions are whole token space is used and
(for vnodes) token space wraps around.

Previously code looked like this:
    auto pid = std::upper_bound(..., [](const dht::token& t, const cdc::stream_id& id) {
                    return t < id.token();
    });
    if (pid != pids.begin()) {
        pid = std::prev(pid);
    }

An `upper_bound` call with `t < id.token()` means it is looking for
an iterator, for which value `t < id.token()` changed to true,
which effectively means a position, where iterator is bigger
then searched value. Then we move iterator backward once if possible.
Assuming token space <-2, 2> and parents [0, 2], when we search for:
- -1 -> we will get 0, it's first, so we can't move backward, so 0 (ok)
- 0 -> we will get 2, it's not first, so we go back and we return 0 (ok)
- 1 -> we will get 2, it's not first, so we go back and we return 0
      (not ok - should be 2)

The fix is to replace it with `std::lower_bound` and remove conditional
backward motion. Since we've a guarantees that whole token space is used
if `std::lower_bound` ends with `end()` value, then we have a wrap
around case and we need to pick `begin()` as result.

Fixes #28354
Fixes: SCYLLADB-537

Closes scylladb/scylladb#28382
2026-03-11 14:51:42 +02:00
Calle Wilund
bc544eb08e gcs_fixture: Change to use docker helper 2026-03-11 12:32:02 +01:00
Calle Wilund
eb2dfe04e1 aws_kms_fixture: Modify to use docker helper 2026-03-11 12:32:02 +01:00
Calle Wilund
4a8afd9649 test/lib/proc_util: Add docker helper
Adds boost test equivalent of dockerized_service to
handle launching dockerized mock service using ephermal port,
query port and return the process.
2026-03-11 12:32:02 +01:00
Calle Wilund
3e8a9a0beb pytest: use ephemeral port publish for docker mock servers
Changes dockerized_service to use ephermal port publish, and
query the published port from podman/docker.
Modifies client code to use slightly changed usage syntax.
2026-03-11 12:32:01 +01:00
Nadav Har'El
b411d436de config: move named_value<T> method bodies out-of-line
The previous commit added extern template declarations to suppress
named_value<T> instantiation in every translation units, but those only
suppress non-inline members. All method bodies defined inside the class
body were inline and thus exempt from extern template, so they were
still emitted as weak symbols in every TU that used them.

Fix this by moving all named_value<T> method definitions out of the class
body in config_file.hh and into config_file_impl.hh as out-of-line template
definitions.  Since config_file_impl.hh is included only by db/config.cc,
utils/config_file.cc, sstables/compressor.cc, and
ent/encryption/encryption_config.cc, the method bodies are now compiled
in only those four TUs.

Also add the two missing explicit instantiation pairs that caused linker
errors:
- named_value<vector<object_storage_endpoint_param>> in db/config.cc
- named_value<encryption_config::string_string_map> in encryption_config.cc
2026-03-11 13:20:03 +02:00
Piotr Dulikowski
d9a277453e Merge 'cql3: pin prepared cache entry in prepare() to avoid invalid weak handle race' from Alex Dathskovsky
query_processor::prepare() could race with prepared statement invalidation: after loading from the prepared cache, we converted the cached object to a checked weak pointer and then continued asynchronous work (including error-injection waitpoints). If invalidation happened in that window, the weak handle could no longer be promoted and the prepare path could fail nondeterministically.

This change keeps a strong cache entry reference alive across the whole critical section in prepare() by using a pinned cache accessor (get_pinned()), and only deriving the weak handle while the entry is pinned. This removes the lifetime gap without adding retry loops.

  Test coverage was extended in test/cluster/test_prepare_race.py:

  - reproduces the invalidation-during-prepare window with injection,
  - verifies prepare completes successfully,
  - then invalidates again and executes the same stale client prepared object,
  - confirms the driver transparently re-requests/re-prepares and execution succeeds.

  This change introduces:

  - no behavior change for normal prepare flow besides stronger lifetime guarantees,
  - no new protocol semantics,
  - preserves existing cache invalidation logic,
  - adds explicit cluster-level regression coverage for both the race and driver reprepare path.
  - pushes the re prepare operation twards the driver, the server will return unprepared error for the first time and the driver will have to re prepare during execution stage

Fixes: https://github.com/scylladb/scylladb/issues/27657

Backport to active branches recommended: No node crash, but user-visible PREPARE failures under rare schema-invalidation race; low-risk timeout-bounded retry improves robustness.

Closes scylladb/scylladb#28952

* github.com:scylladb/scylladb:
  transport/messages: hold pinned prepared entry in PREPARE result
  cql3: pin prepared cache entry in prepare() to avoid invalid weak handle race
2026-03-11 12:09:23 +01:00
Calle Wilund
e3e940bc47 dbuild: Use container network in dbuild nested containers
Remove the host network setting, ensuring we use private networks
(slirp4netns). This will allow nested container port aliasing,
helping CI stability (can use ephemeral ports and container
introspection).

This also makes the nested podman setup non-conditional,
since we only run podman containers inside dbuild, and need
the setup regardless if host container is docker or not.
2026-03-11 12:05:51 +01:00
Calle Wilund
8a56eafd39 scylla_cluster: Read notify sock in background to prevent deadlock
Starts a thread to process scylla notify messages (NOTIFY_SOCKET)
instead of just processing inline, non-blocking. This because it
is possible for the pipe created to be to small to hold enough
messages for us to reach the point where we otherwise even read
from said pipe, allowing other end (scylla) to proceed.
2026-03-11 11:59:00 +01:00
Patryk Jędrzejczak
37aeba9c8c Merge 'raft: add global read barrier to group0_batch::commit and switch auth and service levels' from Marcin Maliszkiewicz
This series adds a global read barrier to raft_group0_client, ensuring that Raft group0 mutations are applied on all live nodes before returning to the caller.

Currently, after a group0_batch::commit, the mutations are only guaranteed to be applied on the leader. Other nodes may still be catching up, leading to stale reads. This patch introduces a broadcast read barrier mechanism. Calling  send_group0_read_barrier_to_live_members after committing will cause the coordinator to send a read barrier RPC to all live nodes (discovered via gossiper) and waits for them to complete. This is best effort attempt to get cluster-wide visibility of the committed state before the response is returned to the user.

Auth and service levels write paths are switched to use this new mechanism.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-650

Backport: no, new feature

Closes scylladb/scylladb#28731

* https://github.com/scylladb/scylladb:
  test: add tests for global group0_batch barrier feature
  qos: switch service levels write paths to use global group0_batch barrier
  auth: switch write paths to use global group0_batch barrier
  raft: add function to broadcast read barrier request
  raft: add gossiper dependency to raft_group0_client
  raft: add read barrier RPC
2026-03-11 10:37:19 +01:00
Botond Dénes
54bddeb3b5 sstables/mx/writer: yield after writing a cell
With the goal of avoiding stalls on writing large collections, like
below:
    ++[0#1/1 100%] addr=0x5422d1e total=32 count=1 avg=32:
    |              seastar::backtrace<seastar::backtrace_buffer::append_backtrace_oneline()::{lambda(seastar::frame)#1}> at ./build/release/seastar/./seastar/include/seastar/util/backtrace.hh:85
    ++           - addr=0x541b6d4:
    |              seastar::backtrace_buffer::append_backtrace_oneline at ./build/release/seastar/./seastar/src/core/reactor.cc:811
    |              (inlined by) seastar::print_with_backtrace at ./build/release/seastar/./seastar/src/core/reactor.cc:838
    ++           - addr=0x541afb7:
    |              seastar::internal::cpu_stall_detector::generate_trace at ./build/release/seastar/./seastar/src/core/reactor.cc:1479
    ++           - addr=0x541b86c:
    |              seastar::internal::cpu_stall_detector::maybe_report at ./build/release/seastar/./seastar/src/core/reactor.cc:1214
    |              (inlined by) seastar::internal::cpu_stall_detector::on_signal at ./build/release/seastar/./seastar/src/core/reactor.cc:1234
    |              (inlined by) seastar::reactor::block_notifier at ./build/release/seastar/./seastar/src/core/reactor.cc:1548
    /opt/scylladb/libreloc/libc.so.6: ELF 64-bit LSB shared object, x86-64, version 1 (GNU/Linux), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, BuildID[sha1]=f83d43b9b4b0ed5c2bd0a1613bf33e08ee054c93, for GNU/Linux 3.2.0, not stripped

    ++           - addr=/opt/scylladb/libreloc/libc.so.6+0x1a28f:
    |              sigpending at ??:0
    ++           - addr=0x1760bf6:
    |              std::basic_string_view<signed char, std::char_traits<signed char> >::remove_prefix at /usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/string_view:302
    |              (inlined by) managed_bytes_basic_view<(mutable_view)0>::remove_prefix at ././utils/managed_bytes.hh:421
    |              (inlined by) _Z11read_simpleIlTk14FragmentedView24managed_bytes_basic_viewIL12mutable_view0EEET_RT0_ at ././utils/fragment_range.hh:365
    |              (inlined by) _ZL9get_fieldIlTk14FragmentedView24managed_bytes_basic_viewIL12mutable_view0EEQsr3stdE12is_trivial_vIT_EES3_T0_j at ././mutation/atomic_cell.hh:62
    |              (inlined by) atomic_cell_type::timestamp at ././mutation/atomic_cell.hh:103
    |              (inlined by) basic_atomic_cell_view<(mutable_view)0>::timestamp at ././mutation/atomic_cell.hh:232
    |              (inlined by) sstables::mc::writer::write_cell at ./sstables/mx/writer.cc:1101
    |              (inlined by) sstables::mc::writer::write_collection(bytes_ostream&, clustering_key_prefix const*, column_definition const&, collection_mutation_view, sstables::mc::writer::row_time_properties const&, bool)::$_0::operator() at ./sstables/mx/writer.cc:1233
    |              (inlined by) collection_mutation_view::with_deserialized<sstables::mc::writer::write_collection(bytes_ostream&, clustering_key_prefix const*, column_definition const&, collection_mutation_view, sstables::mc::writer::row_time_properties const&, bool)::$_0> at ././mutation/collection_mutation.hh:97
    |              (inlined by) sstables::mc::writer::write_collection at ./sstables/mx/writer.cc:1221
    ++           - addr=0x1677af3:
    |              sstables::mc::writer::write_cells at ./sstables/mx/writer.cc:1261
    |              (inlined by) sstables::mc::writer::write_row_body at ./sstables/mx/writer.cc:1287
    |              (inlined by) sstables::mc::writer::write_clustered at ./sstables/mx/writer.cc:1377
    |              (inlined by) _ZN8sstables2mc6writer15write_clusteredI14clustering_rowQ9ClusteredIT_EEEvRKS4_9tombstone at ./sstables/mx/writer.cc:766
    |              (inlined by) sstables::mc::writer::consume at ./sstables/mx/writer.cc:1425

Putting the yield in write_cell() instead of in write_collection() means
that writing any row benefits from the added yield point in the middle.

Refs: SCYLLADB-964

Closes scylladb/scylladb#28948
2026-03-11 10:34:55 +01:00
Nadav Har'El
e0c13518ae config: suppress named_value<T> instantiation in every source file
config.hh is included by a large fraction of the codebase. It pulls in
utils/config_file.hh, whose named_value<T> template has its method
bodies defined in config_file_impl.hh. Those bodies depend on three of
the heaviest Boost headers – boost/program_options.hpp,
boost/lexical_cast.hpp, and boost/regex.hpp – as well as yaml-cpp.
Because the method bodies are reachable from config.hh, every
translation unit that includes config.hh was silently instantiating all
of named_value<T>'s methods (for each distinct T) and compiling that
Boost/yaml-cpp machinery from scratch.

Fix this by adding extern template struct declarations for all 32
distinct named_value<T> specialisations used by db::config:
- the 14 primitive/stdlib types go into utils/config_file.hh
- the 18 db-specific types (enum_option<…>, seed_provider_type, etc.)
  go into db/config.hh

Matching explicit template struct instantiation definitions are added in
db/config.cc, which is already the only translation unit that includes
config_file_impl.hh.  As a result the Boost/yaml-cpp template machinery
is compiled exactly once (in config.o) instead of being re-instantiated
in every including TU.

One subtlety: named_value<seed_provider_type> has an explicit member
specialisation of add_command_line_option.  Per [temp.expl.spec], such
a specialisation must be declared before any extern template declaration
of the enclosing class template, so a forward declaration of the
specialisation is added to config.hh ahead of the extern template line.

Also, for some of the types we explicitly instantiated in db/config.cc,
the named_value<T> constructor calls config_type_for<T>(), which we
also need to provide explicit specializations - some of them we already
had but some were missing.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-03-11 11:30:39 +02:00
Botond Dénes
475220b9c9 Merge 'Remove the rest of pre raft topology code' from Gleb Natapov
Remove the rest of the code that assumes that either group0 does not exist yet or a cluster is till not upgraded to raft topology. Both of those are not supported any more.

No need to backport since we remove functionality here.

Closes scylladb/scylladb#28841

* github.com:scylladb/scylladb:
  service level: remove version 1 service level code
  features: move GROUP0_SCHEMA_VERSIONING to deprecated features list
  migration_manager: remove unused forward definitions
  test: remove unused code
  auth: drop auth_migration_listener since it does nothing now
  schema: drop schema_registry_entry::maybe_sync() function
  schema: drop make_table_deleting_mutations since it should not be needed with raft
  schema: remove calculate_schema_digest function
  schema: drop recalculate_schema_version function and its uses
  migration_manager: drop check for group0_schema_versioning feature
  cdc: drop usage of cdc_local table and v1 generation definition
  storage_service: no need to add yourself to the topology during reboot since raft state loading already did it
  storage_service: remove unused functions
  group0: drop with_raft() function from group0_guard since it always returns true now
  gossiper: do not gossip TOKENS and CDC_GENERATION_ID any more
  gossiper: drop tokens from loaded_endpoint_state
  gossiper: remove unused functions
  storage_service: do not pass loaded_peer_features to join_topology()
  storage_service: remove unused fields from replacement_info
  gossiper: drop is_safe_for_restart() function and its use
  storage_service: remove unused variables from join_topology
  gossiper: remove the code that was only used in gossiper topology
  storage_service: drop the check for raft mode from recovery code
  cdc: remove legacy code
  test: remove unused injection points
  auth: remove legacy auth mode and upgrade code
  treewide: remove schema pull code since we never pull schema any more
  raft topology: drop upgrade_state and its type from the topology state machine since it is not used any longer
  group0: hoist the checks for an illegal upgrade into main.cc
  api: drop get_topology_upgrade_state and always report upgrade status as done
  service_level_controller: drop service level upgrade code
  test: drop run_with_raft_recovery parameter to cql_test_env
  group0: get rid of group0_upgrade_state
  storage_service: drop topology_change_kind as it is no longer needed
  storage_service: drop check_ability_to_perform_topology_operation since no upgrades can happen any more
  service_storage: remove unused functions
  storage_service: remove non raft rebuild code
  storage_service: set topology change kind only once
  group0: drop in_recovery function and its uses
  group0: rename use_raft to maintenance_mode and make it sync
2026-03-11 10:24:20 +02:00
Piotr Dulikowski
38a2829f69 Merge 'Return HTTP error description in Vector Store client' from Szymon Wasik
The `service_error` struct: 6dc2c42f8b/service/vector_store_client.hh (L64)
currently stores just the error status code. For this reason whenever the HTTP error occurs, only the error code can be forwarded to the client. For example see here: 6dc2c42f8b/service/vector_store_client.cc (L580)
For this reason in the output of the drivers full description of the error is missing which forces user to take a look into Scylla server logs.

The objective of this PR is to extend the support for HTTP errors in Vector Store client to handle messages as well.

Moreover, it removes the quadratic reallocation in response_content_to_sstring() helper function that is used for getting the response in case of error.

Fixes: VECTOR-189

Closes scylladb/scylladb#26139

* github.com:scylladb/scylladb:
  vector_search: Avoid quadratic reallocation in response_content_to_sstring
  vector_store_client: Return HTTP error description, not just code
2026-03-11 09:19:27 +01:00
Calle Wilund
6d8ac23731 test_encryption: Use maximum replication in _smoke_test
Refs: SCYLLADB-557

We should use full replication in KS/CF creation and population,
for at least two reasons:
1.) Ensure we wait fully for and write to all nodes
2.) Make test more "real", behaving like a proper cluster

Closes scylladb/scylladb#28959
2026-03-11 09:54:57 +02:00
Nadav Har'El
00a819bcd8 cql3: remove unnecessary assert()
In cql3/, there was one call to assert() (not SCYLLA_ASSERT or
throwing_assert), and it was:

        const auto shard_num = smp::count;
        assert(shard_num > 0)

Rather than converting this assert() to throwing_assert() as I did in
previous patches, I decided to outright remove it: Seastar guarantees
that smp::count is not zero. Many other places in the code use
smp::count assuming that it is correct, no other place bothers to assert
it isn't zero.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-03-11 09:43:24 +02:00
Nadav Har'El
34eec020b3 cql3: replace abort() by throwing_assert()
After the previous patch replaced all SCYLLA_ASSERT() calls by
throwing_assert(), this patch also replaces all calls to abort().

All these abort() calls are supposedly cases that can never happen,
but if they ever do happen because of a bug, in none of these places
we absolutely need to crash - and exception that aborts the current
operation should be enough.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-03-11 09:43:11 +02:00
Nadav Har'El
c87d6407ed cql3: Replace SCYLLA_ASSERT by throwing_assert
In this patch we replace every single use of SCYLLA_ASSERT() in the cql3/
directory by throwing_assert().

The problem with SCYLLA_ASSERT() is that when it fails, it crashes Scylla.
This is almost always a bad idea (see #7871 discussing why), but it's even
riskier in front-end code like cql3/: In front-end code, there is a risk
that due to a bug in our code, a specific user request can cause Scylla
to crash. A malicious user can send this query to all nodes and crash
the entire cluster. When the user is not malicious, it causes a small
problem (a failing request) to become a much worse crash - and worse,
the user has no idea which request is causing this crash and the crash
will repeat if the same request is tried again.

All of this is solved by using the new throwing_assert(), which is the
same as SCYLLA_ASSERT() but throws an exception (using on_internal_error())
instead of crashing. The exception will prevent the code path with the
invalid assumption from continuing, but will result in only the current
user request being aborted, with a clear error message reporting the
internal server error due to an assertion failure.

I reviewed all the changes that I did in this patch to check that (to the
best of my understanding) none of the assertions in cql3/ involve the
sort of serious corruption that might require crashing the Scylla node
entirely.

throwing_assert() also improves logging of assertion failures compared
to the original SCYLLA_ASSERT() - SCYLLA_ASSERT() printed a message to
stderr which in many installations is lost, whereas throwing_assert()
uses Scylla's standard logger, and also includes a backtrace in the
log message.

Fixes #13970 (Exorcise assertions from CQL code paths)
Refs #7871 (Exorcise assertions from Scylla)

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-03-11 09:41:20 +02:00
Botond Dénes
99fa912f1b Merge 'Generalize streaming scopes tests' from Pavel Emelyanov
To restore how streaming scopes work there are two tests that greatly duplicate each other -- test_restore_with_streaming_scopes from cluster/object_store suite and test_refresh_with_streaming_scopes from cluster suite.

This patch generalizes both into a do_test_streaming_scopes() non-test function

Closes scylladb/scylladb#28874

* github.com:scylladb/scylladb:
  test: Re-sort comments around do_test_streaming_scopes()
  test: Split do_load_sstables()
  test: Drop load_fn argument from do_load_sstables()
  test: Re-use do_test_streaming_scopes() in refresh test
  test: Introduce SSTablesOnLocalStorage
  test: Introduce SSTablesOnObjectStorage
  test: Move test_restore_with_streaming_scopes() into do_test_streaming_scopes()
2026-03-11 09:35:21 +02:00
Dmitriy Kruglov
cee44716db docs: add cluster platform migration procedure
Document how to migrate a ScyllaDB cluster to different instance
types using the add-and-replace node cycling approach.

Closes: QAINFRA-42

Closes scylladb/scylladb#28458
2026-03-11 09:31:35 +02:00
Nadav Har'El
401dc1894c test/alternator,cqlpy: avoid xfail_strict against DynamoDB/Cassandra
Recently, in commit 7b30a39, we added to pytest.ini the option xfail_strict.
This option causes every time a test XPASSes, i.e., an XFAIL test actually
passes, to be considered an error and fail the test.

While this has some benefits, it's a big problem when running tests
against a reference implementation like DynamoDB or Cassandra: We
typically mark a test "xfail" if the test shows a known bug - i.e., if
the test fails on Scylla but passes on the reference system (DynamoDB
or Cassandra). This means that when running "test/cqlpy/run-cassandra"
or "test/alternator/run --aws", we expect to see many tests XPASS,
and now this will cause these runs to "fail".

So in this patch we add the xfail_strict=false to cqlpy/run-cassandra
and alternator/run --aws. This option is not added to cqlpy/run or
to alternator/run without --aws, and also doesn't affect test.py or
Jenkins.

P.S. This is another nail in the coffin of doing "cd test/alternator;
pytest --aws". You should get used to running Alternator tests through
test/alternator/run, even if you don't need to run Scylla (the "--aws"
option doesn't run Scylla).

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#28973
2026-03-11 09:29:30 +02:00
Robert Bindar
29619e48d7 replica/table: calculate manifest tablet_count from tablet map
During tests I noticed that if the number of tablets is very small,
say 2, and the number of nodes is 3 (2 shards per node), using the
number of storage groups on each shard, a shard may end up holding 0 groups,
whilst the other holds 1 group. And in some nodes even both shards have
0 groups.
Taking the minimum among shards here was showing in manifests a tablet
count of 0 for all 3 nodes, which is incorrect.

Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>

Closes scylladb/scylladb#28978
2026-03-11 09:27:04 +02:00
Botond Dénes
3fed6f9eff Merge 'service: tasks: scan all tablets in tablet_virtual_task::wait' from Aleksandra Martyniuk
Currently, for repair tasks tablet_virtual_task::wait gathers the
ids of tablets that are to be repaired. The gathered set is later
used to check if the repair is still ongoing.

However, if the tablets are resized (split or merged), the gathered
set becomes irrelevant. Those, we may end up with invalid tablet id
error being thrown.

Wait until repair is done for all tablets in the table.

Fixes: https://github.com/scylladb/scylladb/issues/28202

Backport to 2026.1 needed as it contains the change introducing the issue d51b1fea94

Closes scylladb/scylladb#28323

* github.com:scylladb/scylladb:
  service: fix indentation
  test: add test_tablet_repair_wait
  service: remove status_helper::tablets
  service: tasks: scan all tablets in tablet_virtual_task::wait
2026-03-11 09:24:07 +02:00
Raphael S. Carvalho
cc5b1acadf Improve log when sstable load fails due to missing tablet replica
A bug or some bad operator intervention can lead to a sstable existing in a
node after the tablet replica was moved to a different node.

This will result sstable loading during boot failing, requiring operator
intervention.
The log today just dumps the name of the "orphaned" sstable, but one
investigating it might want to know which process (repair, memtable, whatever)
generated that sstable, if the sstable was created locally or remotely,
and the current replica set of the underlying tablet.
From the original identifier, we can know the exact time the sstable was
created on its original node. From the current id, we know the time it
was created on the current node.
All this info can help the investigator to correlate with events in other nodes
(includes actions from the coordinator) to get closer to the root cause.

The new log will look like this:

"Unable to load SSTable .../me-3gyg_1fsw_2u0u826b00b71vc46o-big-Data.db
(originated from compaction with id 913f41c0-18c2-11f1-8f08-cb8521b3f330
on host e483238c-2287-4022-8bc4-b4f1c4cb2b0d)
of tablet 6 (replica set: [e483238c-2287-4022-8bc4-b4f1c4cb2b0d:0])"

Refs https://scylladb.atlassian.net/browse/SCYLLADB-788.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes scylladb/scylladb#28921
2026-03-11 06:20:34 +02:00
Avi Kivity
b17e1259e3 Merge 'types: optimize vector deserialization for high-dimensional vectors' from Szymon Wasik
Vector deserialization is an operation which performance is critical for
vector similarity search feature because it is frequently executed during
rescoring operation. Some of the identified performance bottlenecks
for it include:

1. Per-element virtual dispatch in deserialize(): each of the N elements
   went through visit() which switches on ~28 type variants. For a
   1024-dimension float vector, that's 1024 redundant type switches when
   the element type is the same for all of them.

2. Redundant work in split_fragmented(): value_length_if_fixed() was
   called inside the loop (N virtual calls), and no reserve() was done
   on the output vector causing repeated reallocations.

This series fixes both:

- Introduce deserialize_vector_visitor that dispatches on the element
  type once for the entire vector, then loops inside the resolved
  handler. Simple numeric types (float, int, etc.) call
  deserialize_value() directly with no virtual dispatch per element.
  String types (ascii, utf8) get a dedicated handler that skips
  make_empty() (sstring has no empty_t constructor). Complex types
  (list, map, tuple, etc.) fall back to per-element dispatch.

- In split_fragmented(), reserve the output vector to _dimension and
  cache value_length_if_fixed() before the loop.

Benchmark results (1024-dim float vector, release build, -O3 -flto):

  deserialize:      15.73 us -> 11.70 us  (1.34x, 26% faster)
  split_fragmented: 10.34 us ->  7.45 us  (1.39x, 28% faster)

References: SCYLLADB-471

Backport: none, unless we observe some critical performance improvement for quantization.

Closes scylladb/scylladb#28618

* github.com:scylladb/scylladb:
  types: optimize reading vector fragments
  types: optimize vector deserialization for high-dimensional vectors
2026-03-11 00:39:46 +02:00
Dawid Mędrek
167feabe1a cql3: Reject user-provided timestamps for strongly consistent tables
Similarly to LWTs, we reject queries with user-provided timestamps
when they target strongly consistent tables.

Such statements could force us to rewrite history, and that contradicts
the philosophy of linearizability we aim for.

Fixes SCYLLADB-879

Closes scylladb/scylladb#28867
2026-03-10 22:11:39 +02:00
Marcin Maliszkiewicz
8ae80a32c0 Update seastar submodule
* seastar d2953d2a...4d268e0e (32):
  > Merge 'prometheus: support multiple __name__ filters and prefixed names' from Travis Downs
    doc: update prometheus.md with __name__ filter enhancements
    prometheus: support prefixed names in __name__ filter
    prometheus: add benchmarks for name filter performance
    prometheus: support multiple __name__ query parameters
    prometheus: move write_body_args to header
  > fair_queue: Subtract from _queued_capacity on pop_front()
  > memory: expose cumulative allocated bytes statistic
  > Merge 'Add ability to configure IO bandwidth limit for supergroup' from Pavel Emelyanov
    test: IO bandiwdth throttler unit tests
    code: Add ability to configure IO bandwidth limit for supergroup
    io_queue: Have more than one throttler par class
    io_queue: Introduce bandwidth_throttler helper class
    io_queue: Nest io_group::priotiy_class_data-s
    io_queue: Update class bandwidth on group's class data
    io_queue: Make io_group::priority_class_data::tokens() static
    fair_queue: Introduce group (un)plugging
  > Fix _shard_to_numa_node_mapping double population
  > Use exception parameter in log_timer_callback_exception()
  > Fix wakeup_granularity() fallback debug-fs reading
  > test_fixture: Fix SEASTAR_FIXTURE_THREAD_TEST_CASE thread not propagated
  > build: support tuning -ffile-prefix-map
  > test: Remove unused C::dup() method of testing class
  > src/core/reactor: introduce reactor::get_backend_name()
  > util/process: add pid() accessor
  > Merge 'Add source location to task and tasktrace object' from Radosław Cybulski
    coroutine.hh: disable source_location for GCC to avoid ICE
    reactor: improve do_dump_task_queue reporting
    Use source_location in `do_dump_task_queue`
    Update backtrace with source locations of resume points
    Add calls to update resume_point
    Add a std::source_location (resume_point) to task object.
  > Merge 'Refine posix file .dup() implementation' from Pavel Emelyanov
    file: Templatize posix_file_handle_impl
    file: Don't dup() non-read-only files
    file: Split ..._impl::dup() implementations
    test: Add a simple test for dup()
  > Merge 'Deprecate reactor::make_pollable_fd(socket_address, int)' from Pavel Emelyanov
    reactor: Deprecate make_pollable_fd()
    net/posix: Create file_desc for sockets in-place
    reactor,net: Keep sock_need_nonblock boolean on posix_network_stack
    net/posix: Re-format constructor initializer lists
  > Merge 'test: add fuzz testing infrastructure and sstring fuzzer' from Travis Downs
    test: add fuzz tests to CI workflow
    test: add sstring differential fuzzer
    test: add fuzz testing infrastructure
  > Introduce "integrated queue length" metrics and use it for IO classes (#3210)
  > reactor: Remove get_sg_data(unsigned) overload
  > memcached: Stop using scattered_message
  > reactor: Mark uptime() method const
  > alien: Remove deprecated run_on and submit_to calls
  > file: make open_flags and access_flags constexpr
  > scheduling: Unfriend some methods from scheduling_group
  > reactor: Move _dying bit to epoll backend
  > file: coroutinize the with_file templates
  > configure: validate --cook ingredient names
  > fix trailing whitespace
  > Merge 'Estimate timing overhead, allow failing if it is too high' from Travis Downs
    perf_tests: document overhead column and threshold options
    perf_tests: add measurement overhead tracking and warnings
    perf_tests: remove inline/hot attributes from time_measurement methods
    perf_tests: move time_measurement class to implementation file
    perf_tests: move perf counters into time_measurement singleton
  > rpc: log handler type
  > Merge 'Add pre-commit with trailing whitespace hook' from Travis Downs
    Add GitHub Actions workflow for pre-commit enforcement
    Add pre-commit setup documentation to HACKING.md
    Add pre-commit configuration with trailing-whitespace hook
    Remove trailing whitespace from source files
  > posix-stack: Make internal::posix_connect() resolve exceptions into futures
  > sstring: fix npos to be size_t for consistency with std::string

Closes scylladb/scylladb#28954
2026-03-10 22:06:58 +02:00
Szymon Wasik
7fae78d2b0 types: optimize reading vector fragments
There was a redundant work in split_fragmented(): value_length_if_fixed() was
called inside the loop (N virtual calls), and no reserve() was done
on the output vector causing repeated reallocations.

This patch reserves the output vector to _dimension and
caches value_length_if_fixed() before the loop.

Additionally, split read_vector_element() into two specialized functions:
read_vector_element_fixed() and read_vector_element_variable(), and hoist
the branch on fixed_len outside the loop in split_fragmented() and
deserialize_loop(). This avoids a conditional branch per element in the
hot path.

Benchmark results (1024-dim float vector, release build, -O3 -flto):
10.34 us ->  7.45 us  (1.39x, 28% faster)
2026-03-10 20:17:31 +01:00
Taras Veretilnyk
579269b3c5 test/cqlpy: test --ignore-component-digest-mismatch flag in scylla sstable upgrade
Verify that scylla sstable upgrade fails when an sstable has a
corrupted Statistics component digest, and succeeds when the
--ignore-component-digest-mismatch flag is provided.
2026-03-10 19:24:05 +01:00
Taras Veretilnyk
fc4c82b962 docs: document --ignore-component-digest-mismatch flag for scylla sstable upgrade 2026-03-10 19:24:05 +01:00
Taras Veretilnyk
7214f5a0b6 sstables: propagate ignore_component_digest_mismatch config to all load sites
Add ignore_component_digest_mismatch option to db::config (default false).
When set, sstable loading logs a warning instead of throwing on component
digest mismatches, allowing a node to start up despite corrupted non-vital
components or bugs in digest calculation.

Propagate the config to all production sstable load paths:
- distributed_loader (node startup, upload dir processing)
- storage_service (tablet storage cloning)
- sstables_loader (load-and-stream, download tasks, attach)
- stream_blob (tablet streaming)
2026-03-10 19:24:05 +01:00
Taras Veretilnyk
c123f637ea sstables: add option to ignore component digest mismatches
Add `ignore_component_digest_mismatch` option to `sstable_open_config`
that logs a warning instead of throwing `malformed_sstable_exception`
on component digest mismatch. This is useful for recovering sstables
with corrupted non-vital components or working around bugs in digest
calculation.

Expose the option in scylla-sstable via the
`--ignore-component-digest-mismatch` flag for the upgrade operation.
2026-03-10 19:24:05 +01:00
Taras Veretilnyk
95420014ea sstable_compaction_test: Add scrub validate test for corrupted index
Generalize corrupt_sstable() and scrub_validate_corrupted_file() to
accept a component_type parameter, defaulting to Data, so they can be
reused for corrupting other components.
2026-03-10 19:24:05 +01:00
Taras Veretilnyk
a3912cf7f1 sstables: add tests for component digest validation on corrupted SSTables
Add tests that verify SSTable component digest validation detects
corruption on load. Each test writes an SSTable, corrupts a specific
component file by flipping a bit, then asserts that reloading the
SSTable throws malformed_sstable_exception with the expected digest
mismatch message.
2026-03-10 19:24:05 +01:00
Taras Veretilnyk
e78a3d2c44 sstables: validate index components digests during SSTable scrub in validate mode 2026-03-10 19:24:05 +01:00
Taras Veretilnyk
9decbdeab0 sstables: verify component digests on SSTable load
Add integrity verification for SSTable component files by validating
their CRC32 digests against the expected values stored in Scylla
metadata during SSTable loading.

The following components are validated on load: TOC, Scylla metadata,
CompressionInfo, Statistics, Summary, and Filter.
2026-03-10 19:24:05 +01:00
Taras Veretilnyk
478c1eaec5 sstables: add digest_file_random_access_reader for CRC32 digest computation
Add a new reader that wraps file_random_access_reader and computes a
running CRC32 digest over the data as it is read. The digest accumulates
across sequential read_exactly() calls and is reset on seek(), since a
non-sequential read invalidates the running checksum.
2026-03-10 19:24:05 +01:00
Szymon Wasik
6c0ef8eb92 types: optimize vector deserialization for high-dimensional vectors
One of the performance bottlenecks while deserializing vectors was
per-element virtual dispatch in deserialize(): each of the N elements
went through visit() which switches on ~28 type variants. For a
1024-dimension float vector, that's 1024 redundant type switches when
the element type is the same for all of them.

This patch introduces deserialize_vector_visitor that dispatches on the element
type once for the entire vector, then loops inside the resolved
handler. Simple numeric types (float, int, etc.) call
deserialize_value() directly with no virtual dispatch per element.
String types (ascii, utf8) get a dedicated handler that skips
make_empty() (sstring has no empty_t constructor). Complex types
(list, map, tuple, etc.) fall back to per-element dispatch.

Benchmark results (1024-dim float vector, release build, -O3 -flto):
15.73 us -> 11.70 us  (1.34x, 26% faster)
2026-03-10 18:21:34 +01:00
Andrzej Jackowski
9247dff8c2 reader_concurrency_semaphore: fix leak workaround
`e4da0afb8d5491bf995cbd1d7a7efb966c79ac34` introduces a protection
against resources that are "made up" of thin air to
`reader_concurrency_semaphore`. If there are more `_resources` than
the `_initial_resources`, it means there is a negative leak, and
`on_internal_error_noexcept` is called. In addition to it,
`_resources` is set to `std::max(_resources, _initial_resources)`.

However, the commit message of `e4da0afb8d5491bf995cbd1d7a7efb966c79ac34`
states the opposite: "The detection also clamps the
_resources to _initial_resources, to prevent any damage".

Before this commit, the protection mechanism doesn't clamp
`_resources` to `_initial_resources` but instead keeps `_resources` high,
possibly even indefinitely growing. This commit changes `std::max` to
`std::min` to make the code behave as intended.

Refs: SCYLLADB-163

Closes scylladb/scylladb#28982
2026-03-10 18:57:31 +02:00
Szymon Wasik
74d86d3fe9 vector_search: Avoid quadratic reallocation in response_content_to_sstring
Pre-compute the total size and allocate a single uninitialized sstring
before copying the buffers, following the pattern from Seastar's
read_entire_stream_contiguous(). This avoids iterative reallocation
which is O(n^2) for large responses.
2026-03-10 17:45:55 +01:00
Szymon Wasik
d27610f138 vector_store_client: Return HTTP error description, not just code
This simple patch adds support for storing the HTTP error description
that Vector Store client receives from vector store. Until now it was
just printed to the log but it was not returned. For this reason it
was not forwarded to the drivers which forced users to access ScyllaDB
server logs to understand what is wrong with Vector Store.

This patch also updates formatter to print the message next to the
error code.

Fixes: VECTOR-189
2026-03-10 17:22:30 +01:00
Nadav Har'El
92ee959e9b test/alternator: speed up test_streams.py by using module-scope fixtures
Previously, all stream-table fixtures in this test file used
scope="function", forcing a fresh table to be created for every test,
slowing down the test a bit (though not much), and discouraging writing
small new tests.

This was a workaround for a DynamoDB quirk (that Alternator doesn't have):
LATEST shard iterators have a time slack and may point slightly before
the true stream head, causing leftover events from a previous test to
appear in the next test's reads.

We fix this by draining the stream inside latest_iterators() and
shards_and_latest_iterators() after obtaining the LATEST iterators:
fetch records in a loop until two consecutive polling rounds both return
empty, guaranteeing the iterators are positioned past all pre-existing
events before the caller writes anything.  With this guarantee in place,
all stream-table fixtures can safely use scope="module".

After this patch, test_streams.py continues to pass on DynamoDB.
On Alternator, the test file's run time went down a bit, from
20.2 seconds to 17.7 seconds.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-03-10 17:14:04 +02:00
Nadav Har'El
6ac1f1333f test/alternator: test_streams.py don't use fixtures in 4 tests
In the next patch, we plan to make the fixtures in test_streams.py
shared between tests. Most tests work well with shared tables, but two
(test_streams_trim_horizon and test_streams_starting_sequence_number)
were written to expect a new table with an empty history, and two
other (test_streams_closed_read and test_streams_disabled_stream) want
to disable streaming and would break a shared table.

So this patch we modify these four tests to create their own new table
instead of using a fixture.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-03-10 17:12:33 +02:00
Botond Dénes
81e214237f Merge 'Add digests for all sstable components in scylla metadata' from Taras Veretilnyk
This pull request adds support for calculation and storing CRC32 digests for all SSTable components.
This change replaces plain file_writer with crc32_digest_file_writer for all SSTable components that should be checksummed. The resulting component digests are stored in the sstable structure
and later persisted to disk as part of the Scylla metadata component during writer::consume_end_of_stream.
Several test cases where introduced to verify expected behaviour.

Additionally, this PR adds new rewrite component mechanism for safe sstable component rewriting.
Previously, rewriting an sstable component (e.g., via rewrite_statistics) created a temporary file that was renamed to the final name after sealing. This allowed crash recovery by simply removing the temporary file on startup.

However, with component digests stored in scylla_metadata (#20100),
replacing a component like Statistics requires atomically updating both the component
and scylla_metadata with the new digest - impossible with POSIX rename.

The new mechanism creates a clone sstable with a fresh generation:

- Hard-links all components from the source except the component being rewritten and scylla_metadata
- Copies original sstable components pointer and recognized components from the source
- Invokes a modifier callback to adjust the new sstable before rewriting
- Writes the modified component along with updated scylla_metadata containing the new digest
- Seals the new sstable with a temporary TOC
- Replaces the old sstable atomically, the same way as it is done in compaction

This is built on the rewrite_sstables compaction framework to support batch operations (e.g., following incremental repair).
In case of any failure durning the whole process, sstable will be automatically deleted on the node startup due to
temporary toc persistence.

Backport is not required, it is a new feature

Fixes https://github.com/scylladb/scylladb/issues/20100, https://github.com/scylladb/scylladb/issues/27453

Closes scylladb/scylladb#28338

* github.com:scylladb/scylladb:
  docs: document components_digests subcomponent and trailing digest in Scylla.db
  sstable_compaction_test: Add tests for perform_component_rewrite
  sstable_test: add verification testcases of SSTable components digests persistance
  sstables: store digest of all sstable components in scylla metadata
  sstables: replace rewrite_statistics with new rewrite component mechanism
  sstables: add new rewrite component mechanism for safe sstable component rewriting
  compaction: add compaction_group_view method to specify sstable version
  sstables: add null_data_sink and serialized_checksum for checksum-only calculation
  sstables: extract default write open flags into a constant
  sstables: Add write_simple_with_digest for component checksumming
  sstables: Extract file writer closing logic into separate methods
  sstables: Implement CRC32 digest-only writer
2026-03-10 16:02:53 +02:00
Aleksandra Martyniuk
e02b0d763c service: fix indentation 2026-03-10 14:44:52 +01:00
Aleksandra Martyniuk
02257d1429 test: add test_tablet_repair_wait
Add a test that checks if tablet_virtual_task::wait won't fail if
tablets are merged.
2026-03-10 14:42:27 +01:00
Aleksandra Martyniuk
0e0070f118 service: remove status_helper::tablets
Currently, status_helper::tablets, which keeps a vector of processed
tablet ids, is used only in tablet_virtual_task::get_status_helper,
so there is no point in returning it. Also, in get_status_helper,
it is used only to determine if any tablets are processed.

Remove status_helper::tablets. Use a flag instead of the vector
in get_status_helper.
2026-03-10 14:42:27 +01:00
Aleksandra Martyniuk
e5928497ce service: tasks: scan all tablets in tablet_virtual_task::wait
Currently, for repair tasks tablet_virtual_task::wait gathers the
ids of tablets that are to be repaired. The gathered set is later
used to check if the repair is still ongoing.

However, if the tablets are resized (split or merged), the gathered
set becomes irrelevant. Those, we may end up with invalid tablet id
error being thrown.

Wait until repair is done for all tablets in the table.
2026-03-10 14:42:21 +01:00
Andrei Chekun
c36df5ecf4 test.py: eliminite drivers exception
There is a race condition in driver that raises the RuntimeException.
This pollutes the output, so this PR is just silencing this exception.

Fixes: SCYLLADB-900

Closes scylladb/scylladb#28957
2026-03-10 14:31:36 +02:00
Alex
3ac4e258e8 transport/messages: hold pinned prepared entry in PREPARE result
result_message::prepared now owns a strong pinned prepared-cache entry instead of relying only on a weak pointer view. This closes the remaining lifetime gap after query_processor::prepare() returns, so users of the returned PREPARE message cannot observe an invalidated weak handle during subsequent
  processing.

  - update result_message::prepared::cql constructor to accept pinned entry
  - construct weak view from owned pinned entry inside the message
  - pass pinned cache entry from query_processor::prepare() into the message constructor
2026-03-10 14:17:57 +02:00
Alex
27051d9a7c cql3: pin prepared cache entry in prepare() to avoid invalid weak handle race
query_processor::prepare() could race with prepared statement invalidation: after loading from the prepared cache, we converted the cached object to a checked weak pointer and then continued asynchronous work (including error-injection waitpoints). If invalidation happened in that window, the weak
  handle could no longer be promoted and the prepare path could fail nondeterministically.

  This change keeps a strong cache entry reference alive across the whole critical section in prepare() by using a pinned cache accessor (get_pinned()), and only deriving the weak handle while the entry is pinned. This removes the lifetime gap without adding retry loops.

  Test coverage was extended in test/cluster/test_prepare_race.py:

  - reproduces the invalidation-during-prepare window with injection,
  - verifies prepare completes successfully,
  - then invalidates again and executes the same stale client prepared object,
  - confirms the driver transparently re-requests/re-prepares and execution succeeds.

  This change introduces:

  - no behavior change for normal prepare flow besides stronger lifetime guarantees,
  - no new protocol semantics,
  - preserves existing cache invalidation logic,
  - adds explicit cluster-level regression coverage for both the race and driver reprepare path.
  - pushes the re prepare operation twards the driver, the server will return unprepared error for the first time and the driver will have to re prepare during execution stage
2026-03-10 14:17:57 +02:00
Piotr Dulikowski
37f8cdf485 Merge 'test.py: fix unawaited ScyllaLogFile.grep() coroutines' from Andrei Chekun
Fixed several places where ScyllaLogFile.grep() was called without await, resulting in checking coroutine objects for truthiness instead of actual log matches.

Fixes: SCYLLADB-903

No backport, framework fix and one test fix.

Closes scylladb/scylladb#28909

* github.com:scylladb/scylladb:
  test.py: fix unawaited ScyllaLogFile.grep() coroutines
  tests: fix test_group0_recovers_after_partial_command_application
2026-03-10 12:29:23 +01:00
Dario Mirovic
f72081194c db: use prefix tombstones in DROP TABLE schema mutations
When dropping a table, make_drop_table_or_view_mutations() creates
a point tombstone in system_schema.columns for every column in the table.

The clustering key of system_schema.columns is (table_name, column_name).
A clustering key with only the table_name component acts as a prefix
tombstone. That tombstone covers all columns belonging to that table.
This approach is already used by make_table_deleting_mutations() during
CREATE TABLE.

Apply the same prefix tombstone approach to DROP TABLE for the columns,
view_virtual_columns, computed_columns, and dropped_columns schema tables.
This reduces tombstone accumulation in schema table sstables.

In test_max_cells test case, which repeatedly creates and drops a table
with 32768 columns, overall test time improved from ~180s to ~157s, which
is ~12.7% improvement.

Refs SCYLLADB-815

Closes scylladb/scylladb#28976
2026-03-10 11:59:00 +01:00
Gleb Natapov
b59b3d4f8a service level: remove version 1 service level code 2026-03-10 10:46:48 +02:00
Gleb Natapov
b633ec1779 features: move GROUP0_SCHEMA_VERSIONING to deprecated features list 2026-03-10 10:46:48 +02:00
Gleb Natapov
40ec0d4942 migration_manager: remove unused forward definitions 2026-03-10 10:46:48 +02:00
Gleb Natapov
aa9eb0ef8c test: remove unused code 2026-03-10 10:46:48 +02:00
Gleb Natapov
4660f908f9 auth: drop auth_migration_listener since it does nothing now 2026-03-10 10:46:48 +02:00
Gleb Natapov
74b5a8d43d schema: drop schema_registry_entry::maybe_sync() function
Schema is synced through group0 now. Drop all the test of the function
as well.
2026-03-10 10:46:47 +02:00
Gleb Natapov
b9f3281af6 schema: drop make_table_deleting_mutations since it should not be needed with raft
Also remove the test since it is no longer relevant
2026-03-10 10:46:47 +02:00
Gleb Natapov
f76199e5c2 schema: remove calculate_schema_digest function
It is used by the test only, so remove the test and its data as well.
2026-03-10 10:46:47 +02:00
Gleb Natapov
08e33ad7f7 schema: drop recalculate_schema_version function and its uses
There is no need to recalculate schema version any more since it is set
by group0.
2026-03-10 10:46:39 +02:00
Gleb Natapov
7bb334a5dd migration_manager: drop check for group0_schema_versioning feature
We do not allow upgrading from a version that does not have it any
longer.
2026-03-10 10:39:59 +02:00
Gleb Natapov
4402b030ae cdc: drop usage of cdc_local table and v1 generation definition 2026-03-10 10:39:59 +02:00
Gleb Natapov
6769615ff1 storage_service: no need to add yourself to the topology during reboot since raft state loading already did it 2026-03-10 10:39:59 +02:00
Gleb Natapov
33fbda9f3b storage_service: remove unused functions 2026-03-10 10:39:58 +02:00
Gleb Natapov
0e3e7be335 group0: drop with_raft() function from group0_guard since it always returns true now
Also drop the code that assumed that the function can return false.
2026-03-10 10:39:58 +02:00
Gleb Natapov
4e56ca3c76 gossiper: do not gossip TOKENS and CDC_GENERATION_ID any more
They were used by legacy topology and cdc code only.
2026-03-10 10:39:58 +02:00
Gleb Natapov
77f8f952b2 gossiper: drop tokens from loaded_endpoint_state 2026-03-10 10:39:58 +02:00
Gleb Natapov
706754dc24 gossiper: remove unused functions 2026-03-10 10:39:58 +02:00
Gleb Natapov
8ee4cdd4b7 storage_service: do not pass loaded_peer_features to join_topology()
They are not used there any longer.
2026-03-10 10:39:58 +02:00
Gleb Natapov
24c01f2289 storage_service: remove unused fields from replacement_info 2026-03-10 10:39:58 +02:00
Gleb Natapov
2d8722d204 gossiper: drop is_safe_for_restart() function and its use
The function checks that the node's state is not left or removed in
gossiper during restart, but with raft topology a removed node will
not be able to contact the cluster to get this information since it will
be banned.
2026-03-10 10:39:58 +02:00
Gleb Natapov
6f739a8ee4 storage_service: remove unused variables from join_topology 2026-03-10 10:39:58 +02:00
Gleb Natapov
d35b83bec8 gossiper: remove the code that was only used in gossiper topology
The topology state machine is always present now and can be passed to
the gossiper during creation.
2026-03-10 10:39:58 +02:00
Gleb Natapov
390eb46c1a storage_service: drop the check for raft mode from recovery code
In non raft mode the node will node boot at all, so the check is
redundant now.
2026-03-10 10:39:58 +02:00
Gleb Natapov
6a7e850161 cdc: remove legacy code
The patch removes test/boost/cdc_generation_test.cc since it unit tests
cdc::limit_number_of_streams_if_needed function which is remove here.
2026-03-10 10:38:57 +02:00
Gleb Natapov
0b508c5f96 test: remove unused injection points
Also remove test_auth_raft_command_split test which is irrelevant since 5ba7d1b116
because it does not use the function that injects max sized command after the
commit.
2026-03-10 10:09:39 +02:00
Gleb Natapov
1d188f0394 auth: remove legacy auth mode and upgrade code
A system needs to be upgraded to use v2 auth before moving to this
ScyllaDB version otherwise the boot will fail.
2026-03-10 10:09:39 +02:00
Gleb Natapov
02fc4ad0a9 treewide: remove schema pull code since we never pull schema any more
Schema pull was used by legacy schema code which is not supported for a
long time now and during legacy recovery which is no longer supported as
well. It can be dropped now.
2026-03-10 10:09:39 +02:00
Gleb Natapov
0cf726c81f raft topology: drop upgrade_state and its type from the topology state machine since it is not used any longer 2026-03-10 10:09:39 +02:00
Gleb Natapov
60a861c518 group0: hoist the checks for an illegal upgrade into main.cc
The checks are spread around now, but having then in one place and done
as early as possible simplifies the logic.
2026-03-10 10:09:39 +02:00
Gleb Natapov
1ff98c89e3 api: drop get_topology_upgrade_state and always report upgrade status as done
Non upgraded version will not boot any longer.
2026-03-10 10:09:38 +02:00
Gleb Natapov
be153a4eb7 service_level_controller: drop service level upgrade code
We do not allow upgrade from a version that is not updated yet, so the
code is not used any longer.
2026-03-10 10:09:38 +02:00
Gleb Natapov
61cc091364 test: drop run_with_raft_recovery parameter to cql_test_env
It is unused.
2026-03-10 10:09:38 +02:00
Gleb Natapov
00083b42a7 group0: get rid of group0_upgrade_state
Simplify code by getting rid of group0_upgrade_state since upgrade is no
longer supported, so no need to track its state. The none upgraded node
will simply not boot and to detect that the patch checks the state
directly from the system table.
2026-03-10 10:09:38 +02:00
Gleb Natapov
d4b55de214 storage_service: drop topology_change_kind as it is no longer needed
The mode is always raft, so no need to keep a variable that tracks that.
2026-03-10 10:09:38 +02:00
Gleb Natapov
68ea6aa0a6 storage_service: drop check_ability_to_perform_topology_operation since no upgrades can happen any more 2026-03-10 10:09:38 +02:00
Gleb Natapov
06652948f3 service_storage: remove unused functions
raft_topology_change_enabled and upgrade_state_to_topology_op_kind are
not use any more. Remove the code.
2026-03-10 10:09:38 +02:00
Gleb Natapov
e8c72b7ba0 storage_service: remove non raft rebuild code
Only raft is supported now.
2026-03-10 10:09:38 +02:00
Gleb Natapov
49ebab971d storage_service: set topology change kind only once
The only support mode is topology_change_kind::raft, so always set it in
storage_service::join_cluster during join or regular boot. Drop the check
for legacy mode from raft_group0::setup_group0_if_exist since the mode
will not be set at this point any longer. The wrong upgrade will still
be detected in storage_service::join_cluster where topology.upgrade_state
is checked directly.
2026-03-10 10:09:38 +02:00
Gleb Natapov
4e072977d4 group0: drop in_recovery function and its uses
Legacy recovery procedure is no longer supported and the code can be
dropped.
2026-03-10 10:09:38 +02:00
Gleb Natapov
770762edd8 group0: rename use_raft to maintenance_mode and make it sync
group0_upgrade_state::recovery is now used only in maintenance mode
so rename the function to indicate it. Also there is no preemption point
in the function any more and it can be a regular function, not a
co-routine.
2026-03-10 10:09:33 +02:00
Pavel Emelyanov
61af7c8300 test: Re-sort comments around do_test_streaming_scopes()
The test description of refreshing test is very elaborated and it's
worth having it as the description of the streaming scopes test itself.
Callers of the helper can go with smaller descriptions.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-03-10 10:00:09 +03:00
Pavel Emelyanov
5ce3597c25 test: Split do_load_sstables()
This helper does two things -- sorts sstables per server according to
scope in use and calls sstables_storage.restore(). The code looks better
if the sorting of sstables stays in a helper and the call for .restore()
is moved to the caller.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-03-10 10:00:09 +03:00
Pavel Emelyanov
8c1fb2b39a test: Drop load_fn argument from do_load_sstables()
Now all callers provide the sstables_storage argument and the load_fn is
effectively unused.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-03-10 09:59:08 +03:00
Pavel Emelyanov
59051ccc28 test: Re-use do_test_streaming_scopes() in refresh test
Now it's possible to replace the whole body of the
test_refresh_with_streaming_scopes() test by calling the corresponding
helper function from backup/restore test module. This helper does
exactly the same, and the SSTablesOnLocalStorage class provides the
necessary save/restore implementations.

One more thing to mention -- the refreshing test for some reason only
wants to run with restored min-tablet-count equal to the original one.
The do_test_streaming_scopes() needs to account for that, as it runs the
tests for more options.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-03-10 09:59:07 +03:00
Pavel Emelyanov
f6f1cb0391 test: Introduce SSTablesOnLocalStorage
This class implements some of the sstables manipulations performed by
test_refresh_with_streaming_scopes(). It's here to facilitate next patch
that will use it to call do_test_streaming_scopes() helper.

This patch moves two blocks of code out of the test into this new class.

The shutil.rmtree(tmpbackup) is seemingly lost, but it really isn't --
the tmpbackup variable holds a name of a _subdir_ inside servers'
workdirs. This path doesn't really exist on disk on its own, so removing
it is a no-op.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-03-10 09:58:40 +03:00
Pavel Emelyanov
dae4da1810 test: Introduce SSTablesOnObjectStorage
The class in question performs two operations for
do_test_streaming_scopes(): saves sstables and restores them. Current
caller of the helper is the test_restore_with_streaming_scopes() test
that need to backup sstables on object storage and restore them from
there with the restoration API. The SSTablesOnObjectStorage class does
exactly that.

The change in do_load_sstables() that checks for sstables_storage to be
non None is needed to keep test_refresh_with_streaming_scopes() work --
that test doesn't provide sstables_storage (yet) and the function in
question will call the load_fn callback. Next patch will eliminate it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-03-10 09:58:39 +03:00
Pavel Emelyanov
5a033dea47 test: Move test_restore_with_streaming_scopes() into do_test_streaming_scopes()
The body of this test is duplicated by
test_refresh_with_streaming_scopes() test from other module. Keeping it
in a non-test top-level function will help generalizing these two tests.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-03-10 09:57:53 +03:00
Nadav Har'El
d78ea3d498 test/cqlpy: mark test_unbuilt_index_not_used not strictly xfail
A few days ago, in commit 7b30a3981b we added to pytest.ini the option
xfail_strict. This option causes every time a test XPASSes, i.e., an
xfail test actually passes - to be considered an error and fail the test.

But some tests demonstrate a timing-related bug and do not reproduce the
bug every single time. An example we noticed in one CI run is:

    test/cqlpy/test_secondary_index.py::test_unbuilt_index_not_used

This test reproduces a timing-related bug (if you read from a secondary
index "too quickly" you can get wrong results), but only about 90% of the
time, not 100% of the time.

The solution is to add "strict=False" for the xfail marker on this
specific test. This undoes the xfail_strict for this specific test,
accepting that this specific test can either pass or fail. Note that
this does NOT make this test worthless - we still see this test failing
most of the time, and when a developer finally fixes this issue, the
test will begin to pass all the time.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-956
(we'll probably need to follow up this fix with the same fix for other
xfail tests that can sometime pass).

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#28942
2026-03-09 22:48:20 +02:00
Avi Kivity
01ddc17ab9 Merge 'mv: allow skipping view updates when a collection is unmodified' from Wojciech Mitros
When we generate view updates, we check whether we can skip the
entire view update if all columns selected by the view are unmodified.
However, for collection columns, we only check if they were unset
before and after the update.
In this patch we add a check for the actual collection contents.
We perform this check for both virtual and non-virtual selections.
When the column is only a virtual column in the view, it would be
enough to check the liveness of each collection cell, however for
that we'd need to deserialize the entire collection anyway, which
should be effectively as expensive as comparing all of its bytes.

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-808

Closes scylladb/scylladb#28839

* github.com:scylladb/scylladb:
  mv: allow skipping view updates when a collection is unmodified
  mv: allow skipping view updates if an empty collection remains unset
2026-03-09 22:46:01 +02:00
Botond Dénes
13ff9c4394 db,compaction: use utils::chunked_vector for cache invalidation ranges
Instead of dht::partition_ranges_vector, which is an std::vector<> and
have been seen to cause large allocations when calculating ranges to be
invalidated after compaction:

    seastar_memory - oversized allocation: 147456 bytes. This is non-fatal, but could lead to latency and/or fragmentation issues. Please report: at
    [Backtrace #0]
    void seastar::backtrace<seastar::current_backtrace_tasklocal()::$_0>(seastar::current_backtrace_tasklocal()::$_0&&, bool) at ./build/release/seastar/./seastar/include/seastar/util/backtrace.hh:89
     (inlined by) seastar::current_backtrace_tasklocal() at ./build/release/seastar/./seastar/src/util/backtrace.cc:99
    seastar::current_tasktrace() at ./build/release/seastar/./seastar/src/util/backtrace.cc:136
    seastar::current_backtrace() at ./build/release/seastar/./seastar/src/util/backtrace.cc:169
    seastar::memory::cpu_pages::warn_large_allocation(unsigned long) at ./build/release/seastar/./seastar/src/core/memory.cc:840
    seastar::memory::cpu_pages::check_large_allocation(unsigned long) at ./build/release/seastar/./seastar/src/core/memory.cc:903
     (inlined by) seastar::memory::cpu_pages::allocate_large(unsigned int, bool) at ./build/release/seastar/./seastar/src/core/memory.cc:910
     (inlined by) seastar::memory::allocate_large(unsigned long, bool) at ./build/release/seastar/./seastar/src/core/memory.cc:1533
     (inlined by) seastar::memory::allocate_slowpath(unsigned long) at ./build/release/seastar/./seastar/src/core/memory.cc:1679
    seastar::memory::allocate(unsigned long) at ././seastar/src/core/memory.cc:1698
     (inlined by) operator new(unsigned long) at ././seastar/src/core/memory.cc:2440
     (inlined by) std::__new_allocator<interval<dht::ring_position>>::allocate(unsigned long, void const*) at /usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/bits/new_allocator.h:151
     (inlined by) std::allocator<interval<dht::ring_position>>::allocate(unsigned long) at /usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/bits/allocator.h:203
     (inlined by) std::allocator_traits<std::allocator<interval<dht::ring_position>>>::allocate(std::allocator<interval<dht::ring_position>>&, unsigned long) at /usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/bits/alloc_traits.h:614
     (inlined by) std::_Vector_base<interval<dht::ring_position>, std::allocator<interval<dht::ring_position>>>::_M_allocate(unsigned long) at /usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/bits/stl_vector.h:387
     (inlined by) std::vector<interval<dht::ring_position>, std::allocator<interval<dht::ring_position>>>::reserve(unsigned long) at /usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/bits/vector.tcc:79
    dht::to_partition_ranges(utils::chunked_vector<interval<dht::token>, 131072ul> const&, seastar::bool_class<utils::can_yield_tag>) at ./dht/i_partitioner.cc:347
    compaction::compaction::get_ranges_for_invalidation(std::vector<seastar::lw_shared_ptr<sstables::sstable>, std::allocator<seastar::lw_shared_ptr<sstables::sstable>>> const&) at ./compaction/compaction.cc:619
     (inlined by) compaction::compaction::get_compaction_completion_desc(std::vector<seastar::lw_shared_ptr<sstables::sstable>, std::allocator<seastar::lw_shared_ptr<sstables::sstable>>>, std::vector<seastar::lw_shared_ptr<sstables::sstable>, std::allocator<seastar::lw_shared_ptr<sstables::sstable>>>) at ./compaction/compaction.cc:719
     (inlined by) compaction::regular_compaction::replace_remaining_exhausted_sstables() at ./compaction/compaction.cc:1362
    compaction::compaction::finish(std::chrono::time_point<db_clock, std::chrono::duration<long, std::ratio<1l, 1000l>>>, std::chrono::time_point<db_clock, std::chrono::duration<long, std::ratio<1l, 1000l>>>) at ./compaction/compaction.cc:1021
    compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0::operator()() at ./compaction/compaction.cc:1960
     (inlined by) compaction::compaction_result std::__invoke_impl<compaction::compaction_result, compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0>(std::__invoke_other, compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0&&) at /usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/bits/invoke.h:63
     (inlined by) std::__invoke_result<compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0>::type std::__invoke<compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0>(compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0&&) at /usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/bits/invoke.h:98
     (inlined by) decltype(auto) std::__apply_impl<compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0, std::tuple<>>(compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0&&, std::tuple<>&&, std::integer_sequence<unsigned long, ...>) at /usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/tuple:2920
     (inlined by) decltype(auto) std::apply<compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0, std::tuple<>>(compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0&&, std::tuple<>&&) at /usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/tuple:2935
     (inlined by) seastar::future<compaction::compaction_result> seastar::futurize<compaction::compaction_result>::apply<compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0>(compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0&&, std::tuple<>&&) at ././seastar/include/seastar/core/future.hh:1930
     (inlined by) seastar::futurize<std::invoke_result<compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0>::type>::type seastar::async<compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0>(seastar::thread_attributes, compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0&&)::'lambda'()::operator()() const at ././seastar/include/seastar/core/thread.hh:267
     (inlined by) seastar::noncopyable_function<void ()>::direct_vtable_for<seastar::futurize<std::invoke_result<compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0>::type>::type seastar::async<compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0>(seastar::thread_attributes, compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0&&)::'lambda'()>::call(seastar::noncopyable_function<void ()> const*) at ././seastar/include/seastar/util/noncopyable_function.hh:138
    seastar::noncopyable_function<void ()>::operator()() const at ./build/release/seastar/./seastar/include/seastar/util/noncopyable_function.hh:224
     (inlined by) seastar::thread_context::main() at ./build/release/seastar/./seastar/src/core/thread.cc:318

dht::partition_ranges_vector is used on the hot path, so just convert
the problematic user -- cache invalidation -- to use
utils::chunked_vector<dht::partition_range> instead.

Fixes: SCYLLADB-121

Closes scylladb/scylladb#28855
2026-03-09 22:04:54 +02:00
Andrei Chekun
8acba40c84 test.py: fix unawaited ScyllaLogFile.grep() coroutines
Fixed several places where ScyllaLogFile.grep() was called without
await, resulting in checking coroutine objects for truthiness instead
of actual log matches.

Fixes: SCYLLADB-903
2026-03-09 19:41:07 +01:00
Andrei Chekun
224a11be65 tests: fix test_group0_recovers_after_partial_command_application
Due to the fact that grep logs was not awaited this issue was masked.
With adding await for log grep it started to fail. This PR fixes the
test.
2026-03-09 19:41:07 +01:00
Nadav Har'El
16e7a88a02 test/alternator: fix do_test() in test_streams.py
Many tests in test/alternator/test_streams.py use a do_test() function
which performs a user-defined function that runs some write requests,
and then verifies that the expected output appears on the stream.

Because DynamoDB drops do-nothing changes from the stream - such as
writing to an item a value that it already has - these tests need to
write to a different item each time, so do_test() invents a random key
and passes it to the user-defined function to use. But... we had a bug,
the random number generation was done only once, instead of every time.
The fix is to do the random number generation on every call.

We never noticed this bug when each test used a brand new table. But the
next patch will make the tests share the test table, and tests start
to fail. It's especially visible if you run the same test twice against
DynamoDB, e.g.,

test/alternator/run --count 2 --aws \
    test_streams.py::test_streams_putitem_keys_only

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-03-09 19:21:53 +02:00
Łukasz Paszkowski
147b355326 replica/table: avoid computing token range side in storage_group_of() on hot path
`storage_group_of()` is on the replica-side token lookup hot path but
used `tablet_map::get_tablet_id_and_range_side()`, which computes both
tablet id and post-split range side.

Most callers only need the storage group id. Switch `storage_group_of()`
to use `get_tablet_id()` via `tablet_id_for_token()`, and select the
compaction group via new overloads that compute the range side only
when splitting mode is active.
2026-03-09 17:59:36 +01:00
Łukasz Paszkowski
419e9aa323 replica/compaction_group: add lazy select_compaction_group() overloads
Change `storage_group::select_compaction_group()` to accept a token
(and tablet_map) and compute the tablet range side only when
splitting_mode() is active.

Add an overload for selecting the compaction group for an sstable
spanning a token range.
2026-03-09 17:59:36 +01:00
Łukasz Paszkowski
3f70611504 locator/tablets: add tablet_map::get_tablet_range_side()
Add `tablet_map::get_tablet_range_side(token)` to compute the
post-split range side without computing the tablet id.

Pure addition, no behavior change.
2026-03-09 17:59:36 +01:00
Jakub Smolar
7cdd979158 db/config: announce ms format as highest supported
Uncomment the feature flag check in get_highest_supported_format()
to return MS format when supported, otherwise fall back to ME.
2026-03-09 17:12:09 +01:00
Michał Chojnowski
949fc85217 db/config: enable ms sstable format by default
Trie-based sstable indexes are supposed to be (hopefully)
a better default than the old BIG indexes.
Make them the new default.

If we change our mind, this change can be reverted later.
2026-03-09 17:12:09 +01:00
Michał Chojnowski
6b413e3959 cluster/dtest/bypass_cache_test: switch from highest_supported_sstable_format to chosen_sstable_format
Trie-based indexes and older indexes have a difference in metrics,
and the test uses the metrics to check for bypass cache.
To choose the right metrics, it uses highest_supported_sstable_format,
which is inappropriate, because the sstable format chosen for writes
by Scylla might be different than highest_supported_sstable_format.

Use chosen_sstable_format instead.
2026-03-09 17:12:09 +01:00
Michał Chojnowski
b89840c4b9 api/system: add /system/chosen_sstable_version
Returns the sstable version currently chosen for use in for new sstables.

We are adding it because some tests want to know what format they are
writing (tests using upgradesstable, tests which check stats that only
apply to one of the index types, etc).

(Currently they are using `highest_supported_sstable_format` for this
purpose, which is inappropriate, and will become invalid if a non-latest
format is the default).
2026-03-09 17:12:09 +01:00
Michał Chojnowski
9280a039ee test/cluster/dtest: reduce num_tokens to 16
cluster.dtest_alternator_tests.test_slow_query_logging performs
a bootstrap with 768 token ranges.

It works with `me` sstables, which have 2 open file descriptors
per open sstable, but with `ms` sstables, which have 3 open
file descriptors per open sstable, it fails with EMFILE.

To avoid this problem, let's just decrease the number of vnodes
for in the test suite. It's appropriate anyway, because it avoids some
unneeded work without weakening the tests.
(Note: pylib-based have been setting `num_tokens` to 16 for a long time too).

This breaks `bypass_cache_test`, which is written in a way that expects
a certain number of token ranges. We adjust the relevant parameter
accordingly.
2026-03-09 17:12:09 +01:00
Marcin Maliszkiewicz
96a2b0e634 test: add tests for global group0_batch barrier feature
Runtime: 16s in dev mode
2026-03-09 15:15:59 +01:00
Marcin Maliszkiewicz
6723ced684 qos: switch service levels write paths to use global group0_batch barrier
This ensures that we return auth functions only after
we wait until all live nodes apply our mutations.
2026-03-09 15:15:59 +01:00
Marcin Maliszkiewicz
fe79fdf090 auth: switch write paths to use global group0_batch barrier
This ensures that we return auth functions only after
we wait until all live nodes apply our mutations.
2026-03-09 15:15:59 +01:00
Marcin Maliszkiewicz
4c8681a927 raft: add function to broadcast read barrier request
This function ensures that all alive nodes executed
read barrier.

It will be usefull for the following commits which would
eventually delay returning response to the user until
mutations are applied on other nodes so that the user
may perceive better data consistency accross nodes.
2026-03-09 15:15:59 +01:00
Marcin Maliszkiewicz
cbae84a926 raft: add gossiper dependency to raft_group0_client
In following commit raft_group0_client will send read
barrier RPC to all alive nodes, it takes list of the nodes
from gossiper.
2026-03-09 15:15:59 +01:00
Marcin Maliszkiewicz
8422fbca9f raft: add read barrier RPC
The RPC does read barrier on a destination node.

It will be issued in following commits
to live nodes to assure that command was applied
everywhere.
2026-03-09 15:15:59 +01:00
Michał Chojnowski
ff60a5f1e5 cql3: suggest ALTER MATERIALIZED VIEW to users trying to use ALTER TABLE on a view
When a user tries to use ALTER TABLE on a materialized view,
the resulting error message is `Cannot use ALTER TABLE on Materialized View`.

The intention behind this error is that ALTER MATERIALIZED VIEW should
be used instead.

But we observed that some users interpret this error message as a general
"You cannot do any ALTER on this thing".

This patch enhances the error message (and others similar to it)
to prevent the confusion.

Closes scylladb/scylladb#28831
2026-03-09 15:07:21 +01:00
Botond Dénes
1e41db5948 Merge 'service: tasks: return successful status if a table was dropped' from Aleksandra Martyniuk
tablet_virtual_task::wait throws if a table on which a tablet operation
was working is dropped.

Treat the tablet operation as successful if a table is dropped.

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-494

Needs backport to all live releases

Closes scylladb/scylladb#28933

* github.com:scylladb/scylladb:
  test: add test_tablet_repair_wait_with_table_drop
  service: tasks: return successful status if a table was dropped
2026-03-09 16:04:44 +02:00
Piotr Dulikowski
23ed0d4df8 Merge 'vector_search: fix TLS server name with IP' from Karol Nowacki
SNI works only with DNS hostnames. Adding an IP address causes warnings
on the server side.
This change adds SNI only if it is not an IP address.

This change has no unit tests, as this behavior is not critical,
since it causes a warning on the server side.
The critical part, that the server name is verified, is already covered.

This PR also adds warning logs to improve future troubleshooting of connections to the vector-store nodes.

Fixes: VECTOR-528

Backports to 2025.04 and 2026.01 are required, as these branches are also affected.

Closes scylladb/scylladb#28637

* github.com:scylladb/scylladb:
  vector_search: fix TLS server name with IP
  vector_search: add warn log for failed ann requests
2026-03-09 15:03:22 +01:00
Asias He
e0483f6001 test: Fix coordinator assumption in do_test_tablet_incremental_repair_merge_error
The first node in the cluster is not guaranteed to be the coordinator
node. Hardcoding node 0 as the coordinator causes test flakiness.  This
patch dynamically finds the actual coordinator node and targets it for
error injection, log checking, and restarts.

Additionally, inject `tablet_force_tablet_count_decrease_once` across
all servers to force the tablet merge process to trigger once.

Fixes SCYLLADB-865

Closes scylladb/scylladb#28945
2026-03-09 15:27:45 +02:00
Marcin Maliszkiewicz
b6a7484520 docs: note eventual visibility of auth changes
Mention that role and permission changes are durable but may
not be immediately visible on other nodes due to asynchronous
replication.

Fixes: SCYLLADB-651

Closes scylladb/scylladb#28900
2026-03-09 14:07:10 +01:00
Piotr Dulikowski
42d70baad3 db: view: mutate_MV: don't hold keyspace ref across preemption
Currently, the view_update_generator::mutate_MV function acquires a
reference to the keyspace relevant to the operation, then it calls
max_concurrent_for_each and uses that reference inside the lambda passed
to that function. max_concurrent_for_each can preempt and there is no
mechanism that makes sure that the keyspace is alive until the view
updates are generated, so it is possible that the keyspace is freed by
the time the reference is used.

Fix the issue by precomputing the necessary information based on the
keyspace reference right away, and then passing that information by
value to the other parts of the code. It turns out that we only need to
know whether the keyspace uses tablets and whether it uses a network
topology strategy.

Fixes: scylladb/scylladb#28925

Closes scylladb/scylladb#28928
2026-03-09 15:04:26 +02:00
Łukasz Paszkowski
826fd5d6c3 test/storage: harden out-of-space prevention tests around restart and disk-utilization transitions
The tests in test_out_of_space_prevention.py are flaky. Three issues contribute:

1. After creating/removing the blob file that simulates disk pressure,
   the tests immediately checked derived state (e.g., "compaction_manager
   - Drained") without first confirming the disk space monitor had detected
   the utilization change. Fix: explicitly wait for "Reached/Dropped below
   critical disk utilization level" right after creating/removing the blob
   file, before checking downstream effects.

2. Several tests called `manager.driver_connect()` or omitted reconnection
   entirely after `server_restart()` / `server_start()`. The pre-existing
   driver session can silently reconnect multiple times, causing subsequent
   CQL queries to fail. Fix: call `reconnect_driver()` after every node restart.
   Additionally, call `wait_for_cql_and_get_hosts()` where CQL is used afterward,
   to ensure all connection pools are established.

3. Some log assertions used marks captured before a restart, so they could
   match pre-restart messages or miss messages emitted in the correct post-restart
   window. Fix: refresh marks at the right points.

Apart from that, the patch fixes a typo: autotoogle -> autotoggle.

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-655

Closes scylladb/scylladb#28626
2026-03-09 14:45:09 +02:00
Calle Wilund
ef795eda5b test_encryption: Fix test_system_auth_encryption
Fixes: SCYLLADB-915

Test was quite broken; Not waiting for coro:s, as well as a bunch
of checks no longer even close to valid (this is a ported dtest, and
not a very good one).

Closes scylladb/scylladb#28887
2026-03-09 14:38:31 +02:00
Marcin Maliszkiewicz
f177259316 Merge 'vector_search: small improvements' from Karol Nowacki
vector_search: small improvements

This PR addresses several minor code quality issues and style inconsistencies within the vector_search module.

No backport is needed as these improvements are not visible to the end user.

Closes scylladb/scylladb#28718

* github.com:scylladb/scylladb:
  vector_search: fix names of private members
  vector_search: remove unused global variable
2026-03-09 11:42:35 +01:00
Botond Dénes
6bba4f7ca1 Merge 'test: cluster: util: sleep for 0.01s between writes in do_writes' from Patryk Jędrzejczak
Tests use `start_writes` as a simple write workload to test that writes
succeed when they should (e.g., there is no availability loss), but not to
test performance. There is no reason to overload the CPU, which can lead to
test failures.

I suspect this function to be the cause of SCYLLADB-929, where the failures
of `test_raft_recovery_user_data` (that creates multiple write workloads
with `start_writes`) indicated that the machine was overloaded.
The relevant observations:
- two runs failed at the same time in debug mode,
- there were many reactor stalls and RPC timeouts in the logs (leading to
  unexpected events like servers marking each other down and group0
  leader changes).

I didn't prove that `start_writes` really caused this, but adding this sleep
should be a good change, even if I'm wrong.

The number of writes performed by the test decreases 30-50 times with the
sleep.

Note that some other util functions like `start_writes_to_cdc_table` have
such a sleep.

This PR also contains some minor updates to `test_raft_recovery_user_data`.

Fixes SCYLLADB-929

No backport:
- the failures were observed only in master CI,
- no proof that the change fixes the issue, so backports could be a waste
  of time.

Closes scylladb/scylladb#28917

* github.com:scylladb/scylladb:
  test: test_raft_recovery_user_data: replace asyncio.gather with gather_safely
  test: test_raft_recovery_user_data: use the exclude_node API
  test: test_raft_recovery_user_data: drop tablet_load_stats_cfg
  test: cluster: util: sleep for 0.01s between writes in do_writes
2026-03-09 12:12:04 +02:00
Nadav Har'El
47e8206482 test/alternator: test, and document, Alternator's data encoding
This patch adds a test file, test/alternator/test_encoding.py, testing
how Alternator stores its data in the underlying CQL database. We test
how tables are named, and how attributes of different types are encoded
into CQL.

The test, which begins with a long comment, also doubles as developer-
oriented *documention* on how Alternator encodes its data in the CQL
database. This documentation is not intended for end-users - we do
not want to officially support reading or writing Alternator tables
through CQL. But once in a while, this information can come in handy
for developers.

More importantly, this test will also serve as a regression test,
verifying that Alternator's encoding doesn't change unintentionally.
If we make an unintentional change to the way that Alternator stores
its data, this can break upgrades: The new code might not be able to
read or write the old table with its old encoding. So it's important
to make sure we never make such unintentional changes to the encoding
of Alternator's data. If we ever do make *intentional* changes to
Alternator's data encoding, we will need to fix the relevant test;
But also not forget to make sure that the new code is able to read
the old encoding as well.

The new tests use both "dynamodb" (Alternator) and "cql" fixtures,
to test how CQL sees the Alternator tables. So naturally are these
tests are marked "scylla_only" and skipped on DynamoDB.

Fixes #19770.

Closes scylladb/scylladb#28866
2026-03-09 10:50:09 +01:00
Andrzej Jackowski
6fb5ab78eb db/config: move guardrails config to one place and reorder
The motivations for this patch are as follows:
 - Guardrails should follow similar conventions, e.g. for config names,
   metrics names, testing. Keeping guardrails together makes it easier
   to find and compare existing guardrails when new guardrails are
   implemented.
 - The configuration is used to auto-generate the documentation
   (particularly, the `configuration-parameters` page). Currently,
   the order of parameters in the documentation is inconsistent (e.g.
   `minimum_replication_factor_fail_threshold` before
   `minimum_replication_factor_warn_threshold` but
   `maximum_replication_factor_fail_threshold` after
   `maximum_replication_factor_warn_threshold`), which can be confusing
   to customers.

Fixes: SCYLLADB-256

Closes scylladb/scylladb#28932
2026-03-09 10:50:00 +01:00
Patryk Jędrzejczak
46b7170347 Merge 'test/pylib: centralize timeout scaling and propagate build_mode in LWT helpers' from Alex Dathskovsky
This series improves timeout handling consistency across the test framework and makes build-mode effects explicit in LWT tests. (starting with LWT test that got flaky)

1. Centralize timeout scaling
Introduce scale_timeout(timeout) fixture in runner.py to provide a single, consistent mechanism for scaling test timeouts based on build mode.
Previously, timeout adjustments were done in an ad-hoc manner across different helpers and tests. Centralizing the logic:
Ensures consistent behavior across the test suite
Simplifies maintenance and reasoning about timeout behavior
Reduces duplication and per-test scaling logic
This becomes increasingly important as tests run on heterogeneous hardware configurations, where different build modes (especially debug) can significantly impact execution time.

2. Make scale_timeout explicit in LWT helpers
Propagate scale_timeout explicitly through BaseLWTTester and Worker, validating it at construction time instead of relying on implicit pytest fixture injection inside helper classes.
Additionally:
Update wait_for_phase_ops() and wait_for_tablet_count() to use scale_timeout_by_mode() for consistent polling behavior across modes
Update all LWT test call sites to pass build_mode explicitly
Increase default timeout values, as the previous defaults were too short and prone to flakiness, particularly under slower configurations such as debug builds

Overall, this series improves determinism, reduces flakiness, and makes the interaction between build mode and test timing explicit and maintainable.

backport: not required just an enhansment for test.py infra

Closes scylladb/scylladb#28840

* https://github.com/scylladb/scylladb:
  test/auth_cluster: align service-level timeout expectations with scaled config
  test/lwt: propagate scale_timeout through LWT helpers; scale resize waits Pass scale_timeout explicitly through BaseLWTTester and Worker, validating it at construction time instead of relying on implicit pytest fixture injection inside helper classes. Update wait_for_phase_ops() and wait_for_tablet_count() to use scale_timeout_by_mode() so polling behavior remains consistent across build modes. Adjust LWT test call sites to pass scale_timeout explicitly. Increase default timeout values, as the previous defaults were too short and prone to flakiness under slower configurations (notably debug/dev builds).
  test/pylib: introduce scale_timeout fixture helper
2026-03-09 10:28:19 +01:00
Patryk Jędrzejczak
4c8dba15f1 Merge 'strong_consistency/state_machine: ensure and upgrade mutations schema' from Michał Jadwiszczak
This patch fixes 2 issues within strong consistency state machine:
- it might happen that apply is called before the schema is delivered to the node
- on the other hand, the apply may be called after the schema was changed and purged from the schema registry

The first problem is fixed by doing `group0.read_barrier()` before applying the mutations.
The second one is solved by upgrading the mutations using column mappings in case the version of the mutations' schema is older.

Fixes SCYLLADB-428

Strong consistency is in experimental phase, no need to backport.

Closes scylladb/scylladb#28546

* https://github.com/scylladb/scylladb:
  test/cluster/test_strong_consistency: add reproducer for old schema during apply
  test/cluster/test_strong_consistency: add reproducer for missing schema during apply
  test/cluster/test_strong_consistency: extract common function
  raft_group_registry: allow to drop append entries requests for specific raft group
  strong_consistency/state_machine: find and hold schemas of applying mutations
  strong_consistency/state_machine: pull necessary dependencies
  db/schema_tables: add `get_column_mapping_if_exists()`
2026-03-09 09:49:22 +01:00
Marcin Maliszkiewicz
4150c62f29 Merge 'test_proxy_protocol: fix flaky system.clients visibility checks' from Piotr Smaron
`test_proxy_protocol_port_preserved_in_system_clients` failed because it
didn't see the just created connection in system.clients immediately. The
last lines of the stacktrace are:
```
            # Complete CQL handshake
            await do_cql_handshake(reader, writer)

            # Now query system.clients using the driver to see our connection
            cql = manager.get_cql()
            rows = list(cql.execute(
                f"SELECT address, port FROM system.clients WHERE address = '{fake_src_addr}' ALLOW FILTERING"
            ))

            # We should find our connection with the fake source address and port
>           assert len(rows) > 0, f"Expected to find connection from {fake_src_addr} in system.clients"
E           AssertionError: Expected to find connection from 203.0.113.200 in system.clients
E           assert 0 > 0
E            +  where 0 = len([])
```
Explanation: we first await for the hand-made connection to be completed,
then, via another connection, we're querying system.clients, and we don't
get this hand-made connection in the resultset.
The solution is to replace the bare cql.execute() calls with await wait_for_results(), a helper
 that polls via cql.run_async() until the expected row count is reached
 (30 s timeout, 100 ms period).

Fixes: SCYLLADB-819

The flaky test is present on master and in previous release, so backporting only there.

Closes scylladb/scylladb#28849

* github.com:scylladb/scylladb:
  test_proxy_protocol: introduce extra logging to aid debugging
  test_proxy_protocol: fix flaky system.clients visibility checks
2026-03-09 08:37:57 +01:00
Yaron Kaikov
977bdd6260 .github/workflows/trigger-scylla-ci: fix heredoc injection in trigger-scylla-ci workflow
Move all ${{ }} expression interpolations into env: blocks so they are
passed as environment variables instead of being expanded directly into
shell scripts. This prevents an attacker from escaping the heredoc in
the Validate Comment Trigger step and executing arbitrary commands on
the runner.

The Verify Org Membership step is hardened in the same way for
defense-in-depth.

Refs: GHSA-9pmq-v59g-8fxp
Fixes: SCYLLADB-954

Closes scylladb/scylladb#28935
2026-03-08 21:34:51 +02:00
Wojciech Mitros
0008976e2f mv: allow skipping view updates when a collection is unmodified
When we generate view updates, we check whether we can skip the
entire view update if all columns selected by the view are unmodified.
However, for collection columns, we only check if they were unset
before and after the update.
In this patch we add a check for the actual collection contents.
We perform this check for both virtual and non-virtual selections.
When the column is only a virtual column in the view, it would be
enough to check the liveness of each collection cell, however for
that we'd need to deserialize the entire collection anyway, which
should be effectively as expensive as comparing all of its bytes.

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-808
2026-03-08 16:23:22 +01:00
Wojciech Mitros
7d1e0a2e4d mv: allow skipping view updates if an empty collection remains unset
Currently, when we generate view updates, we skip the view update if
all columns selected by the view are unchanged in the base table update.
However, this does not apply for collection columns - if the base table
has a collection regular column, we never allow skipping generating
view updates and the reason for that is missing implementation.
We can easily relax this for the case where the collection was missing
before and after the update - in this commit we move the check for
collections after the check for missing cells.
2026-03-08 16:22:27 +01:00
Artsiom Mishuta
fda68811e8 test.py: fix strict-config argument.
The ini-level strict_config was removed/never existed as a config key in pytest 8 — it's only a command-line flag(and  back in pytest 9)
In pytest 8.3.5, the equivalent is the --strict-config CLI flag, not an ini option

Fixes SCYLLADB-955

Closes scylladb/scylladb#28939
2026-03-08 16:09:29 +02:00
Taras Veretilnyk
739dd59ebc docs: document components_digests subcomponent and trailing digest in Scylla.db
Document the new `components_digests` subcomponent (tag 12) added to the
Scylla.db metadata component, which stores CRC32 digests of all checksummed
SSTable component files. Also document the trailing CRC32 digest that
stores digest of the scylla metadata itself.
2026-03-06 21:58:15 +01:00
Taras Veretilnyk
2b1c37396a sstable_compaction_test: Add tests for perform_component_rewrite
Add two test cases to verify the correctness of the perform_component_rewrite
functionality:
- test_perform_component_rewrite_single_sstable: Tests rewriting the Statistics
  component of a single sstable
- test_perform_component_rewrite_multiple_sstables: Tests rewriting 5 out of 10
  sstables
2026-03-06 21:58:15 +01:00
Taras Veretilnyk
591d13e942 sstable_test: add verification testcases of SSTable components digests persistance
Adds a generic test helper that writes a random SSTable, reloads it, and
verifies that the persisted CRC32 digest for each component matches the
digest computed from disk. Those covers all checksummed components test cases.
2026-03-06 21:58:15 +01:00
Taras Veretilnyk
54af4a26ca sstables: store digest of all sstable components in scylla metadata
This change replaces plain file_writer with crc32_digest_file_writer
for all SSTable components that should be checksummed. The resulting component
digests are stored in scylla metadata component.

This also extends new rewrite component mechanism,
to rewrite metadata with updated digest together with the component.
2026-03-06 21:58:10 +01:00
Dawid Mędrek
5feed00caa Merge 'raft: read_barrier: update local commit_idx to read_idx when it's safe' from Patryk Jędrzejczak
When the local entry with `read_idx` belongs to the current term, it's
safe to update the local `commit_idx` to `read_idx`.

The motivation for this change is to speed up read barriers. `wait_for_apply`
executed at the end of `read_barrier` is delayed until the follower learns
that the entry with `read_idx` is committed. It usually happens quickly in
the `read_quorum` message. However, non-voters don't receive this message,
so they have to wait for `append_entries`. If no new entries are being
added, `append_entries` can come only from `fsm::tick_leader()`. For group0,
this happens once every 100ms.

The issue above significantly slows down cluster setups in tests. Nodes
join group0 as non-voters, and then they are met with several read barriers
just after a write to group0. One example is `global_token_metadata_barrier`
in `write_both_read_new` performed just after `update_topology_state` in
`write_both_read_old`.

I tested the performance impact of this change with the following test:
```python
for _ in range(10):
    await manager.servers_add(3)
```
It consistently takes 44-45s with the change and 50-51s without the change
in dev mode.

No backport:
- non-critical performance improvement mostly relevant in tests,
- the change requires some soak time in master.

Closes scylladb/scylladb#28891

* github.com:scylladb/scylladb:
  raft: server: fix the repeating typo
  raft: clarify the comment about read_barrier_reply
  raft: read_barrier: update local commit_idx to read_idx when it's safe
  raft: log: clarify the specification of term_for
2026-03-06 18:50:08 +01:00
Aleksandra Martyniuk
40dca578c5 test: add test_tablet_repair_wait_with_table_drop 2026-03-06 15:08:29 +01:00
Piotr Smaron
f12e4ea42b test_proxy_protocol: introduce extra logging to aid debugging
In case of an error, we want to see the contents of the system.clients
table to have a better understanding of what happened - whether the
row(s) are really missing or maybe they are there, but 1 digit doesn't
match or the row is half-written.
We'll therefore query for the whole table on the CQL side, and then
filter out the rows we want to later proceed with on the python side.
This way we can dump the contents of the whole system.clients table if
something goes south.
2026-03-06 14:50:12 +01:00
Piotr Smaron
d8cf2c5f23 test_proxy_protocol: fix flaky system.clients visibility checks
`test_proxy_protocol_port_preserved_in_system_clients` failed because it
didn't see the just created connection in system.clients immediately. The
last lines of the stacktrace are:
```
            # Complete CQL handshake
            await do_cql_handshake(reader, writer)

            # Now query system.clients using the driver to see our connection
            cql = manager.get_cql()
            rows = list(cql.execute(
                f"SELECT address, port FROM system.clients WHERE address = '{fake_src_addr}' ALLOW FILTERING"
            ))

            # We should find our connection with the fake source address and port
>           assert len(rows) > 0, f"Expected to find connection from {fake_src_addr} in system.clients"
E           AssertionError: Expected to find connection from 203.0.113.200 in system.clients
E           assert 0 > 0
E            +  where 0 = len([])
```
Explanation: we first await for the hand-made connection to be completed,
then, via another connection, we're querying system.clients, and we don't
get this hand-made connection in the resultset.
The solution is to replace the bare cql.execute() calls with await wait_for_results(), a helper
 that polls via cql.run_async() until the expected row count is reached
 (30 s timeout, 100 ms period).

Fixes: SCYLLADB-819
2026-03-06 14:49:59 +01:00
Aleksandra Martyniuk
dd634c329f service: tasks: return successful status if a table was dropped
tablet_virtual_task::wait throws if a table on which a tablet operation
was working is dropped.

Treat the tablet operation as successful if a table is dropped.
2026-03-06 14:37:44 +01:00
Botond Dénes
4fdc0a5316 Merge 'Relax test's check_mutation_replicas() argument list' from Pavel Emelyanov
The one accepts long list of arguments, some of those is not really needed. Also some callers can be relaxed not to provide default values for arguments with such.

Improving tests, not backporting

Closes scylladb/scylladb#28861

* github.com:scylladb/scylladb:
  test: Remove passing default "expected_replicas" to check_mutation_replicas()
  test: Remove scope and primary-replica-only arguments from check_mutation_replicas() helper
2026-03-06 11:25:00 +02:00
Szymon Malewski
d817e56e87 vector_similarity_fcts.cc: fix strict aliasing violation in extract_float_vector
Previous code performed endian conversion by bulk-copying raw bytes
into a std::vector<float> and then iterating over it via a
reinterpret_cast<uint32_t*> pointer. Accessing float storage through a
uint32_t* violates C++ strict aliasing rules, giving the compiler
freedom to reorder or elide the stores, causing undefined behavior.

Replace the two-pass approach with a single-pass loop using
seastar::consume_be<uint32_t>() and std::bit_cast<float>(), which is
both well-defined and auto-vectorizable.

Follow-up #28754

Closes scylladb/scylladb#28912
2026-03-06 09:15:45 +01:00
Artsiom Mishuta
5d7a73cc5b test.py add support if non_gating tests
Add support for non_gating, the opposite of gating in dtest terminology, tests in test.py
codebase

This test will/should not be run by any current gating job (ci/next/nightly)

Closes scylladb/scylladb#28902
2026-03-06 09:39:32 +02:00
Andrei Chekun
01498a00d5 test.py: make HostRegistry singleton
HostRegistry initialized in several places in the framework, this can
lead to the overlapping IP, even though the possibility is low it's not
zero. This PR makes host registry initialized once for the master
thread and pytest. To avoid communication between with workers, each
worker will get its own subnet that it can use solely for its own goals.
This simplifies the solution while providing the way to avoid overlapping IP's.

Closes scylladb/scylladb#28520
2026-03-06 09:25:29 +02:00
Artsiom Mishuta
2be4d8074d test.py disable XFail tests on CI run
This PR disables running FXAIL tests on ci run to speed it up.

tests will continue run on "nightly" job and FAIL on unexpected pass
and will continue run on "NEXT" job and NOT FAIL on unexpected pass

Closes scylladb/scylladb#28886
2026-03-06 09:12:06 +02:00
Szymon Malewski
f9d213547f cql3: selection: fix add_column_for_post_processing for ORDER BY
The purpose of `add_column_for_post_processing` is to add columns that are required for processing of a query,
but are not part of SELECT clause and shouldn't be returned. They are added to the final result set, but later are not serialized.
Mainly it is used for filtering and grouping columns, with a special case of `WHERE primary_key IN ...  ORDER BY ...` when the whole result set needs additional final sorting,
and ordering columns must be added as well.
There was a bug that manifested in #9435, #8100 and was actually identified in #22061.
In case of selection with processing (e.g functions involved), result set row is formed in two stages.
Initially it is a list of columns fetched from replicas - on which filtering and grouping is performed.
After that the actual selection is resolved and the final number of columns can change.
Ordering is performed on this final shape, but the ordering column index returned by `add_column_for_post_processing` refereed to initial shape.
If selection refereed to the same column twice (e.g. `v, TTL(v)` as in #9435) final row was longer than initial and ordering refereed to incorrect column.
If a function in selection refereed to multiple columns (e.g. as_json(.., ..) which #8100 effectively uses) the final row was shorter
and ordering tried to use a non-existing column.

This patch fixes the problem by making sure that column index of the final result set is used for ordering.

The previously crashing test `cassandra_tests/validation/entities/json_test.py::testJsonOrdering` doesn't have to be skipped, but now it is failing on issue #28467.

Fixes #9435
Fixes #8100
Fixes #22061

Closes scylladb/scylladb#28472
2026-03-05 19:22:34 +02:00
Patryk Jędrzejczak
c8c57850d9 test: test_raft_recovery_user_data: replace asyncio.gather with gather_safely 2026-03-05 17:13:52 +01:00
Patryk Jędrzejczak
c3aa4ed23c test: test_raft_recovery_user_data: use the exclude_node API
The API is now available.
2026-03-05 17:13:52 +01:00
Patryk Jędrzejczak
dd75687251 test: test_raft_recovery_user_data: drop tablet_load_stats_cfg
The issue has been fixed.
2026-03-05 17:13:52 +01:00
Patryk Jędrzejczak
52940c4f31 test: cluster: util: sleep for 0.01s between writes in do_writes
Tests use `start_writes` as a simple write workload to test that writes
succeed when they should (e.g., there is no availability loss), but not to
test performance. There is no reason to overload the CPU, which can lead to
test failures.

I suspect this function to be the cause of SCYLLADB-929, where the failures
of `test_raft_recovery_user_data` (that creates multiple write workloads
with `start_writes`) indicated that the machine was overloaded.
The relevant observations:
- two runs failed at the same time in debug mode,
- there were many reactor stalls and RPC timeouts in the logs (leading to
  unexpected events like servers marking each other down and group0
  leader changes).

I didn't prove that `start_writes` really caused this, but adding this sleep
should be a good change, even if I'm wrong.

The number of writes performed by the test decreases 30-50 times with the
sleep.

Note that some other util functions like `start_writes_to_cdc_table` have
such a sleep.

Fixes SCYLLADB-929
2026-03-05 17:13:40 +01:00
Calle Wilund
ab3d3d8638 build: add slirp4netns to dependencies
Needed for port forwarded podman-in-podman containers

[avi:
  - move from Dockerfile to install-dependencies.sh so non-container
  builds also get it
  - regenerate frozen toolchain with optimized clang from

    https://devpkg.scylladb.com/clang/clang-21.1.8-Fedora-43-aarch64.tar.gz
    https://devpkg.scylladb.com/clang/clang-21.1.8-Fedora-43-x86_64.tar.gz
]

Closes scylladb/scylladb#28870
2026-03-05 17:44:17 +02:00
Michał Jadwiszczak
37bbbd3a27 test/cluster/test_strong_consistency: add reproducer for old schema during apply 2026-03-05 13:50:20 +01:00
Michał Jadwiszczak
6aef4d3541 test/cluster/test_strong_consistency: add reproducer for missing schema during apply 2026-03-05 13:50:16 +01:00
Michał Jadwiszczak
4795f5840f test/cluster/test_strong_consistency: extract common function 2026-03-05 13:47:43 +01:00
Michał Jadwiszczak
3548b7ad38 raft_group_registry: allow to drop append entries requests for specific raft group
Similar to `raft_drop_incoming_append_entries`, the new error injection
`raft_drop_incoming_append_entries_for_specified_group` skips handler
for `raft_append_entries` RPC but it allows to specify id of raft group
for which the requests should be dropped.

The id of a raft group should be passed in error injection parameters
under `value` key.
2026-03-05 13:47:43 +01:00
Michał Jadwiszczak
b0cffb2e81 strong_consistency/state_machine: find and hold schemas of applying mutations
It might happen that a strong consistency command will arrive to a node:
- before it knows about the schema
- after the schema was changes and the old version was removed from the
  memory

To fix the first case, it's enough to perform a read barrier on group0.
In case of the second one, we can use column mapping the upgrade the
mutation to newer schema.

Also, we should hold pointers to schemas until we finish `_db.apply()`,
so the schema is valid for the whole time.
And potentially we should hold multiple pointers because commands passed
to `state_machine::apply()` may contain mutations to different schema
versions.

This commit relies on a fact that the tablet raft group and its state
machine is created only after the table is created locally on the node.

Fixes SCYLLADB-428
2026-03-05 13:47:40 +01:00
Tomasz Grabiec
b90fe19a42 Merge 'service: assert that tables updated via group0 use schema commitlog' from Aleksandra Martyniuk
Set enable_schema_commitlog for each group0 tables.

Assert that group0 tables use schema commitlog in ensure_group0_schema
(per each command).

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-914.

Needs backport to all live releases as all are vulnerable

Closes scylladb/scylladb#28876

* github.com:scylladb/scylladb:
  test: add test_group0_tables_use_schema_commitlog
  db: service: remove group0 tables from schema commitlog schema initializer
  service: ensure that tables updated via group0 use schema commitlog
  db: schema: remove set_is_group0_table param
2026-03-05 13:28:13 +01:00
Botond Dénes
509f2af8db Merge 'repair: Fix rwlock in compaction_state and lock holder lifecycle' from Raphael Raph Carvalho
Consider this:

- repair takes the lock holder
- tablet merge filber destories the compaction group and the compaction state
- repair fails
- repair destroy the lock holder

This is observed in the test:

```
repair - repair[5d73d094-72ee-4570-a3cc-1cd479b2a036] Repair 1 out of 1 tablets: table=sec_index.users range=(432345564227567615,504403158265495551] replicas=[0e9d51a5-9c99-4d6e-b9db-ad36a148b0ea:15, 498e354c-1254-4d8d-a565-2f5c6523845a:9, 5208598c-84f0-4526-bb7f-573728592172:28]

...

repair - repair[5d73d094-72ee-4570-a3cc-1cd479b2a036]: Started to repair 1 out of 1 tables in keyspace=sec_index, table=users, table_id=ea2072d0-ccd9-11f0-8dba-c5ab01bffb77, repair_reason=repair
repair - Enable incremental repair for table=sec_index.users range=(432345564227567615,504403158265495551]
table - Disabled compaction for range=(432345564227567615,504403158265495551] session_id=a13a72cc-cd2d-11f0-8e9b-76d54580ab09 for incremental repair
table - Got unrepaired compaction and repair lock for range=(432345564227567615,504403158265495551] session_id=a13a72cc-cd2d-11f0-8e9b-76d54580ab09 for incremental repair
table - Disabled compaction for range=(432345564227567615,504403158265495551] session_id=a13a72cc-cd2d-11f0-8e9b-76d54580ab09 for incremental repair
table - Got unrepaired compaction and repair lock for range=(432345564227567615,504403158265495551] session_id=a13a72cc-cd2d-11f0-8e9b-76d54580ab09 for incremental repair
repair - repair[5d73d094-72ee-4570-a3cc-1cd479b2a036]: get_sync_boundary: got error from node=0e9d51a5-9c99-4d6e-b9db-ad36a148b0ea, keyspace=sec_index, table=users, range=(432345564227567615,504403158265495551], error=seastar::rpc::remote_verb_error (Compaction state for table [0x60f008fa34c0] not found)
compaction_manager - Stopping 1 tasks for 1 ongoing compactions for table sec_index.users compaction_group=238 due to tablet merge
compaction_manager - Stopping 1 tasks for 1 ongoing compactions for table sec_index.users compaction_group=238 due to tablet merge

....

scylla[10793] Segmentation fault on shard 28, in scheduling group streaming
```

The rwlock in compaction_state could be destroyed before the lock holder
of the rwlock is destroyed. This causes user after free when the lock
the holder is destroyed.

To fix it, users of repair lock will now be waited when a compaction
group is being stopped.
That way, compaction group - which controls the lifetime of rwlock -
cannot be destroyed while the lock is held.
Additionally, the merge completion fiber - that might remove groups -
is properly serialized with incremental repair.

The issue can be reproduced using sanitize build consistently and can not
be reproduced after the fix.

Fixes #27365

Closes scylladb/scylladb#28823

* github.com:scylladb/scylladb:
  repair: Fix rwlock in compaction_state and lock holder lifecycle
  repair: Prevent repair lock holder leakage after table drop
2026-03-05 14:18:25 +02:00
Patryk Jędrzejczak
f1978d8a22 raft: server: fix the repeating typo 2026-03-05 13:06:08 +01:00
Patryk Jędrzejczak
5a43695f6a raft: clarify the comment about read_barrier_reply
The comment could be misleading. It could suggest that the returned index is
already safe to read. That's not necessarily true. The entry with the
returned index could, for example, be dropped by the leader if the leader's
entry with this index had a different term.
2026-03-05 13:06:08 +01:00
Patryk Jędrzejczak
1ae2ae50a6 raft: read_barrier: update local commit_idx to read_idx when it's safe
When the local entry with `read_idx` belongs to the current term, it's
safe to update the local `commit_idx` to `read_idx`. The argument for
safety is in the new comment above `maybe_update_commit_idx_for_read`.

The motivation for this change is to speed up read barriers. `wait_for_apply`
executed at the end of `read_barrier` is delayed until the follower learns
that the entry with `read_idx` is committed. It usually happens quickly in
the `read_quorum` message. However, non-voters don't receive this message,
so they have to wait for `append_entries`. If no new entries are being
added, `append_entries` can come only from `fsm::tick_leader()`. For group0,
this happens once every 100ms.

The issue above significantly slows down cluster setups in tests. Nodes
join group0 as non-voters, and then they are met with several read barriers
just after a write to group0. One example is `global_token_metadata_barrier`
in `write_both_read_new` performed just after `update_topology_state` in
`write_both_read_old`.

Writing a test for this change would be difficult, so we trust the nemesis
tests to do the job. They have already found consistency issues in read
barriers. See #10578.
2026-03-05 13:06:08 +01:00
Patryk Jędrzejczak
1cbd0da519 raft: log: clarify the specification of term_for
When `idx > last_idx()`, the function does an out-of-bounds access to `_log`.
This may look contradictory to the current specification.
2026-03-05 13:06:07 +01:00
Michał Jadwiszczak
33a16940be strong_consistency/state_machine: pull necessary dependencies
Both migration manager and system keyspace will be used in next commit.
The first one is needed to execute group0 read barrier and we need
system keyspace to get column mappings.
2026-03-05 12:33:17 +01:00
Alex
b32ef8ecd5 test/auth_cluster: align service-level timeout expectations with scaled config
Use scale_timeout_by_mode() in make_scylla_conf() to derive
  request_timeout_in_ms in test/pylib/scylla_cluster.py.

  Update test_connections_parameters_auto_update in
  test/cluster/auth_cluster/test_raft_service_levels.py to expect the
  mode-specific timeout string returned by the REST endpoint after this
  scaling change.
2026-03-05 13:32:15 +02:00
Alex
a66565cc42 test/lwt: propagate scale_timeout through LWT helpers; scale resize waits
Pass scale_timeout explicitly through BaseLWTTester and Worker, validating it at construction time instead of relying on implicit pytest fixture injection inside helper classes.
Update wait_for_phase_ops() and wait_for_tablet_count() to use scale_timeout_by_mode() so polling behavior remains consistent across build modes.
Adjust LWT test call sites to pass scale_timeout explicitly.
Increase default timeout values, as the previous defaults were too short and prone to flakiness under slower configurations (notably debug/dev builds).
2026-03-05 13:07:09 +02:00
Alex
73f1a65203 test/pylib: introduce scale_timeout fixture helper
Introduce scale_timeout(mode) to centralize test timeout scaling logic based on build mode, the function will return a callable that will handle the timeout by mode.
This ensures consistent timeout behavior across test helpers and eliminates ad-hoc per-test scaling adjustments.
Centralizing the logic improves maintainability and makes timeout behavior easier to reason about.
This becomes increasingly important as we run tests on heterogeneous hardware configurations.
Different build modes (especially debug) can significantly affect execution time, and having a single scaling mechanism helps keep test stability predictable across environments.

No functional change beyond unifying existing timeout scaling behavior.
2026-03-05 13:07:09 +02:00
Anna Stuchlik
855c503c63 doc: fix the unified installer instructions
This commit updates the documentation for the unified installer.

- The Open Source example is replaced with version 2025.1 (Source Available, currently supported, LTS).
- The info about CentOS 7 is removed (no longer supported).
- Java 8 is removed.
- The example for cassandra-stress is removed (as it was already removed on other installation pages).

Fixes https://github.com/scylladb/scylladb/issues/28150

Closes scylladb/scylladb#28152
2026-03-05 12:57:06 +02:00
Michał Jadwiszczak
d25be9e389 db/schema_tables: add get_column_mapping_if_exists()
In scenarios where we want to firsty check if a column mapping exists
and if we don't want do flow control with exception, it is very wasteful
to do
```
if (column_mapping_exists()) {
  get_column_mapping();
}
```
especially in a hot path like `state_machine::apply()` becase this will
execute 2 internal queries.

This commit introduces `get_column_mapping_if_exists()` function,
which simply wrapps result of `get_column_mapping()` in optional and
doesn't throw an exception if the mapping doesn't exist.
2026-03-05 11:55:57 +01:00
Artsiom Mishuta
7b30a3981b test.py: enable strict_config,xfail_strict,strict-markers
this commit enables 3 strict pytest options:

strict_config - if any warnings encountered while parsing the pytest section of the configuration file will raise errors.
xfail_strict - if markers not registered in the markers section of the configuration file will raise errors.
strict-markers - if tests marked with @pytest.mark.xfail that actually succeed will by default fail the test suite

and fix errors that occur after enabling these options

Closes scylladb/scylladb#28859
2026-03-05 12:54:26 +02:00
Dawid Mędrek
7564a56dc8 Merge 'tombstone_gc: allow using repair-mode tombstone gc with RF=1 tables' from Botond Dénes
Currently, repair-mode tombstone-gc cannot be used on tables with RF=1. We want to make repair-mode the default for all tablet tables (and more, see https://github.com/scylladb/scylladb/issues/22814), but currently a keyspace created with RF=1 and later altered to RF>1 will end up using timeout-mode tombstone gc. This is because the repair-mode tombstone-gc code relies on repair history to determine the gc-before time for keys/ranges. RF=1 tables cannot run repairs so they will have empty repair history and consequently won't be able to purge tombstones.
This PR solves this by keeping a registry of RF=1 tables and consulting this registry when creating `tombstone_gc_state` objects. If the table is RF=1, tombstone-gc will work as if the table used immediate-mode tombstone-gc. The registry is updated on each replication update. As soon as the table is not RF=1 anymore, the tombstone-gc reverts to the natural repair-mode behaviour.

After this PR, tombstone-gc defaults to repair-mode for all tables, regardless of RF and tablets/vnodes.

Fixes: SCYLLADB-106.

New feature, no backport required.

Closes scylladb/scylladb#22945

* github.com:scylladb/scylladb:
  test/{boost,cluster}: add test for tombstone gc mode=repair with RF=1
  tombstone_gc: allow use of repair-mode for RF=1 tables
  replica/table: update rf=1 table registry in shared tombstone-gc state
  tombstone_gc: tombstone_gc_before_getter: consider RF when getting gc before time
  tombstone_gc: unpack per_table_history_maps
  tombstone_gc: extract _group0_gc_time from per_table_history_map
  tombstone_gc: drop tombstone_gc_state(nullptr) ctor and operator bool()
  test/lib/random_schema: use timeout-mode tombstone_gc
  tombstone_gc_options: add C++ friendly constructor
  test: move away from tombstone_gc_state(nullptr) ctor
  treewide: move away from tombstone_gc_state(nullptr) ctor
  sstable: move away from tombstone_gc_mode::operator bool()
  replica/table: add get_tombstone_gc_state()
  compaction: use tombstone_gc_state with value semantics
  db/row_cache: use tombstone_gc_state with value semantics
  tombstone_gc: introduce tombstone_gc_state::for_tests()
2026-03-05 11:50:31 +01:00
Piotr Dulikowski
a2669e9983 test: test_mv_merge_allowed: add mistakenly omitted awaits
The test test_mv_merge_allowed asserts in two places that the tablet
count is 2. It does so by calling an async function but, mistakenly, the
returned coroutine was not awaited. The coroutine is, apparently, truthy
so the assertions always passed.

Fix the test to properly await the coroutines in the assertions.

Fixes: SCYLLADB-905

Closes scylladb/scylladb#28875
2026-03-05 11:29:23 +01:00
Avi Kivity
5ae40caa6d dist: tune tcp_mem to 3% of total memory in scylla-kernel-conf package
tcp_mem defaults to 9% of total memory. ScyllaDB defaults to 93%. The
sum is more than 100%.

Fix by tuning tcp_mem to 3% of total memory.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-734

Closes scylladb/scylladb#28700
2026-03-05 12:51:04 +03:00
Patryk Jędrzejczak
bb1a798c2c Merge 'raft: Throw stopped_error if server aborted' from Dawid Mędrek
This PR solves a series of similar problems related to executing
methods on an already aborted `raft::server`. They materialize
in various ways:

* For `add_entry` and `modify_config`, a `raft::not_a_leader` with
  a null ID will be returned IF forwarding is disabled. This wasn't
  a big problem because forwarding has always been enabled for group0,
  but it's something that's nice to fix. It's might be relevant for
  strong consistency that will heavily rely on this code.

* For `wait_for_leader` and `wait_for_state_change`, the calls may
  hang and never resolve. A more detailed scenario is provided in a
  commit message.

For the last two methods, we also extend their descriptions to indicate
the new possible exception type, `raft::stopped_error`. This change is
correct since either we enter the functions and throw the exception
immediately (if the server has already been aborted), or it will be
thrown upon the call to `raft::server::abort`.

We fix both issues. A few reproducer tests have been included to verify
that the calls finish and throw the appropriate errors.

Fixes SCYLLADB-841

Backport: Although the hanging problems haven't been spotted so far
          (at least to the best of my knowledge), it's best to avoid
          running into a problem like that, so let's backport the
          changes to all supported versions. They're small enough.

Closes scylladb/scylladb#28822

* https://github.com/scylladb/scylladb:
  raft: Make methods throw stopped_error if server aborted
  raft: Throw stopped_error if server aborted
  test: raft: Introduce get_default_cluster
2026-03-05 10:47:39 +01:00
Botond Dénes
cd13a911cc test/cluster/test_data_resurrection_in_memtable.py: dump rows before check
So that if the check of expected rows fail, we have a dump to look at
and see what is different.
2026-03-05 11:44:02 +02:00
Botond Dénes
f375aae257 replica/database: consolidate the two database_apply error injections
Into a single database_apply one. Add three parameters:
* ks_name and cf_name to filter the tables to be affected
* what - what to do: throw or wait

This leads to smaller footprint in the code and improved filtering for
table names at the cost of some extra error injection params in the
tests.
2026-03-05 11:44:02 +02:00
Marcin Maliszkiewicz
c3f59e4fa1 Merge 'cql3: implement write_consistency_levels guardrails' from Andrzej Jackowski
This patch series implements `write_consistency_levels_warned` and `write_consistency_levels_disallowed` guardrails, allowing the configuration of which consistency levels are unwanted for writes. The motivation for these guardrails is to forbid writing with consistency levels that don't provide high durability guarantees (like CL=ANY, ONE, or LOCAL_ONE).

Neither guardrail is enabled by default, so as not to disrupt clusters that are currently using any of the CLs for writes. The warning guardrail may seem harmless, as it only adds a warning to the CQL response; however, enabling it can significantly increase network traffic (as a warning message is added to each response) and also decrease throughput due to additional allocations required to prepare the warning. Therefore, both guardrails should be enabled with care. The newly added `writes_per_consistency_level` metric, which is incremented unconditionally, can help decide whether a guardrail can be safely enabled in an existing cluster.

This commit adds additional `if` instructions on the critical path. However, based on the `perf_simple_query` benchmark for writes, the difference is marginal (~40 additional instructions, which is a relative difference smaller than 0.001).

BEFORE:
```
291443.35 tps ( 53.3 allocs/op,  16.0 logallocs/op,  14.2 tasks/op,   48067 insns/op,   18885 cycles/op,        0 errors)
throughput:
 mean=   289743.07 standard-deviation=6075.60
 median= 291424.69 median-absolute-deviation=1702.56
 maximum=292498.27 minimum=261920.06
instructions_per_op:
 mean=   48072.30 standard-deviation=21.15
 median= 48074.49 median-absolute-deviation=12.07
 maximum=48119.87 minimum=48019.89
cpu_cycles_per_op:
 mean=   18884.09 standard-deviation=56.43
 median= 18877.33 median-absolute-deviation=14.71
 maximum=19155.48 minimum=18821.57
```

AFTER:
```
290108.83 tps ( 53.3 allocs/op,  16.0 logallocs/op,  14.2 tasks/op,   48121 insns/op,   18988 cycles/op,        0 errors)
throughput:
 mean=   289105.08 standard-deviation=3626.58
 median= 290018.90 median-absolute-deviation=1072.25
 maximum=291110.44 minimum=274669.98
instructions_per_op:
 mean=   48117.57 standard-deviation=18.58
 median= 48114.51 median-absolute-deviation=12.08
 maximum=48162.18 minimum=48087.18
cpu_cycles_per_op:
 mean=   18953.43 standard-deviation=28.76
 median= 18945.82 median-absolute-deviation=20.84
 maximum=19023.93 minimum=18916.46
```

Fixes: SCYLLADB-259
Refs: SCYLLADB-739
No backport, it's a new feature

Closes scylladb/scylladb#28570

* github.com:scylladb/scylladb:
  scylla.yaml: add write CL guardrails to scylla.yaml
  scylla.yaml: reorganize guardrails config to be in one place
  test: add cluster tests for write CL guardrails
  test: implement test_guardrail_write_consistency_level
  cql3: start using write CL guardrails
  cql3/query_processor: implement metrics to track CL of writes
  db: cql3/query_processor: add write_consistency_levels enum_sets
  config: add write_consistency_levels_* guardrails configuration
2026-03-05 09:55:38 +01:00
Botond Dénes
44b8cad3df service/storage_proxy: add name of table to error message for write errors
It is useful to know what table the failed write belongs to.
2026-03-05 10:51:12 +02:00
Yauheni Khatsianevich
aa85f5a9c3 test: migrating alternator ttl tests to scylla repo
migrating alternator_ttl_tests.py to scylla repo as part of
deprecating dtest framework
migrated tests:
- test_ttl_with_load_and_decommission

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-869

Closes scylladb/scylladb#28858
2026-03-05 10:04:14 +02:00
Nadav Har'El
8e32d97be6 test/alternator: fix run script
The test/alternator/run script currently fails, Scylla fails to boot
complaining that "--alternator-ttl-period-in-seconds" is specified
twice (which is, unfortunately, not allowed). The problem is that
recently we started to set this option in test/cqlpy/run.py, for
CQL's new row-level TTL, so now it is no longer needed in
test/alternator/run - and in fact not allowed and we must remove it.

This patch only affects the script test/alternator/run, and has no
affect on running tests through test.py or Jenkins.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#28868
2026-03-05 10:06:38 +03:00
Botond Dénes
9b2242c752 test/cluster/test_repair.py: fix test_repair_timtestamp_difference
This test forgot to await its check() calls, which is the pass-condition
of the test. Once the await was added, the test started failing. Turns
out, the test was broken, but this was never discovered, because due to
the missing await, the errors were not propagated.
This patch adds the missing await and fixes the discovered problems:
* Use cql.run_async() instead of cql.execute()
* Fix json path for timestamp
* Missing flush/compact

Fixes: SCYLLADB-911

Closes scylladb/scylladb#28883
2026-03-05 10:04:49 +03:00
Nadav Har'El
af07718fff test/cqlpy: fix "run --release" for versions 5.4 or older
Recently we started to rely on the options "--auth-superuser-name"
and "--auth-superuser-salted-password" to ensure that a
cassandra/cassandra user exists for tests - without those options
a default superuser no longer exists.

This broke "test/cqlpy/run --release" for old releases, earlier
than 5.4 (in the enterprise stream, 2024.1 or earlier), because
those old release didn't have this option.

So in this patch we fix the "--release" logic that removes these
options from the command line when running these old versions.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#28894
2026-03-05 09:59:46 +03:00
Botond Dénes
5e7b966d37 Merge 'Remove prepare_snapshot_for_backup() helper from backup/restore tests' from Pavel Emelyanov
The helper in question duplicates the functionality of `take_snapshot()` one from the same file. The only difference is that it additionally creates keyspace:table with yet another helper, but that helper is also going to be removed (as continuation of #28600 and #28608)

Enhancing tests, not backporting

Closes scylladb/scylladb#28834

* github.com:scylladb/scylladb:
  test_backup: Remove prepare_snapshot_for_backup()
  test_backup: Patch test_simple_backup_and_restore to use take_snapshot()
  test_backup: Patch backup tests to use take_snapshot()
  test_backup: Add helper to take snapshot on a single server
2026-03-05 06:54:07 +02:00
Dani Tweig
25fc8ef14c Add RELENG to milestone-to-Jira sync project keys
Closes scylladb/scylladb#28889
2026-03-05 06:51:21 +02:00
Calle Wilund
35aab75256 test_internode_compression: Add await for "run" coro:s
Fixes: SCYLLADB-907

Closes scylladb/scylladb#28885
2026-03-05 06:50:33 +02:00
Patryk Jędrzejczak
2a3476094e storage_service: raft_topology_cmd_handler: fix use-after-free
8e9c7397c5 made `rs` a reference, which can
lead to use-after-free. The `normal_nodes` map containing the referenced
value can be destroyed before the last use of `rs` when the topology state
is reloaded after a context switch on some `co_await`. The following move
assignment in `storage_service::topology_state_load` causes this:
```
_topology_state_machine._topology = co_await _sys_ks.local().load_topology_state(tablet_hosts);
```

This issue has been discovered in next-2026.1 CI after queueing the
backport of #28558. `test_truncate_during_topology_change` failed after
ASan reported a heap-use-after-free in
```
co_await _repair.local().bootstrap_with_repair(get_token_metadata_ptr(), rs.ring.value().tokens, session);
```
This test enables `delay_bootstrap_120s`, which makes the bug much more
likely to reproduce, but it could happen elsewhere.

No backport needed, as the only backport of #28558 hasn't been merged yet.
The backport PR will cherry-pick this commit.

Closes scylladb/scylladb#28772
2026-03-05 01:50:22 +01:00
Aleksandra Martyniuk
156c29f962 test: add test_group0_tables_use_schema_commitlog 2026-03-04 17:25:06 +01:00
Aleksandra Martyniuk
5306e26b83 db: service: remove group0 tables from schema commitlog schema initializer
Remove group0 tables from schema commitlog schema initializer.
The schema commitlog of group0 tables is ensured by set_is_group0_table.
2026-03-04 17:25:06 +01:00
Aleksandra Martyniuk
690b2c4142 service: ensure that tables updated via group0 use schema commitlog
Set enable_schema_commitlog for each group0 tables.

Assert that group0 tables use schema commitlog in ensure_group0_schema
(per each command).

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-914.
2026-03-04 17:25:04 +01:00
Aleksandra Martyniuk
6b3b174704 db: schema: remove set_is_group0_table param
set_is_group0_table takes an enabled flag, based on which it decides
whether it's a group0 table. The method is called only with enabled = true.

Drop the param. For not group0 tables nothing should be set.
2026-03-04 17:24:34 +01:00
Aleksandra Martyniuk
57f1e46204 test: cluster: tasks: await set_task_ttl
Await set_task_ttl in test_tablet_repair_task_children.

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-912

Closes scylladb/scylladb#28882
2026-03-04 17:58:37 +02:00
Dawid Mędrek
d44fc00c4c raft: Make methods throw stopped_error if server aborted
After the previous changes in `raft::server::{add_entry, modify_config}`
(cf. SCYLLADB-841), we also go through other methods of `raft::server`
and verify that they handle the aborted state properly.

I found two methods that do not:

(A) `wait_for_leader`
(B) `wait_for_state_change`

What happened before these changes?

In case (A), the dangerous scenario occurred when `_leader_promise` was
empty on entering the function. In that case, we would construct the
promise and wait on the corresponding future. However, if the server
had been already aborted before the call, the future would never
resolve and we'd be effectively stuck.

Case (B) is fully analogous: instead of `_leader_promise`, we'd work
with `_stte_change_promise`.

There's probably a more proper solution to this problem, but since I'm
not familiar with the internal code of Raft, I fix it this way. We can
improve it further in the future.

We provide two simple validation tests. They verify that after aborting
a `raft::server`, the calls:

* do not hang (the tests would time out otherwise),
* throw raft::stopped_error.

Fixes SCYLLADB-841
2026-03-04 16:28:11 +01:00
Dawid Mędrek
c200d6ab4f raft: Throw stopped_error if server aborted
Before the change, calling `add_entry` or `modify_config` on an already
aborted Raft server could result in an error `not_a_leader` containing
a null server ID. It was possible precisely when forwarding was
disabled in the server configuration.

`not_a_leader` is supposed to return the ID of the current leader,
so that was wrong. Furthermore, the description of the function
specified that if a server is aborted, then it should throw
`stopped_error`.

We fix that issue. A few small reproducer tests were provided to verify
that the functions behave correctly with and without forwarding enabled.

Refs SCYLLADB-841
2026-03-04 16:28:08 +01:00
Marcin Maliszkiewicz
9697b6013f Merge 'test: add missing awaits in test_client_routes_upgrade' from Andrzej Jackowski
Two calls in test_client_routes_upgrade were missing `await`,
so they were never actually executed. This caused Python
to emit RuntimeWarning about unawaited coroutines, and more
importantly, the test skipped important verification steps, which
could mask real bugs or cause flakiness.

Additionally, increase 10s timeouts to 60s to avoid flakiness in slow
environments. Although these tests haven't failed so far, similar
issues have already been observed in other tests with too-short
timeouts.

Fixes: [SCYLLADB-909](https://scylladb.atlassian.net/browse/SCYLLADB-909)

Backport to 2026.1, as the test is also there.

[SCYLLADB-909]: https://scylladb.atlassian.net/browse/SCYLLADB-909?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ

Closes scylladb/scylladb#28877

* github.com:scylladb/scylladb:
  test: increase timeouts in test_client_routes.py
  test: add missing awaits in test_client_routes_upgrade
2026-03-04 15:26:34 +01:00
Szymon Malewski
212bd6ae1a vector: Vectorize loops in similarity functions
The main loops iterating over vector components were not vectorized due to:
- "cannot prove it is safe to reorder floating-point operations"
- "Cannot vectorize early exit loop with more than one early exit"
The first issue is fixed with adding `#pragma clang fp contract(fast) reassociate(on)`, which allows compiler to optimize floating point operations.
The second issue is solved by refactoring the operations in the affected loop.
Additionally using float operations instead of double increases throughput and numerical accuracy is not the main consideration in vector search scenarios.

Performance measured:
- scylla built using dbuild
- using https://github.com/zilliztech/VectorDBBench (modified to call `SELECT id, similarity_cosine({vector<float, 1536>}, {vector<float, 1536>}) ...` without ANN search):
- client concurrency 20
before: ~2250 QPS
`float` operations: ~2350 QPS
`compute_cosine_similarity` vectorization: ~2500QPS
`extract_float_vector` vectorization: ~3000QPS

Follow-up https://github.com/scylladb/scylladb/pull/28615
Ref https://scylladb.atlassian.net/browse/SCYLLADB-764

Closes scylladb/scylladb#28754
2026-03-04 15:14:53 +01:00
Andrzej Jackowski
221b78cb81 test: increase timeouts in test_client_routes.py
Increase 10s timeouts to 60s to avoid flakiness in slow
environments. Although these tests haven't failed so far, similar
issues have already been observed in other tests with too-short
timeouts.

Test execution time is unaffected; the entire suite in `dev` takes ~30s
before and after this change.
2026-03-04 13:40:30 +01:00
Andrzej Jackowski
527c4141da test: add missing awaits in test_client_routes_upgrade
Two calls in test_client_routes_upgrade were missing `await`,
so they were never actually executed. This caused Python
to emit RuntimeWarning about unawaited coroutines, and more
importantly, the test skipped important verification steps, which
could mask real bugs or cause flakiness.

Fixes: SCYLLADB-909
2026-03-04 13:34:37 +01:00
Piotr Smaron
a31cb18324 db: fix UB in system.clients row sorting
The comparator used to sort per-IP client rows was not a strict-weak-ordering (it could return true in both directions for some pairs), which makes `std::ranges::sort` behavior undefined. A concrete pair that breaks it (and is realistic in system.clients):
a = (port=9042, client_type="cql")
b = (port=10000, client_type="alternator")
With the current comparator:
cmp(a,b) = (9042 < 10000) || ("cql" < "alternator") = true || false = true
cmp(b,a) = (10000 < 9042) || ("alternator" < "cql") = false || true = true
So both directions are true, meaning there is no valid ordering that sort can achieve.

The fix is to sort lexicographically by (port, client_type) to match the table's clustering key and ensure deterministic ordering.

Closes scylladb/scylladb#28844
2026-03-04 14:10:49 +03:00
Avi Kivity
c331796d28 Merge 'Support Min < Precision for approx_exponential_histogram' from Amnon Heiman
This series closes a gap in the approx_exponential_histogram implementation to
cover integer values starting from small Min values.

While the original implementation was focused on durations, where this limitation
was not an issue, over time, there has been a growing need for histograms that
cover smaller values, such as the number of SSTables or the number of items in a
batch.

The reason for the original limitation is inherent to the exponential histogram
math. The previous code required Min to be at least Precision to avoid negative
bit shifts in the exponential calculations.

After this series, approx_exponential_histogram allows Min to be smaller than
Precision by scaling values during indexing. The value is shifted left by
log2 Precision minus log2 Min or zero whichever is larger, and the existing
exponential math is applied. Bucket limits are then scaled back to the original
units. This keeps insertion and retrieval O(1) without runtime branching, at the
cost of repeated bucket limits for some values in the Min to Precision range.

Additional tests cover the new behavior.
Relates to #2785

** New feature, no need to backport. **

Closes scylladb/scylladb#28371

* github.com:scylladb/scylladb:
  estimated_histogram_test.cc: add to_metrics_histogram test
  histogram_metrics_helper.hh: Support Min < Precision
  estimated_histogram_test.cc: Add tests for approx_exponential_histogram with Min<Precision
  estimated_histogram.hh: support Min less than Precision histograms
2026-03-04 12:43:26 +02:00
Szymon Malewski
4c4673e8f9 test: vector_similarity: Fix similarity value checks
`isclose` function checks if returned similarity floats are close enough to expected value, but it doesn't `assert` by itself.
Several tests missed that `assert`, effectively always passing.
With this patch similarity values checks are wrapped in helper function `assert_similarity` with predefined tolerance.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-877

Closes scylladb/scylladb#28748
2026-03-04 09:53:32 +01:00
Marcin Maliszkiewicz
c7d3f80863 Merge 'auth: do not create default 'cassandra:cassandra' superuser' from Dario Mirovic
This patch series removes creation of default 'cassandra:cassandra' superuser on system start.

Disable creation of a superuser with default 'cassandra:cassandra' credentials to improve security. The current flow requires clients to create another superuser and then drop the default `cassandra:cassandra' role. For those who do, there is a time window where the default credentials exist. For those who do not, that role stays. We want to improve security by forcing the client to either use config to specify default values for default superuser name and password or use cqlsh over maintenance socket connection to explicitly create/alter a superuser role.

The patch series:
- Enable role modification over the maintenance socket
- Stop using default 'cassandra' value for default superuser, skipping creation instead

Design document: https://scylladb.atlassian.net/wiki/spaces/RND/pages/165773327/Drop+default+cassandra+superuser

Fixes scylladb/scylla-enterprise#5657

This is an improvement. It does not need a backport.

Closes scylladb/scylladb#27215

* github.com:scylladb/scylladb:
  config: enable maintenance socket in workdir by default
  docs: auth: do not specify password with -p option
  docs: update documentation related to default superuser
  test: maintenance socket role management
  test: cluster: add logs to test_maintenance_socket.py
  test: pylib: fix connect_driver handling when adding and starting server
  auth: do not create default 'cassandra:cassandra' superuser
  auth: remove redundant DEFAULT_USER_NAME from password authenticator
  auth: enable role management operations via maintenance socket
  client_state: add has_superuser method
  client_state: add _bypass_auth_checks flag
  auth: let maintenance_socket_role_manager know if node is in maintenance mode
  auth: remove class registrator usage
  auth: instantiate auth service with factory functors
  auth: add service constructor with factory functors
  auth: add transitional.hh file
  service: qos: handle special scheduling group case for maintenance socket
  service: qos: use _auth_integration as condition for using _auth_integration
2026-03-04 09:43:57 +01:00
Piotr Dulikowski
85dcbfae9a Merge 'hint: Don't switch group in database::apply_hint()' from Pavel Emelyanov
The method is called from storage_proxy::mutate_hint() which is in turn called from hint_mutation::apply_locally(). The latter is either called from directly by hint sender, which already runs in streaming group, or via RPC HINT_MUTATION handler which uses index 1 that negotiates streaming group as well.

To be sure, add a debugging check for current group being the expected one.

Code cleanup, not backporting

Closes scylladb/scylladb#28545

* github.com:scylladb/scylladb:
  hint: Don't switch group in database::apply_hint()
  hint_sender: Switch to sender group on stop either
2026-03-04 09:36:38 +01:00
Pavel Emelyanov
5793e305b5 test_backup: Remove prepare_snapshot_for_backup()
It's now unused

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-03-04 11:33:43 +03:00
Pavel Emelyanov
ffbd9a3218 test_backup: Patch test_simple_backup_and_restore to use take_snapshot()
This change is a bit more careful, as the test collects files from
snapshot directory several times. Before patching it to use the helper,
it collected _all_ the files. Now the helper only provides TOC-s, but
that's fine -- the only check that relies on that may also re-collect
TOC-s and compare new set with old set.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-03-04 11:33:43 +03:00
Pavel Emelyanov
c1b0ac141b test_backup: Patch backup tests to use take_snapshot()
Some of those tests need to update the hard-coded 'backup' snapshot name
to use the one provided by take_snapshot() helper. Other than that, the
patching is pretty straightforward.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-03-04 11:33:43 +03:00
Pavel Emelyanov
ea17c26fd9 test_backup: Add helper to take snapshot on a single server
The take_snapshot() helper returns a dict(server: list[string]). When
there's only one server to work with, it's more handy to just get a
single list of sstables.

Next patches will make use of that helper.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-03-04 11:31:39 +03:00
Botond Dénes
e7487c21e4 test/{boost,cluster}: add test for tombstone gc mode=repair with RF=1 2026-03-04 09:45:38 +02:00
Botond Dénes
5998a859f7 tombstone_gc: allow use of repair-mode for RF=1 tables
Modify the methods which calculate the default gc mode as well as that
which validates whether repair-mode can be used at all, so both accepts
use of repair-mode on RF=1 tables.

This de-facto changes the default tombstone-gc to repair-mode for all
tables. Documentation is updated accordingly.

Some tests need adjusting:
* cqlpy/test_select_from_mutation_fragments.py: disable GC for some test
  cases because this patch makes tombstones they write subject to GC
  when using defaults.
* test/cluster/test_mv.py::test_mv_tombstone_gc_not_inherited used
  repair-mode as a non-default for the base table and expected the MV to
  revert to default. Another mode has to be used as the non-default
  (immediate).
* test/cqlpy/test_tools.py::test_scylla_sstable_dump_schema: don't
  compare tombstone_gc schema extension when comparing dumped schema vs.
  original. The tool's schema loader doesn't have access to the keyspace
  definition so it will come up with different defaults for
  tombstone-gc.
* test/boost/row_cache_test.cc::test_populating_cache_with_expired_and_nonexpired_tombstones
  sets tombstone expiry assuming the tombstone-gc timeout-mode default.
  Change the CREATE TABLE statement to set the expected mode.
2026-03-04 09:44:24 +02:00
Andrzej Jackowski
c0e94828de scylla.yaml: add write CL guardrails to scylla.yaml
Disabled by default. This change is introduced only to document the
guardrail.

Refs: SCYLLADB-259
2026-03-04 08:00:17 +01:00
Andrzej Jackowski
038f89ede4 scylla.yaml: reorganize guardrails config to be in one place
Also change the format of the section header and add "#" to empty
lines, so that in the future no one splits the section by adding new
configs.
2026-03-04 08:00:17 +01:00
Andrzej Jackowski
ec42fdfd01 test: add cluster tests for write CL guardrails
Most of the functionality is tested in cqlpy tests located in
`test_guardrail_write_consistency_level.py`. Add two tests
that require the cluster framework:

- `test_invalid_write_cl_guardrail_config` checks the node startup
  path when incorrect `write_consistency_levels_warned` and
  `write_consistency_levels_disallowed` values are used.
- `test_write_cl_default` checks the behavior of the default
  configuration using a multi-node cluster.

Tests execution time:
 - Dev: 10s
 - Debug: 18s

Refs: SCYLLADB-259
2026-03-04 08:00:17 +01:00
Andrzej Jackowski
446539f12f test: implement test_guardrail_write_consistency_level
Implement basic tests for write consistency level guardrails,
verifying that they work for each type of write request (inserts,
updates, deletes, logged batches, unlogged batches, conditional batches,
and counter operations).

All tests are marked as Scylla-only because they currently don't
pass with Cassandra due to differences in handling superusers (see:
SCYLLADB-882).

Tests execution time:
 - Dev: 3s
 - Debug: 14s

Refs: SCYLLADB-259
Refs: SCYLLADB-882
2026-03-04 08:00:13 +01:00
Avi Kivity
85bd6d0114 Merge 'Add multiple-shard persistent metadata storage for strongly consistent tables' from Wojciech Mitros
In this series we introduce new system tables and use them for storing the raft metadata
for strongly consistent tables. In contrast to the previously used raft group0 tables, the
new tables can store data on any shard. The tables also allow specifying the shard where
each partition should reside, which enables the tablets of strongly consistent tables to have
their raft group metadata co-located on the same shard as the tablet replica.

The new tables have almost the same schemas as the raft group0 tables. However, they
have an additional column in their partition keys. The additional column is the shard
that specifies where the data should be located. While a tablet and its corresponding
raft group server resides on some shard, it now writes and reads all requests to the
metadata tables using its shard in addition to the group_id.

The extra partition key column is used by the new partitioner and sharder which allow
this special shard routing. The partitioner encodes the shard in the token and the
sharder decodes the shard from the token. This approach for routing avoids any
additional lookups (for the tablet mapping) during operations on the new tables
and it also doesn't require keeping any state. It also doesn't interact negatively
with resharding - as long as tablets (and their corresponding raft metadata) occupy
some shard, we do not allow starting the node with a shard count lower than the
id of this shard. When increasing the shard count, the routing does not change,
similarly to how tablet allocation doesn't change.

To use the new tables, a new implementation of `raft::persistence` is added. Currently,
it's almost an exact copy of the `raft_sys_table_storage` which just uses the new tables,
but in the future we can modify it with changes specific to metadata (or mutation)
storage for strongly consistent tables. The new storage is used in the `groups_manager`,
which combined with the removal of some `this_shard_id() == 0` checks, allows strongly
consistent tables to be used on all shards.

This approach for making sure that the reads/writes to the new tables end up on the correct shards
won in the balance of complexity/usability/performance against a few other approaches we've considered.
They include:
1. Making the Raft server read/write directly to the database, skipping the sharder, on its shard, while using
the default partitioner/sharder. This approach could let us avoid changing the schema and there should be
no problems for reads and writes performed by the Raft server. However, in this approach we would input
data in tables conflicting with the placement determined by the sharder. As a result, any read going through
the sharder could miss the rows it was supposed to read. Even when reading all shards to find a specific value,
there is a risk of polluting the cache - the rows loaded on incorrect shards may persist in the cache for an unknown
amount of time. The cache may also mistakenly remember that a row is missing, even though it's actually present,
just on an incorrect shard.
Some of the issues with this approach could be worked around using another sharder which always returns
this_shard_id() when asked about a shard. It's not clear how such a sharder would implement a method like
`token_for_next_shard`, and how much simpler it would be compared to the current "identity" sharder.
2. Using a sharder depending on the current allocation of tablets on the node. This approach relies on the
knowledge of group_id -> shard mapping at any point in time in the cluster. For this approach we'd also
need to either add a custom partitioner which encodes the group_id in the token, or we'd need to track the
token(group_id) -> shard mapping. This approach has the benefit over the one used in the series of keeping
the partition key as just group_id. However, it requires more logic, and the access to the live state of the node
in the sharder, and it's not static - the same token may be sharded differently depending on the state of the
node - it shouldn't occur in practice, but if we changed the state of the node before adjusting the table data,
we would be unable to access/fix the stale data without artificially also changing the state of the node.
3. Using metadata tables co-located to the strongly consistent tables. This approach could simplify the
metadata migrations in the future, however it would require additional schema management of all co-located
metadata tables, and it's not even obvious what could be used as the partition key in these tables - some
metadata is per-raft-group, so we couldn't reuse the partition key of the strongly consistent table for it. And
finding and remembering a partition key that is routed to a specific shard is not a simple task. Finally, splits
and merges will most likely need special handling for metadata anyway, so we wouldn't even make use of
co-located table's splits and merges.

Fixes [SCYLLADB-361](https://scylladb.atlassian.net/browse/SCYLLADB-361)

[SCYLLADB-361]: https://scylladb.atlassian.net/browse/SCYLLADB-361?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ

Closes scylladb/scylladb#28509

* github.com:scylladb/scylladb:
  docs: add strong consistency doc
  test/cluster: add tests for strongly-consistent tables' metadata persistence
  raft: enable multi-shard raft groups for strongly consistent tablets
  test/raft: add unit tests for raft_groups_storage
  raft: add raft_groups_storage persistence class
  db: add system tables for strongly consistent tables' raft groups
  dht: add fixed_shard_partitioner and fixed_shard_sharder
  raft: add group_id -> shard mapping to raft_group_registry
  schema: add with_sharder overload accepting static_sharder reference
2026-03-04 08:55:43 +02:00
Piotr Dulikowski
2fb981413a Merge 'vector_search: test: fix HTTPS client test flakiness' from Karol Nowacki
The default 100ms timeout for client readiness in tests is too
aggressive. In some test environments, this is not enough time for
client creation, which involves address resolution and TLS certificate
reading, leading to flaky tests.

This commit increases the default client creation timeout to 10 seconds.
This makes the tests more robust, especially in slower execution
environments, and prevents similar flakiness in other test cases.

Fixes: VECTOR-547, SCYLLADB-802, SCYLLADB-825, SCYLLADB-826

Backport to 2025.4 and 2026.1, as the same problem occurs on these branches and can potentially make the CI flaky there as well.

Closes scylladb/scylladb#28846

* github.com:scylladb/scylladb:
  vector_search: test: include ANN error in assertion
  vector_search: test: fix HTTPS client test flakiness
2026-03-04 08:55:43 +02:00
Wojciech Mitros
38f02b8d76 mv: remove dead code in view_updates::can_skip_view_updates
When we create a materialized view, we consider 2 cases:
1. the view's primary key contains a column that is not
in the primary key of the base table
2. the view's primary key doesn't contain such a column

In the 2nd case, we add all columns from the base table
to the schema of the view (as virtual columns). As a result,
all of these columns are effectively "selected" in
view_updates::can_skip_view_updates. Same thing happens when
we add new columns to the base table using ALTER.
Because of this, we can never have !column_is_selected and
!has_base_non_pk_columns_in_view_pk at the same time. And
thus, the check (!column_is_selected
&& _base_info.has_base_non_pk_columns_in_view_pk) is always
the same as (!column_is_selected).
Because we immediately return after this check, the tail of
this function is also never reached - all checks after the
(column_is_selected) are affected by this. Also, the condition
(!column_is_selected && base_has_nonexpiring_marker) is always
false at the point it is called. And this in turn makes the
`base_has_nonexpiring_marker` unused, so we delete it as well.

It's worth considering, why did we even have
`base_has_nonexpiring_marker` if it's effectively unused. We
initially introduced it in bd52e05ae2 and we (incorrectly)
used it to allow skipping view updates even if the liveness of
virtual columns changed. Soon after, in 5f85a7a821, we
started categorizing virtual columns as column_is_selected == true
and we moved the liveness checks for virtual columns to the
`if (column_is_selected)` clause, before the `base_has_nonexpiring_marker`
check. We changed this because even if we have a nonexpiring marker
right now, it may be changed in the future, in which case the liveness
of the view row will depend on liveness of the virtual columns and
we'll need to have the view updates from the time the row marker was
nonexpiring.

Closes scylladb/scylladb#28838
2026-03-04 08:55:43 +02:00
Geoff Montee
0eb5603ebd Docs: describe the system tables
Fixes issue #12818 with the following docs changes:

docs/dev/system_keyspace.md: Added missing system tables, added table of contents (TOC), added categories

Closes scylladb/scylladb#27789
2026-03-04 08:55:43 +02:00
Botond Dénes
f156bcddab Merge 'test: decrease strain in test_startup_response' from Marcin Maliszkiewicz
For 2025.3 and 2025.4 this test runs order of magnitude
slower in debug mode. Potentially due to passwords::check
running in alien thread and overwhelming the CPU (this is
fixed in newer versions).

Decreasing the number of connections in test makes it fast
again, without breaking reproducibility.

As additional measure we double the timeout.

The fix is now cherry-picked to master as sometimes
test fails there too.

(cherry picked from commit 1f1fc2c2ac)

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-795

backport: 2026.1, already on other stable branches

Closes scylladb/scylladb#28848

* github.com:scylladb/scylladb:
  test: add more logs to test_startup_no_auth_response
  test: decrease strain in test_startup_response
2026-03-04 08:55:43 +02:00
Andrzej Jackowski
bb359b3b78 cql3: start using write CL guardrails
Enable verification of write consistency level guardrails in
`modification_statement` and `batch_statement`.

Neither guardrail is enabled by default, so as not to disrupt clusters
that are currently using any of the CLs for writes. The warning
guardrail may seem harmless, as it only adds a warning to the CQL
response; however, enabling it can significantly increase network
traffic (as a warning message is added to each response) and also
decrease throughput due to additional allocations required to prepare
the warning. Therefore, both guardrails should be enabled with care.
The newly added `writes_per_consistency_level` metric, which is
incremented unconditionally, can help decide whether a guardrail can
be safely enabled in an existing cluster.

This commit adds additional `if` instructions on the critical path.
However, based on the `perf_simple_query` benchmark for writes,
the difference is marginal (~40 additional instructions, which is
a relative difference smaller than 0.001).

BEFORE:
```
291443.35 tps ( 53.3 allocs/op,  16.0 logallocs/op,  14.2 tasks/op,   48067 insns/op,   18885 cycles/op,        0 errors)
throughput:
	mean=   289743.07 standard-deviation=6075.60
	median= 291424.69 median-absolute-deviation=1702.56
	maximum=292498.27 minimum=261920.06
instructions_per_op:
	mean=   48072.30 standard-deviation=21.15
	median= 48074.49 median-absolute-deviation=12.07
	maximum=48119.87 minimum=48019.89
cpu_cycles_per_op:
	mean=   18884.09 standard-deviation=56.43
	median= 18877.33 median-absolute-deviation=14.71
	maximum=19155.48 minimum=18821.57
```

AFTER:
```
290108.83 tps ( 53.3 allocs/op,  16.0 logallocs/op,  14.2 tasks/op,   48121 insns/op,   18988 cycles/op,        0 errors)
throughput:
	mean=   289105.08 standard-deviation=3626.58
	median= 290018.90 median-absolute-deviation=1072.25
	maximum=291110.44 minimum=274669.98
instructions_per_op:
	mean=   48117.57 standard-deviation=18.58
	median= 48114.51 median-absolute-deviation=12.08
	maximum=48162.18 minimum=48087.18
cpu_cycles_per_op:
	mean=   18953.43 standard-deviation=28.76
	median= 18945.82 median-absolute-deviation=20.84
	maximum=19023.93 minimum=18916.46
```

Fixes: SCYLLADB-259
2026-03-04 07:26:00 +01:00
Asias He
225b10b683 repair: Fix rwlock in compaction_state and lock holder lifecycle
Consider this:

- repair takes the lock holder
- tablet merge filber destories the compaction group and the compaction state
- repair fails
- repair destroy the lock holder

This is observed in the test:

```
repair - repair[5d73d094-72ee-4570-a3cc-1cd479b2a036] Repair 1 out of 1 tablets: table=sec_index.users range=(432345564227567615,504403158265495551] replicas=[0e9d51a5-9c99-4d6e-b9db-ad36a148b0ea:15, 498e354c-1254-4d8d-a565-2f5c6523845a:9, 5208598c-84f0-4526-bb7f-573728592172:28]

...

repair - repair[5d73d094-72ee-4570-a3cc-1cd479b2a036]: Started to repair 1 out of 1 tables in keyspace=sec_index, table=users, table_id=ea2072d0-ccd9-11f0-8dba-c5ab01bffb77, repair_reason=repair
repair - Enable incremental repair for table=sec_index.users range=(432345564227567615,504403158265495551]
table - Disabled compaction for range=(432345564227567615,504403158265495551] session_id=a13a72cc-cd2d-11f0-8e9b-76d54580ab09 for incremental repair
table - Got unrepaired compaction and repair lock for range=(432345564227567615,504403158265495551] session_id=a13a72cc-cd2d-11f0-8e9b-76d54580ab09 for incremental repair
table - Disabled compaction for range=(432345564227567615,504403158265495551] session_id=a13a72cc-cd2d-11f0-8e9b-76d54580ab09 for incremental repair
table - Got unrepaired compaction and repair lock for range=(432345564227567615,504403158265495551] session_id=a13a72cc-cd2d-11f0-8e9b-76d54580ab09 for incremental repair
repair - repair[5d73d094-72ee-4570-a3cc-1cd479b2a036]: get_sync_boundary: got error from node=0e9d51a5-9c99-4d6e-b9db-ad36a148b0ea, keyspace=sec_index, table=users, range=(432345564227567615,504403158265495551], error=seastar::rpc::remote_verb_error (Compaction state for table [0x60f008fa34c0] not found)
compaction_manager - Stopping 1 tasks for 1 ongoing compactions for table sec_index.users compaction_group=238 due to tablet merge
compaction_manager - Stopping 1 tasks for 1 ongoing compactions for table sec_index.users compaction_group=238 due to tablet merge

....

scylla[10793] Segmentation fault on shard 28, in scheduling group streaming
```

The rwlock in compaction_state could be destroyed before the lock holder
of the rwlock is destroyed. This causes user after free when the lock
the holder is destroyed.

To fix it, users of repair lock will now be waited when a compaction
group is being stopped.
That way, compaction group - which controls the lifetime of rwlock -
cannot be destroyed while the lock is held.
Additionally, the merge completion fiber - that might remove groups -
is properly serialized with incremental repair.

The issue can be reproduced using sanitize build consistently and can not
be reproduced after the fix.

Fixes #27365

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2026-03-03 21:05:15 -03:00
Raphael S. Carvalho
1d8903d9f7 repair: Prevent repair lock holder leakage after table drop
Prevent repair lock holder from being leaked in repair_service when table
is dropped midway.
The leakage might result in use-after-free later, since the repair lock
itself will be gone after table drop.
The RPC verb that removes the lock on success path will not be called
by coordinator after table was dropped.

Refs #27365.
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-896.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2026-03-03 21:05:10 -03:00
Dario Mirovic
06af4480ea config: enable maintenance socket in workdir by default
We want to enable maintenance socket by default.
This will prevent users from having to reboot a server to enable it.
Also, there is little point in having maintenance socket that is turned off,
and we want users to use it. After this patch series, they will have
to use it. Note that while config seeding exists, we do not encourage it
for production deployments.

This patch changes default maintenance_socket value from ignore to workdir.
This enables maintenance socket without specifying an explicit path.

Refs SCYLLADB-409
2026-03-04 00:01:07 +01:00
Dario Mirovic
6e83fb5029 docs: auth: do not specify password with -p option
Specifying password with -p option is considered unsafe.
The password will be saved in bash history.
The preferred approach is to enter the password when prompted.
Any approach that passes the password via command line arguments
makes that password visible in process options (ps command), no matter
if the password is passed directly or as an environment variable.

Refs SCYLLADB-409
2026-03-04 00:01:07 +01:00
Dario Mirovic
afafb8a8fa docs: update documentation related to default superuser
Update create superuser procedure:
- Remove notes about default `cassandra` superuser
- Add create superuser using existing superuser section
- Update create superuser by using `scylla.yaml` config
- Add create superuser using maintenance socket

Update password reset procedure:
- Add maintenance socket approach
- Remove the old approach with deleting all the roles

Update enabling authentication with downtime and during runtime:
- Mention creating new superuser over the maintenance socket
- Remove default superuser usage

Update enable authorization:
- Mention creating new superuser over the maintenance socket
- Remove mention of default superuser

Reasoning for deletion of the old approach:
- [old] Needs cluster downtime, removes all roles, needs recreation of roles,
  needs maintenance socket anyways, if config values are not used for superuser
- [new] No cluster downtime, possibly one node restart to enable maintenance
  socket, faster

Refs SCYLLADB-409
2026-03-04 00:01:07 +01:00
Dario Mirovic
3db74aaf5f test: maintenance socket role management
Introduce a test that cover:
- Server startup without credentials config seeding with no roles created
- Await maintenance socket role management to be enabled
- `CREATE ROLE`, `ALTER ROLE`, and `DROP ROLE` statement execution success

All the tests in the test_maintenance_socket.py module take 2-3 seconds
to execute.

Explicitly shut down Cluster objects to prevent 'RuntimeError: cannot
schedule new futures after shutdown'.

Refs SCYLLADB-409
2026-03-03 23:57:50 +01:00
Dario Mirovic
f74fe22386 test: cluster: add logs to test_maintenance_socket.py
Add logs to test_maintenance_socket.py test test_maintenance_socket.
This approach offers additional visibility in case of test failure.
Such logs will be added to new tests in a follow up patch in this
patch series.

Refs SCYLLADB-409
2026-03-03 23:42:25 +01:00
Dario Mirovic
0e5ddec2a8 test: pylib: fix connect_driver handling when adding and starting server
When connect_driver=False, the expected server up state should be
capped to HOST_ID_QUERIED. This is to avoid waiting for CQL readiness,
which requires a superuser to be present.

This logic was only in ScyllaCluster.server_start. ManagerClient.server_add
with start=True and connect_driver=False would still wait for CQL and hang
if no superuser is present. The workaround was to call
ManagerClient.server_add(start=False, connect_driver=False) followed by
ManagerClient.server_start(connect_driver=False).

This patch moves the capping from ScyllaCluster.server_start to
ManagerClient.server_add and ManagerClient.server_start, where connect_driver
is processed. ScyllaCluster only receives the already resolved
expected_server_up_state value.

Refs SCYLLADB-409
2026-03-03 23:42:25 +01:00
Dario Mirovic
fd17dcbec8 auth: do not create default 'cassandra:cassandra' superuser
Changes the behavior of default superuser creation.
Previously, without configuration 'cassandra:cassandra' credentials
were used. Now default superuser creation is skipped if not configured.

The two ways to create default superuser are:
- Config file - auth_superuser_name and auth_superuser_salted_password fields
- Maintenance socket - connect over maintenance socket and CREATE/ALTER ROLE ...

Behavior changes:

Old behavior:
- No config - 'cassandra:cassandra' created
- auth_superuser_name only - <name>:cassandra created
- auth_superuser_salted_password only - 'cassandra:<password>' created
- Both specified - '<name>:<password>' created

New behavior:
- No config - no default superuser
    - Requires maintenance socket setup
- auth_superuser_name only - '<name>:' created WITHOUT password
    - Requires maintenance socket setup
- auth_superuser_salted_password only - no default superuser
- Both specified - '<name>:<password>' created

Fixes SCYLLADB-409
2026-03-03 23:42:25 +01:00
Dario Mirovic
9dc1deccf3 auth: remove redundant DEFAULT_USER_NAME from password authenticator
Remove redundant DEFAULT_USER_NAME from password_authenticator.cc file.
It is just a copy of meta::DEFAULT_SUPERUSER_NAME.

Refs SCYLLADB-409
2026-03-03 23:42:25 +01:00
Dario Mirovic
45628cf041 auth: enable role management operations via maintenance socket
Introduce maintenance_socket_authenticator and rework
maintenance_socket_role_manager to support role management operations.

Maintenance auth service uses allow_all_authenticator. To allow
role modification statements over the maintenance socket connections,
we need to treat the maintenance socket connections as superusers and
give them proper access rights.

Possible approaches are:
1. Modify allow_all_authenticator with conditional logic that
   password_authenticator already does
2. Modify password_authenticator with conditional logic specific
   for the maintenance socket connections
3. Extend password_authenticator, overriding the methods that differ

Option 3 is chosen: maintenance_socket_authenticator extends
password_authenticator with authentication disabled.

The maintenance_socket_role_manager is reworked to lazily create a
standard_role_manager once the node joins the cluster, delegating role
operations to it. In maintenance mode role operations remain disabled.

Refs SCYLLADB-409
2026-03-03 23:41:05 +01:00
Dario Mirovic
6a1edab2ac client_state: add has_superuser method
Encapsulate the superuser check in client_state so that it
respects _bypass_auth_checks. Connections that bypass auth
(internal callers and the maintenance socket) are always
considered superusers.

Migrate existing call sites from auth::has_superuser(service, user)
to client_state.has_superuser(). Also add _bypass_auth_checks
handling to ensure_not_anonymous().

Refs SCYLLADB-409
2026-03-03 22:31:35 +01:00
Dario Mirovic
d765b5b309 client_state: add _bypass_auth_checks flag
Authorization checks were previously skipped based on the
_is_internal flag. This couples two concerns: marking client
state as internal and bypassing authorization.

Introduce _bypass_auth_checks to handle only the authorization
bypass. Internal client state sets it to true, preserving current
behavior. External client state accepts it as a constructor
parameter, defaulting to false.

This will allow maintenance socket connections to skip
authorization without being marked as internal.

Refs SCYLLADB-409
2026-03-03 22:31:35 +01:00
Dario Mirovic
b68656b59f auth: let maintenance_socket_role_manager know if node is in maintenance mode
This patch is part of preparations for dropping 'cassandra::cassandra'
default superuser. When that is implemented, maintenance_socket_role_manager
will have two modes of work:
1. in maintenance mode, where role operations are forbidden
2. in normal mode, where role operations are allowed

To execute the role operations, the node has to join a cluster.
In maintenance mode the node does not join a cluster.

This patch lets maintenance_socket_role_manager know if it works under
maintenance mode and returns appropriate error message when role
operations execution is requested.

Refs SCYLLADB-409
2026-03-03 22:31:35 +01:00
Dario Mirovic
3bef493a35 auth: remove class registrator usage
This patch removes class registrator usage in auth module.
It is not used after switching to factory functor initialization
of auth service.

Several role manager, authenticator, and authorizer name variables
are returned as well, and hardcoded inside qualified_java_name method,
since that is the only place they are ever used.

Refs SCYLLADB-409
2026-03-03 22:31:35 +01:00
Dario Mirovic
eab24ff3b0 auth: instantiate auth service with factory functors
Auth service is instantiated with the constructor that accepts
service_config, which then uses class registrator to instantiate
authorizer, authenticator, and role manager.

This patch switches to instantiating auth service via the constructor
that accepts factory functors. This is a step towards removing
usage of class registrator.

Refs SCYLLADB-409
2026-03-03 22:31:35 +01:00
Dario Mirovic
bfff07eacb auth: add service constructor with factory functors
Auth service can be initialized:
- [current] by passing instantiated authorizer, authenticator, role manager
- [current] by passing service_config, which then uses class registrator to instantiate authorizer, authenticator, role manager
    - This approach is easy to use with sharded services
- [new] by passing factory functors which instantiate authorizer, authenticator, role manager
    - This approach is also easy to use with sharded services

Refs SCYLLADB-409
2026-03-03 22:31:35 +01:00
Dario Mirovic
e8e00c874b auth: add transitional.hh file
In a follow-up patch in this patch series class registrator will be removed.
Adding transitional.hh file will be necessary to expose the authenticator and authorizer.

Refs SCYLLADB-409
2026-03-03 22:31:35 +01:00
Dario Mirovic
e5218157de service: qos: handle special scheduling group case for maintenance socket
service_level_controller has special handling for maintenance socket connections.
If the current user is not a named user, it should use the default scheduling group.

The reason is that the maintenance socket can communicate with Scylla before
auth_integration is registered.

The guard is already present, but it was omitted in get_cached_user_scheduling_group.

This also fixes flakiness in test_maintenance_socket.py tests.

Refs SCYLLADB-409
2026-03-03 22:31:35 +01:00
Dario Mirovic
dc9a90d7cb service: qos: use _auth_integration as condition for using _auth_integration
Maintenance socket connections can be established before _auth_integration is
initialized. The fix introduced with scylladb/scylladb#26856 PR check for
the value of user variable. For maintenance socket connections it will be an
anonymous user, and will fall back to using default scheduling group.

This patch changes the criteria for using default scheduling group from
the user variable to checking the _auth_integration variable itself:
- If _auth_integration is not initialized, use default scheduling group
- If _auth_integration is initialized, let it choose the scheduling group

Refs SCYLLADB-409
2026-03-03 22:31:35 +01:00
Andrzej Jackowski
371cdb3c81 cql3/query_processor: implement metrics to track CL of writes
Add `write_consistency_levels_disallowed_violations` and
`write_consistency_levels_warned_violations` metrics to track
violations of write_consistency_levels guardrails.

Add `writes_per_consistency_level` to track what CL is used by
writes, regardless of the guardrails configuration.
Data gathered by this metric can be used to decide whether enabling
a particular write consistency level guardrail in a particular
existing cluster is safe.

Refs: SCYLLADB-259
2026-03-03 21:18:11 +01:00
Andrzej Jackowski
3606934458 db: cql3/query_processor: add write_consistency_levels enum_sets
Add enum_sets to query_processor that track the configuration
values of `write_consistency_levels_warned` and
`write_consistency_levels_disallowed`.

Refs: SCYLLADB-259
2026-03-03 20:28:57 +01:00
Dawid Mędrek
7fd083e329 test: raft: Introduce get_default_cluster
We introduce a function creating a Raft cluster with parameters usually
used by the tests. This will avoid code duplication, especially after
introducing new tests in the following commits.

Note that the test `test_aborting_wait_for_state_change` has changed:
the previous complex configuration was unnecessary for it (I wrote it).
2026-03-03 18:50:21 +01:00
Pavel Emelyanov
b768753c0f test: Remove passing default "expected_replicas" to check_mutation_replicas()
The value of None is default, callers don't need to specify it
explicitly

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-03-03 17:28:21 +03:00
Pavel Emelyanov
b8ae9ede63 test: Remove scope and primary-replica-only arguments from check_mutation_replicas() helper
These two are only used to print into logs on error. However, their
values can be found from previous logs and test execution context.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-03-03 17:26:25 +03:00
Karol Nowacki
45477d9c6b vector_search: test: include ANN error in assertion
When the test fails, the assertion message does not include
the error from the ANN request.

This change enhances the assertion to include the specific ANN error,
making it easier to diagnose test failures.
2026-03-03 14:19:20 +01:00
Karol Nowacki
ab6c222fc4 vector_search: test: fix HTTPS client test flakiness
The default 100ms timeout for client readiness in tests is too
aggressive. In some test environments, this is not enough time for
client creation, which involves address resolution and TLS certificate
reading, leading to flaky tests.

This commit increases the default client creation timeout to 10 seconds.
This makes the tests more robust, especially in slower execution
environments, and prevents similar flakiness in other test cases.

Fixes: VECTOR-547, SCYLLADB-802
2026-03-03 14:19:20 +01:00
Botond Dénes
4f5310bc72 replica/table: update rf=1 table registry in shared tombstone-gc state
On every update of the ERM, update the state of the current table in the
registry of RF=1 tables in shared tombstone gc state.
Ensures that tombstone gc stops collection of tombstones in immediate
mode as soon as the table starts transitioning away from RF=1.
2026-03-03 14:09:28 +02:00
Botond Dénes
7c2c63ab43 tombstone_gc: tombstone_gc_before_getter: consider RF when getting gc before time
Currently we cannot use repair-mode tombstone gc on RF=1 tables, because
such tables don't need repair and so there won't be repair history to
use to produce gc_before times.
Introduce shared_tombstone_gc_state::_rf_one_tables which will keep a
registry of RF=1 tables. Keeping this up to date is left to outside code
(table.cc). Consult the registry to determine whether a table is RF=1 or
not, so the repair history check can be ellided for rf=1 tables.
Not wired in yet into the table code.
2026-03-03 14:09:28 +02:00
Botond Dénes
074006749c tombstone_gc: unpack per_table_history_maps
It is now a class with a single member, replace usage with that of the
member (through an alias to reduce churn).
2026-03-03 14:09:28 +02:00
Botond Dénes
d6e2d44759 tombstone_gc: extract _group0_gc_time from per_table_history_map
Doesn't belong there. Also, having it as a separate member of
shared_tombstone_gc_state makes updating _group0_gc_time cheaper, as the
update doesn't have to do a copy-mutate-swap of the history maps.
2026-03-03 14:09:28 +02:00
Botond Dénes
5fd9fc3056 tombstone_gc: drop tombstone_gc_state(nullptr) ctor and operator bool()
Both are ambiguous and all users were migrated away to more meaningful
alternatives. They are now unused, drop them.
2026-03-03 14:09:28 +02:00
Botond Dénes
a785c0cf41 test/lib/random_schema: use timeout-mode tombstone_gc
This is the current de-facto default for all tests using random schema
and some are apparently relying on this. Make this explicit to avoid
upsetting tests, by the impending change of this default to repair.
2026-03-03 14:09:28 +02:00
Botond Dénes
d10e622a3b tombstone_gc_options: add C++ friendly constructor
So one can create options with strong types, instead of from a map of
strings.
2026-03-03 14:09:28 +02:00
Botond Dénes
6004e84f18 test: move away from tombstone_gc_state(nullptr) ctor
Use for_tests() instead (or no_gc() where approriate).
2026-03-03 14:09:28 +02:00
Botond Dénes
3c34598d88 treewide: move away from tombstone_gc_state(nullptr) ctor
It is ambigous, use the appropriate no-gc or gc-all factories instead,
as appropriate.
A special note for mutation::compacted(): according to the comment above
it, it doesn't drop expired tombstones but as it is currently, it
actually does. Change the tombstone gc param for the underlying call to
compact_for_compaction() to uphold the comment. This is used in tests
mostly, so no fallout expected.

Tests are handled in the next commit, to reduce noise.

Two tests in mutation_test.cc have to be updated:

* test_compactor_range_tombstone_spanning_many_pages
  has to be updated in this commit, as it uses
  mutation_partition::compact_for_query() as well as
  compact_for_query(). The test passes default constructed
  tombstone_gc() to the latter while the former now uses no-gc
  creating a mismatch in tombstone gc behaviour, resulting in test
  failure. Update the test to also pass no-gc to compact_for_query().

* test_query_digest similarly uses mutation_partition::query_mutation()
  and another compaction method, having to match the no-gc now used in
  query_mutation().
2026-03-03 14:09:28 +02:00
Botond Dénes
04b001daa6 sstable: move away from tombstone_gc_mode::operator bool()
It is ambiguous, use tombstone_gc_mode::is_gc_enabled() instead.
Note that the two has slightly different meanings, operator bool()
returned true when repair-history related functionality was enabled.
This is fine, because the only two users are logs, where the two
meanings are close enough. All other users were eliminated or migrated
already, taking the change in meaning into account.
2026-03-03 14:09:28 +02:00
Botond Dénes
6364e35403 replica/table: add get_tombstone_gc_state()
Shorthand for
get_compaction_manager().get_shared_tombstone_gc_state().get_tombstone_gc_state().
2026-03-03 14:09:28 +02:00
Botond Dénes
f3ee6a0bd1 compaction: use tombstone_gc_state with value semantics
Instead of passing around references to it, pass around values.

This object is now designed to be used as a value-type, after recent
refactoring.
2026-03-03 14:09:27 +02:00
Botond Dénes
83e20d920e db/row_cache: use tombstone_gc_state with value semantics
Instead of keeping a pointer to it. Replace nullptr with
tombstone_gc_state::no_gc().

This object is now designed to be used as a value-type, after recent
refactoring.
2026-03-03 14:09:27 +02:00
Botond Dénes
041ab593c7 tombstone_gc: introduce tombstone_gc_state::for_tests()
To replace the usage of tombstone_gc_state(nullptr) usage in tests
specifically. This more verbose factory method will hopefully convey
that this is not to be used in production code. The nullptr constructor
doesn't convey this and in fact it was used in production code
here-and-there.
2026-03-03 14:09:27 +02:00
Artsiom Mishuta
5c84a76b28 test.py: setup pytest logger
This commit introduces pure pytest logging into a file

Previously, test.py used pytest as a script(not a framework) and just captured pytest stdout and logged this data by itself

This commit sets up the log files format that additionaly display Python processName, threadName adn taskName because test.py test cases use them, and now it is so hard to investigate issues that are connected with parallelism inside test case themselve

In addition, commit splits the logging of different pytest workers(xdist) into different files. If pytest workers have ho failed test - log file for these workers will be deleted

There is also additional logging for failures that will contain a separate file per test failure and contain the error itself (stacktrace) and all capture logs from stdout, stderr during the test run. With --save-log-on-success it will be a separate file per test on pass as well

All this new functionality works with the new xdit scheduler (--test-py-init=True)

Fixes SCYLLADB-713

Closes scylladb/scylladb#28705
2026-03-03 11:49:01 +01:00
Dimitrios Symonidis
80b74d7df2 tablet options: Add max_tablet_count tablet option to enforce tablet count upper bounds
Introduced a new max_tablet_count tablet option that caps the maximum number of tablets a table can have. This feature is designed primarily for backup and restore workflows.
During backup, when load balancing is disabled for snapshot consistency, the current tablet count is recorded in the backup manifest.
During restore, max_tablet_count is set to this recorded value, ensuring the restored table's tablet count never exceeds the original snapshot's tablet distribution.
This guarantee enables efficient file-based SSTable streaming during restore, as each SSTable remains fully contained within a single tablet boundary.

Closes scylladb/scylladb#28450
2026-03-03 11:19:24 +03:00
Calle Wilund
69f8e722bf table::snapshot_table_on_all_shards: Use set to keep track of tablets in manifest
Fixes: SCYLLADB-828

Avoid iterating linear set of tablets when building manifest. Reduces complexity.

Closes scylladb/scylladb#28851
2026-03-03 08:09:33 +02:00
Karol Nowacki
30487e8854 index: fix vector index with filtering target column
The secondary index mechanism is currently used to determine the target column.
This mechanism works incorrectly for vector indexes with filtering because
it returns the last specified column as the target (vectors) column.
However, the syntax for a vector index requires the first column to be the target:
```
CREATE CUSTOM INDEX ON t(vectors, users) USING 'vector_index';
```

This discrepancy eventually leads to the following exception when performing an
ANN search on a vector index with filtering columns:
````
ANN ordering by vector requires the column to be indexed using 'vector_index'
````

This commit fixes the issue by introducing dedicated logic for vector indexes
to correctly identify the target(vectors) column.

Fixes: SCYLLADB-635

Closes scylladb/scylladb#28740
2026-03-02 18:47:58 +02:00
Sergey Zolotukhin
33923578eb Update DROP INDEX statement documentation
Clarify behavior of DROP INDEX during ongoing builds.

Closes scylladb/scylladb#28659
2026-03-02 17:31:23 +02:00
Botond Dénes
ab532882db tools/scylla-sstable: introduce scylla sstable split
Split input sstable(s) into multiple output sstables based on the provided
token boundaries. The input sstable(s) are divided according to the specified
split tokens, creating one output sstable per token range.

Fixes: SCYLLADB-10

Closes scylladb/scylladb#28741
2026-03-02 15:19:17 +01:00
Marcin Maliszkiewicz
d95939d69a test: add more logs to test_startup_no_auth_response
When test fails with assert connections_observed
we would like to know if it was unable to connect
or execute query in attempt_good_connection
2026-03-02 14:53:46 +01:00
Marcin Maliszkiewicz
91126eb2fb test: decrease strain in test_startup_response
For 2025.3 and 2025.4 this test runs order of magnitude
slower in debug mode. Potentially due to passwords::check
running in alien thread and overwhelming the CPU (this is
fixed in newer versions).

Decreasing the number of connections in test makes it fast
again, without breaking reproducibility.

As additional measure we double the timeout.

The fix is now cherry-picked to master as sometimes
test fails there too.

(cherry picked from commit 1f1fc2c2ac)
2026-03-02 14:46:51 +01:00
Botond Dénes
bf3edaf220 tools/scylla-sstable: filter_operation(): use deferred_close() to close reader
Manual closing is bypassed with exceptions, promoting an exception to a
crash due to unclosed reader.

Closes scylladb/scylladb#28797
2026-03-02 14:16:08 +01:00
Marcin Maliszkiewicz
6bf706ef1b Merge 'scylla-sstable: query: handle nested UDTs' from Botond Dénes
The query (and in certain modes the write) operations uses virtual table facility inside `cql_test_env`. The schema of the sstable is created as a table in `cql_test_env`. This involves registering all UDTs with the keyspace, so they are available for lookups.
This was done with a flat loop over all column types, but this is not enough. UDTs might be nested in other types, like collections. One has to do a traversal of the type tree and register every UDT on the way.
This PR changes the flat loop to a recursive traversal of the type tree. The query operation now works with UDTs, no matter how deeply nested they are.

Backport: Implements missing functionality of a tool, no backport.

Closes scylladb/scylladb#28798

* github.com:scylladb/scylladb:
  tools/scylla-sstable: create_table_in_cql_env(): register UDTs recursively
  tools/scylla-sstable: generalize dump_if_user_type
  tools/scylla-sstable: move dump_if_user_type() definition
2026-03-02 14:14:43 +01:00
dependabot[bot]
f5fa77ac9d build(deps): bump sphinx-multiversion-scylla in /docs
Bumps [sphinx-multiversion-scylla](https://holzhaus.github.io/sphinx-multiversion/) from 0.3.4 to 0.3.7.

---
updated-dependencies:
- dependency-name: sphinx-multiversion-scylla
  dependency-version: 0.3.7
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>

Closes scylladb/scylladb#28833
2026-03-02 14:13:03 +01:00
Karol Nowacki
647172d4b8 vector_search: fix names of private members
According to coding style in Scylla,
member variables are prefixed with underscore.
2026-03-02 14:08:16 +01:00
Karol Nowacki
f2308b000f vector_search: remove unused global variable 2026-03-02 14:08:07 +01:00
Marcin Maliszkiewicz
a83ee6cf66 Merge 'db/batchlog_manager: re-add v1 support for mixed clusters' from Botond Dénes
3f7ee3ce5d introduced system.batchlog_v2, with a schema designed to speed up batchlog replays and make post-replay cleanups much more effective.
It did not introduce a cluster feature for the new table, because it is node local table, so the cluster can switch to the new table gradually, one node at a time.
However, https://github.com/scylladb/scylladb/issues/27886 showed that the switching causes timeouts during upgrades, in mixed clusters. Furthermore, switching to the new table unconditionally  on upgrades nodes, means that on rollback, the batches saved into the v2 table are lost.
This PR introduces re-introduces v1 (`system.batchlog`) support and guards the use of the v2 table with a cluster feature, so mixed clusters keep using v1 and thus be rollback-compatible.
The re-introduced v1 support doesn't support post-replay cleanups for simplicity. The cleanup in v1 was never particularly effective anyway and we ended up disabling it for heavy batchlog users, so I don't think the lack of support for cleanup is a problem.

Fixes: https://github.com/scylladb/scylladb/issues/27886

Needs backport to 2026.1, to fix upgrades for clusters using batches

Closes scylladb/scylladb#28736

* github.com:scylladb/scylladb:
  test/boost/batchlog_manager_test: add tests for v1 batchlog
  test/boost/batchlog_manager_test: make prepare_batches() work with both v1 and v2
  test/boost/batchlog_manager_test: fix indentation
  test/boost/batchlog_manager_test: extract prepare_batches() method
  test/lib/cql_assertions: is_rows(): add dump parameter
  tools/scylla-sstable: extract query result printers
  tools/scylla-sstable: add std::ostream& arg to query result printers
  repair/row_level: repair_flush_hints_batchlog_handler(): add all_replayed to finish log
  db/batchlog_manager: re-add v1 support
  db/batchlog_manager: return all_replayed from process_batch()
  db/batchlog_manager: process_bath() fix indentation
  db/batchlog_manager: make batch() a standalone function
  db/batchlog_manager: make structs stats public
  db/batchlog_manager: allocate limiter on the stack
  db/batchlog_manager: add feature_service dependency
  gms/feature_service: add batchlog_v2 feature
2026-03-02 12:09:10 +01:00
Patryk Jędrzejczak
ba7f314cdc test: test_full_shutdown_during_replace: retry replace after the replacing node is removed from gossip
The test is currently flaky with `reuse_ip = True`. The issue is that the
test retries replace before the first replace is rolled back and the
first replacing node is removed from gossip. The second replacing node
can see the entry of the first replacing node in gossip. This entry has
a newer generation than the entry of the node being replaced, and both
replacing nodes have the same IP as the node being replaced. Therefore,
the second replacing node incorrectly considers this entry as the entry
of the node being replaced. This entry is missing rack and DC, so the
second replace fails with
```
ERROR 2026-02-24 21:19:03,420 [shard 0:main] init - Startup failed:
std::runtime_error (Cannot replace node
8762a9d2-3b30-4e66-83a1-98d16c5dd007/127.61.127.1 with a node on
a different data center or rack.
Current location=UNKNOWN_DC/UNKNOWN_RACK, new location=dc1/rack2)
```

Fixes SCYLLADB-805

Closes scylladb/scylladb#28829
2026-03-02 10:26:57 +02:00
Yaron Kaikov
ab02486ce8 .github/workflows/trigger-scylla-ci.yaml: fix org membership check in trigger-scylla-ci workflow
Following becb48b586 it seems we have a regression with trigger CI logic
The Verify Org Membership step used gh api /orgs/scylladb/members/$AUTHOR
with GITHUB_TOKEN to check if the user is an org member. However,
GITHUB_TOKEN does not have read:org scope, so the API call fails for all
users — even actual scylladb org members — causing CI triggers to be
silently skipped.

Replace the API call with the author_association field from the GitHub
event payload, which is set by GitHub itself and requires no special
token permissions. This allows any scylladb org member (MEMBER or OWNER)
to trigger CI via comment, regardless of whether they authored the PR.

Closes scylladb/scylladb#28837
2026-03-02 10:23:14 +02:00
Michael Litvak
8c4bc33e51 test: remove test_view_building_with_tablet_move
remove the test since it's not relevant anymore, it's not testing what
it's supposed to test and it's unstable.

the purpose of the test was to reproduce an issue in the legacy view
builder where a view starts to build at token T2 and then all tokens
[T1, end) with T1<T2 migrate to another node while it's still building,
exposing an issue when the view builder wraparounds the token ring.

this is not relevant anymore because now view building with tablets is
done via the view building coordinator for tablets, and all views start
to build from the first token with no wraparound.

besides, the test is unstable due to relying too much on specific
timing, which was useful for investigating and fixing the original issue
but not anymore.

Fixes SCYLLADB-842

Closes scylladb/scylladb#28842
2026-03-02 07:42:08 +01:00
Marcin Maliszkiewicz
8c2da76fde test/cqlpy: remove xfail from test_constant_function_parameter
The issue was fixed by commit cc03f5c89d
("cql3: support literals and bind variables in selectors"), so the
xfail marker is no longer needed.

Closes scylladb/scylladb#28776
2026-03-01 20:03:42 +02:00
Jenkins Promoter
fb6eebc383 Update pgo profiles - aarch64 2026-03-01 05:17:44 +02:00
Jenkins Promoter
8edd532c05 Update pgo profiles - x86_64 2026-03-01 04:31:57 +02:00
Botond Dénes
1f09fcfb26 Merge 'Use standard ks/cf/data creation methods in test_restore_with_streaming_scopes' from Pavel Emelyanov
The test uses create_dataset helper duplicating the existing code that does the same. This PR patches basic tests to use standard facilities.

Also the PR simplifies the 3-level nested loops used to combine several sets of restoration parameters by using itertools.product facility.

Continuation of #28600.

Cleaning tests, not backporting

Closes scylladb/scylladb#28608

* github.com:scylladb/scylladb:
  test/object_store: Use itertools.product() for deeply nested loops
  test/object_store: Replace dataset creation usage with standard methods
  test/object_store: Shift indentation right for test_restore_with_streaming_scopes
2026-02-27 16:15:55 +02:00
Avi Kivity
450a09b152 test: tools: restrict embedded perf tests from taking over host
The perf-simple-query tests were not restricted on CPU count,
so on a 96-CPU machine, they would run on 96 CPUs, and time out
in debug mode.

All restrict memory usage and add --overprovisioned so that
pinning is disabled. Apply that to all tests.

Closes scylladb/scylladb#28821
2026-02-27 16:06:22 +02:00
Botond Dénes
d3a3921487 Merge 'Re-use and improve the take_snapshot() helper in backup tests' from Pavel Emelyanov
The helper is very simple yet generic -- it takes a snapshot of a keyspace on all servers and collects the resulting sstables from workdirs. Re-using it in all test cases saves some lines of code. Also, the method is "sequential", making it "parallel" reduces the waiting time a bit.

Will help generalizing existing backup/restore tests to support clustered snapshot/backup/restore API (see #28525) later.

Cleaning up tests, not backporting.

Closes scylladb/scylladb#28660

* github.com:scylladb/scylladb:
  test/backup: Run keyspace flush and snapshot taking API in parallel
  test/backup: Re-use take_snapshot() helper in do_abort_restore()
  test/backup: Move take_snapshot() helper up
2026-02-27 15:58:18 +02:00
Łukasz Paszkowski
fb40d1e411 compaction_manager: improve readability of maybe_wait_for_sstable_count_reduction()
Split the chained inject_parameter().value_or() expression into two
separate named variables for clarity.

Use condition_variable::when() instead of wait(). when() is the
coroutine-native API, designed specifically for co_await contexts —
it avoids the heap allocation of a promise_waiter that wait() uses,
and instead uses a stack-based awaiter.

Closes scylladb/scylladb#28824
2026-02-27 15:51:30 +02:00
Patryk Jędrzejczak
9a9202c909 Merge 'Remove gossiper topology code' from Gleb Natapov
The PR removes most of the code that assumes that group0 and raft topology is not enabled. It also makes sure that joining a cluster in no raft mode or upgrading a node in a cluster that not yet uses raft topology to this version will fail.

Refs #15422

No backport needed since this removes functionality.

Closes scylladb/scylladb#28514

* https://github.com/scylladb/scylladb:
  group0: fix indentation after previous patch
  raft_group0: simplify get_group0_upgrade_state function since no upgrade can happen any more
  raft_group0: move service::group0_upgrade_state to use fmt::formatter instead of iostream
  raft_group0: remove unused code from raft_group0
  node_ops: remove topology over node ops code
  topology: fix indentation after the previous patch
  topology: drop topology_change_enabled parameter from raft_group0 code
  storage_service: remove unused handle_state_* functions
  gossiper: drop wait_for_gossip_to_settle and deprecate correspondent option
  storage_service: fix indentation after the last patch
  storage_service: remove gossiper bootstrapping code
  storage_service: drop get_group_server_if_raft_topolgy_enabled
  storage_service: drop is_topology_coordinator_enabled and its uses
  storage_service: drop run_with_api_lock_in_gossiper_mode_only
  topology: remove code that assumes raft_topology_change_enabled() may return false
  test: schema_change_test: make test_schema_digest_does_not_change_with_disabled_features tests run in raft mode
  test: schema_change_test: drop schema tests relevant for no raft mode only
  topology: remove upgrade to raft topology code
  group0: remove upgrade to group0 code
  group0: refuse to boot if a cluster is still is not in a raft topology mode
  storage_service: refuse to join a cluster in legacy mode
2026-02-27 14:43:41 +01:00
Anna Stuchlik
dfd46ad3fb doc: add the upgrade guide from 2025.x to 2026.1
This commit adds the upgrade guide for version 2026.1.

According to the new upgrade policy, the user can now upgrade to the major version (2026.1)
from any previous minor version.
So instead of adding a separate guide form 2025.4 to 2026.1, we need a guide from 2025.x to 2026.1.

In addition, this commit:
- Updates the upgrade policy for reflect the above change.
- Removes the upgrade guides for the previous version.

Fixes https://github.com/scylladb/scylladb/issues/28533

Fixes  https://github.com/scylladb/scylladb/issues/28532

Closes scylladb/scylladb#28789
2026-02-27 15:36:34 +02:00
Botond Dénes
fcc570c697 Merge 'Exorcise assertions from Alternator, using a new throwing_assert() macro' from Nadav Har'El
assert(), and SCYLLA_ASSERT() are evil (Refs #7871) because they can cause the entire Scylla cluster to crash mysteriously instead of cleanly failing the specific request that encountered a serious problem of failed pre-requisite.

In this two-patch series, in the first patch we introduce a new macro throwing_assert(), a convenient drop-in replacement for SCYLLA_ASSERT() but which has all the benefits of on_internal_error() instead of the dangers of SCYLLA_ASSERT().
In the second patch we use the new function to replace every call to SCYLLA_ASSERT() in Alternator by the new throwing_assert().

Here is an example from the second patch to demonstrate the power of this approach: The Alternator code uses the attrs_column() function to retrieve the ":attrs" column of a schema. Since every Alternator table always has an ":attrs" column in its schema, we felt safe to SCYLLA_ASSERT() that this column exists. However, imagine that one day because of a bug, one Alternator table is missing this column. Or maybe not a bug - maybe a malicious user on a shared cluster found a way to deliberately delete this column (e.g, with a CQL command!) and this check fails. Before this patch, the entire Scylla node will crash. If the same request is sent to all nodes - the entire cluster will crash. The user might not even know which request caused this crash. In contrast, after this patch, the specific operation - e.g., PutItem - will get an exception. Only this operation, and nothing else, will be aborted, and the user who sent this request will even get an "Internal Server Error" with the assertion-failure message, alerting them that this specific query is causing problems, while other queries might work normally.

There's no need to backport this patch - unless it becomes annoying that other branches don't have the throwing_assert() function and we want it to ease other backports.

Fixes #28308.

Closes scylladb/scylladb#28445

* github.com:scylladb/scylladb:
  alternator: replace SCYLLA_ASSERT with throwing_assert
  utils: introduce throwing_assert(), a safe replacement for assert
2026-02-27 15:35:36 +02:00
Roy Dahan
822c1597c9 install.sh: fix REST API paths for nonroot installations
In nonroot installations, the install.sh script was hardcoding the
api_ui_dir and api_doc_dir paths to /opt/scylladb/ in scylla.yaml,
even though the actual files were installed to a different location
(typically ~/scylladb). This caused REST API endpoints like
/api-doc/failure_detector/ to fail with "transfer closed with
outstanding read data remaining" error because Scylla couldn't find
the API documentation files at the configured paths.

Fix this by using the $prefix variable instead of hardcoded
/opt/scylladb/ paths. This ensures that:
- In regular installations: $prefix = /opt/scylladb (no change)
- In nonroot installations: $prefix = ~/scylladb (paths now correct)

Fixes: SCYLLADB-721

Backport: The hardcoded paths in install.sh have been present since
the nonroot installation feature was introduced, making REST API
endpoints non-functional in all nonroot installations across all
live versions of Scylla.

Closes scylladb/scylladb#28805
2026-02-27 15:32:54 +02:00
Botond Dénes
9521a51e4c Merge 'generic_server: scale connection concurrency semaphore by listener count' from Marcin Maliszkiewicz
The concurrency semaphore gates uninitialized connections across all
do_accepts loops, but was initialized to a fixed value regardless of
how many listeners exist. With multiple listeners competing for the
same units, each effectively gets less than the configured concurrency.

Initialize the semaphore to concurrency - 1 and signal 1 per listen()
call, so total capacity is concurrency - 1 + nr_listeners. This
guarantees each listener's accept loop can have at least one unit
available.

It mainly fixes problem when setting uninitialized_connections_semaphore_cpu_concurrency
config value to 1 would result in not being able to process connections, as only 1 out of 2
listeners got the semaphore.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-762
Backport: no, it's a minor problem

Closes scylladb/scylladb#28747

* github.com:scylladb/scylladb:
  test: add test_uninitialized_conns_semaphore
  generic_server: fix waiters count in shed log
  generic_server: scale connection concurrency semaphore by listener count
2026-02-27 15:06:50 +02:00
Taras Veretilnyk
5bbc44ed12 sstables: replace rewrite_statistics with new rewrite component mechanism
This commits migrates all callers that used rewrite_statistics to new
rewrite component mechanism.
2026-02-26 22:38:55 +01:00
Taras Veretilnyk
51c345aaf6 sstables: add new rewrite component mechanism for safe sstable component rewriting
Previously, rewriting an sstable component (e.g., via rewrite_statistics) created a temporary file that was renamed
to the final name after sealing. This allows crash recovery by simply removing the temporary file on startup.

However, this approach won't work once component digests are stored in scylla_metadata,
as replacing a component like Statistics will require atomically updating both the component
and scylla_metadata with the new digest—impossible with POSIX rename.

The new mechanism creates a clone sstable with a fresh generation:
- Hard-links all components from the source except the component being rewritten and scylla metadata if update_sstable_id is true
- Copies original sstable components pointer and recognized components from the source
- Invokes a modifier callback to adjust the new sstable before rewriting
- Writes the modified component. If update_sstable_id is true, reads scylla metadata, generates new sstable_id and rewrites it.
- Seals the new sstable with a temporary TOC
- Replaces the old sstable atomically, the same way as it is done in compaction

This is built on the rewrite_sstables compaction framework to support batch operations (e.g., following incremental repair).
In case of any failure during the whole process, sstable will be automatically deleted on the node startup due to
temporary toc persistence.

This prepares the infrastructure for component digests. Once digests are introduced in scylla_metadata
this mechanism will be extended to also rewrite scylla metadata with the updated digest alongside the modified component, ensuring atomic updates of both.
2026-02-26 22:38:55 +01:00
Taras Veretilnyk
4aa0a3acf9 compaction: add compaction_group_view method to specify sstable version
Add make_sstable() overload that accepts sstable_version_types parameter
to compaction_group_view interface and all implementations.
This will be useful in rewrite component mechanism, as we
need to preserve sstable version when creating the new one for the replacement.
2026-02-26 22:38:55 +01:00
Taras Veretilnyk
16ea7a8c1c sstables: add null_data_sink and serialized_checksum for checksum-only calculation
Introduce a null_data_sink and make_digest_calculator implementation that discards
all writes, enabling checksum calculation without file I/O. This allows the
existing checksummed_file_writer to be used for digest computation
without writing data to disk.

This will be used in a future commit to calculate the scylla metadata
component checksum before writing it to disk, allowing the component
to store its own checksum.
2026-02-26 22:38:51 +01:00
Łukasz Paszkowski
bb57b0f3b7 compaction_manager: fix maybe_wait_for_sstable_count_reduction() hanging forever
The futurization refactoring in 9d3755f276 ("replica: Futurize
retrieval of sstable sets in compaction_group_view") changed
maybe_wait_for_sstable_count_reduction() from a single predicated
wait:
```
    co_await cstate.compaction_done.wait([..] {
        return num_runs_for_compaction() <= threshold
            || !can_perform_regular_compaction(t);
    });
```
to a while loop with a predicated wait:
```
    while (can_perform_regular_compaction(t)
           && co_await num_runs_for_compaction() > threshold) {
        co_await cstate.compaction_done.wait([this, &t] {
            return !can_perform_regular_compaction(t);
        });
    }
```

This was necessary because num_runs_for_compaction() became a
coroutine (returns future<size_t>) and can no longer be called
inside a condition_variable predicate (which must be synchronous).

However, the inner wait's predicate — !can_perform_regular_compaction(t)
— only returns true when compaction is disabled or the table is being
removed. During normal operation, every signal() from compaction_done
wakes the waiter, the predicate returns false, and the waiter
immediately goes back to sleep without ever re-checking the outer
while loop's num_runs_for_compaction() condition.

This causes memtable flushes to hang forever in
maybe_wait_for_sstable_count_reduction() whenever the sstable run
count exceeds the threshold, because completed compactions signal
compaction_done but the signal is swallowed by the predicate.

Fix by replacing the predicated wait with a bare wait(), so that
any signal (including from completed compactions) causes the outer
while loop to re-evaluate num_runs_for_compaction().

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-610

Closes scylladb/scylladb#28801
2026-02-26 20:13:50 +02:00
Marcin Maliszkiewicz
a03ebe1a29 Merge 'cql: implement a new per-row TTL feature' from Nadav Har'El
This series implements a new per-row TTL feature for CQL. The per-row TTL feature was requested in issue #13000. It is a feature that does not exist in Cassandra, and was inspired by DynamoDB's TTL feature - and under the hood uses the same implementation that we used in Alternator to implement this DynamoDB feature.

The new per-row TTL feature is completely separate from CQL's existing per-write (and per-cell) TTL, and both will be available to users.

In the per-row TTL feature, one column in the table is designated as the "TTL" column, and its value for a row is the expiration time for that row. The TTL column can be designated at table creation time, e.g.:

```cql
CREATE TABLE tab (
    id int PRIMARY KEY,
    t text,
    expiration timestamp TTL
);
```

Or after the table already exists with:

```cql
ALTER TABLE tab TTL expiration
```

Expiration can also be disabled, with:

```cql
ALTER TABLE tab TTL NULL
```

The new per-row TTL feature has two features that users have been asking for:

1. A user can change the value of just the TTL column - without rewriting the entire row - to change the expiration time of the entire row.
2. When an expired row is finally deleted, a CDC event about this deletion appears in the CDC log (if CDC is enabled), including - if a preimage is enabled - the content of the deleted row.

To achieve the second goal (CDC events), a row is not guaranteed to disappear at exactly its expiration time (as CQL's original TTL feature guarantees). Rather, the row is deleted some time later, depending on `alternator_ttl_period_in_seconds`; Until the actual deletion, the row is still readable (and even writable). But we are guaranteed that when the row is finally deleted, the CDC event will come too.

The implementation uses the same background thread used by Alternator to periodically scan for expired items and delete them.

The expiration thread keeps the same metrics as it did for Alternator:
* `scylla_expiration_scan_passes`
* `scylla_expiration_scan_table`
* `scylla_expiration_items_deleted`
* `scylla_expiration_secondary_ranges_scanned`

The series begins with a few small preparation patches, followed by the main part of the feature (which isn't big, since we are just enabling the pre-existing Alternator expiration machinary for CQL) and finally 30 tests (single-node and multi-node tests) and documentation.

This series is a new feature, so traditionally would not be backported. However, I wouldn't be surprised if we will be requested to backport it so that customers will not need to wait for a new major release.

Fixes #13000

Closes scylladb/scylladb#28320

* github.com:scylladb/scylladb:
  test/cqlpy: verify that a column can't be both STATIC and PRIMARY KEY
  docs/cql: document the new CQL per-row TTL feature
  test/cluster: tests for the new CQL per-row TTL feature
  test/cqlpy: tests for the new CQL per-row TTL feature
  test: set low alternator_ttl_period_in_seconds in CQL tests
  cql ttl: fix ALTER TABLE to disable TTL if column is dropped
  cql ttl: add setting/unsetting of TTL column to ALTER TABLE
  cql ttl: add TTL column support to CREATE TABLE and DESC TABLE
  ttl: add CQL support to Alternator's TTL expiration service
  alternator ttl: move TTL_TAG_KEY to a header file
  alternator ttl: remove unnecessary check of feature flag
  cql: add "cql_row_ttl" cluster feature
  alternator: fix error message if UpdateTimeToLive is not supported
2026-02-26 15:29:12 +01:00
Ferenc Szili
f22d75a57e load_stats: add filtering for tablet sizes
This patch adds filtering for tablet sizes collected in load_stats.
This is needed to improve the chances that the balancer will have all
the tablet sizes for the node, and that way avoid having the node
ignored during balancing.
2026-02-26 15:17:39 +01:00
Marcin Maliszkiewicz
4d0f1bf5c9 conf: improve rf_rack_valid_keyspaces documentation is scylla.yaml
Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-761

Closes scylladb/scylladb#28738
2026-02-26 14:34:28 +01:00
Yaniv Michael Kaul
ead9961783 cql: vector: fix vector dimension type
Switch vector dimension handling to fixed-width `uint32_t` type,
update parsing/validation, and add boundary tests.

The dimension is parsed as `unsigned long` at first which is guaranteed
to be **at least** 32-bit long, which is safe to downcast to `uint32_t`.

Move `MAX_VECTOR_DIMENSION` from `cql3_type::raw_vector` to `cql3_type`
to ensure public visibility for checks outside the class.

Add tests to verify the type boundaries.

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-223

Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
Co-authored-by: Dawid Pawlik <dawid.pawlik@scylladb.com>

Closes scylladb/scylladb#28762
2026-02-26 14:46:53 +02:00
Ferenc Szili
d0a5a1d5d0 load_stats: move tablet filtering for table size computation
This patch moves the table size tablet filtering code from a lambda in
storage_service::load_stats_for_tablet_based_tables() to the code
section where it will be used:
tablet_storage_group_manager::table_load_stats()

This is needed to better accomodate the next commit which will add
code for filtering tablets for tablet sizes.
2026-02-26 11:07:53 +01:00
Ferenc Szili
9cd2a04e79 load_stats: bring the comment and code in sync
Table sizes are collected in load stats, and are filtered according to
the migration stage, so as to avoid double accounting of tablet sizes.
The comment which explains the cut-off migration stage (cleanup) and
the cut-off stage in the code (streaming) are not the same.

This patch fixes that.
2026-02-26 11:03:33 +01:00
Michael Litvak
4a60ee28a2 test/cqlpy/test_materialized_view.py: increase view build timeout
The test test_build_view_with_large_row creates a materialized view and
expects the view to be built with a timeout of 5 seconds. It was
observed to fail because the timeout is too short on slow machines.

Increase the timeout to 60 seconds to make the test less flaky on slow
machines. Similarly for the other tests in the file that have a timeout
for view build, increase the timeout to 60 seconds to be consistent and
safer.

Fixes SCYLLADB-769

Closes scylladb/scylladb#28817
2026-02-26 11:27:51 +02:00
Michał Hudobski
579ed6f19f secondary_index_manager: fix double registration bug
We have observed a bug that caused Scylla to crash due to metrics double
registration. This bug is really difficult to reproduce and was seen
only once in the wild. We think that it may be caused by a request
in-flight keeping a reference to the stats object, making it not
deregister when the index is dropped, which casues a double registration
when we recreate the index, however we are not 100% sure.

This patch makes it so the metrics always get deregistered when we drop the index, which should fix the double registration bug.

Fixes: #27252

Closes scylladb/scylladb#28655
2026-02-26 09:39:53 +01:00
Marcin Maliszkiewicz
30f18a91fd Merge 'dtest: wait_for speedup' from Dario Mirovic
Audit tests have been slow. They rely on wait_for function. This function first sleeps for the duration of the time step specified, and then calls the given function. The audit tests need 0.02-0.03 seconds for the given function, but the operation lasts around 1.02-1.03 seconds, since step is 1 second.

This patch modifies wait_for dtest function so it first executes the given function, and afterwards calls time.sleep(step). This reduces time needed for the given function from 1.03 to 0.03 seconds.

Total audit tests suite speedup is 3x. On the developer machine the time is reduced from 13+ minutes to 4 minutes.

This patch also improves performance of some alternator tests that use the same wait_for dtest function.

`wait_for` in dtest framework has default time step reduced to make the environment more responsive and test execution faster.

Refs SCYLLADB-573

This is a performance improvement of testing framework. No need to backport.

Closes scylladb/scylladb#28590

* github.com:scylladb/scylladb:
  dtest: shorten default sleep step in wait_for
  dtest: wait_for speedup
2026-02-26 09:33:38 +01:00
Amnon Heiman
5db971c2f9 estimated_histogram_test.cc: add to_metrics_histogram test
Add a test that exercises to_metrics_histogram when Min is smaller than
Precision.

The test verifies duplicate integer bounds are collapsed,
counts remain cumulative, and native histogram metadata is still present
with the expected schema and min id.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2026-02-26 09:00:52 +02:00
Amnon Heiman
0b4f28ae21 histogram_metrics_helper.hh: Support Min < Precision
to_metrics_histogram now collapses duplicate integer bucket bounds
caused by Min less than Precision scaling while always keeping native
histogram metadata.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2026-02-26 09:00:38 +02:00
Amnon Heiman
af6371c11f estimated_histogram_test.cc: Add tests for approx_exponential_histogram with Min<Precision
Add three test cases to verify the hybrid linear/exponential bucketing:
- test_histogram_min_1_bucket_limits: Validates bucket lower limits
- test_histogram_min_1_basic: Tests value insertion and bucket distribution
- test_histogram_min_1_statistics: Tests min(), max(), quantile(), and mean()
- test_histogram_min_2_precision_4: Test min == 2 and precision 4.

These tests cover the new Min<Precision mode with Precision=4, verifying both the
linear range and exponential range.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2026-02-26 08:53:06 +02:00
Amnon Heiman
6c21e5f80c estimated_histogram.hh: support Min less than Precision histograms
approx_exponential_histogram is a pseudo exponential histogram implementation
that can insert and retrieve values into and from buckets in O 1 time.
The implementation uses power of two ranges and splits them linearly into
buckets. The number of buckets per power of two range is called Precision.

The original implementation aimed at covering large value ranges had a
limitation. The histogram Min value had to be greater than or equal to
Precision. As a result code that needs histograms for small integer values
could not use this implementation efficiently.

This change addresses that gap by handling the case where Min is less than
Precision. For Min smaller than Precision the value is scaled by a power of
two factor during indexing so the existing exponential math can be reused
without runtime branching. Bucket limits are scaled back to the original
units which can lead to repeated bucket limits in the Min to Precision
range for integer values.

Example with Min 2 and Precision 4
Buckets 2 2 3 3 4 5 6 7 8 10 12 14 and so on

Implementation details
Introduce SHIFT based on log2 Precision minus log2 Min when positive
Scale Min and Max by SHIFT for all exponential calculations
Compute NUM_BUCKETS using the standard log2 Max over Min formula
Use scaled value in find_bucket_index to avoid fractional bucket steps
Return bucket limits by scaling back to original units
Constraint relaxed from Min greater or equal to Precision to allow any Min
less than Max still power of two

This change maintains backward compatibility with existing histograms
while enabling efficient tracking of small integer values.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2026-02-26 00:46:14 +02:00
Amnon Heiman
aca5284b13 test(alternator): add per-table latency coverage for item and batch ops
Add missing tests for per-table Alternator latency metrics to ensure recent
per-table latency accounting is actually validated.

Changes in this patch:

Refactor latency assertion helper into check_sets_latency_by_metric(),
parameterized by metric name.
Keep existing behavior by implementing check_sets_latency() as a wrapper
over scylla_alternator_op_latency.
Add test_item_latency_per_table() to verify
scylla_alternator_table_op_latency_count increases for:
PutItem, GetItem, DeleteItem, UpdateItem, BatchWriteItem,
and BatchGetItem.
This closes a test gap where only global latency metrics were checked, while
per-table latency metrics were not covered.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2026-02-25 20:51:18 +02:00
Amnon Heiman
29e0b4e08c alternator: track per-table latency for batch get/write operations
Batch operations were updating only global latency histograms, which left
table-level latency metrics incomplete.

This change computes request duration once at the end of each operation and
reuses it to update both global and per-table latency stats:

Latencies are stored per table used,

This aligns batch read/write metric behavior with other operations and improves
per-table observability.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2026-02-25 20:51:18 +02:00
Yaron Kaikov
b211590bc0 .github/workflows: enable automatic backport PR creation with Jira sub-issue integration
This workflow calls the reusable backport-with-jira workflow from
scylladb/github-automation to enable automatic backport PR creation with
Jira sub-issue integration.

The workflow triggers on:
- Push to master/next-*/branch-* branches (for promotion events)
- PR labeled with backport/X.X pattern (for manual backport requests)
- PR closed/merged on version branches (for chain backport processing)

Features enabled by calling the shared workflow:
- Creates Jira sub-issues under the main issue for each backport version
- Sorts versions descending (highest first: 2025.4 -> 2025.3 -> 2025.2)
- Cherry-picks from previous version branch to avoid repeated conflicts
- On Jira API failure: adds comment to main issue, applies 'jira-sub-issue-creation-failed' label, continues with PR

Closes scylladb/scylladb#28804
2026-02-25 16:39:17 +02:00
Nadav Har'El
1d265e7d6d test/cqlpy: verify that a column can't be both STATIC and PRIMARY KEY
While adding the new syntax "TTL" to CREATE TABLE, I noticed that the
parser actually allows a column to be defined as "STATIC PRIMARY KEY".

So I add here a small test to verify that this is not really allowed:
The syntax "c int STATIC PRIMARY KEY" is accepted, but then rejected
by a later check. The syntax "c int PRIMARY KEY STATIC" is rejected
as a syntax error.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-02-25 14:59:45 +02:00
Nadav Har'El
34c0c64d9d docs/cql: document the new CQL per-row TTL feature
Add user-facing documentation for the new CQL per-row TTL feature,
in docs/cql/cql-extensions.md.

Also mention (and link) the new alternative TTL feature in a few
relevant documents about the old (per-write) TTL,  about CDC,
and about the CREATE TABLE and ALTER TABLE commands.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-02-25 14:59:44 +02:00
Nadav Har'El
23ad0be034 test/cluster: tests for the new CQL per-row TTL feature
The previous patch added single-node functional tests (in test/cqlpy)
for everything which was possible to test on a single node. In this
patch we add four tests that we couldn't test on a single node, using
the test/cluster test framework:

1. Test that the TTL expiration work - both the scanning threads and
   the actual deletion work on all nodes - happens on the "streaming"
   scheduling group.

2. Test that even if one of the cluster's nodes is down, still all
   the items get expired - another node "takes over" the dead node's
   work.

3. Test that rolling upgrade works as designed for the CQL per-row TTL
   feature: Before every single node in the cluster is upgraded to
   support this feature, a TTL column cannot be enabled on a table.
   And as soon as the last node of the cluster is upgraded, the TTL
   feature begins to work completely (you don't need to reboot all
   the nodes again).

4. Test that expiration works correctly on a multi-DC setup. The test
   doesn't check the efficiency of this process - i.e., that today each
   DC scans part of the data, reading with LOCAL_QUORUM, and writing
   the deletions across the entire cluster. Rather, the test only
   verifies the correctness - that expired rows do get deleted -
   for the usual case the data across the DCs is consistent.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-02-25 14:59:44 +02:00
Nadav Har'El
7a1351c6cf test/cqlpy: tests for the new CQL per-row TTL feature
This patch contains 27 functional tests (in the test/cqlpy framework)
for the new CQL per-row TTL feature. The tests cover the TTL column
configuration statements (CREATE TABLE, ALTER TABLE) as well as the
actual item expiration or non-expiration depending on the value of
the expiration-time column - and also CDC events generated on expiration
and the metrics generated by the expiration process.

These tests were written together with the code, as in "test-driven
development", so they aim to cover every corner case considered during
the development, and they reproduce every bug and misstep seen during
the development process. As a result, they hopefully achieve very high
code coverage - but since we don't have a working code-coverage tool,
I can't report any specific code coverage numbers.

These tests check everything which we can check on single-node cluster.
The next patch will add additional multi-node tests for things we can't
check here with a single node - such as the scheduling group used by the
distributed work, the effect of dead nodes on the TTL functionality, and
the process of rolling upgrade.

The tests in this patch do NOT try to stress the background expiration
scanning threads, or to check how they handle topology changes, large
amounts of data or clusters spanning multiple DCs. These tests also don't
test the performance impact of these scanning threads. Because the
expiration scanning thread is identical to the one already used by
Alternator TTL, we assume that many of these aspects were already tested
for Alternator TTL and did not change when the same implementation is
used for the new CQL feature.

All new tests pass on ScyllaDB. Because the per-row TTL feature is
a new ScyllaDB feature that does not exist on Cassandra, all these
tests are skipped on Cassandra.

Because some of these tests involve waiting for expiration, they can't
be very quick. Still, because we set alternator_ttl_period_in_seconds
to 0.5 seconds in the test framework, all 27 tests running sequentially
finish in roughly 6 seconds total, which we consider acceptable.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-02-25 14:59:44 +02:00
Nadav Har'El
154cecda71 test: set low alternator_ttl_period_in_seconds in CQL tests
In test/alternator/run we set alternator_ttl_period_in_seconds to a very
low number (0.5 seconds) to allow TTL tests to expire items very quickly
and finish quickly.

Until now, we didn't need to do this for CQL tests, because they weren't
using this Alternator-only feature. Now that CQL uses the same expiration
feature with its original configuration parameter, we need to set it in
CQL tests too.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-02-25 14:59:43 +02:00
Nadav Har'El
eebd7b0fbc cql ttl: fix ALTER TABLE to disable TTL if column is dropped
If "ALTER TABLE tab DROP x" is done to delete column x, and column x was
the designated TTL column, then the per-row TTL feature should be disabled
on this table.

If we don't do this, the expiration scanner will continue to scan the
table trying to read the dropped column - which will be wasteful or
worse.

A test for this case is also included in test/cqlpy/test_ttl_row.py
in a later patch.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-02-25 14:59:43 +02:00
Nadav Har'El
acbdf637b6 cql ttl: add setting/unsetting of TTL column to ALTER TABLE
The previous patch added the ability in CREATE TABLE to designate one
of the regular columns as a "TTL column", to be used by the per-row TTL
feature (Refs #13000). In this patch we add to ALTER TABLE the ability
to enable per-row TTL on an existing table with a given column as the
TTL column:

     ALTER TABLE tab TTL colname

and also the ability to disable per-row TTL with

     ALTER TABLE tab TTL NULL

as in CREATE TABLE, the designated TTL column must be a regular column
(it can't be a primary key column or a static column), and must have
the types timestamp, bigint or int.

You can't enable per-row TTL if already enabled, or disable it if
already disabled. To change the TTL column on an existing table,
you must first disable TTL, and then re-enable it with the new column.

A large collection of functional tests (in test/cqlpy), for every detail
of this patch, will come in a later patch in this series.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-02-25 14:59:43 +02:00
Nadav Har'El
22c79b6af8 cql ttl: add TTL column support to CREATE TABLE and DESC TABLE
This patch enables the per-row TTL feature in CQL (Refs #13000).

This patch allows the user to create a new table with one of its columns
designated as the TTL column with a syntax like:

    CREATE TABLE tab (
        id int PRIMARY KEY,
        t text,
        expiration timestamp TTL
    );

The column marked "TTL" must have the "timestamp", "bigint" or "int"
types (the choice of these types was explained in the previous patch),
and there can only be one such column. We decided not to allow a column
to be both a primary key column and a TTL column - although it would
have worked (it's supported in Alternator), I considered this non-useful
and confusing, and decided not to allow it in CQL. A TTL column also
can't be a static column.

We save the information of which column is the TTL column in a tag which
is read by the "expiration service" - originally a part of Alternator's
TTL implementation. After the previous patch, the expiration service is
running and knows how to understand CQL tables, so the CQL per-row TTL
feature will start to work.

This patch also implements DESC TABLE, printing the word "TTL" in the
right place of the output.

This patch doesn't yet implement ALTER TABLE that should allow enabling
or disabling the TTL column setting on an existing table - we'll do that
in the next patch.

A large collection of functional tests (in test/cqlpy), for every detail
of this feature will be added in a later patch.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-02-25 14:59:42 +02:00
Nadav Har'El
e636bc39ad ttl: add CQL support to Alternator's TTL expiration service
The Alternator TTL feature uses an "expiration service", a background
thread on each shard which periodically scans for expired items and
deletes them. When writing the expiration service, we already
anticipated that the day will come that we'll want to use it for CQL
too. Well, now that we want to use it for CQL, we only need to make
two changes:

1. Before this patch, the expiration service was only started if
   Alternator was enabled. Now we need to start it unconditionally,
   as both Alternator and CQL will need to use it.
   The performance impact of the new background threads, when not
   needed, should be minimal: These threads will wake up every
   alternator_ttl_period_in_seconds (by default - once a day) and
   just check if any table has per-row TTL enabled, and if not, do
   nothing.

2. Before this patch, the expiration-time column had to be of type
   "decimal" - a variable-precision floating-point type. This made
   sense in Alternator - where all numbers are of this type, but CQL
   offers better and more efficient types for this purpose. In this
   patch we add support for two additional types for the expiration
   time column: The "timestamp" type (which uses millisecond precision,
   which our implementation truncates to whole seconds) and for the
   "bigint" type storing a number of seconds since the UNIX epoch.
   We also support the smaller "int" type for compatibility with
   existing data, but it is not recommended because a signed
   32-bit integer counting time from 1970 will break in 2038.

After this patch, the expiration service supports CQL tables, but there
is nothing yet that can enable it on CQL tables - i.e., nothing that
sets the appropriate tag on the table to tell the expiration service
which column is the expiration-time column. We'll add new syntax to
do this in the next patch.

At the moment, we leave the expiration service implementation in
its existing location - alternator/ttl.cc. This is despite the fact
that we now start it and use it also for CQL. For better modularity,
we should probably later move the expiration service implementation
to a separate module (directory).

Similarly, the expiration service's period is still configured via
alternator_ttl_period_in_seconds, which is now a misnomer because it
also affects CQL. Later we can rename this configuration parameter,
or alternatively, consider different scan periods for different tables
and table types, and have separate configuration for Alternator TTL
and CQL per-row TTL.

The metrics kept by the expiration service are the same metrics existing
for Alternator TTL, and fortunately do not have the name "alternator" in
their name:

   * scylla_expiration_scan_passes
   * scylla_expiration_scan_table
   * scylla_expiration_items_deleted
   * scylla_expiration_secondary_ranges_scanned

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-02-25 14:59:42 +02:00
Nadav Har'El
2823780557 alternator ttl: move TTL_TAG_KEY to a header file
TTL_TAG_KEY stores the name of the tag in which we store the name of the
table's expiration-time column, for Alternator's TTL feature.

We already need this name in two source files, and soon we'll need it
in more files - as we want to use the same implementation also for for
a new per-row TTL feature in CQL. So it's time to move the declaration
of this variable to a new header file - alternator/ttl_tag.hh.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-02-25 14:59:42 +02:00
Nadav Har'El
5e16c59312 alternator ttl: remove unnecessary check of feature flag
Every node that supports the Alternator TTL feature should start its
background expiration-checking thread, *without* checking if other
nodes support this feature. This patch removes the unnecessary check.

Indeed, until all other nodes enable this feature, the background thread
will have nothing to do. but when finally all nodes have this feature -
we need this thread to already be on - without requiring another reboot
of all nodes to start this thread.

In practice, this change won't change anything on modern installations
because this feature is already three years old and always enabled on
modern clusters. But I don't want to repeat the same mistake for the new
CQL per-row TTL feature, so better fix it in Alternator too.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-02-25 14:59:41 +02:00
Nadav Har'El
4f4e93b695 cql: add "cql_row_ttl" cluster feature
This patch adds a new cluster feature "CQL_ROW_TTL", for the new CQL
per-row TTL feature.

With this patch, this node reports supporting this feature, but the CQL
per-row TTL feature can only be used once all the nodes in the cluster
supports the feature. In other words, user requests to enable per-row TTL
on a table should check this feature flag (on the whole cluster) before
proceeding.

This is needed because the implementation of the per-row-TTL expiration
requires the cooperation of all nodes to participate in scanning for
expired items, so the feature can't be trusted until all nodes participate
in it.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-02-25 14:59:41 +02:00
Nadav Har'El
0d6b7a6211 alternator: fix error message if UpdateTimeToLive is not supported
Since commit 2dedb5ea75, the Alternator
TTL feature is no longer experimental. It is still a "cluster feature"
meaning it cannot be used on a partially-upgraded cluster until the entire
cluster supports this feature.

The error message we printed when the cluster doesn't support this
feature was outdated, referring to the no-longer-existing experimental
feature. So this patch fixes the error message.

Since this feature is already three years old, nobody is likely to ever
see this error message (it can be seen only by someone upgrading an
even older cluster, during the rolling upgrade), but better not have
wrong error messages in the code, even if it's not seen by users.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-02-25 14:59:41 +02:00
Nadav Har'El
b78bb914d7 alternator: replace SCYLLA_ASSERT with throwing_assert
Replace all calls to SCYLLA_ASSSERT() in Alternator by the better and
safer throwing_assert() introduced in the previous patch.

As a result of this patch, if one of the call sites for these asserts
is buggy and ever fails, only the involved operation will be killed
by an exception, instead of crashing the whole server - and often the
entire cluster (as the same buggy request reaches all nodes and
crashes them all).

Additionally, this patch replaces a few existing uses in Alternator
of on_internal_error() with a non-interesting message with a
more-or-less equivalent, but shorter, throwing_assert(). The idea is
to convert the verbose idiom:

       if (!condition) {
          on_internal_error(logger, "some error message")
       }

With the shorter

       throwing_assert(condition)

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-02-25 14:58:47 +02:00
Nadav Har'El
d876e7cd0a utils: introduce throwing_assert(), a safe replacement for assert
This patch introduces throwing_assert(cond), a better and safer
replacement for assert(cond) or SCYLLA_ASSERT(cond). It aims to
eventually replace all assertions in Scylla and provide a real solution
to issue #7871 ("exorcise assertions from Scylla").

throwing_assert() is based on the existing on_internal_error() and
inherits all its benefits, but brings with it the *convenience* of
assert() and SCYLLA_ASSERT(): No need for a separate if(), new strings,
etc.  For example, you can do write just one line of throwing_assert():

    throwing_assert(p != nullptr);

Instead of much more verbose on_internal_error:

    if (p == nullptr) {
        utils::on_internal_error("assertion failed: p != nullptr")
    }

Like assert() and SCYLLA_ASSERT(), in our tests throwing_assert() dumps
core on failure. But its advantage over the other assertion functions
like becomes clear in production:

* assert() is compiled-out in release builds. This means that the
  condition is not checked, and the code after the failed condition
  continues to run normally, potentially to disasterous consequences.

  In contrast, throwing_assert() continues to check the condition even in
  release builds, and if the condition is false it throws an exception.
  This ensures that the code following the condition doesn't run.

* SCYLLA_ASSERT() in release builds checks the condition and *crashes*
  Scylla if the condition is not met.

  In contrast, throwing_assert() doesn't crash, but throws an exception.
  This means that the specific operation that encountered the error
  is aborted, instead of the entire server. It often also means that
  the user of this operation will see this error somehow and know
  which operation failed - instead of encountering a mysterious
  server (or even whole-cluster crash) without any indication which
  operation caused it.

Another benefit of throwing_assert() is that it logs the error message
(and also a backtrace!) to Scylla's usual logging mechanisms - not to
stderr like assert and SCYLLA_ASSERT write, where users sometimes can't
see what is written.

Fixes #28308.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-02-25 14:58:47 +02:00
Wojciech Mitros
d1ff8f1db3 docs: add strong consistency doc
Add a new docs/dev document for the strongly consistent tables feature.
For now, it only contains information about the Raft metadata persistence,
but it should be updated as more of the strong-consistency components are
added.
2026-02-25 12:34:58 +01:00
Wojciech Mitros
97dc88d6b6 test/cluster: add tests for strongly-consistent tables' metadata persistence
In this patch we add various tests for checking how strongly consistent
tables work while allowing their tablets to reside on non-0 shards and
while using the new persistent storage for their raft metadata.
The tests verify that:
- strongly consistent tables' tablets can be allocated on different shards
and we can write/read from them
- the raft metadata is persistent across restarts even with disruptions
- the sharder correctly routes metadata queries to specified shards
- we can correctly perform multi-shard reads from the metadata tables
- we can read using just the group_id (without shard) using ALLOW FILTERING

For the tests we add logging to the sharder and partitioner and we add
some extra logs for observability.
2026-02-25 12:34:58 +01:00
Wojciech Mitros
f841c0522d raft: enable multi-shard raft groups for strongly consistent tablets
In this patch we allow strongly consistent tables to have tablets on
shards different than 0.
For that, we remove the checks for shard 0 for the non-group0 raft
groups, and  we allow the tablet allocator to place tablets of
strongly consistent tables on shards different than 0.
We also start using the new storage (raft::persistence) for strongly
consistent tables, added in the preceding commits.
2026-02-25 12:34:58 +01:00
Wojciech Mitros
ffe32e8e4d test/raft: add unit tests for raft_groups_storage
Most functions of the new storage for raft groups for strongly
consistent tables are the same as for the system raft table
storage, so we reuse the tests for them to test the new storage.

We add additional tests for checking the new raft groups partitioner
and sharder, and for verifying that writes using storages for different
shards do not affect the data read on different shards.

We also add a test for checking the snapshot_descriptor present after
the storage bootstrap - for both system and strongly consistent storages
we check that the storage contains the initial descriptor.
2026-02-25 12:34:58 +01:00
Wojciech Mitros
16977d7aa0 raft: add raft_groups_storage persistence class
Add raft_groups_storage, a raft::persistence implementation for
strongly consistent tablet groups.

Currently, it's almost an exact copy of the raft_sys_table_storage that
uses the new raft tables for strongly consistent tables (raft_groups,
raft_groups_snapshots, raft_groups_snapshot_config) which have
a (shard, group_id) partition key.

In the future, the mutation, term and commit_idx data will be stored
differently for for strongly consistent tables than for group0, which
will differentiate this class from the original raft_sys_table_storage.

The storage is created for each raft group server and it takes a shard
parameter at construction time to ensure all queries target the correct
partition (and thus shard).
2026-02-25 12:34:58 +01:00
Wojciech Mitros
654fe4b1ca db: add system tables for strongly consistent tables' raft groups
Add three new system tables for storing raft state for strongly
consistent tablets, corresponding to the tables for group0:

- system.raft_groups: Stores the raft log, term/vote, snapshot_id,
  and commit_idx for each tablet's raft group.

- system.raft_groups_snapshots: Stores snapshot descriptors
  (index, term) for each group.

- system.raft_groups_snapshot_config: Stores the raft configuration
  (current and previous voters) for each snapshot.

These tables use a (shard, group_id) composite partition key with
the newly added raft_groups_partitioner and raft_groups_sharder, ensuring
data is co-located with the tablet replica that owns the raft group.

The tables are only created when the STRONGLY_CONSISTENT_TABLES experimental
feature is enabled.
2026-02-25 12:34:58 +01:00
Wojciech Mitros
cb0caea8bf dht: add fixed_shard_partitioner and fixed_shard_sharder
Add a custom partitioner and sharder that will be used for Raft tables
for strongly consistent tables. These tables will have partition keys
of the form (shard, group_id) and the partitioner creates tokens that
encode the target shard in the high 16 bits.

Token layout:
  [shard: 16 bits][partition key hash: 48 bits]

This encoding guarantees that raft group data will be located on the
same shard as the tablet replica corresponding to that raft group as long
we use the tablet replica's shard as the value in the partition key.

Storing the shard directly in the partition key avoids additional lookups
for request routing to the incoming new raft tables.
For even more simplicity, we avoid biasing between uint64_t and int64_t
by limiting the acceptable shard ids up to 32767 (leaving the top bit 0),
which results in the same value of the token when interpreting either as
uint64_t or int64_t.

The sharder decodes the shard by extracting the high bits, which is
shard-count independent. This allows the partition key:shard mapping
to remain the same even during smp changes (only increases are allowed,
the same limitation as for tablets).
2026-02-25 12:34:51 +01:00
Botond Dénes
99244179f7 Merge 'CQL transport: Add histogram-based request/response size tracking' from Amnon Heiman
This series closes a gap in how CQL request and response sizes are reported.

Previously, request_size and response_size were tracked as simple counters,
providing only cumulative totals per shard. This made it difficult to understand
the distribution of message sizes and identify potential issues with very large
or very small requests.

After this series, the CQL transport reports detailed histogram metrics showing
the distribution of request and response sizes. These histograms are tracked
per-instance, per-type (per ops), and per-scheduling-group, providing
much better visibility into CQL traffic patterns.

The histograms are collected for QUERY, EXECUTE, and BATCH operations, which are
the primary data path operations where message size distribution is most relevant.
This data can help identify:
- Clients sending unexpectedly large requests
- Operations with oversized result sets
- Scheduling group differences in traffic patterns

To support this, the series extends the approx_exponential_histogram template to
handle accurate sum, adds a bytes_histogram type alias optimized for byte-range measurements (1KB to 1GB).

The existing per-shard counter metrics are maintained for backward compatibility.
Metrics example:
```
scylla_transport_cql_request_bytes{kind="BATCH",scheduling_group_name="sl:default",shard="0"} 129808
scylla_transport_cql_request_bytes{kind="EXECUTE",scheduling_group_name="sl:default",shard="0"} 227409
scylla_transport_cql_request_bytes{kind="PREPARE",scheduling_group_name="sl:default",shard="0"} 631
scylla_transport_cql_request_bytes{kind="QUERY",scheduling_group_name="sl:default",shard="0"} 2809
scylla_transport_cql_request_bytes{kind="QUERY",scheduling_group_name="sl:driver",shard="0"} 4079
scylla_transport_cql_request_bytes{kind="REGISTER",scheduling_group_name="sl:default",shard="0"} 98
scylla_transport_cql_request_bytes{kind="STARTUP",scheduling_group_name="sl:driver",shard="0"} 432
scylla_transport_cql_request_histogram_bytes_sum{kind="QUERY",scheduling_group_name="sl:driver"} 4079
scylla_transport_cql_request_histogram_bytes_count{kind="QUERY",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="1024.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="2048.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="4096.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="8192.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="16384.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="32768.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="65536.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="131072.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="262144.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="524288.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="1048576.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="2097152.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="4194304.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="8388608.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="16777216.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="33554432.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="67108864.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="134217728.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="268435456.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="536870912.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="1073741824.000000",scheduling_group_name="sl:driver"} 57
```
**The field sees it as an important issue**

Fixes #14850

Closes scylladb/scylladb#28419

* github.com:scylladb/scylladb:
  test/boost/estimated_histogram_test.cc: Switch to real Sum
  transport/server: to bytes_histogram
  approx_exponential_histogram: Add sum() method for accurate value tracking
  utils/estimated_histogram.hh: Add bytes_histogram
2026-02-25 13:05:18 +02:00
Yaron Kaikov
98494e08eb ci: harden trigger-scylla-ci workflow against credential leaks and untrusted PRs
refs: https://github.com/scylladb/scylladb/security/advisories/GHSA-wrqg-xx2q-r3fv

- Remove -v and -i flags from curl to prevent credentials from being
  logged in workflow output
- Move PR_NUMBER and PR_REPO_NAME into the env block with proper quoting
  to prevent shell injection via crafted PR metadata
- Add org membership verification step for pull_request_target events so
  that only PRs from scylladb org members can trigger Jenkins CI

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-796

Closes scylladb/scylladb#28785
2026-02-25 12:42:18 +02:00
Avi Kivity
511fab1f28 gossiper: exit failure detector sleep faster
When running unit tests, there's a visible ~1-second sleep
when gossip exits the failure detector loop.

Improve this by adding a condition variable for exiting the loop
and signaling it when any of the exit conditions are satisfied:
the abort_source is pulled, the gossiper is shut down, or the sleep
is complete. We can't just use the abort_source because gossip can be
shut down independently of the rest of the system.

To see the improvement, I ran cql_query_test in dev mode:

Before:

$ time ./build/dev/test/boost/combined_tests -t cql_query_test -- --smp 2  > /dev/null 2>&1

real	2m26.904s
user	0m24.307s
sys	0m13.402s

After:

$ time ./build/dev/test/boost/combined_tests -t cql_query_test -- --smp 2  > /dev/null 2>&1

real	0m26.579s
user	0m24.671s
sys	0m13.636s

Two minutes of real-time saved.

Real-life improvement in test.py will be lower, because of the overhead
of launching pytest for each test case.

Closes scylladb/scylladb#28649
2026-02-25 11:41:02 +02:00
Andrzej Jackowski
e2c4b0a733 config: add write_consistency_levels_* guardrails configuration
Add guardrails configuration that can be used later in this
patch series.

Refs: SCYLLADB-259
2026-02-25 10:30:03 +01:00
Avi Kivity
5baf16005f build: install antlr3 from maven + source, not rpm packages
Fedora removed the C++ backend from antlr3 [1], citing incompatible
license. The license in question (the Unicode license) is fine for us.

To be able to continue using antlr3, build it ourselves. The main
executable can be used as is from Maven, since we don't need any
patches for the parser. The runtime needs to be patched, so we download
the source and patch it.

Regenerated frozen toolchain with optimized clang from

  https://devpkg.scylladb.com/clang/clang-21.1.8-Fedora-43-aarch64.tar.gz
  https://devpkg.scylladb.com/clang/clang-21.1.8-Fedora-43-x86_64.tar.gz

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-773

Closes scylladb/scylladb#28765
2026-02-25 11:03:19 +02:00
Andrei Chekun
729bad77b1 test.py: add possibility to run downloaded Scylla binary
Add possibility to run Scylla binary that is stored or download the
relocatable package with Scylla.

Closes scylladb/scylladb#28787
2026-02-25 10:23:19 +02:00
Łukasz Paszkowski
9ade0b23da reader_concurrency_semaphore: set _ex in on_preemptive_abort()
When a permit is preemptively aborted, store the corresponding
exception in permit's member: `reader_permit::impl::_ex`.

This makes preemptively-aborted permits consistently report aborted()
and prevents them from being treated as eligible for inactive
registration in `register_inactive_read()`, avoiding assertion
failures on unexpected permit state.

Closes scylladb/scylladb#28591
2026-02-25 10:20:06 +02:00
Grzegorz Burzyński
b4f0eb666f packaging: add systemctl command to dependencies
scylladb/scylla container image doesn't include systemctl binary, while it
is used by perftune.py script shipped within the same image.

Scylla Operator runs this script to tune Scylla nodes/containers,
expecting its all dependencies to be available in the container's PATH.
Without systemctl, the script fails on systems that run irqbalance
(e.g., on EKS nodes) as the script tries to reconfigure irqbalance and
restart it via systemctl afterwards.

Fixes: scylladb/scylla-operator#3080

Closes scylladb/scylladb#28567
2026-02-25 10:19:32 +02:00
Botond Dénes
56cc7bbeec Merge 'Allow "global" snapshot using topology coordinator + add tablet metadata to manifest' from Calle Wilund
Refs: SCYLLADB-193

Adds a "snapshot_table" topology operation and associated data structure/table columns to support dispatching a snapshot operation as a topo coordinator op.

Logic is similar, and thus broken out and semi-shared with, truncation.

Also adds optional tablet metadata to manifest, listing all tablets present in a given snapshot, as well as
tablet sstable ownership, repair status, and token ranges.

As per description in SCYLLADB-193, the alternative snapshot mechanism is in
a separate namespace under 'tablets', which while dubious is the desired destination.

The API is accessed via `nodetool cluster snapshot`, which more or less mirrors `nodetool snapshot`, but using topo op.

TTL is added to message propagation as a separate patch here, since it is not (yet) used from API (or nodetool).
Requires a syntax for both API and command line.

Closes scylladb/scylladb#28525

* github.com:scylladb/scylladb:
  topology::snapshot: Add expiry (ttl) to RPC/topo op
  test_snapshot_with_tablets: Extend test to check manifest content
  table::manifest: Add tablet info to manifest.json
  test::test_snapshot_with_tablets: Add small test for topo coordinated snapshot
  scylla-nodetool: Add "cluster snapshot" command
  api::storage_service: Add tablets/snapshots command for cluster level snapshot
  db::snapshot-ctl: Add method to do snapshot using topo coordinator
  storage_proxy: Add snapshot_keyspace method
  topology_coordinator: Add handler for snapshot_tables
  storage_proxy: Add handler for SNAPSHOT_WITH_TABLETS
  messaging_service: Add SNAPSHOT_WITH_TABLETS verb
  feature_service: Add SNAPSHOT_AS_TOPOLOGY_OPERATION feature
  topology_mutation: Add setter for snapshot part of row
  system_keyspace::topology_requests_entry: Add snapshot info to table
  topology_state_machine: Add snapshot_tables operation
  topology_coordinator: Break out logic from handle_truncate_table
  storage_proxy: Break out logic from request_truncate_with_tablets
  test/object_store: Remove create_ks_and_cf() helper
  test/object_store: Replace create_ks_and_cf() usage with standard methods
  test/object_store: Shift indentation right for test cases
2026-02-25 10:17:53 +02:00
Botond Dénes
166e245097 Merge 'test.py: Topology test pytest integration' from Andrei Chekun
Migrate cluster tests directory to be handled by pytest. This is the next step in process of unification of the tests and migration to the pytest.
With this PR cluster test will be executed with the full path to the file instead of `suite/test` paradigm.

Backport is not needed because it framework enhancement.

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-46

Closes scylladb/scylladb#27618

* github.com:scylladb/scylladb:
  test.py: remove setsid from the framework
  test.py: rename suite.yaml to test_config.yaml
  test.py: add cluster tests to be executed by pytest
  test.py: add random seed for topology tests reproducibility
  test.py: add explicit default values to pytest options
  test.py: replace SCYLLA env var with build_mode fixture
2026-02-25 10:17:20 +02:00
Botond Dénes
9dff9752b4 Merge 'Fix regression in Alternator TTL with tablets and node going down' from Nadav Har'El
Recently we suffered a regression on how Alternator TTL behaves when a node goes down when tablets are used.

Usually, expiration of data in a particular tablet are handled by this tablet's "primary replica". However, if that node is down, we want another node to perform these expiration until the primary replica goes back online. We created a function `tablet_map::get_secondary_replica()` to select that "other node". We don't care too much what the "secondary replica" means, but we do care that it's different from the primary replica - if it's the same the expiration of that tablet will never be done.

It turns out that recently, in commits 817fdad and d88036d, the implementation of get_primary_replica() changed without a corresponding change to get_secondary_replica(). After those changes, the two functions are mismatched, and sometimes return the same node for both primary and secondary replica.

Unfortunately, although we had a dtest for the handling of a dead node in Alternator TTL, it failed to reproduce this bug, so this regression was missed - nothing else besides Alternator TTL ever used the get_secondary_replica() function.

So this series, in addition to fixing the bug, we add two tests that reproduce this bug (fail before the fix, pass with the fix):

1. A unit test that checks that get_secondary_replica() always returns a different node from get_primary_replica()
2. A cluster test based on the original dtest, which does reproduce this bug in Alternator TTL where some of the data was never expired (but only failed in release build, for an unknown reason).

Fixes SCYLLADB-777.

Closes scylladb/scylladb#28771

* github.com:scylladb/scylladb:
  test: add unit test for tablet_map::get_secondary_replica()
  test, alternator: add test for TTL expiration with a node down
  locator: fix get_secondary_replica() to match get_primary_replica()
2026-02-25 10:13:55 +02:00
Gleb Natapov
0f8cdd81f3 group0: fix indentation after previous patch 2026-02-25 10:08:32 +02:00
Gleb Natapov
7d7cbae763 raft_group0: simplify get_group0_upgrade_state function since no upgrade can happen any more
No need for locking any more so the function may just return a value and
be synchronous.
2026-02-25 10:08:32 +02:00
Gleb Natapov
0689fb5ab2 raft_group0: move service::group0_upgrade_state to use fmt::formatter instead of iostream 2026-02-25 10:08:32 +02:00
Gleb Natapov
cd76604c79 raft_group0: remove unused code from raft_group0
Also do not pass raft_replace_info into setup_group0 since it is not
used there for a long time now.
2026-02-25 10:08:32 +02:00
Gleb Natapov
6173ea476b node_ops: remove topology over node ops code
The code is no longer called.
2026-02-25 10:08:32 +02:00
Gleb Natapov
758d1c9c39 topology: fix indentation after the previous patch 2026-02-25 10:08:31 +02:00
Gleb Natapov
67cd5755b2 topology: drop topology_change_enabled parameter from raft_group0 code
Since the parameter is always true there is no point to pass it
everywhere. Just assume it is true at the point of use.
2026-02-25 10:08:31 +02:00
Gleb Natapov
50da142e77 storage_service: remove unused handle_state_* functions
The handler are no longer called.
2026-02-25 10:08:31 +02:00
Gleb Natapov
1a57f2b22d gossiper: drop wait_for_gossip_to_settle and deprecate correspondent option
The function is unused now and the option that allows to skip the wait
is no longer needed as well.
2026-02-25 10:08:31 +02:00
Gleb Natapov
aa0f103eb9 storage_service: fix indentation after the last patch 2026-02-25 10:08:31 +02:00
Gleb Natapov
be6cced978 storage_service: remove gossiper bootstrapping code
Remove code that is responsible for bootstrapping a node in gossiper
mode since the mode is no longer supported.
2026-02-25 10:08:31 +02:00
Gleb Natapov
8776d00c44 storage_service: drop get_group_server_if_raft_topolgy_enabled
Raft topology is always enabled now so the function does not make sense
any longer.
2026-02-25 10:08:30 +02:00
Gleb Natapov
5fa4f5b08f storage_service: drop is_topology_coordinator_enabled and its uses
The code can now assume that topology coordinator is enabled.
2026-02-25 10:08:30 +02:00
Gleb Natapov
1e8d4436c7 storage_service: drop run_with_api_lock_in_gossiper_mode_only
Since gossiper mode is no longer exists the function is no longer
needed.
2026-02-25 10:08:30 +02:00
Gleb Natapov
a8a167623a topology: remove code that assumes raft_topology_change_enabled() may return false
The path removes the code protected by !raft_topology_change_enabled()
since it is no longer reachable. Drop test_lwt_for_tablets_is_not_supported_without_raft
since not raft mode is no longer supported.
2026-02-25 10:08:30 +02:00
Botond Dénes
8dbcd8a0b3 tools/scylla-sstable: create_table_in_cql_env(): register UDTs recursively
It is not enough to go over all column types and register the UDTs. UDTs
might be nested in other types, like collections. One has to do a
traversal of the type tree and register every UDT on the way. That is
what this patch does.
This function is used by the query and write operations, which should
now both work with nested UDTs.

Add a test which fails before and passes after this patch.
2026-02-25 08:51:25 +02:00
Botond Dénes
cf39a5e610 tools/scylla-sstable: generalize dump_if_user_type
Rename to invoke_on_user_type() and make the action taken on user types
a function parameter. Enables reuse of the traverse logic by other code.
2026-02-25 08:51:25 +02:00
Botond Dénes
80049c88e9 tools/scylla-sstable: move dump_if_user_type() definition
So it can be used by create_table_in_cql_env() code.
2026-02-25 08:51:25 +02:00
Dario Mirovic
3222a1a559 dtest: shorten default sleep step in wait_for
Default sleep step of 1s is too long. Reduce it to make the test
environment more responsive and faster.

Refs SCYLLADB-573
2026-02-25 03:17:47 +01:00
Dario Mirovic
51e7c2f8d9 dtest: wait_for speedup
Audit tests have been slow. They rely on wait_for function.
This function first sleeps for the duration of the time step
specified, and then calls the given function. The audit tests
need 0.02-0.03 seconds for the given function, but the operation
lasts around 1.02-1.03 seconds, since step is 1 second.

This patch modifies wait_for dtest function so it first executes
the given function, and afterwards calls time.sleep(step). This
reduces time needed for the given function from 1.03 to 0.03 seconds.

Total audit tests suite speedup is 3x. On the developer machine
the time is reduced from 13+ minutes to 4 minutes.

This patch also improves performance of some alternator tests that
use the same wait_for dtest function.

Refs SCYLLADB-573
2026-02-25 03:17:46 +01:00
Andrei Chekun
1b92b140ee test.py: improve stdout output for boost test
The current way of checking the boost's stdout can have a race
condition when pytest will try to read the file before it was really
flushed. So this PR should eliminate this possibility.

Closes scylladb/scylladb#28783
2026-02-25 00:50:25 +01:00
Ferenc Szili
f70ca9a406 load_stats: implement the optimized sum of tablet sizes
PR #28703 was merged into master but not with the latest version of the
changes. This patch is an incremental fix for this.

Currently, the elements of the tablet_sizes_per_shard vector are
incremented in separate shards. This is prone to false sharing of cache
lines, and ping-pong of memory, which leads to reduced performance.

In this patch, in order to avoid cache line collisions while updating
the sum of tablet sizes per shard, we align the counter to 64 bytes.

Fixes: SCYLLADB-678

Closes scylladb/scylladb#28757
2026-02-24 22:19:25 +01:00
Marcin Maliszkiewicz
aa7816882e test: add test_uninitialized_conns_semaphore
Runtime in dev mode: 2s
2026-02-24 17:28:51 +01:00
Alex
5557770b59 test_mv_build_during_shutdown started two async CREATE MATERIALIZED VIEW operations and never awaited them (asyncio.gather(...) without await).
This pr adds await for each one of the tasks to wait for the MV schema to be added successfully
and then to start the server shutdown
With this change we dont need will not get the shutdown races.

Closes scylladb/scylladb#28774
2026-02-24 17:25:05 +01:00
Anna Stuchlik
64b1798513 doc: remove reduntant Java-related information
This commit removes:
- Instructions to install scylla-jmx (and all references)
- The Java 11 requirement for Ubuntu.

Fixes https://github.com/scylladb/scylladb/issues/28249
Fixes https://github.com/scylladb/scylladb/issues/28252

Closes scylladb/scylladb#28254
2026-02-24 14:37:39 +01:00
Anna Stuchlik
e2333a57ad doc: remove the tablets limitation for Alternator
This commit removes the information that Alternator doesn't support tablets.
The limitation is no longer valid.

Fixes SCYLLADB-778

Closes scylladb/scylladb#28781
2026-02-24 14:24:30 +02:00
Andrzej Jackowski
cd4caed3d3 test: fix configuration of test_autoretrain_dict
`test_autoretrain_dict` sporadically fails because the default
compression algorithm was changed after the test was written.

`9ffa62a986815709d0a09c705d2d0caf64776249` was an attempt to fix it by
changing the compression configuration during node startup. However,
the configuration change had an incorrect YAML format and was
ignored by ScyllaDB. This commit fixes it.

Fixes: scylladb/scylladb#28204

Closes scylladb/scylladb#28746
2026-02-24 12:08:44 +01:00
Botond Dénes
067bb5f888 test/scylla_gdb: skip coroutine tests if coroutine frame is not found
For a while, we have seen coroutine related tests (those that use the
coroutine_task fixture) fail occasionally, because no coroutine frame is
found. Multiple attempts were made to make this problem self-diagnosing
and dump enough information to be able to debug this post-mortem. To no
avail so far. A lot of time was invested into this this benign issue:
See the long discussion at https://github.com/scylladb/scylladb/issues/22501.

It is not known if the bug is in gdb, or the gdb script trying to find
the coroutine frame. In any case, both are only used for debugging, so
we can tolerate occasional failures -- we are forced to do so when
working with gdb anyway.
Instead of piling on more effor there, just skip these tests when the
problem occurs. This solves the CI flakyness.

Fixes: #22501

Closes scylladb/scylladb#28745
2026-02-24 10:12:03 +01:00
Marcin Maliszkiewicz
d5684b98c8 test: cluster: add continue-after-error to perf tool tests
Add --continue-after-error true to perf-cql-raw and perf-alternator
tests, and --stop-on-error false to perf-simple-query test, so that
tests don't abort on the first error.

Reason for this is that tests are flaky with example failure:
Perf test failed: std::runtime_error (server returned ERROR to EXECUTE)

When CPU is starved on CI we can return timeouts and/or other errors.

The change should make tests more robust on the expense of smaller test
scope. But those tests were written mostly to test startup sequence
as it differs from Scylla's starup.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-759

Closes scylladb/scylladb#28767
2026-02-24 11:08:34 +02:00
Andrei Chekun
d9ce2db1a3 test.py: remove setsid from the framework
With previous architecture, scylla servers were handled by test.py and
if pytest fails, test.py was responsible for stopping scylla processes.
Now with only pytest handling, there is no such mechanism, that's why
I'm removing the setsid, so when the parent pytest process closes it
will automatically close all child including any started process during
testing. This will allow to not leave any scylla process in case pytest
was killed.
2026-02-24 09:48:38 +01:00
Andrei Chekun
d3f5f7468c test.py: rename suite.yaml to test_config.yaml
Switch of discovery of the tests by test.py
2026-02-24 09:48:38 +01:00
Andrei Chekun
e439ec9d67 test.py: add cluster tests to be executed by pytest
cluster tests are now executed by pytest also
Run pytest in an executor to avoid blocking the event loop, allowing
resource monitoring to run concurrently
Logic for passing the arguments to pytest changed due to the fact that
almost all directories now executed by pytest and directories that are
not handled excluded in pytest.ini
Modify the threads count for debug mode, because with the default
logic CI agents die
2026-02-24 09:48:38 +01:00
Andrei Chekun
edf7154fee test.py: add random seed for topology tests reproducibility
Set TOPOLOGY_RANDOM_FAILURES_TEST_SHUFFLE_SEED environment variable
during pytest configuration to enable to ensure that all xdist workers will
discover the same scope of the tests. This is a known limitation of the
xdist plugi where the discovered tests should be consistenta across
master and workers.
2026-02-24 09:48:38 +01:00
Andrei Chekun
4a7d8cd99d test.py: add explicit default values to pytest options
Add explicit default values to pytest command line options to prevent
issues when running tests with pytest's parallel execution where
options are not present on upper conftest, so they're just not set at all.
2026-02-24 09:48:38 +01:00
Andrei Chekun
99234f0a83 test.py: replace SCYLLA env var with build_mode fixture
Replace direct usage of SCYLLA environment variable with the build_mode
pytest fixture and path_to helper function. This makes tests more
flexible and consistent with the test framework. Also this allows to use
tests with xdist, where environment variable can be left in the master
process and will not be set in the workers
Add using the fixture to get the scylla binary from the suite, this will
align with getting relocatable Scylla exe.
2026-02-24 09:48:38 +01:00
Avi Kivity
0add130b9d lua: avoid undefined behavior when converting doubles to integers
Lua doesn't have separate integer and floating point numbers,
so we check if a number can fit in an integer and if so convert
it to an integer.

The conversion routine invokes undefined behavior (and even
acknowledges it!). More recent compilers changed their behavior
when casting infinities, breaking test_user_function_double_return
which tests this conversion.

Fix by tightening the conversion to not invoke undefined behavior.

Closes scylladb/scylladb#28503
2026-02-24 10:41:21 +02:00
Botond Dénes
1d5b8cc562 Merge 'Fix use after free in auth cache' from Marcin Maliszkiewicz
This patchset:
- ensures the loading semaphore is acquired in cross-shard callbacks
- fixes iterator invalidation problem when reloading all cached permissions

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-780
Backport: no, affected code not released yet

Closes scylladb/scylladb#28766

* github.com:scylladb/scylladb:
  auth: cache: fix permissions iterator invalidation in reload_all_permissions
  auth/cache: acquire _loading_sem in cross-shard callbacks
2026-02-24 10:35:46 +02:00
Pavel Emelyanov
5a5eb67144 vector_search/dns: Use newer seastar get_host_by_name API
The hostent::addr_list is deprecated in favor of address_entry::addr
field that contains the very same addresses.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#28565
2026-02-23 21:28:43 +02:00
Pavel Emelyanov
6b02b50e3d Merge 'object_storage: add retryable machinery to object storage' from Ernest Zaslavsky
- add an overload to the rest http client to accept retry strategy instance as an argument
- remove hand rolled error handling from object storage client and replace with common machinery that supports handling and retrying when appropriate

No backport neede since it is only refactoring

Closes scylladb/scylladb#28161

* github.com:scylladb/scylladb:
  object_storage: add retryable machinery to object storage
  rest_client: add `simple_send` overload
2026-02-23 21:28:51 +03:00
Wojciech Mitros
c1b3fec11a raft: add group_id -> shard mapping to raft_group_registry
To handle RPC from other nodes, we need to be able to redirect the
requests for each raft group to the shard that owns it. We need to
be able to do the redirection on all shards, so to achieve that, on
all shards we need to store the information about which shard is
occupied by each Raft group server.
For that we add a group_id -> shard mapping to the raft_group_registry.
The mapping is filled out when starting raft servers, it's emptied
when we abort raft servers. We use it when registering RPC verb handlers,
so that regardless of the shard handling the RPC, the work on the raft
group can be performed on the corresponding shard.
2026-02-23 15:34:56 +01:00
Nadav Har'El
e463d528fe test: add unit test for tablet_map::get_secondary_replica()
This patch adds a unit test for tablet_map::get_secondary_replica().
It was never officially defined how the "primary" and "secondary"
replicas were chosen, and their implementation changed over time,
but the one invariant that this test verifies is that the secondary
replica and the primary replica must be a different node.

This test reproduces issue SCYLLADB-777, where we discovered that
the get_primary_replica() changed without a corresponding change to
get_primary_replica(). So before the previous patch, this test failed,
and after the previous patch - it passes.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-02-23 16:19:43 +02:00
Nadav Har'El
0c7f499750 test, alternator: add test for TTL expiration with a node down
We have many single-node functional tests for Alternator TTL in
test/alternator/test_ttl.py. This patch adds a multi-node test in
test/cluster/test_alternator.py. The new test verifies that:

 1. Even though Alternator TTL splits the work of scanning and expiring
    items between nodes, all the items get correctly expired.
 2. When one node is down, all the items still expire because the
    "secondary" owner of each token range takes over expiring the
   items in this range while the "primary" owner is down.

This new test is actually a port of a test we already had in dtest
(alternator_ttl_tests.py::test_multinode_expiration). This port is
faster and smaller then the original (fewer nodes, fewer rows), but it
still found a regression (SCYLLADB-777) that dtest missed - the new test
failed when running with tablets and in release build mode.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-02-23 16:19:43 +02:00
Nadav Har'El
9ab3d5b946 locator: fix get_secondary_replica() to match get_primary_replica()
The function tablet_map::get_secondary_replica() is used by Alternator
TTL to choose a node different from get_primary_replica(). Unfortunately,
recently (commits 817fdad and d88037d) the implementation of the latter
function changed, without changing the former. So this patch changes
the former to match.

The next two patches will have two tests that fail before this patch,
and pass with it:

1. A unit test that checks that get_secondary_replica() returns a
   different node than get_primary_replica().

2. An Alternator TTL test that checks that when a node is down,
   expirations still happen because the secondary replica takes over
   the primary replica's work.

Fixes SCYLLADB-777

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-02-23 16:19:30 +02:00
Botond Dénes
dcd8de86ee Merge 'docs: update a documentation of adding/removing DC and rebuilding a node' from Aleksandra Martyniuk
Describe a procedure to convert tablet keyspace replication factor
to rack list. Update the procedures of adding and removing a node
to consider tablet keyspaces.

Fixes: [SCYLLADB-398](https://scylladb.atlassian.net/browse/SCYLLADB-398)
Fixes: https://github.com/scylladb/scylladb/issues/28306.
Fixes: https://github.com/scylladb/scylladb/issues/28307.
Fixes: https://github.com/scylladb/scylladb/issues/28270.

Needs backport to all live branches as they all include tablets.

[SCYLLADB-398]: https://scylladb.atlassian.net/browse/SCYLLADB-398?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ

Closes scylladb/scylladb#28521

* github.com:scylladb/scylladb:
  docs: update nodetool rebuild docs
  docs: update a procedure of decommissioning a DC
  docs: update a procedure of adding a DC
  docs: describe upgrade to enforce_rack_list option
  docs: describe conversion to rack-list RF
2026-02-23 16:15:16 +02:00
Andrei Chekun
6ae58c6fa6 test.py: move storage tests to cluster subdirectory
Move the storage test suite from test/storage/ to test/cluster/storage/
to consolidate related cluster-based tests.This removes the standalone
test/storage/suite.yaml as the tests will use the cluster's test configuration.
Initially these tests were in cluster, but to use unshare at first
iteration they were moved outside. Now they are using another way to
handle volumes without unshare, they should be in cluster

Closes scylladb/scylladb#28634
2026-02-23 16:14:15 +02:00
Gleb Natapov
e23af998e1 test: schema_change_test: make test_schema_digest_does_not_change_with_disabled_features tests run in raft mode
They were running in recovery to reuse existing system tables without
group0 id, but since we want to remove recovery mode we need to
re-generate the tables.
2026-02-23 14:54:24 +02:00
Gleb Natapov
f589740a39 test: schema_change_test: drop schema tests relevant for no raft mode only
They were running in no longer supported recovery mode to force gossip
topology.
2026-02-23 14:54:24 +02:00
Gleb Natapov
92049c3205 topology: remove upgrade to raft topology code
We do no longer need this code since we expect that cluster to be
upgraded before moving to this version.
2026-02-23 14:54:24 +02:00
Gleb Natapov
4a9cf687cc group0: remove upgrade to group0 code
This patch removes ability of a cluster to upgrade from not having
group0 to having one. This ability is used in gossiper based recovery
procedure that is deprecated and removed in this version. Also remove
tests that uses the procedure.
2026-02-23 14:54:24 +02:00
Gleb Natapov
dcafb5c083 group0: refuse to boot if a cluster is still is not in a raft topology mode
We are going to drop legacy topology mode (gossiper mode) and no longer
allow ScyllaDB to start in this mode. This patch refuses to boot if a
cluster is not in raft topology mode yet. It may happen if a node of a
cluster that is not yet in a raft topology is upgraded to a newer
version. If this happens the node has to be downgraded. Raft topology
has to be enabled on a cluster and then the node can be upgraded again.
2026-02-23 14:54:24 +02:00
Gleb Natapov
ed52d1a292 storage_service: refuse to join a cluster in legacy mode
We are going to drop legacy topology mode (gossiper mode) and no longer
allow ScyllaDB to start in this mode. This patch disallows a node to
join a cluster that is still in legacy mode. A cluster needs to enable
raft mode first.
2026-02-23 14:54:24 +02:00
Marcin Maliszkiewicz
c5dc086baf Merge 'vector_search: return NaN for similarity_cosine with all-zero vectors' from Dawid Pawlik
The ANN vector queries with all-zero vectors are allowed even on vector indexes with similarity function set to cosine.
When enabling the rescoring option, those queries would fail as the rescoring calls `similarity_cosine` function underneath, causing an `InvalidRequest` exception as all-zero vectors were not allowed matching Cassandra's behaviour.

To eliminate the discrepancy we want the all-zero vector `similarity_cosine` calls to pass, but return the NaN as the cosine similarity for zero vectors is mathematically incorrect. We decided not to use arbitrary values contrary to USearch, for which the distance (not to be confused with similarity) is defined as cos(0, 0) = 0, cos(0, x) = 1 while supporting the range of values [0, 2].
If we wanted to convert that to similarity, that would mean sim_cos(0, x) = 0.5, which does not support mathematical reasoning why that would be more similar than for example vectors marking obtuse angles.
It's safe to assume that all-zero vectors for cosine similarity shouldn't make any impact, therefore we return NaN and eliminate them from best results.

Adjusted the tests accordingly to check both proper Cassandra and Scylla's behaviour.

Fixes: SCYLLADB-456

Backport to 2026.1 needed, as it fixes the bug for ANN vector queries using rescoring introduced there.

Closes scylladb/scylladb#28609

* github.com:scylladb/scylladb:
  test/vector_search: add reproducer for rescoring with zero vectors
  vector_search: return NaN for similarity_cosine with all-zero vectors
2026-02-23 13:10:44 +01:00
Aleksandra Martyniuk
9ccc95808f docs: update nodetool rebuild docs
Update nodetool rebuild docs to mention that the command does not
work for tablet keyspaces.

Fixes: https://github.com/scylladb/scylladb/issues/28270.
2026-02-23 12:45:01 +01:00
Aleksandra Martyniuk
e4c42acd8f docs: update a procedure of decommissioning a DC
Update a procedure of decommissioning a DC for tablet keyspaces.

Fixes: https://github.com/scylladb/scylladb/issues/28307.
2026-02-23 12:45:01 +01:00
Aleksandra Martyniuk
1c764cf6ea docs: update a procedure of adding a DC
Update a procedure of adding a DC for tablet keyspaces.

Fixes: https://github.com/scylladb/scylladb/issues/28306.
2026-02-23 12:45:01 +01:00
Aleksandra Martyniuk
e08ac60161 docs: describe upgrade to enforce_rack_list option 2026-02-23 12:44:57 +01:00
Aleksandra Martyniuk
eefe66b2b2 docs: describe conversion to rack-list RF
Fixes: SCYLLADB-398
2026-02-23 12:41:33 +01:00
Marcin Maliszkiewicz
54dca90e8c Merge 'test: move dtest/guardrails_test.py to test_guardrails.py' from Andrzej Jackowski
This patch series moves `test/cluster/dtest/guardrails_test.py`
to `test/cluster/test_guardrails.py`, and migrates it from `cluster/dtest/`
to `cluster/` framework.

There are two motivations for moving the test:
 - Execution time reduction (from 12s to 9s in 'dev' in my env)
 - Facilitate adding new tests to the `guardrails_test.py` file

No backport, `dtest/guardrails_test.py` is only on master

Closes scylladb/scylladb#28737

* github.com:scylladb/scylladb:
  test: move dtest/guardrails_test.py to test_guardrails.py
  test: prepare guardrails_test.py to be moved to test/cluster/
2026-02-23 12:34:43 +01:00
Marcin Maliszkiewicz
1293b94039 auth: cache: fix permissions iterator invalidation in reload_all_permissions
The inner loops in reload_all_permissions iterate role's permissions
and _anonymous_permissions maps across yield points. Concurrent
load_permissions calls (which don't hold _loading_sem) can emplace
into those same maps during a yield, potentially triggering a rehash
that invalidates the active iterator.

We want to avoid adding semaphore acquire in load_permissions
because it's on a common path (get_permissions).

Fixing by snapshotting the keys into a vector before iterating with
yields, so no long-lived map iterator is held across suspension
points.
2026-02-23 12:14:22 +01:00
Calle Wilund
fec7df7cbb topology::snapshot: Add expiry (ttl) to RPC/topo op
Not set yet, but includes it in messages so it can be properly
set in calling code. Will add entry to manifest.
2026-02-23 11:37:17 +01:00
Calle Wilund
cc60d014ed test_snapshot_with_tablets: Extend test to check manifest content
Verifies we have the expected tablet info in manifest.
2026-02-23 11:37:17 +01:00
Calle Wilund
f7aa2aacfc table::manifest: Add tablet info to manifest.json
If using tablets, will add info for each tablet present in
snapshot, with repair time and token ranges + map each
sstable to its owning tablet
2026-02-23 11:37:17 +01:00
Calle Wilund
ae10b5a897 test::test_snapshot_with_tablets: Add small test for topo coordinated snapshot 2026-02-23 11:37:16 +01:00
Calle Wilund
bac81df20f scylla-nodetool: Add "cluster snapshot" command
Similar to "normal" snapshot, but will use the cluster-wide,
topolgy coordinated snapshot API and path.
2026-02-23 11:37:16 +01:00
Calle Wilund
b0604d9840 api::storage_service: Add tablets/snapshots command for cluster level snapshot
Calls the topology_coordinator path for snapshots.
2026-02-23 11:37:16 +01:00
Piotr Dulikowski
a4c389413c Merge 'Hardens MV shutdown behavior by fixing lifecycle tracking for detached view-builder callbacks' from Alex Dathskovsky
This series hardens MV shutdown behavior by fixing lifecycle tracking for detached view-builder callbacks and aligning update handling with the same async dispatch style used by create/drop.

Patch 1 refactors on_update_view to use a dedicated coroutine dispatcher (dispatch_update_view), keeping update logic serialized under the existing view-builder lock and consistent with the callback architecture already used for create/drop paths.

Patch 2 adds explicit callback lifetime coordination in view_builder:

  - introduce a seastar::gate member
  - acquire _ops_gate.hold() when launching detached create/update/drop dispatch futures
  - keep the hold alive until each detached future resolves
  - close the gate during view_builder::drain() so shutdown waits for in-flight callback work before final teardown

Together, these changes reduce shutdown race exposure in MV event handling while preserving existing behavior for normal operation.

Testing:
  - pytest --test-py-init test/cluster/mv (47 passed, 7 skipped)

backport: not required started happening in master

fixes: SCYLLADB-687

Closes scylladb/scylladb#28648

* github.com:scylladb/scylladb:
  db/view: gate detached view-builder callbacks during shutdown
  db:view: refactor on_update_view to use coroutine dispatcher
2026-02-23 11:28:37 +01:00
Calle Wilund
9680541144 db::snapshot-ctl: Add method to do snapshot using topo coordinator
Separated from "local" snapshot.
2026-02-23 11:27:15 +01:00
Calle Wilund
425d6b4441 storage_proxy: Add snapshot_keyspace method
Takes set of ks->tables tuples and issues snapshot for each.
If feature is enabled, keyspace is non-local, and uses tablets,
will issue topo coordinator call across cluster.

Keyspaces not fitting the above will just go to "normal" (node
local) snapshot.
2026-02-23 11:27:15 +01:00
Calle Wilund
012a065364 topology_coordinator: Add handler for snapshot_tables
Similar to truncate, translates the topo op into RPC call
to relevant replicas and waits.
2026-02-23 11:27:15 +01:00
Calle Wilund
2bc633c3bd storage_proxy: Add handler for SNAPSHOT_WITH_TABLETS 2026-02-23 10:44:42 +01:00
Calle Wilund
d1b06482f0 messaging_service: Add SNAPSHOT_WITH_TABLETS verb 2026-02-23 10:44:42 +01:00
Calle Wilund
3075311f21 feature_service: Add SNAPSHOT_AS_TOPOLOGY_OPERATION feature
To detect if cluster can do coordinated snapshot
2026-02-23 10:44:41 +01:00
Calle Wilund
988c5238cf topology_mutation: Add setter for snapshot part of row 2026-02-23 10:44:41 +01:00
Calle Wilund
8bb81f00f8 system_keyspace::topology_requests_entry: Add snapshot info to table
Adds required info to communicate snapshot requests via topology
coordinator.
2026-02-23 10:44:38 +01:00
Calle Wilund
642aa44937 topology_state_machine: Add snapshot_tables operation
Note: placed after "noop" op, to not confuse ops in a mixed
(upgrading) cluster
2026-02-23 10:43:28 +01:00
Calle Wilund
e3d4493bf6 topology_coordinator: Break out logic from handle_truncate_table
Makes handle_truncate_table use shared logic, because we would
like to reuse some of it for other, coming ops. I.e. snapshot.
2026-02-23 10:43:28 +01:00
Calle Wilund
6e39c3bb83 storage_proxy: Break out logic from request_truncate_with_tablets
Makes request_truncate_with_tablets use a parameterized helper,
because eventually we will want to use almost identical logic
for other ops, like snapshot.
2026-02-23 10:43:28 +01:00
Pavel Emelyanov
ad0c2de0d1 test/object_store: Remove create_ks_and_cf() helper
Now all test cases use standard facilities to create data they test

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-02-23 10:43:28 +01:00
Pavel Emelyanov
6711afd73b test/object_store: Replace create_ks_and_cf() usage with standard methods
To create a keyspace theres new_test_keyspace helper
Table is created with a single cql.run_async with explicit schema
Dataset is populated with a single parallel INSERT as well

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-02-23 10:43:28 +01:00
Pavel Emelyanov
ed3a326637 test/object_store: Shift indentation right for test cases
This is preparational patch. Next will need to replace

  foo()
  bar()

with

  with something() as s:
      foo()
      bar()

Effectively -- only add the `with something()` line. Not to shift the
whole file right together with that future change, do it here.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-02-23 10:43:28 +01:00
Marcin Maliszkiewicz
75d4bc26d3 auth/cache: acquire _loading_sem in cross-shard callbacks
distribute_role() modifies _roles on non-zero shards via
invoke_on_others() without holding _loading_sem. Similarly, load_all()'s
invoke_on_others() callback calls prune_all() without the semaphore.
When these run concurrently with reload_all_permissions(), which
iterates _roles across yield points, an insertion can trigger
absl::flat_hash_map::resize(), freeing the backing storage while
an iterator still references it.

Fix by acquiring _loading_sem on the target shard in both
distribute_role()'s and load_all()'s invoke_on_others callbacks,
serializing all _roles mutations with coroutines that iterate
the map.
2026-02-23 10:30:03 +01:00
Pavel Emelyanov
3d07633300 test/object_store: Use itertools.product() for deeply nested loops
The test_restore_with_streaming_scopes want to run some loop body for
all (almost) combinations of scope, primary-replica-only and min tablet
count. For that three nested loops are used. Using itertools.product()
makes the code shorter, less indented and more explicit.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-02-23 12:28:53 +03:00
Pavel Emelyanov
a9a82f89ac test/object_store: Replace dataset creation usage with standard methods
Two places are fixed

1. The call to create_dataset() is replaced with three "library"
   methods. This makes it explicit which options and schema are used
   for that. Eventually, the large and bulky create_dataset will be
   removed

2. The part that restores data into a fresh new table calls some CQLs by
   hand, and partially re-uses variables obtained from previous call to
   create_dataset(). Using the same "library" methods to re-create an
   empty table makes this part much simpler

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-02-23 12:27:41 +03:00
Pavel Emelyanov
988606ac7f test/object_store: Shift indentation right for test_restore_with_streaming_scopes
This is preparational patch. Next will need to replace

  foo()
  bar()

with

  with something() as s:
      foo()
      bar()

Effectively -- only add the `with something()` line. Not to shift the
whole file right together with that future change, do it here.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-02-23 12:27:09 +03:00
Pavel Emelyanov
5161aeee95 test/backup: Run keyspace flush and snapshot taking API in parallel
The take_snapshot() helper runs these API sequentially for every server.
Running them with asyncio.gather() slightly reduces the wait-time thus
improving the total runtime.

Before:
    CPU utilization: 2.1%
    real	0m33,871s
    user	0m22,500s
    sys	        0m13,207s

After:
    CPU utilization: 2.4%
    real	0m29,532s
    user	0m22,351s
    sys	        0m12,890s

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-02-23 12:20:36 +03:00
Pavel Emelyanov
21752a43fe test/backup: Re-use take_snapshot() helper in do_abort_restore()
The test in question does _exactly_ what this helper does, but in a
longer way. The only difference is that it uses server_id as key to dict
with sstable components, but it's easy to tune.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-02-23 12:20:35 +03:00
Pavel Emelyanov
818a99810c test/backup: Move take_snapshot() helper up
So that it's not in the middle of tests themselves, but near other
"helper" functions in the .py file

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-02-23 12:20:35 +03:00
Ernest Zaslavsky
321d4caf0c object_storage: add retryable machinery to object storage
remove hand rolled error handling from object storage client
and replace with common machinery that supports exception
handling and retrying when appropriate
2026-02-22 14:00:44 +02:00
Ernest Zaslavsky
24972da26d rest_client: add simple_send overload
add an overload to rest client `simple_send` to accept a retry_strategy for http's make_request
2026-02-22 14:00:44 +02:00
Marcin Maliszkiewicz
aba5a8c37f generic_server: fix waiters count in shed log
Capture semaphore waiters count when blocking starts,
not after the wait completes.
2026-02-20 17:04:23 +01:00
Marcin Maliszkiewicz
23bed55170 generic_server: scale connection concurrency semaphore by listener count
The concurrency semaphore gates uninitialized connections across all
do_accepts loops, but was initialized to a fixed value regardless of
how many listeners exist. With multiple listeners competing for the
same units, each effectively gets less than the configured concurrency.

Initialize the semaphore to concurrency - 1 and signal 1 per listen()
call, so total capacity is concurrency - 1 + nr_listeners. This
guarantees each listener's accept loop can have at least one unit
available.
2026-02-20 16:59:19 +01:00
Patryk Jędrzejczak
e8efcae991 Merge 'Use standard ks/cf/data creation methods in object_store/test_basic.py test' from Pavel Emelyanov
The test uses create_ks_and_cf helper duplicating the existing code that does the same. This PR patches basic tests to use standard facilities. Also it prepares the ground for testing keyspace storage options with rf=3

Cleaning tests, not backporting

Closes scylladb/scylladb#28600

* https://github.com/scylladb/scylladb:
  test/object_store: Remove create_ks_and_cf() helper
  test/object_store: Replace create_ks_and_cf() usage with standard methods
  test/object_store: Shift indentation right for test cases
2026-02-20 15:53:38 +01:00
Nadav Har'El
d01915131a test/cqlpy: make test_indexing_paging_and_aggregation much faster
Currently, test_secondary_index.py::test_indexing_paging_and_aggregation
is very slow, and the slowest test in the test/cqlpy framework: It takes
around 13 seconds on dev build, and because it is CPU-bound (doesn't sleep),
it is much slower on debug builds. The reason for this slowness is that it
needs to set up and read over 10,000 rows which is the default
select_internal_page_size.

But after the patches in pull request (#25368), we can configure
select_internal_page_size, so in this patch we change the test to
temporarily reduce this option to just 50, and then the test can reach
the same code paths with just 142 rows instead of 20120 rows before this
patch.

As a result, the test should now be 140 times faster than it was before.
In practice, because of some fixed overheads (the test creates several
tables and indexes), in dev build mode the test run speedup is "only"
26-fold (to around half a second).

I verified that removing the code added in bb08af7 indeed makes the new
shorter test fail - and this is the only test in test_secondary_index.py
that starts to fail besides test_index_paging_group_by which is also
related (so my revert didn't just break secondary indexing completely).
So the shorter test is still a good regression test.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#28268
2026-02-20 15:44:53 +02:00
Avi Kivity
92bc5568c5 tools: toolchain: build sanitizers for future toolchain
The future toolchain did not build the sanitizers, so debug
executables did not link. Fix by not disabling the sanitizers.

Closes scylladb/scylladb#28733
2026-02-20 15:44:24 +02:00
Botond Dénes
6c04e02f66 Merge 'Fix restoration test's validation of streaming directions' from Pavel Emelyanov
The test_restore_with_streaming_scopes among other things checks how data streams flow while restoring. Whether or not to check the streams is decided based on the min tablet count value, which is compared with a hardcoded 512. This value of 512 matched the tablet count used by this test until it was "optimized" by #27839, where this number changed to 5 and streaming checks became off.

Good news is that the very same checks are still performed by test_refresh_with_streaming_scopes. But it's better to have a working restoration test anyway.

Minor test fix, not backporting

Closes scylladb/scylladb#28607

* github.com:scylladb/scylladb:
  test: Fix the condition for streaming directions validation
  test: Split test_backup.py::check_data_is_back() into two
2026-02-20 15:42:10 +02:00
Botond Dénes
6f88c0dbd3 Merge ' test_tablets_parallel_decommission: Fix flakiness due to delayed task appearance' from Tomasz Grabiec
Currently, the test assumes that when
'topology_coordinator_pause_before_processing_backlog: waiting' is
logged, the task for decommission must be there. This was based on the
assumption that topology coordinator is idle and decommission request
wakes it up. But if the server is slow enough, it may still be running
the load balancer in reaction to table creation, and block on that
injection point before decommission request was added.

Fix by waiting for the task to appear rather than the injection.

Fixes SCYLLADB-715

Only 2026.1 vulnerable.

Closes scylladb/scylladb#28688

* github.com:scylladb/scylladb:
  test_tablets_parallel_decommission: Fix flakiness due to delayed task appearance
  test: cluster: task_manager_client: Introduce wait_task_appears()
  tests: pylib: util: Add exponential backoff to wait_for
2026-02-20 15:05:36 +02:00
Pavel Emelyanov
c96420c015 tests: Re-use manager.get_server_exe()
There's a bunch of incremental repair tests that want to call scylla
sstable command. For that they try to find where scylla binary by
scanning /proc directory (see local_process_id and get_scylla_path
helpers).

There's shorter way -- just call manager.get_server_exe().

Same for backup-restore test.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#28676
2026-02-20 14:59:30 +02:00
Pavel Emelyanov
a4a0d75eee test/object_store: Parametrize test_simple_backup_and_restore()
There are three tests and a function with a pair of boolean parameters
called by those. It's less code if the function becomes a test with
parameters.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#28677
2026-02-20 14:57:30 +02:00
Pavel Emelyanov
a2e1293f86 test/object_store: Squash two simple-backup tests together
The test_backup_simple creates a ks/cf, takes a snapshot, backs it up,
then checks that the files were uploaded. The test_backup_move does the
same, but also plays with 'move_files' parameter to be true/false.

In fact, the "move" test was the copy of "simple" one that dropepd check
for scheduling group being "streaming" (backup with --move-files can
check the same, it's not bad), and check for destination bucket to
contain needed files (same here -- checking that files arrived to bucket
after --move-files is good).

In the end of the day, after the change backup test is run two times,
instead of three, and performs extra checks for --move-files case.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#28606
2026-02-20 14:49:30 +02:00
Botond Dénes
7e90ed657c Merge 'Fix client_options docs' from Karol Baryła
https://github.com/scylladb/scylladb/pull/25746 added a new column to `system.clients`: `client_options frozen<map<text, text>>`. This column stores all options sent by the client in the `STARTUP` message.
This PR also added `CLIENT_OPTIONS` to the list of values sent in `SUPPORTED` message, and documented that drivers can send their configuration (as JSON) in `STARTUP` under this key.

Documentation for the new column was not added to the description of `system.clients` table, and documentation about the new `STARTUP` key was added in `protocol-extensions.md`, but in the section about shard awareness extension.

This PR adds missing `system.clients` column description, moves the documentation of `CLIENT_OPTIONS` into its own section, and expands it a bit.

Backport: none, because this fixes internal documentation.

Closes scylladb/scylladb#28126

* github.com:scylladb/scylladb:
  protocol-extensions.md: Fix client_options docs
  system_keyspace.md: Add client_options column
  system_keyspace.md: Fix order in system.clients
2026-02-20 14:23:34 +02:00
Pavel Emelyanov
525cb5b3eb table: Use fmt::to_string() to stringify compation group ID
Doing it with format("{}", foo) is correct, but to_string is
a bit more lightweight.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#28630
2026-02-20 14:13:15 +02:00
Patryk Jędrzejczak
d399a197f5 Merge 'raft: Await instead of returning future in wait_for_state_change' from Dawid Mędrek
The `try-catch` expression is pretty much useless in its current form. If we return the future, the awaiting will only be performed by the caller, completely circumventing the exception handling.

As a result, instead of handling `raft::request_aborted` with a proper error message, the user will face `seastar::abort_requested_exception` whose message is cryptic at best. It doesn't even point to the root of the problem.

Fixes SCYLLADB-665

Backport: This is a small improvement and may help when debugging, so let's backport it to all supported versions.

Closes scylladb/scylladb#28624

* https://github.com/scylladb/scylladb:
  test: raft: Add test_aborting_wait_for_state_change
  raft: Describe exception types for wait_for_state_change and wait_for_leader
  raft: Await instead of returning future in wait_for_state_change
2026-02-20 12:17:22 +01:00
Andrzej Jackowski
eb5a564df2 test: move dtest/guardrails_test.py to test_guardrails.py
This commit moves `guardrails_test.py`, prepared in the previous
commit of this patch series, to `test/cluster/test_guardrails.py`.
It also cleans up `suite.yaml`.
2026-02-20 11:39:52 +01:00
Andrzej Jackowski
9df426d2ae test: prepare guardrails_test.py to be moved to test/cluster/
Disable `test/cluster/dtest/guardrails_test.py` in `suite.yaml` and
make it compatible with the `test/cluster/` framework. This will
allow moving this file from `test/cluster/dtest/` to `test/cluster/`
in the next commit of this patch series.

There are two motivations for moving the test:
 - Execution time reduction (from 12s to 9s in 'dev' in my env)
 - Facilitate adding new tests to the `guardrails_test.py` file
2026-02-20 11:39:43 +01:00
Raphael S. Carvalho
f33f324f77 mutation_compactor: Fix tombstone GC metrics to account for only expired
There are 3 metrics (that goes in every compaction_history entry):
total_tombstone_purge_attempt
total_tombstone_purge_failure_due_to_overlapping_with_memtable
total_tombstone_purge_failure_due_to_overlapping_with_uncompacting_sstable

When a tombstone is not expired (e.g. doesn't satisfy "gc_before" or
grace period), it can be currently accounted as failure due to
overlapping with either memtable or uncompacting sstable.
So those 2 last metrics have noise of *unexpired* tombstones.

What we should do is to only account for expired tombstones in all
those 3 metrics. We lose the info of knowing the amount of tombstones
processed by compaction, now we'll only know about the expired ones.
But those metrics were primarily added for explaining why expired
tombstones cannot be removed.
We could have alternatively added a new field
purge_failure_due_to_being_unexpired or something, but
it requires adding a new field to compaction_history.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-737.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes scylladb/scylladb#28669
2026-02-20 10:43:58 +02:00
Botond Dénes
0bf4c68af5 Merge 'docs: fix link to docker build readme in the README.MD' from Marcin Szopa
Links were pointing to the `debian` subdirectory. However, there docker build was refactored to use `redhat`: 1abf981a73, see https://github.com/scylladb/scylladb/pull/22910

No backport, just a README link fixes.

Closes scylladb/scylladb#28699

* github.com:scylladb/scylladb:
  docs: fix path to the build_docker.sh which was moved from debian to redhat subdirectory
  docs: fix link to docker build README.MD
2026-02-20 08:21:46 +02:00
Botond Dénes
51a25c8af3 test/boost/batchlog_manager_test: add tests for v1 batchlog
The v1 table is used while upgrading from a pre-v2 version. We need
tests to ensure it still works.
2026-02-20 07:03:46 +02:00
Botond Dénes
83344dacbd test/boost/batchlog_manager_test: make prepare_batches() work with both v1 and v2
Make the actual table name a parameter and add logic to adapt to the
variant used.
Also add dump_to_log::yes to is_rows() invokation to help debuging
tests.
2026-02-20 07:03:46 +02:00
Botond Dénes
2956714e19 test/boost/batchlog_manager_test: fix indentation 2026-02-20 07:03:46 +02:00
Botond Dénes
23732227fe test/boost/batchlog_manager_test: extract prepare_batches() method
To be shared between multiple tests in future commits.
Indentation is left broken.
2026-02-20 07:03:46 +02:00
Botond Dénes
af26956bb4 test/lib/cql_assertions: is_rows(): add dump parameter
When set to true, the query results will be logged by the testlog logger
with  debug level. A huge help when debugging failures around cql
assertions: seeing the actual query result is often enough to
immediately understand why the test failed.
2026-02-20 07:03:46 +02:00
Botond Dénes
48e9b3d668 tools/scylla-sstable: extract query result printers
To cql3/query_result_printer.hh. Allowing for other users, outside of
tools.
2026-02-20 07:03:46 +02:00
Botond Dénes
978627c4e1 tools/scylla-sstable: add std::ostream& arg to query result printers
Make them more general-purpose, in preparation to extracting them to
their own header.
2026-02-20 07:03:46 +02:00
Botond Dénes
0549b61d55 repair/row_level: repair_flush_hints_batchlog_handler(): add all_replayed to finish log
Provides visibility into whether batchlog replay was successful or not.
2026-02-20 07:03:46 +02:00
Botond Dénes
dd50bd9bd4 db/batchlog_manager: re-add v1 support
system.batchlog will still have to be used while the cluster is
upgrading from an older version, which doesn't know v2 yet.
Re-add support for replaying v1 batchlogs. The switch to v2 will happen
after the BATCHLOG_V2 cluster feature is enabled.

The only external user -- storage_proxy -- only needs a minor
adjustment: switch between the table names. The rest is handled
transparently by the db/batchlog.hh interface and the batchlog_manager.
2026-02-20 07:03:46 +02:00
Botond Dénes
8ffa3d32c0 db/batchlog_manager: return all_replayed from process_batch()
process_batch() currently returns stop_iteration::no from all control
paths. This is not useful. Return the all_replayed output param instead.
This requires making the batch() lambda a coroutine, but considering the
amount of work process_batch() does (send multiple writes), this should
be inconsequential.
2026-02-20 07:03:46 +02:00
Botond Dénes
091b43f54b db/batchlog_manager: process_bath() fix indentation 2026-02-20 07:03:46 +02:00
Botond Dénes
ef2b8b4819 db/batchlog_manager: make batch() a standalone function
Currently it is a huge lambda. Deserves to be a standalone function, to
make the replay_all_failed_batches() easier to read and modify.
2026-02-20 07:03:46 +02:00
Botond Dénes
ca2bbbad97 db/batchlog_manager: make structs stats public
Need to rename stats() -> get_stats() because it shadows the now
exported type name.
2026-02-20 07:03:46 +02:00
Botond Dénes
f8bfaedb6e db/batchlog_manager: allocate limiter on the stack
Now that replay_all_failed_batches() is a coroutine, there is no need to
make it a shared pointer anymore.
2026-02-20 07:03:46 +02:00
Botond Dénes
ac059dadc6 db/batchlog_manager: add feature_service dependency
Will be needed to check for batchlog_v2 feature.
2026-02-20 07:03:46 +02:00
Botond Dénes
c901ab53d2 gms/feature_service: add batchlog_v2 feature 2026-02-20 07:03:45 +02:00
Avi Kivity
66bef0ed36 lua, tools: adjust for lua 5.5 lua_newstate seed parameter
Lua 5.5 adds a seed parameter to lua_newstate(), provide it
with a strong random seed.

Closes scylladb/scylladb#28734
2026-02-20 06:52:37 +02:00
Avi Kivity
27a5502f14 Merge 'Reapply "main: test: add future and abort_source to after_init_func"' from Marcin Maliszkiewicz
The patchset fixes abort_source implementation for perf-alternator and perf-cql-raw. It moves
run_standalone function to common code in perf.hh with necessary templating.

We also add extensive testing so that it's more difficult to break the tooling in the future.

Fixes SCYLLADB-560
Backport: no, internal tooling improvement

Closes scylladb/scylladb#28541

* github.com:scylladb/scylladb:
  test: cluster: add tests for perf tools
  test: perf: fix port race condition on startup in connect workload
  test: perf: prepare benchmarks to bind to custom host
  test: perf: make perf-alterantor remote port configurable
  test: perf: fix ASAN leak warnings in perf-alternator
  Reapply "main: test: add future and abort_source to after_init_func"
2026-02-19 19:12:46 +02:00
Dawid Mędrek
c9d192c684 Merge 'raft ropology: prevent crashes of multiple nodes' from Patryk Jędrzejczak
Some assertions in the Raft-based topology are likely to cause crashes of
multiple nodes due to the consistent nature of the Raft-based code. If the
failing assertion is executed in the code run by each follower (e.g., the code
reloading the in-memory topology state machine), then all nodes can crash. If
the failing assertion is executed only by the leader (e.g., the topology
coordinator fiber), then multiple consecutive group0 leaders will chain-crash
until there is no group0 majority.

Crashing multiple nodes is much more severe than necessary. It's enough to
prevent the topology state machine from making more progress. This will
naturally happen after throwing a runtime error. The problematic fiber will be
killed or will keep failing in a loop. Note that it should be safe to block
the topology state machine, but not the whole group0, as the topology state
machine is mostly isolated from the rest of group0.

We replace some occurrences of `on_fatal_internal_error` and `SCYLLA_ASSERT`
with `on_internal_error`. These are not all occurrences, as some fatal
assertions make sense, for example, in the bootstrap procedure.

We also raise an internal error to prevent a segmentation fault in a few places.

Fixes #27987

Backporting this PR is not required, but we can consider it at least for 2026.1
because:
- it is LTS,
- the changes are low-risk,
- there shouldn't be many conflicts.

Closes scylladb/scylladb#28558

* github.com:scylladb/scylladb:
  raft topology: prevent accessing nullptr returned by topology::find
  raft topology: make some assertions non-crashing
2026-02-19 16:50:03 +01:00
Marcin Maliszkiewicz
22c3d8d609 Merge 'db/config: enable table audit by default' from Piotr Smaron
In https://github.com/scylladb/scylladb/pull/27262 table audit has been
re-enabled by default in `scylla.yaml`, logging certain categories to a table,
which should make new Scylla deployments have audit enabled.
Now, in the next release, we also want to enable audit in `db/config.cc`,
which should enable audit for all deployments, which don't explicitly configure
audit otherwise in `scylla.yaml` (or via cmd line).
BTW. Because this commit aligns audit's default config values in `db/config.cc`
to those of `scylla.yaml`, `docs/reference/configuration-parameters.rst`, which
is based on `db/config.cc` will start showing that table audit is the default.

Refs: https://github.com/scylladb/scylladb/issues/28355
Refs: https://scylladb.atlassian.net/browse/SCYLLADB-222

No backport: table audit has been enabled in 2026.1 in `scylla.yaml`,
and should be always on starting from the next release,
which is the release we're currently merging to (2026.2).

Closes scylladb/scylladb#28376

* github.com:scylladb/scylladb:
  docs: decommission: note audit ks may require ALTERing
  docs: mention table audit enabled by default
  audit: disable DDL by default
  db/config: enable table audit by default
  test/cluster: fix `test_table_desc_read_barrier` assertion
  test/cluster: adjust audit in tests involving decommissioning its ks
  audit_test: fix incorrect config in `test_audit_type_none`
2026-02-19 16:30:11 +01:00
Pavel Emelyanov
b4b9b547ce replica: Remove unused sched groups from keyspace and table configs
Compaction and statement groups are carried over on those configs, but
are in fact unused. Drop both.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#28540
2026-02-19 15:47:31 +01:00
Patryk Jędrzejczak
45115415fb Merge 'Parametrize and merge several restoration test cases' from Pavel Emelyanov
There are four tests that check how restore with primary-replica-only option works in various scopes and topologies. Cases that check same-racks and same-datacenters are very very similar, so are those that check different-racks and different-datacenters. Parametrizing them and merging saves lots of code (+30 lines, -116 lines)

It's probably worth merging the resulting same-domain with different-domain tests, because the similarity is still large in both, but the result becomes too if-y, so not done here. Maybe later.

Improving tests, not backporting

Closes scylladb/scylladb#28569

* https://github.com/scylladb/scylladb:
  test: Merge test_restore_primary_replica_different_... tests
  test: Merge test_restore_primary_replica_same_... tests
  test: Don't specify expected_replicas in test_restore_primary_replica_different_dc_scope_all
  test: Remove local r_servers variable from test_restore_primary_replica_different_dc_scope_all
2026-02-19 15:42:55 +01:00
Pavel Emelyanov
26372e65df Merge 's3_perf: Fix the s3 perf test' from Ernest Zaslavsky
Fix the build of the test and the upload operation flow

No need to backport since it is only a test we barely use

Closes scylladb/scylladb#28595

* github.com:scylladb/scylladb:
  s3_perf: fix upload operation flow
  s3_perf: fix the CMake build
2026-02-19 15:31:43 +02:00
Avi Kivity
7ec710c250 Merge 'tablets: Reduce per-shard migration concurrency to 2' from Tomasz Grabiec
Tablet migration keeps sstable snapshot during streaming, which may
cause temporary increase in disk utilization if compaction is running
concurrently. SSTables compacted away are kept on disk until streaming
is done with them. The more tablets we allow to migrate concurrently,
the higher disk space can rise. When the target tablet size is
configured correcly, every tablet should own about 1% of disk
space. So concurrency of 4 shouldn't put us at risk. But target tablet
size is not chosen dynamically yet, and it may not be aligned with
disk capacity.

Also, tablet sizes can temporarily grow above the target, up to 2x
before the split starts, and some more because splits take a while to
complete.

To reduce the impact from this, reduce concurrency of
migration. Concurrency of 2 should still be enough to saturate
resources on the leaving shard.

Also, reducing concurrency means that load balancing is more
responsive to preemption. There will be less bandwidth sharing, so
scheduled migrations complete faster. This is important for scale-out,
where we bootstrap a node and want to start migrations to that new
node as soon as possible.

Refs scylladb/siren#15317

Closes scylladb/scylladb#28563

* github.com:scylladb/scylladb:
  tablets, config: Reduce migration concurrency to 2
  tablets: load_balancer: Always accept migration if the load is 0
  config, tablets: Make tablet migration concurrency configurable
2026-02-19 15:31:43 +02:00
Dawid Mędrek
fae71f79c2 test: raft: Add test_aborting_wait_for_state_change 2026-02-19 14:21:01 +01:00
Karol Nowacki
ca7f9a8baf vector_search: fix TLS server name with IP
SNI works only with DNS hostnames. Adding an IP address causes warnings
on the server side.
This change adds SNI only if it is not an IP address.

This change has no unit tests, as this behavior is not critical,
since it causes a warning on the server side.
The critical part, that the server name is verified, is already covered.

Fixes: VECTOR-528
2026-02-19 13:00:03 +01:00
Karol Nowacki
6205aad601 vector_search: add warn log for failed ann requests
In order to simplify troubleshooting connection problems, this patch
adds an extra warn log that prints the error for the vector search
request whenever it fails.
2026-02-19 13:00:03 +01:00
Dawid Mędrek
e4f2b62019 raft: Describe exception types for wait_for_state_change and wait_for_leader
The methods of `raft::server` are abortable and if the passed
`abort_source` is triggered, they throw `raft::request_aborted`.
We document that.

Although `raft::server` is an interface, this is consistent with
the descriptions of its other methods.
2026-02-19 12:47:14 +01:00
Dawid Mędrek
c36623baad raft: Await instead of returning future in wait_for_state_change
The `try-catch` expression is pretty much useless in its current form.
If we return the future, the awaiting will only be performed by the
caller, completely circumventing the exception handling.

As a result, instead of handling `raft::request_aborted` with a proper
error message, the user will face `seastar::abort_requested_exception`
whose message is cryptic at best. It doesn't even point to the root
of the problem.

Fixes SCYLLADB-665
2026-02-19 12:47:14 +01:00
Marcin Maliszkiewicz
de4e5e10af test: perf: fix prepared statements logic in perf-simple-query
Due to lack of checks present in process_execute_internal from
transport/server.cc needs_authorization bool was always set to true
doing some extra work (check_access()) for each request.

We mirror the logic in this patch in test env which perf-simple-query
uses. This can also potentially improve runtime of unittests (marginally).

Note that bug is only in perf tool not scylla itself, the fix
decreases insns/op by around 10%:
Before: 41065 insns/op
After: 37452 insns/op
Command: ./build/release/scylla perf-simple-query --duration 5 --smp 1

Fixes https://github.com/scylladb/scylladb/issues/27941

Closes scylladb/scylladb#28704
2026-02-19 12:42:07 +02:00
Avi Kivity
58a662b9db dist: refresh container base image (ubi9-minimal)
Using an outdated image can cause problems when `microdnf update`
runs, if the distribution doesn't maintain good update hygiene.
Although, I suspect that when update failures happen they're really
caused by propagation delay of packages to mirrors.

Fix by using --pull=always to get a fresh image.

Ref https://scylladb.atlassian.net/browse/SCYLLADB-714

Closes scylladb/scylladb#28680
2026-02-19 12:42:43 +03:00
Ferenc Szili
f1bc17bd4c load_stats: fix race condition when computing sum_tablet_sizes
In storage_service::load_stats_for_tablet_based_tables(), we are passing
a reference to sum_tablet_sizes to the lambda which increments this value
on each shard via map_reduce0(). This means we could have a race
condition because this is executed on separate threads/CPUs.

This patch fixed the problem by collecting the sums by shard into a
vector, then summing those up.

Refs: SCYLLADB-678

Closes scylladb/scylladb#28703
2026-02-19 12:29:25 +03:00
Avi Kivity
dee868b71a interval: avoid clang 23 warning on throw statement in potentially noexcept function
interval_data's move constructor is conditionally noexcept. It
contains a throw statemnt for the case that the underlying type's
move constructor can throw; that throw statemnt is never executed
if we're in the noexept branch. Clang 23 however doesn't understand
that, and warns about throwing in a noexcept function.

Fix that by rewriting the logic using seastar::defer(). In the
noexcept case, the optimizer should eliminate it as dead code.

Closes scylladb/scylladb#28710
2026-02-19 12:24:20 +03:00
Ernest Zaslavsky
45d824e0fe s3_perf: fix upload operation flow
Correct the upload operation logic. The previous flow incorrectly
checked for the test file on S3 even when performing operations that do
not download the file, such as uploads.
2026-02-19 11:14:59 +02:00
Botond Dénes
b637e17b19 db/config: don't use RBNO for scaling
Remove bootstrap and decomission from allowed_repair_based_node_ops.
Using RBNO over streaming for these operations has no benefits, as they
are not exposed to the out-of-date replica problem that replace,
removenode and rebuild are.
On top of that, RBNO is known to have problems with empty user tables.
Using streaming for boostrap and decomission is safe and faster
than RBNO in all condition, especially when the table is small.

One test needs adjustment as it relies on RBNO being used for all node
ops.

Fixes: SCYLLADB-105

Closes scylladb/scylladb#28080
2026-02-19 09:51:09 +01:00
Calle Wilund
8e71a6f52a gcp: Add handling of 429 (too many requests) to exponential backoff
Fixes: SCYLLADB-611

Adds http error code 429 to codes handled by exponential backoff.

Closes scylladb/scylladb#28588
2026-02-19 09:42:39 +01:00
Marcin Maliszkiewicz
3417d50add test: cluster: add tests for perf tools
It checks if all workloads can be properly
executed with succesfull startup and teardown.

Especially testing alternator in remote mode is important
because it's invoked like this during pgo training in pgo.py.

Test runtime:
Release - 24s
Debug - 1m 15s

Test time consists mostly of Scylla startup in various modes.
2026-02-19 09:33:10 +01:00
Marcin Maliszkiewicz
c69534504c test: perf: fix port race condition on startup in connect workload
Other workloads at startup call prepopulate() which connects
with retry loop therefore it waits until cql port is open.

This commit adds a single place where we will wait for port
for all workloads.

Timeout is set to 5 minutes so that even slowest machines
are able to start.
2026-02-19 09:33:10 +01:00
Marcin Maliszkiewicz
828f2fbdb1 test: perf: prepare benchmarks to bind to custom host
This is usefull for tests where we use local networks
like 127.5.5.5 to avoid port and host collisions.
2026-02-19 09:33:10 +01:00
Marcin Maliszkiewicz
9f2b97bef4 test: perf: make perf-alterantor remote port configurable
It could be a usefull option to have.
2026-02-19 09:33:10 +01:00
Marcin Maliszkiewicz
f5a212e91e test: perf: fix ASAN leak warnings in perf-alternator
Those were intentional as test process is short lived
but when we add automated tests in the following commits
we expect clean exit, with 0 exit code.
2026-02-19 09:33:10 +01:00
Marcin Maliszkiewicz
0c76c73e34 Reapply "main: test: add future and abort_source to after_init_func"
This reverts commit ceec703bb7.

The commit was fixed with abort source handling for alternator
standalone path so it's safe to reapply.
2026-02-19 09:33:10 +01:00
Piotr Dulikowski
7d6f734a51 dictionary compression: add missing co_awaits on get_units
There is a handful of places in the code related to dictionary
compression which calls get_units to acquire semaphore units but the
returned future is not awaited, seemingly by mistake. The result of
get_units is assigned to a variable - which is reasonable at a glance
because the semaphore units need to be assigned to a variable in order
to control their scope - but at the same time if co_await is mistakenly
omitted, like here, doing so will silence the nodiscard check of
seastar::future and, effectively, the get_units call will be nearly
useless. Unfortunately, this is an easy mistake to make.

Fix the places in the code that acquire semaphore units via get_units
but never await the future returned by it. I found them by manual code
inspection, so I hope that I didn't miss any.

Closes scylladb/scylladb#28581
2026-02-18 16:40:40 +01:00
Ernest Zaslavsky
4026b54a5e s3_perf: fix the CMake build
Fix the CMake build of the perf_s3_client by adding the
necessary linkage with the jsoncpp library.
2026-02-18 17:12:08 +02:00
Piotr Smaron
797c5cd401 docs: decommission: note audit ks may require ALTERing
With audit feature enabled, it's not immediately obvious that its
pseudo-system keyspace `audit` may require adjusting its RF across DCs
before decommissioning a node, and this should be documented.
2026-02-18 15:14:57 +01:00
Piotr Smaron
65eec6d8e7 docs: mention table audit enabled by default
Also align the documentation with the current audit settings.
2026-02-18 15:14:57 +01:00
Piotr Smaron
c30607d80b audit: disable DDL by default
DDL audit category doesn't make sense if its enabled by default on its
own, as no DDL statements are going to be audited if audit_keyspaces/audit_tables
setting is empty. This may be counter-intuitive to our users, who may
expect to actually see these statements logged if we're enabling this by
default. Also, it doesn't make sense to enable a setting by default if
it has no effect.
Additionally, listed all possible audit categories for user's
convenience.
2026-02-18 15:14:57 +01:00
Piotr Smaron
08dc1008ba db/config: enable table audit by default
In https://github.com/scylladb/scylladb/pull/27262 table audit has been
re-enabled by default in `scylla.yaml`, logging certain categories to a table,
which should make new Scylla deployments have audit enabled.
Now, in the next release, we also want to enable audit in `db/config.cc`,
which should enable audit for all deployments, which don't explicitly configure
audit otherwise in `scylla.yaml` (or via cmd line).
BTW. Because this commit aligns audit's default config values in `db/config.cc`
to those of `scylla.yaml`, `docs/reference/configuration-parameters.rst`, which
is based on `db/config.cc` will start showing that table audit is the default.

Refs: https://github.com/scylladb/scylladb/issues/28355
Refs: https://scylladb.atlassian.net/browse/SCYLLADB-222
2026-02-18 15:14:57 +01:00
Piotr Smaron
95ee4a562c test/cluster: fix test_table_desc_read_barrier assertion
The test `assertion desc_schema[0] == desc_schema[1]` does a direct
list comparison, which is order-sensitive. Before enabling audit by default,
both nodes would return only the test keyspace/table, so the order
didn't matter. With audit enabled, there will be multiple keyspaces,
and they can be returned in different order by different nodes.
2026-02-18 15:14:57 +01:00
Piotr Smaron
2e12b83366 test/cluster: adjust audit in tests involving decommissioning its ks
When table audit is enabled, Scylla creates the "audit" ks with
NetworkTopologyStrategy and RF=3. During node decommission, streaming can fail
for the audit ks with "zero replica after the removal" when all nodes from a DC
are removed, and so we have to ALTER audit ks to either zero the number of its
replicas, to allow for a clear decommission, or have them in the 2nd DC.

BTW. https://github.com/scylladb/scylladb/issues/27395 is the same change, but
in dtests repository.
2026-02-18 15:14:55 +01:00
Piotr Smaron
0cf20fa15a audit_test: fix incorrect config in test_audit_type_none
Passing Python `None` to setup is incorrect, because config updates are sent
as a dict and `None` is treated as "unset" - meaning: use Scylla's default.
Using the explicit string "none" to guarantee that audit is disabled.
2026-02-18 15:12:26 +01:00
Asias He
1be80c9e86 repair: Skip auto repair for tables using RF one
There is no point running repair for tables using RF one. Row level
repair will skip it but the auto repair scheduler will keep scheduling
such repairs since repair_time could not be updated.

Skip such repairs at the scheduler level for auto repair.

If the request is issued by user, we will have to schedule such
repair otherwise the user request will never be finished.

Fixes SCYLLADB-561

Closes scylladb/scylladb#28640
2026-02-18 14:32:50 +02:00
Andrzej Jackowski
4221d9bbfd docs: improve examples in Handling Audit Failures section
This commit introduces four changes:
 - In the `table` example, singular forms (node, partition) are changed to
   plural forms (nodes, partitions). Currently, the default `table`
   audit configuration is RF=3 and writes use CL=ONE. Therefore,
   a `table` audit log write failure should not be caused by a single
   node unavailability, and plural forms are more adequate.
 - In the `table` example, unreachability due to network issues is
   mentioned because with RF=3, audit failure due to network problems
   is more likely to happen than a simultaneous failure of three
   nodes (such network failures happened in SCYLLADB-706).
 - In the `syslog` example, a slash `/` is changed to `or`, so `table`
   and `syslog` examples have similar structure.
 - As the `syslog` line is already being changed, I also change `unix`
   to `Unix`, as the capitalized form is the correct one.

Refs SCYLLADB-706

Closes scylladb/scylladb#28702
2026-02-18 13:10:01 +01:00
Botond Dénes
3bfd47da4b Merge 'transport: fix connection code to consume only initially taken semaphore units' from Marcin Maliszkiewicz
The connection's `cpu_concurrency_t` struct tracks the state of a connection
to manage the admission of new requests and prevent CPU overload during
connection storms. When a connection holds units (allowed only 0 or 1), it is
considered to be in the "CPU state" and contributes to the concurrency limits
used when accepting new connections.

The bug stems from the fact that `counted_data_source_impl::get` and
`counted_data_sink_impl::put` calls can interleave during execution. This
occurs because of `should_parallelize` and `_ready_to_respond`, the latter being
a future chain that can run in the background while requests are being read.
Consequently, while reading request (N), the system may concurrently be
writing the response for request (N-1) on the same connection.

This interleaving allows `return_all()` to be called twice before the
subsequent `consume_units()` is invoked. While the second `return_all()` call
correctly returns 0 units, the matching `consume_units()` call would
mistakenly take an extra unit from the semaphore. Over time, a connection
blocked on a read operation could end up holding an unreturned semaphore
unit. If this pattern repeats across multiple connections, the semaphore
units are eventually depleted, preventing the server from accepting any
new connections.

The fix ensures that we always consume the exact number of units that were
previously returned. With this change, interleaved operations behave as
follows:

get() return_all     — returns 1 unit
put() return_all     — returns 0 units
get() consume_units  — takes back 1 unit
put() consume_units  — takes back 0 units

Logically, the networking phase ends when the first network operation
concludes. But more importantly, when a network operation
starts, we no longer hold any units.

Other solutions are possible but the chosen one seems to be the
simplest and safest to backport.

Fixes SCYLLADB-485
Backport: all supported affected versions, bug introduced with initial feature implementation in: ed3e4f33fd

Closes scylladb/scylladb#28530

* github.com:scylladb/scylladb:
  test: auth_cluster: add test for hanged AUTHENTICATING connections
  transport: fix connection code to consume only initially taken semaphore units
2026-02-18 13:48:49 +02:00
Marcin Szopa
9217f85e99 docs: fix path to the build_docker.sh which was moved from debian to redhat subdirectory 2026-02-18 12:19:27 +01:00
Marcin Szopa
e66bf4a6f5 docs: fix link to docker build README.MD
Link was pointing to the old place of the README. It was moved in the 1abf981a73
2026-02-18 12:12:46 +01:00
Piotr Dulikowski
b9db3c9c75 Merge 'Add consistent permissions cache' from Marcin Maliszkiewicz
This patchset replaces permissions cache based on loading_cache with a new unified (permissions and roles), full, coherent auth cache.

Reason for the change is that we want to improve scenarios under stress and simplify operation manuals. New cache doesn't require any tweaking. And it behaves particularly better in scenarios with lots of schema entities (e.g. tables) combined with unprepared queries. Old cache can generate few thousands of extra internal tps due to cache refresh.

Benchmark of unprepared statements (just to populate the cache) with 1000 tables shows 3k tps of internal reads reduction and 9.1% reduction of median instructions per op. So many tables were used to show resource impact, cache could be filled with other resource types to show the same improvement.

Backport: no, it's a new feature.
Fixes https://github.com/scylladb/scylladb/issues/7397
Fixes https://github.com/scylladb/scylladb/issues/3693
Fixes https://github.com/scylladb/scylladb/issues/2589
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-147

Closes scylladb/scylladb#28078

* github.com:scylladb/scylladb:
  test: boost: add auth cache tests
  auth: add cache size metrics
  docs: conf: update permissions cache documentation
  auth: remove old permissions cache
  auth: use unified cache for permissions
  auth: ldap: add permissions reload to unified cache
  auth: add permissions cache to auth/cache
  auth: add service::revoke_all as main entry point
  auth: explicitly life-extend resource in auth_migration_listener
2026-02-18 12:03:20 +01:00
Tomasz Grabiec
af0b5d0894 Merge 'tablets global barrier: acknowledge barrier_and_drain from all nodes' from Petr Gusev
Before this series, the `global_barrier` used during tablet migration did not guarantee that `barrier_and_drain` was acknowledged by tablet replicas. As a result, if a request coordinator was fenced out, stale requests from previous topology versions could still execute on replicas in parallel with new requests from incompatible topology versions. For example, stale requests from `tablet_transition_stage::streaming` could run concurrently with new requests from `tablet_transition_stage::use_new`. This caused several issues, including [#26864](https://github.com/scylladb/scylladb/issues/26864) and [#26375](https://github.com/scylladb/scylladb/issues/26375).

This PR fixes the problem in two steps:
* Replicas now hold an erm strong pointer while handling RPCs from coordinators.
* The tablet barrier is updated to require `barrier_and_drain` acknowledgments from all nodes.

A description of alternative solutions and various tradeoffs can be found in [this document](https://docs.google.com/document/d/1tpDtPOsrGaZGBYkdwOKApQv4eMzrBydMM1GaYYmaPgg/edit?pli=1&tab=t.0#heading=h.vidfy0hrz5j7).

[A previous attempt on this changes](https://github.com/scylladb/scylladb/pull/27185).

Fixes [scylladb/scylladb#26864](https://github.com/scylladb/scylladb/issues/26864)
Fixes [scylladb/scylladb#26375](https://github.com/scylladb/scylladb/issues/26375)

backport: needs backport to 2025.4 (fixes #26864 for tablets LWT)

Closes scylladb/scylladb#27492

* github.com:scylladb/scylladb:
  tests: extract get_topology_version helper
  global tablets barrier: require all nodes to ack barrier_and_drain
  topology_coordinator: pass raft_topology_cmd by value
  storage_proxy: hold erms in replica handlers
  token_metadata: improve stale versions diagnostics
2026-02-18 11:45:56 +01:00
Pavel Emelyanov
0c443d5764 gms: Use newer seastar get_host_by_name API
The hostent::addr_list is deprecated in favor of address_entry::addr
field that contains the very same addresses.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#28566
2026-02-18 12:24:35 +02:00
Pavel Emelyanov
5b740afe9a database: Remove streaming sched group getter
All users of it had been updated to get the streaming group elsewhere,
so this getter is no longer needed.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#28527
2026-02-18 12:23:35 +02:00
Avi Kivity
c5a1f44731 tools: toolchain: switch from ccache to sccache
sccache combines the functions of ccache and distcc, and
promises to support C++20 modules in the future. Switch
to sccache in anticipation of modules support.

The documentation is adjusted since cache will be
persistent for sccache without further work.

Closes scylladb/scylladb#28524
2026-02-18 12:23:12 +02:00
Botond Dénes
36167a155e Merge 'Remove map_to_key_value() helpers from API' from Pavel Emelyanov
There are some places that get `map<foo, bar>` and return it to the caller as `"key": string(foo), "value": string(bar)` json. For that there's `map_to_key_value()` helper in api.hh that re-formats the map into a vector of json elements and returns it, letting seastar json-ize that vector.

Recently in seastar there appeared stream_range_as_array() helper that helps streaming any range without converting it into intermediate collection. Some of the hottest users of `map_to_key_value()` had been converted, this PR converts few remainders and removes the helper in question to encourage further usage of the stream_range_as_array().

Code cleanup, not backporting

Closes scylladb/scylladb#28491

* github.com:scylladb/scylladb:
  api: Remove map_to_key_value() helpers
  api: Streamify view_build_statuses handler
  api: Streamify few more storage_service/ handlers
  api: Add map_to_json() helper
  api: Coroutinize view_build_statuses handler
2026-02-18 12:22:00 +02:00
Ernest Zaslavsky
196f7cad93 nodetool: fix handling of "--primary-replica-only" argument
The "--primary-replica-only" ("-pro") flag was previously ignored by
the `restore` operation. This patch ensures the argument is parsed and
applied correctly.

Closes scylladb/scylladb#28490
2026-02-18 12:21:27 +02:00
Pavel Emelyanov
bce43c6b20 api: Remove unused (lost) local variable
Lost when the get_range_to_endpoint_map hander was implemented for real (48c3c94aa6)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#28489
2026-02-18 12:20:30 +02:00
Ernest Zaslavsky
afac984632 s3_client: reorganize tests in part_size_calculation_test
just group all BOOST_REQUIRE_EXCEPTION tests in one block and
remove artificial scopes
2026-02-18 12:12:04 +02:00
Ernest Zaslavsky
1a20877afe s3_client: switch using s3 limits constants in tests
instead of using magic numbers, switch using s3 limit constants to
make it clearer what and why is tested
2026-02-18 12:12:04 +02:00
Ernest Zaslavsky
d763bdabc2 s3_client: fix the s3::range max object size
in s3::Range class start using s3 global constant for two reasons:
1) uniformity, no need to introduce semantically same constant in each class
2) the value was wrong
2026-02-18 12:12:04 +02:00
Ernest Zaslavsky
24e70b30c8 s3_client: remove "aws" prefix from object limits constants
remove "aws" prefix from object limits constants since it is
irrelevant and unnecessary when sitting under s3 namespace
2026-02-18 12:12:04 +02:00
Ernest Zaslavsky
329c156600 s3_client: make s3 object limits accessible
make s3 limits constants publicly accessible to reuse it later
2026-02-18 12:12:04 +02:00
Alex
c44ad31d44 db/view: gate detached view-builder callbacks during shutdown
Detached migration callbacks (on_create_view, on_update_view, on_drop_view)
  can race with view_builder::drain() teardown.

  Add a lifetime gate to view_builder and wire callback launches through
  _ops_gate.hold() so each detached dispatch future is tracked until it
  completes (finally keeps the hold alive). During shutdown, drain()
  now waits for all tracked callback work with _ops_gate.close().

  This ensures drain does not proceed past callback lifetime while shutdown is in
  progress, and ignores only gate_closed_exception at callback entry as the
  expected shutdown path.
2026-02-18 11:56:41 +02:00
Pavel Emelyanov
b01adf643c Merge 'init: fix infinite loop on npos wrap with updated Seastar' from Emil Maskovsky
Fixes parsing of comma-separated seed lists in "init.cc" and "cql_test_env.cc" to use the standard `split_comma_separated_list` utility, avoiding manual `npos` arithmetic. The previous code relied on `npos` being `uint32_t(-1)`, which would not overflow in `uint64_t` target and exit the loop as expected. With Seastar's upcoming change to make `npos` `size_t(-1)`, this would wrap around to zero and cause an infinite loop.

Switch to `split_comma_separated_list` standardized way of tokenization that is also used in other places in the code. Empty tokens are handled as before. This prevents startup hangs and test failures when Seastar is updated.

The other commit also removes the unnecessary creation of temporary `gms::inet_address()` objects when calling `std::set<gms::inet_address>::emplace()`.

Refs: https://github.com/scylladb/seastar/pull/3236

No backport: The problem will only appear in master after the Seastar will be upgraded. The old code works with the Seastar before https://github.com/scylladb/seastar/pull/3236 (although by accident because of different integer bitsizes).

Closes scylladb/scylladb#28573

* github.com:scylladb/scylladb:
  init: fix infinite loop on npos wrap with updated Seastar
  init: remove unnecessary object creation in emplace calls
2026-02-18 11:46:26 +03:00
Aleksandra Martyniuk
100ccd61f8 tasks: increase tasks_vt_get_children timeout
test_node_ops_tasks.py::test_get_children fails due to timeout of
tasks_vt_get_children injection in debug mode. Compared to a successful
run, no clear root cause stands out.

Extend the message timeout of tasks_vt_get_children from 10s to 60s.

Fixes: #28295.

Closes scylladb/scylladb#28599
2026-02-18 11:39:19 +03:00
Dani Tweig
aac0f57836 .github/workflows: add SMI to milestone sync Jira project keys
What changed
Updated .github/workflows/call_sync_milestone_to_jira.yml to include SMI in jira_project_keys

Why (Requirements Summary)
Adding SMI to create releases in the SMI Jira project based on new milestones from scylladb.git.
This will create a new release in the SMI Jira project when a milestone is added to scylladb.git.

Fixes:PM-190

Closes scylladb/scylladb#28585
2026-02-18 09:35:37 +02:00
Nadav Har'El
a1475dbeb9 test/cqlpy: make test testMapWithLargePartition faster
Right now the slowest test in the test/cqlpy directory is

   cassandra_tests/validation/entities/collections_test.py::
      testMapWithLargePartition

This test (translated from Cassandra's unit test), just wants to verify
that we can write and flush a partition with a single large map - with
200 items totalling around 2MB in size.

200 items totalling 2MB is large, but not huge, and is not the reason
why this test was so so slow (around 9 seconds). It turns out that most
of the test time was spent in Python code, preparing a 2MB random string
the slowest possible way. But there is no need for this string to be
random at all - we only care about the large size of the value, not the
specific characters in it!

Making the characters written in this text constant instead of random
made it 20 times fast - it now takes less than half a second.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#28271
2026-02-18 10:12:16 +03:00
Raphael S. Carvalho
5b550e94a6 streaming: Release space incrementally during file streaming
File streaming only releases the file descriptors of a tablet being
streamed in the very streaming end. Which means that if the streaming
tablet has compaction on largest tier finished after streaming
started, there will be always ~2x space amplification for that
single tablet. Since there can be up to 4 tablets being migrated
away, it can add up to a significant amount, since nodes are pushed
to a substantial usage of available space (~90%).

We want to optimize this by dropping reference to a sstable after
it was fully streamed. This way, we reduce the chances of hitting
2x space amplification for a given tablet.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes scylladb/scylladb#28505
2026-02-18 10:10:40 +03:00
Avi Kivity
f3cbd76d93 build: install cassandra-stress RPM with no signature check
Fedora 45 tightened the default installation checks [1]. As a result
the cassandra-stress rpm we provide no longer installs.

Install it with --no-gpgchecks as a workaround. It's our own package
so we trust it. Later we'll sign it properly.

We install its dependencies via the normal methods so they're still
checked.

[1] https://fedoraproject.org/wiki/Changes/Enforcing_signature_checking_by_default

Closes scylladb/scylladb#28687
2026-02-18 10:08:13 +03:00
Pavel Emelyanov
89d8ae5cb6 Merge 'http: prepare http clients retry machinery refactoring' from Ernest Zaslavsky
Today S3 client has well established and well testes (hopefully) http request retry strategy, in the rest of clients it looks like we are trying to achieve the same writing the same code over and over again and of course missing corner cases that already been addressed in the S3 client.
This PR aims to extract the code that could assist other clients to detect the retryability of an error originating from the http client, reuse the built in seastar http client retryability and to minimize the boilerplate of http client exception handling

No backport needed since it is only refactoring of the existing code

Closes scylladb/scylladb#28250

* github.com:scylladb/scylladb:
  exceptions: add helper to build a chain of error handlers
  http: extract error classification code
  aws_error: extract `retryable` from aws_error
2026-02-18 10:06:37 +03:00
Pavel Emelyanov
2f10fd93be Merge 's3_client: Fix s3 part size and number of parts calculation' from Ernest Zaslavsky
- Correct `calc_part_size` function since it could return more than 10k parts
- Add tests
- Add more checks in `calc_part_size` to comply with S3 limits

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-640
Must be ported back to 2025.3/4 and 2026.1 since we may encounter this bug in production clusters

Closes scylladb/scylladb#28592

* github.com:scylladb/scylladb:
  s3_client: add more constrains to the calc_part_size
  s3_client: add tests for calc_part_size
  s3_client: correct multipart part-size logic to respect 10k limit
2026-02-18 10:04:53 +03:00
Tomasz Grabiec
d33d38139f test_tablets_parallel_decommission: Fix flakiness due to delayed task appearance
Currently, the test assumes that when
'topology_coordinator_pause_before_processing_backlog: waiting' is
logged, the task for decommission must be there. This was based on the
assumption that topology coordinator is idle and decommission request
wakes it up. But if the server is slow enough, it may still be running
the load balancer in reaction to table creation, and block on that
injection point before decommission request was added.

Fix by waiting for the task to appear rather than the injection.

Fixes SCYLLADB-715
2026-02-18 01:02:50 +01:00
Tomasz Grabiec
2454de4f8f test: cluster: task_manager_client: Introduce wait_task_appears() 2026-02-18 01:02:44 +01:00
Tomasz Grabiec
e14eca46af tests: pylib: util: Add exponential backoff to wait_for
Allows balancing the trade-off between fast execution in case the
condition is satisfied quickly and not adding load when it's not.
2026-02-18 01:02:19 +01:00
Szymon Malewski
668d6fe019 vector: Improve similarity functions performance
Improves performance of deserialization of vector data for calculating similarity functions.
Instead of deserializing vector data into a std::vector<data_value>, we deserialize directly into a std::vector<float>
and then pass it to similarity functions as a std::span<const float>.
This avoids overhead of data_value allocations and conversions.
Example QPS of `SELECT id, similarity_cosine({vector<float, 1536>}, {vector<float, 1536>}) ...`:
client concurrency 1: before: ~135 QPS, after: ~1005 QPS
client concurrency 20: before: ~280 QPS, after: ~2097 QPS
Measured using https://github.com/zilliztech/VectorDBBench (modified to call above query without ANN search)

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-471

Closes scylladb/scylladb#28615
2026-02-18 00:33:34 +02:00
Calle Wilund
ab4e4a8ac7 commitlog: Always abort replenish queue on loop exit
Fixes #28678

If replenish loop exits the sleep condition, with an empty queue,
when "_shutdown" is already set, a waiter might get stuck, unsignalled
waiting for segments, even though we are exiting.

Simply move queue abort to always be done on loop exit.

Closes scylladb/scylladb#28679
2026-02-17 23:46:47 +02:00
Emil Maskovsky
6b98f44485 init: fix infinite loop on npos wrap with updated Seastar
Fixes parsing of comma-separated seed lists in "init.cc" and
"cql_test_env.cc" to use the standard `split_comma_separated_list`
utility, avoiding manual `npos` arithmetic. The previous code relied on
`npos` being `uint32_t(-1)`, which would not overflow in `uint64_t`
target and exit the loop as expected. With Seastar's upcoming change
to make `npos` `size_t(-1)`, this would wrap around to zero and cause
an infinite loop.

Switch to `split_comma_separated_list` standardized way of tokenization
that is also used in other places in the code. Empty tokens are handled
as before. This prevents startup hangs and test failures when Seastar is
updated.

Refs: scylladb/seastar#3236
2026-02-17 17:57:13 +00:00
Emil Maskovsky
bda0fc9d93 init: remove unnecessary object creation in emplace calls
Simplifies code by directly passing constructor arguments to emplace,
avoiding redundant temporary gms::inet_address() object creation.
Improves clarity and potentially performance in affected areas.
2026-02-17 17:57:12 +00:00
Marcin Maliszkiewicz
741969cf4c test: boost: add auth cache tests
The cache is covered already with general auth
dtests but some cases are more tricky and easier
to express directly as calls to cache class.
For such tests boost test file was added.
2026-02-17 18:18:40 +01:00
Marcin Maliszkiewicz
c11eb73a59 auth: add cache size metrics 2026-02-17 18:18:40 +01:00
Marcin Maliszkiewicz
a059798de9 docs: conf: update permissions cache documentation 2026-02-17 18:18:40 +01:00
Marcin Maliszkiewicz
a23e503e7b auth: remove old permissions cache 2026-02-17 17:56:27 +01:00
Marcin Maliszkiewicz
9d9184e5b7 auth: use unified cache for permissions 2026-02-17 17:56:27 +01:00
Marcin Maliszkiewicz
7eedf50c12 auth: ldap: add permissions reload to unified cache
The LDAP server may change role-chain assignments without notifying
Scylla. As a result, effective permissions can change, so some form of
polling is required.

Currently, this is handled via cache expiration. However, the unified
cache is designed to be consistent and does not support expiration.
To provide an equivalent mechanism for LDAP, we will periodically
reload the permissions portion of the new cache at intervals matching
the previously configured expiration time.
2026-02-17 17:56:27 +01:00
Marcin Maliszkiewicz
10996bd0fb auth: add permissions cache to auth/cache
We want to get rid of loading cache because its periodic
refresh logic generates a lot of internal load when there
is many entries. Also our operation procedures involve tweaking
the config while new unified cache is supposed to work out
of the box.
2026-02-17 17:56:27 +01:00
Marcin Maliszkiewicz
03c4e4bb10 auth: add service::revoke_all as main entry point
In the following commit we'll need to add some
cache related logic (removing resource permissions).
This logic doesn't depend on authorizer so it should
be managed by the service itself.
2026-02-17 17:56:27 +01:00
Marcin Maliszkiewicz
070d0bfc4c auth: explicitly life-extend resource in auth_migration_listener
Otherwise it's easy to trigger use-after-free when code slightly changes.
2026-02-17 17:56:27 +01:00
Marcin Maliszkiewicz
3b98451776 test: auth_cluster: add test for hanged AUTHENTICATING connections
Test runtime:
Release - 2s
Debug - 5s
2026-02-17 17:55:48 +01:00
Marcin Maliszkiewicz
0376d16ad3 transport: fix connection code to consume only initially taken semaphore units
The connection's cpu_concurrency_t struct tracks the state of a connection
to manage the admission of new requests and prevent CPU overload during
connection storms. When a connection holds units (allowed only 0 or 1), it is
considered to be in the "CPU state" and contributes to the concurrency limits
used when accepting new connections.

The bug stems from the fact that `counted_data_source_impl::get` and
`counted_data_sink_impl::put` calls can interleave during execution. This
occurs because of `should_parallelize` and `_ready_to_respond`, the latter being
a future chain that can run in the background while requests are being read.
Consequently, while reading request (N), the system may concurrently be
writing the response for request (N-1) on the same connection.

This interleaving allows `return_all()` to be called twice before the
subsequent `consume_units()` is invoked. While the second `return_all()` call
correctly returns 0 units, the matching `consume_units()` call would
mistakenly take an extra unit from the semaphore. Over time, a connection
blocked on a read operation could end up holding an unreturned semaphore
unit. If this pattern repeats across multiple connections, the semaphore
units are eventually depleted, preventing the server from accepting any
new connections.

The fix ensures that we always consume the exact number of units that were
previously returned. With this change, interleaved operations behave as
follows:

get() return_all     — returns 1 unit
put() return_all     — returns 0 units
get() consume_units  — takes back 1 unit
put() consume_units  — takes back 0 units

Logically, the networking phase ends when the first network operation
concludes. But more importantly, when a network operation
starts, we no longer hold any units.

Other solutions are possible but the chosen one seems to be the
simplest and safest to backport.

Fixes SCYLLADB-485
2026-02-17 17:55:48 +01:00
Dani Tweig
5dc06647e9 .github: add workflow to auto-close issues from ScyllaDB associates
Added .github/workflows/close_issue_for_scylla_employee.yml workflow file to automatically close issues opened by ScyllaDB associates

We want to allow external users to open issues in the scylladb repo, but for ScyllaDB associates, we would like them to open issues in Jira instead. If a ScyllaDB associates opens by mistake an issue in scylladb.git repo, the issue will be closed automatically with an appropriate comment explaining that the issue should be opened in Jira.

This is a new github action, and does not require any code backport.

Fixes: PM-64

Closes scylladb/scylladb#28212
2026-02-17 17:18:32 +02:00
Dani Tweig
bb8a2c3a26 .github/workflow/:Add milestone sync to Jira based on GitHub Action
What changed
Added new workflow file .github/workflows/call_jira_sync_pr_milestone.yml

Why (Requirements Summary)
Adds a GitHub Action that will be triggered when a milestone is set or removed from a PR
When milestone is added (milestoned event), calls main_jira_sync_pr_milestone_set.yml from github-automation.git, which will add the version to the 'Fix Versions' field in the relevant linked Jira issue
When milestone is removed (demilestoned event), calls main_jira_sync_pr_milestone_removed.yml from github-automation.git, which will remove the version from the 'Fix Versions' field in the relevant linked Jira issue
Testing was performed in staging.git and the STAG Jira project.

Fixes:PM-177

Closes scylladb/scylladb#28575
2026-02-17 16:41:03 +02:00
Botond Dénes
2e087882fa Merge 'GCS object storage. Fix incompatibilty issues with "real" GCS' from Calle Wilund
Fixes #28398
Fixes #28399

When used as path elements in google storage paths, the object names need to be URL encoded. Due to

a.) tests not really using prefixes including non-url valid chars (i.e. / etc)
and
b.) the mock server used for most testing not enforcing this particular aspect,

this was missed.

Modified unit tests to use prefixing for all names, so when running real GS, any errors like this will show.

"Real" GCS also behaves a bit different when listing with pager, compared to mock;
The former will not give a pager token for last page, only penultimate.
 Adds handling for this.

Needs backport to the releases that have (though might not really use) the feature, as it is technically possible to use google storage for backup and whatnot there, and it should work as expected.

Closes scylladb/scylladb#28400

* github.com:scylladb/scylladb:
  utils/gcp/object_storage: URL-encode object names in URL:s
  utils::gcp::object_storage: Fix list object pager end condition detection
2026-02-17 16:40:02 +02:00
Andrei Chekun
1b5789cd63 test.py: refactor manager fixture
The current manager flow have a flaw. It will trigger pytest.fail when
it found errors on teardown regardless if the test was already failed.
This will create an additional record in JUnit report with the same name
and Jenkins will not be able to show the logs correctly. So to avoid
this, this PR changes logic slightly.
Now manager will check that test failed or not to avoid two fails for
the same test in the report.
If test passed, manager will check the cluster status and fail if
something wrong with a status of it. There is no need to check the
cluster status in case of test fail.
If test passed, and cluster status if OK, but there are unexpected
errors in the logs, test will fail as well. But this check will gather
all information about the errors and potential stacktraces and will only
fail the test if it's not yet failed to avoid double entry in report.

Closes scylladb/scylladb#28633
2026-02-17 14:35:18 +01:00
Dawid Mędrek
5b5222d72f Merge 'test: make test_different_group0_ids work with the Raft-based topology' from Patryk Jędrzejczak
The test was marked with xfail in #28383, as it needed to be updated to
work with the Raft-based topology. We are doing that in this patch.

With the Raft-based topology, there is no reason to check that nodes with
different group0 IDs cannot merge their topology/token_metadata. That is
clearly impossible, as doing any topology change requires being in the
same group0. So, the original regression test doesn't make sense.

We can still test that nodes with different group0 IDs cannot gossip with
each other, so we keep the test. It's very fast anyway.

No backport, test update.

Closes scylladb/scylladb#28571

* github.com:scylladb/scylladb:
  test: run test_different_group0_ids in all modes
  test: make test_different_group0_ids work with the Raft-based topology
2026-02-17 13:56:41 +01:00
Dawid Mędrek
1b80f6982b Merge 'test: make the load balancer simulator tablet size aware' from Ferenc Szili
Currently, the load balancing simulator computes node, shard and tablet load based on tablet count.

This patch changes the load balancing simulator to be tablet size aware. It generates random tablet sizes with a normal distribution, and a mean value of `default_target_tablet_size`, and reports the computed load for nodes and tables based on tablet size sum, instead of tablet count.

This is the last patch in the size based load balancing series. It is the last PR in the Size Based Load Balancing series:

- First part for tablet size collection via load_stats: scylladb/scylladb#26035
- Second part reconcile load_stats: scylladb/scylladb#26152
- The third part for load_sketch changes: scylladb/scylladb#26153
- The fourth part which performs tablet load balancing based on tablet size: scylladb/scylladb#26254
- The fifth part changes the load balancing simulator: scylladb/scylladb#26438

This is a new feature and backport is not needed.

Closes scylladb/scylladb#26438

* github.com:scylladb/scylladb:
  test, simulator: compute load based on tablet size instead of count
  test, simulator: generate tablet sizes and update load_stats
  test, simulator: postpone creation of load_stats_ptr
2026-02-17 13:29:37 +01:00
Avi Kivity
ffde2414e8 cql3: grammar: remove special case for vector similarity functions in selectors
In b03d520aff ("cql3: introduce similarity functions syntax") we
added vector similarity functions to the grammar. The grammar had to
be modified because we wanted to support literals as vector similarity
function arguments, and the general function syntax in selectors
did not allow that.

In cc03f5c89d ("cql3: support literals and bind variables in
selectors") we extended the selector function call grammar to allow
literals as function arguments.

Here, we remove the special case for vector similarity functions as
the general case in function calls covers all the possibilities the
special case does.

As a side effect, the vector similarity function names are no longer
reserved.

Note: the grammar change fixes an inconsistency with how the vector
similarity functions were evaluated: typically, when a USE statement
is in effect, an unqualified function is first matched against functions
in the keyspace, and only if there is no match is the system keyspace
checked. But with the previous implementation vector similarity functions
ignored the USE keyspace and always matched only the system keyspace.

This small inconsistency doesn't matter in practice because user defined
functions are still experimental, and no one would name a UDF to conflict
with a system function, but it is still good to fix it.

Closes scylladb/scylladb#28481
2026-02-17 12:40:21 +01:00
Ernest Zaslavsky
30699ed84b api: report restore params
report restore params once the API's call for restore is invoked

Closes scylladb/scylladb#28431
2026-02-17 14:27:21 +03:00
Andrei Chekun
767789304e test.py: improve C++ fail summary in pytest
Currently, if the test fail, pytest will output only some basic information
about the fail. With this change, it will output the last 300 lines of the
boost/seastar test output.
Also add capturing the output of the failed tests to JUnit report, so it
will be present in the report on Jenkins.

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-449

Closes scylladb/scylladb#28535
2026-02-17 14:25:28 +03:00
Pavel Emelyanov
6d4af84846 Merge 'test: increase open file limit for sstable tests' from Avi Kivity
In ebda2fd4db ("test: cql_test_env: increase file descriptor limit"),
we raised the open file limit for cql_test_env. Here, we raise it for sstables::test_env
as well, to fix a couple of twcs resharding tests failing outside dbuild. These tests
open 256 sstables, and with 2 files/sstable + resharding work it is understandable
that they overflow the 1024 limit.

No backport: this is a quality of life improvement for developers running outside dbuild, but they can use dbuild for branches.

Closes scylladb/scylladb#28646

* github.com:scylladb/scylladb:
  test: sstables::test_env: adjust file open limit
  test: extract cql_test_env's adjust_rlimit() for reuse
2026-02-17 14:19:43 +03:00
Avi Kivity
41925083dc test: minio: tune sync setting
Disable O_DSYNC in minio to avoid unnecessary slowdown in S3
tests.

Closes scylladb/scylladb#28579
2026-02-17 14:19:27 +03:00
Avi Kivity
f03491b589 Update seastar submodule
* seastar f55dc7eb...d2953d2a (13):
  > io_tester: Revive IO bandwidth configuration
  > Merge 'io_tester: add vectorized I/O support' from Travis Downs
    doc: add vectorized I/O options to io-tester.md
    io_tester: add vectorized I/O support
  > Merge 'Remove global scheduling group ID bitmap' from Pavel Emelyanov
    reactor: Drop sched group IDs bitmap
    reactor: Allocate scheduling group on shard-0 first
    reactor: Detach init_scheduling_group_specific_data()
    reactor: Coroutinize create_scheduling_group()
  > set_iterator: increase compatibility with C++ ranges
  > test: fix race condition in test_connection_statistics
  > Add Claude Code project instructions
  > reactor: Unfriend pollable_fd via pollable_fd_state::make()
  > Merge 'rpc_tester: introduce rpc_streaming job based on streaming API' from Jakub Czyszczoń
    apps: rpc_tester: Add STREAM_UNIDIRECTIONAL job We introduce an unidirectional streaming to the rpc_streaming job.
    apps: rpc_tester: Add STREAM_BIDIRECTIONAL job This commit extends the rpc_tester with rpc_streaming job that uses rpc::sink<> and rpc::source<> to stream data between the client and the server.
  > treewide: remove remnants of SEASTAR_MODULE
  > test: Tune abort-accept test to use more readable async()
  > build: support sccache as a compiler cache (#3205)
  > posix-stack: Reuse parent class _reuseport from child
  > Merge 'reactor_backend: Fix another busy spin bug in the epoll backend' from Stephan Dollberg
    tests: Add unit test for epoll busy spin bug
    reactor_backend: Fix another busy spin bug in epoll

Closes scylladb/scylladb#28513
2026-02-17 13:13:22 +02:00
Jakub Smolar
189b056605 scylla_gdb: use run_ctx to nahdle Scylla exe and remove pexpect
Previous implementation of Scylla lifecycle brought flakiness to the test.
This change leaves lifecycle management up to PythonTest.run_ctx,
which implements more stability logic for setup/teardown.

Replace pexpect-driven GDB interaction with GDB batch mode:
- Avoids DeprecationWarning: "This process is multi-threaded, use of forkpty()
may lead to deadlocks in the child.", which ultimately caused CI deadlocks.
- Removes timeout-driven flakiness on slow systems - no interactive waits/timeouts.
- Produces cleaner, more direct assertions around command execution and output.
- Trade-off: batch mode adds ~10s per command per test,
but with --dist=worksteal this is ~10% overall runtime increase across the suite.

Closes scylladb/scylladb#28484
2026-02-17 11:36:20 +01:00
Łukasz Paszkowski
f45465b9f6 test_out_of_space_prevention.py: Lower the critical disk utilization threshold
After PR https://github.com/scylladb/scylladb/pull/28396 reduced
the test volumes to 20MiB to speed up test_out_of_space_prevention.py,
keeping the original 0.8 critical disk utilization threshold can make
the tests flaky: transient disk usage (e.g. commitlog segment churn)
can push the node into ENOSPC during the run.

These tests do not write much data, so reduce the critical disk
utilization threshold to 0.5. With 20MiB volumes this leaves ~10MiB
of headroom for temporary growth during the test.

Fixes: https://github.com/scylladb/scylladb/issues/28463

Closes scylladb/scylladb#28593
2026-02-16 15:10:18 +02:00
Andrei Chekun
e26cf0b2d6 test/cluster: fix two flaky tests
test_maintenance_socket with new way of running is flaky. Looks like the
driver tries to reconnect with an old maintenance socket from previous
driver and fails. This PR adds white list for connection that stabilize
the test
test_no_removed_node_event_on_ip_change was flaky on CI, while the issue
never reproduced locally. The assumption that under load we have race
condition and trying to check the logs before message is arrived. Small
for loop to retry added to avoid such situation.

Closes scylladb/scylladb#28635
2026-02-16 14:50:54 +02:00
Patryk Jędrzejczak
0693091aff test: test_restart_leaving_replica_during_cleanup: reconnect driver after restart
The test can currently fail like this:
```
>           await cql.run_async(f"ALTER TABLE {ks}.test WITH tablets = {{'min_tablet_count': 1}}")
E           cassandra.cluster.NoHostAvailable: ('Unable to complete the operation against any hosts', {<Host: 127.158.27.9:9042 datacenter1>: <Error from server: code=0000 [Server error] message="Failed to apply group 0 change due to concurrent modification">})
```
The following happens:
- node A is restarted and becomes the group0 leader,
- the driver sends the ALTER TABLE request to node B,
- the request hits group 0 concurrent modification error 10 times and fails
  because node A performs tablet migrations at the the same time.

What is unexpected is that even though the driver session uses the default
retry policy, the driver doesn't retry the request on node A. The request
is guaranteed to succeed on node A because it's the only node adding group0
entries.

The driver doesn't retry the request on node A because of a missing
`wait_for_cql_and_get_hosts` call. We add it in this commit. We also reconnect
the driver just in case to prevent hitting scylladb/python-driver#295.

Moreover, we can revert the workaround from
4c9efc08d8, as the fix from this commit also
prevents DROP KEYSPACE failures.

The commit has been tested in byo with `_concurrent_ddl_retries{0}` to
verify that node A really can't hit group 0 concurrent modification error
and always receives the ALTER TABLE request from the driver. All 300 runs in
each build mode passed.

Fixes #25938

Closes scylladb/scylladb#28632
2026-02-16 12:56:18 +01:00
Marcin Maliszkiewicz
6a4aef28ae Merge 'test: explicitly set compression algorithm in test_autoretrain_dict' from Andrzej Jackowski
When `test_autoretrain_dict` was originally written, the default
`sstable_compression_user_table_options` was `LZ4Compressor`. The
test assumed (correctly) that initially the compression doesn't use
a trained dictionary, and later in the test scenario, it changed
the algorithm to one with a dictionary.

However, the default `sstable_compression_user_table_options` is now
`LZ4WithDictsCompressor`, so the old assumption is no longer correct.
As a result, the assertion that data is initially not compressed well
may or may not fail depending on dictionary training timing.

To fix this, this commit explicitly sets `ZstdCompressor`
as the initial `sstable_compression_user_table_options`, ensuring that
the assumption that initial compression is without a dictionary
is always met.

Note: `ZstdCompressor` differs from the former default `LZ4Compressor`.
However, it's a better choice — the test aims to show the benefit of
using a dictionary, not the benefit of Zstd over LZ4 (and the test uses
ZstdWithDictsCompressor as the algorithm with the dictionary).

Fixes: https://github.com/scylladb/scylladb/issues/28204
Backport: 2025.4, as test already failed there (and also backport to 2026.1 to make everything consistent).

Closes scylladb/scylladb#28625

* github.com:scylladb/scylladb:
  test: explicitly set compression algorithm in test_autoretrain_dict
  test: remove unneeded semicolons from python test
2026-02-16 11:38:24 +01:00
Ernest Zaslavsky
034c6fbd87 s3_client: limit multipart upload concurrency
Prevent launching hundreds or thousands of fibers during multipart uploads
by capping concurrent part submissions to 16.

Closes scylladb/scylladb#28554
2026-02-16 13:32:58 +03:00
Botond Dénes
9f57d6285b Merge 'test: improve error reporting and retries in get_scylla_2025_1_executable' from Marcin Maliszkiewicz
Harden get_scylla_2025_1_executable() by improving error reporting when subprocesses fail,
increasing curl's retry count for more resilient downloads, and enabling --retry-all-errors to retry on all failures.

Fixes https://github.com/scylladb/scylladb/issues/27745
Backport: no, it's not a bug fix

Closes scylladb/scylladb#28628

* github.com:scylladb/scylladb:
  test: pylib: retry on all errors in get_scylla_2025_1_executable curl's call
  test: pylib: increase curl's number of retries when downloading scylla
  test: pylib: improve error reporting in get_scylla_2025_1_executable
2026-02-16 10:09:17 +02:00
Petr Gusev
c785d242a7 tests: extract get_topology_version helper
This is a refactoring commit. We need to load the cluster version
for a host in several places, so extract a helper for this.
2026-02-16 08:57:42 +01:00
Petr Gusev
ffe3262e8d global tablets barrier: require all nodes to ack barrier_and_drain
Previously, global_tablet_token_metadata_barrier() could proceed with
fencing even if some nodes did not acknowledge the barrier_and_drain.
This could cause problems:
* In scylladb/scylladb#26864, replica locks did not provide mutual
exclusion, because “fenced out” requests from old topology versions
could run in parallel with requests using newer versions.
* In scylladb/scylladb#26375, the barrier could succeed even though we
did not wait for closed sessions to become unused. This could leave
aborted repair or streaming tasks running concurrently after a tablet
transition was aborted, and thus running concurrently with the next
transition.

In this commit we add a parameter drain_all_nodes: bool to
the global_token_metadata_barrier function. If this parameter is set,
the barrier waits for all nodes to acknowledge the barrier_and_drain
round of RPCs. If any of the nodes are not accessible or throw an error,
such errors are rethrown to the caller. We set this parameter only in
global_tablet_token_metadata_barrier since for topology migrations
the old behavior should be preserved. In case of errors, the tablet
migration is blocked until the problem goes away by itself or the
problematic node is added to the ignore_nodes list.

The test_fenced_out_on_tablet_migration_while_handling_paxos_verb is
removed: with tablets, we now drain all nodes, so after a successful
barrier_and_drain round there can be no coordinators with an old
topology version. The fence_token check after executing a request on
a replica is therefore unnecessary for tablets, but still required for
vnodes, where topology changes do not wait for all nodes.
Topology fencing is covered by test_fence_lwt_during_bootstrap.

Fixes scylladb/scylladb#26864
Fixes scylladb/scylladb#26375
2026-02-16 08:57:42 +01:00
Petr Gusev
06f88b43e5 topology_coordinator: pass raft_topology_cmd by value
It's just a single enum. Passing by reference risks
use-after-free if a temporary command is created on
the stack in a non-coroutine function.
2026-02-16 08:57:42 +01:00
Petr Gusev
df73f723a6 storage_proxy: hold erms in replica handlers
Add explicit erm-holding variables in all replica-side RPC handlers.
This is required to ensure that tablet migration waits for in-flight
replica requests even if a non-replica coordinator has been fenced out.

Holding erms on the replica side may increase the global-barrier wait
time, since the barrier must drain these requests. We believe this
is acceptable because:
* We already hold erms during replica-side request execution, but in
an ad-hoc, non-systemic way in lower layers of storage_proxy
(e.g. in sp::mutate_locally and do_query_tablets).
* Replica requests are bounded by replica-side timeouts, so the
global-barrier wait time cannot exceed the maximum of these timeouts.

For Paxos verbs, we use token_metadata_guard, which wraps the ERM and
automatically refreshes it when tablet migration does not affect the
current token; see the token_metadata_guard comments for details.
We use this guard only for Paxos verbs because regular reads and writes
already hold raw erms in storage_proxy and on the coordinators.

The erms must be held in all RPC handlers that support fencing — that
is, those with a fencing_token parameter in storage_proxy.idl.
Counter updates already hold erms in
mutate_counter_on_leader_and_replicate.

Fix test_tablets2::test_timed_out_reader_after_cleanup: the tablets
barrier now waits for all nodes. As a result, the replica read
is expected to finish, rather than fail due to the tablet having
moved as it did previously. The test is renamed to
test_tablets_barrier_waits_for_replica_erms to better reflect its
purpose.

Refs scylladb/scylladb#26864
2026-02-16 08:57:42 +01:00
Petr Gusev
e39f4b399c token_metadata: improve stale versions diagnostics
Before waiting on stale_versions_in_use(), we log the stale versions
the barrier_and_drain handler will wait for, along with the number of
token_metadata references representing each version.
To achieve this, we store a pointer to token_metadata in
version_tracker, traverse the _trackers list, and output all items
with a version smaller than the latest. Since token_metadata
contains the version_tracker instance, it is guaranteed to remain
alive during traversal. To count references, token_metadata now
inherits from enable_lw_shared_from_this.

This helps diagnose tablet migration stalls and allows more
deterministic tests: when a barrier is expected to block, we can
verify that the log contains the expected stale versions rather
than checking that the barrier_and_drain is blocked on
stale_versions_in_use() for a fixed amount of time.
2026-02-16 08:57:42 +01:00
Andrei Chekun
8c5c1096c2 test: ensure that that table used it cqlpy/test_tools have at least 3 pk
One of the tests check that amount of the PK should be more than 2, but
the method that creates it can return table with less keys. This leads
to flakiness and to avoid it, this PR ensures that table will have at
least 3 PK

Closes scylladb/scylladb#28636
2026-02-16 09:50:58 +02:00
Anna Mikhlin
33cf97d688 .github/workflows: ignore quoted comments for trigger CI
prevent CI from being triggered when trigger-ci command appears inside
quoted (>) comment text

Fixes: https://scylladb.atlassian.net/browse/RELENG-271

Closes scylladb/scylladb#28604
2026-02-16 09:33:16 +02:00
Andrei Chekun
e144d5b0bb test.py: fix JUnit double test case records
Move the hook for overwriting the XML reporter to be the first, to
avoid double records.

Closes scylladb/scylladb#28627
2026-02-15 19:02:24 +02:00
Alex
75e25493c1 db:view: refactor on_update_view to use coroutine dispatcher
on_update_view() currently runs its serialized logic inline via with_semaphore()
  from a detached callback path, while create/drop already use dedicated async
  dispatchers.

  Refactor update handling to follow the same pattern:

  - add dispatch_update_view(sstring ks_name, sstring view_name)
  - move update logic into that coroutine
  - acquire the existing view-builder lock via get_or_adopt_view_builder_lock()
  - keep existing behavior for missing base/view state
  - keep background invocation semantics from on_update_view()

  This aligns update/create/drop flow and keeps async lifecycle handling and a first step to fix shutdown issue.
2026-02-15 18:50:32 +02:00
Avi Kivity
a365e2deaa test: sstables::test_env: adjust file open limit
The twcs compaction tests open more than 1024 files (not
so good), and will fail in a user session with the default
soft limit (1024).

Attempt to raise the limit so the tests pass. On a modern
systemd installation the hard limit is >500,000, so this
will work.

There's no problem in dbuild since it raises the file limit
globally.
2026-02-15 14:27:37 +02:00
Avi Kivity
bab3afab88 test: extract cql_test_env's adjust_rlimit() for reuse
The sstable-oriented sstable::test_env would also like to use
it, so extract it into a neutral place.
2026-02-15 14:26:46 +02:00
Jenkins Promoter
69249671a7 Update pgo profiles - aarch64 2026-02-15 05:22:17 +02:00
Jenkins Promoter
27aaafb8aa Update pgo profiles - x86_64 2026-02-15 04:26:36 +02:00
Piotr Dulikowski
9c1e310b0d Merge 'vector_search: Fix flaky vector_store_client_https_rewrite_ca_cert' from Karol Nowacki
Most likely, the root cause of the flaky test was that the TLS handshake hung for an extended period (60s). This caused
the test case to fail because the ANN request duration exceeded the test case timeout.

The PR introduces two changes:

* Mitigation of the hanging TLS handshake: This issue likely occurred because the test performed certificate rewrites
simultaneously with ANN requests that utilize those certificates.
* Production code fix: This addresses a bug where the TLS handshake itself was not covered by the connection timeout.
Since tls::connect does not perform the handshake immediately, the handshake only occurs during the first write
operation, potentially bypassing connect timeout.

Fixes: #28012

Backport to 2026.01 and 2025.04 is needed, as these branches are also affected and may experience CI flakiness due to this test.

Closes scylladb/scylladb#28617

* github.com:scylladb/scylladb:
  vector_search: Fix missing timeout on TLS handshake
  vector_search: test: Fix flaky cert rewrite test
2026-02-13 19:03:50 +01:00
Taras Veretilnyk
f140ab0332 sstables: extract default write open flags into a constant
Extract the commonly used `open_flags::wo | open_flags::create |
open_flags::exclusive` into a reusable constant
`sstable_write_open_flags` to reduce duplication.
2026-02-13 14:27:01 +01:00
Taras Veretilnyk
c8281b7b8b sstables: Add write_simple_with_digest for component checksumming
Introduce new methods to write SSTable components while calculating
and returning their CRC32 checksums. This adds:

- make_digests_component_file_writer(): creates a crc32_digest_file_writer
  for component writing with checksum tracking
- write_simple_with_digest() and do_write_simple_with_digest(): write
  components and return the full checksum value
2026-02-13 14:27:01 +01:00
Taras Veretilnyk
1bf934c77c sstables: Extract file writer closing logic into separate methods
Refactor the consume_end_of_stream() method by extracting the inline
file writer closing logic into dedicated methods:
- close_index_writer()
- close_partitions_writer()
- close_rows_writer()
2026-02-13 14:27:01 +01:00
Taras Veretilnyk
dec5e48666 sstables: Implement CRC32 digest-only writer
Introduce template parameter to checksummed file writer to support
digest-only calculation without storing chunk checksums.
This will be needed for future to calculate digest of other components.
2026-02-13 14:27:00 +01:00
Patryk Jędrzejczak
aebc108b1b test: run test_different_group0_ids in all modes
CI currently fails in release and debug modes if the PR only changes
a test run only in dev mode. There is no reason to wait for the CI fix,
as there is no reason to run this test only in dev mode in the first
place. The test is very fast.
2026-02-13 13:30:29 +01:00
Patryk Jędrzejczak
59746ea035 test: make test_different_group0_ids work with the Raft-based topology
The test was marked with xfail in #28383, as it needed to be updated to
work with the Raft-based topology. We are doing that in this patch.

With the Raft-based topology, there is no reason to check that nodes with
different group0 IDs cannot merge their topology/token_metadata. That is
clearly impossible, as doing any topology change requires being in the
same group0. So, the original regression test doesn't make sense.

We can still test that nodes with different group0 IDs cannot gossip with
each other, so we keep the test. It's very fast anyway.
2026-02-13 13:30:28 +01:00
Marcin Maliszkiewicz
1b0a68d1de test: pylib: retry on all errors in get_scylla_2025_1_executable curl's call
It's difficult to say if our download backend would always return
transient error correctly so that the curl could retry. Instead it's
more robust to always retry on error.
2026-02-12 16:18:52 +01:00
Marcin Maliszkiewicz
8ca834d4a4 test: pylib: increase curl's number of retries when downloading scylla
By default curl does exponential backoff, and we want to keep that
but there is time cap of 10 minutes, so with 40 retries we'd wait
long time, instead we set the cap to 60 seconds.

Total waiting time (excluding receiving request time):
before - 17m
after - 35m
2026-02-12 16:18:52 +01:00
Marcin Maliszkiewicz
70366168aa test: pylib: improve error reporting in get_scylla_2025_1_executable
Curl or other tools this function calls will now log error
in the place they fail instead of doing plain assert.
2026-02-12 16:18:52 +01:00
Andrzej Jackowski
9ffa62a986 test: explicitly set compression algorithm in test_autoretrain_dict
When `test_autoretrain_dict` was originally written, the default
`sstable_compression_user_table_options` was `LZ4Compressor`. The
test assumed (correctly) that initially the compression doesn't use
a trained dictionary, and later in the test scenario, it changed
the algorithm to one with a dictionary.

However, the default `sstable_compression_user_table_options` is now
`LZ4WithDictsCompressor`, so the old assumption is no longer correct.
As a result, the assertion that data is initially not compressed well
may or may not fail depending on dictionary training timing.

To fix this, this commit explicitly sets `ZstdCompressor`
as the initial `sstable_compression_user_table_options`, ensuring that
the assumption that initial compression is without a dictionary
is always met.

Note: `ZstdCompressor` differs from the former default `LZ4Compressor`.
However, it's a better choice — the test aims to show the benefit of
using a dictionary, not the benefit of Zstd over LZ4 (and the test uses
ZstdWithDictsCompressor as the algorithm with the dictionary).

Fixes: scylladb/scylladb#28204
2026-02-12 14:58:39 +01:00
Andrzej Jackowski
e63cfc38b3 test: remove unneeded semicolons from python test 2026-02-12 14:49:17 +01:00
Patryk Jędrzejczak
8e9c7397c5 raft topology: prevent accessing nullptr returned by topology::find
It's better to raise an internal error than cause a segmentation fault on
possibly multiple nodes.
2026-02-12 13:10:04 +01:00
Patryk Jędrzejczak
e21ecf69de raft topology: make some assertions non-crashing
Some assertions in the Raft-based topology are likely to cause crashes of
multiple nodes due to the consistent nature of the Raft-based code. If the
failing assertion is executed in the code run by each follower (e.g., the code
reloading the in-memory topology state machine), then all nodes can crash. If
the failing assertion is executed only by the leader (e.g., the topology
coordinator fiber), then multiple consecutive group0 leaders will chain-crash
until there is no group0 majority.

Crashing multiple nodes is much more severe than necessary. It's enough to
prevent the topology state machine from making more progress. This will
naturally happen after throwing a runtime error. The problematic fiber will be
killed or will keep failing in a loop. Note that it should be safe to block
the topology state machine, but not the whole group0, as the topology state
machine is mostly isolated from the rest of group0.

We replace some occurrences of `on_fatal_internal_error` and `SCYLLA_ASSERT`
with `on_internal_error`. These are not all occurrences, as some fatal
assertions make sense, for example, in the bootstrap procedure.
2026-02-12 13:10:03 +01:00
Ferenc Szili
d7cfaf3f84 test, simulator: compute load based on tablet size instead of count
This patch changes the load balancing simulator so that it computes
table load based on tablet sizes instead of tablet count.

best_shard_overcommit measured minimal allowed overcommit in cases
where the number of tablets can not be evenly distributed across
all the available shards. This is still the case, but instead of
computing it as an integer div_ceil() of the average shard load,
it is now computed by allocating the tablet sizes using the
largest-tablet-first method. From these, we can get the lowest
overcommit for the given set of nodes, shards and tablet sizes.
2026-02-12 12:54:55 +01:00
Ferenc Szili
216443c050 test, simulator: generate tablet sizes and update load_stats
This change adds a random tablet size generator. The tablet sizes are
created in load_stats.

Further changes to the load balance simulator:

- apply_plan() updates the load_stats after a migration plan is issued by the
load balancer,

- adds the option to set a command line option which controls the tablet size
deviation factor.
2026-02-12 12:54:55 +01:00
Ferenc Szili
e31870a02d test, simulator: postpone creation of load_stats_ptr
With size based load balancing, we will have to move the tablet size in
load_stats after each internode migration issued by balance_tablets().
This will be done in a subsequent commit in apply_plan() which is
called from rebalance_tablets().

Currently, rebalance_tablets() is passed a load_stats_ptr which is
defined as:

using load_stats_ptr = lw_shared_ptr<const load_stats>;

Because this is a pointer to const, apply_plan() can't modify it.

So, we pass a reference to load_stats to rebalance_tablets() and create
a load_stats_ptr from it for each call to balance_tablets().
2026-02-12 12:54:55 +01:00
Aleksandra Martyniuk
f955a90309 test: fix test_remove_node_violating_rf_rack_with_rack_list
test_remove_node_violating_rf_rack_with_rack_list creates a cluster
with four nodes. One of the nodes is excluded, then another one is
stopped, excluded, and removed. If the two stopped nodes were both
voters, the majority is lost and the cluster loses its raft leader.
As a result, the node cannot be removed and the operation times out.

Add the 5th node to the cluster. This way the majority is always up.

Fixes: https://github.com/scylladb/scylladb/issues/28596.

Closes scylladb/scylladb#28610
2026-02-12 12:58:48 +02:00
Ferenc Szili
4ca40929ef test: add read barrier to test_balance_empty_tablets
The test creates a single node cluster, then creates 3 tables which
remain empty. Then it adds another node with half the disk capacity of
the first one, and then it waits for the balancer to migrate tablets to
the newly added node by calling the quiesce topology API. The number of
tablets on the smaller node should be exactly half the number of tablets
on the larger node.

After waiting for quiesce topology, we could have a situation where we
query the number of tablets from the node which still hasn't processed
the last tablet migrations and updated system.tablets.

This patch adds a read barrier so that both nodes see the same tablets
metadata before we query the number of tablets.

Fixes: SCYLLADB-603

Closes scylladb/scylladb#28598
2026-02-12 11:16:34 +02:00
Karol Nowacki
079fe17e8b vector_search: Fix missing timeout on TLS handshake
Currently the TLS handshake in the vector search client does not have a timeout.
This is because tls::connect does not perform handshake itself; the handshake
is deferred until the first read/write operation is performed. This can lead to long
hangs on ANN requests.

This commit calls tls::check_session_is_resumed() after tls::connect
to force the handshake to happen immediately and to run under with_timeout.
2026-02-12 10:08:37 +01:00
Karol Nowacki
aef5ff7491 vector_search: test: Fix flaky cert rewrite test
The test is flaky most likely because when TLS certificate rewrite
happens simultaneously with an ANN request, the handshake can hang for a
long time (~60s). This leads to a timeout in the test case.

This change introduces a checkpoint in the test so that it will
wait for the certificate rewrite to happen before sending an ANN request,
which should prevent the handshake from hanging and make the test more reliable.

Fixes: #28012
2026-02-12 09:58:54 +01:00
Piotr Dulikowski
38c4a14a5b Merge 'test: cluster: Fix test_sync_point' from Dawid Mędrek
The test `test_sync_point` had a few shortcomings that made it flaky
or simply wrong:

1. We were verifying that hints were written by checking the size of
   in-flight hints. However, that could potentially lead to problems
   in rare situations.

   For instance, if all of the hints failed to be written to disk, the
   size of in-flight hints would drop to zero, but creating a sync point
   would correspond to the empty state.

   In such a situation, we should fail immediately and indicate what
   the cause was.

2. A sync point corresponds to the hints that have already been written
   to disk. The number of those is tracked by the metric `written`.
   It's a much more reliable way to make sure that hints have been
   written to the commitlog. That ensures that the sync point we'll
   create will really correspond to those hints.

3. The auxiliary function `wait_for` used in the test works like this:
   it executes the passed callback and looks at the result. If it's
   `None`, it retries it. Otherwise, the callback is deemed to have
   finished its execution and no further retries will be attempted.

   Before this commit, we simply returned a bool, and so the code was
   wrong. We improve it.

---

Note that this fixes scylladb/scylladb#28203, which was a manifestation
of scylladb/scylladb#25879. We created a sync point that corresponded
to the empty state, and so it immediately resolved, even when node 3
was still dead.

As a bonus, we rewrite the auxiliary code responsible for fetching
metrics and manipulating sync points. Now it's asynchronous and
uses the existing standard mechanisms available to developers.

Furthermore, we reduce the time needed for executing
`test_sync_point` by 27 seconds.

---

The total difference in time needed to execute the whole test file
(on my local machine, in dev mode):

Before:

    CPU utilization: 0.9%

    real    2m7.811s
    user    0m25.446s
    sys     0m16.733s

After:

    CPU utilization: 1.1%

    real    1m40.288s
    user    0m25.218s
    sys     0m16.566s

---

Refs scylladb/scylladb#25879
Fixes scylladb/scylladb#28203

Backport: This improves the stability of our CI, so let's
          backport it to all supported versions.

Closes scylladb/scylladb#28602

* github.com:scylladb/scylladb:
  test: cluster: Reduce wait time in test_sync_point
  test: cluster: Fix test_sync_point
  test: cluster: Await sync points asynchronously
  test: cluster: Create sync points asynchronously
  test: cluster: Fetch hint metrics asynchronously
2026-02-12 09:34:09 +01:00
Dawid Pawlik
4e32502bb3 test/vector_search: add reproducer for rescoring with zero vectors
Add reproducer for the SCYLLADB-456 issue following exception
on ANN vector queries with rescoring with similarity cosine.
2026-02-11 13:41:09 +01:00
Dawid Pawlik
af0889d194 vector_search: return NaN for similarity_cosine with all-zero vectors
The ANN vector queries with all-zero vectors are allowed even on vector
indexes with similarity function set to cosine.
When enabling the rescoring option, those queries would fail as the rescoring
calls `similarity_cosine` function underneath, causing an `InvalidRequest` exception
as all-zero vectors were not allowed matching Cassandra's behaviour.

To eliminate the discrepancy we want the all-zero vector `similarity_cosine` calls to pass,
but return the NaN as the cosine similarity for zero vectors is mathematically incorrect.
We decided not to use arbitrary values contrary to USearch, for which the distance
(not to be confused with similarity) is defined as cos(0, 0) = 0, cos(0, x) = 1 while
supporting the range of values [0, 2].
If we wanted to convert that to similarity, that would mean sim_cos(0, x) = 0.5,
which does not support mathematical reasoning why that would be more similar than
for example vectors marking obtuse angles.
It's safe to assume that all-zero vectors for cosine similarity shouldn't make any impact,
therefore we return NaN and eliminate them from best results.

Adjusted the tests accordingly to check both proper Cassandra and Scylla's behaviour.

Fixes: SCYLLADB-456
2026-02-11 12:31:47 +01:00
Pavel Emelyanov
2a3a56850c test: Fix the condition for streaming directions validation
Commit ea8a661119 tried to reduce the dataset for restoration tests.
While doing it effectively disabled part of itself -- the checks for
streaming directions were never ran after this change. The thing is that
this check only runs if restored tablet count matches some hardcoded one
of 512. This was the real dataset size of the test before the
aforementioned commit, but after it it had changed to over values, and
the comparison with 512 became always False.

Fix it with a local variable to prevent such mistakes in the future.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-02-11 12:55:27 +03:00
Pavel Emelyanov
f187dceb1a test: Split test_backup.py::check_data_is_back() into two
This method does two things -- checks that the data is indeed back, and
validates streaming directions. The latter is not quite about "data is
back", so better to have it as explicit dedicated method.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-02-11 12:54:20 +03:00
Dawid Mędrek
f83f911bae test: cluster: Reduce wait time in test_sync_point
If everything is OK, the sync point will not resolve with node 3 dead.
As a result, the waiting will use all of the time we allocate for it,
i.e. 30 seconds. That's a lot of time.

There's no easy way to verify that the sync point will NOT resolve, but
let's at least reduce the waiting to 3 seconds. If there's a bug, it
should be enough to trigger it at some point, while reducing the average
time needed for CI.
2026-02-10 17:05:02 +01:00
Dawid Mędrek
a256ba7de0 test: cluster: Fix test_sync_point
The test had a few shortcomings that made it flaky or simply wrong:

1. We were verifying that hints were written by checking the size of
   in-flight hints. However, that could potentially lead to problems
   in rare situations.

   For instance, if all of the hints failed to be written to disk, the
   size of in-flight hints would drop to zero, but creating a sync point
   would correspond to the empty state.

   In such a situation, we should fail immediately and indicate what
   the cause was.

2. A sync point corresponds to the hints that have already been written
   to disk. The number of those is tracked by the metric `written`.
   It's a much more reliable way to make sure that hints have been
   written to the commitlog. That ensures that the sync point we'll
   create will really correspond to those hints.

3. The auxiliary function `wait_for` used in the test works like this:
   it executes the passed callback and looks at the result. If it's
   `None`, it retries it. Otherwise, the callback is deemed to have
   finished its execution and no further retries will be attempted.

   Before this commit, we simply returned a bool, and so the code was
   wrong. We improve it.

Note that this fixes scylladb/scylladb#28203, which was a manifestation
of scylladb/scylladb#25879. We created a sync point that corresponded
to the empty state, and so it immediately resolved, even when node 3
was still dead.

Refs scylladb/scylladb#25879
Fixes scylladb/scylladb#28203
2026-02-10 17:05:02 +01:00
Dawid Mędrek
c5239edf2a test: cluster: Await sync points asynchronously
There's a dedicated HTTP API for communicating with the cluster, so
let's use it instead of yet another custom solution.
2026-02-10 17:05:02 +01:00
Dawid Mędrek
ac4af5f461 test: cluster: Create sync points asynchronously
There's a dedicated HTTP API for communicating with the nodes, so let's
use it instead of yet another custom solution.
2026-02-10 17:05:01 +01:00
Dawid Mędrek
628e74f157 test: cluster: Fetch hint metrics asynchronously
There's a dedicated API for fetching metrics now. Let's use it instead
of developing yet another solution that's also worse.
2026-02-10 17:04:59 +01:00
Pavel Emelyanov
875fd03882 test/object_store: Remove create_ks_and_cf() helper
Now all test cases use standard facilities to create data they test

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-02-10 15:59:05 +03:00
Pavel Emelyanov
94176f7477 test/object_store: Replace create_ks_and_cf() usage with standard methods
To create a keyspace theres new_test_keyspace helper
Table is created with a single cql.run_async with explicit schema
Dataset is populated with a single parallel INSERT as well

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-02-10 15:58:05 +03:00
Pavel Emelyanov
6665cda23f test/object_store: Shift indentation right for test cases
This is preparational patch. Next will need to replace

  foo()
  bar()

with

  with something() as s:
      foo()
      bar()

Effectively -- only add the `with something()` line. Not to shift the
whole file right together with that future change, do it here.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-02-10 15:56:27 +03:00
Ernest Zaslavsky
960adbb439 s3_client: add more constrains to the calc_part_size
Enforce more checks on part size and object size as defined in
"Amazon S3 multipart upload limits", see
https://docs.aws.amazon.com/AmazonS3/latest/userguide/qfacts.html and
https://docs.aws.amazon.com/AmazonS3/latest/userguide/UsingObjects.html
2026-02-10 13:15:07 +02:00
Ernest Zaslavsky
6280cb91ca s3_client: add tests for calc_part_size
Introduce tests that validate the corrected multipart part-size
calculation, including boundary conditions and error cases.
2026-02-10 13:13:26 +02:00
Ernest Zaslavsky
289e910cec s3_client: correct multipart part-size logic to respect 10k limit
The previous calculation could produce more than 10,000 parts for large
uploads because we mixed values in bytes and MiB when determining the
part size. This could result in selecting a part size that still
exceeded the AWS multipart upload limit. The updated logic now ensures
the number of parts never exceeds the allowed maximum.

This change also aligns the implementation with the code comment: we
prefer a 50 MiB part size because it provides the best performance, and
we use it whenever it fits within the 10,000-part limit. If it does not,
we increase the part size (in bytes, aligned to MiB) to stay within the
limit.
2026-02-10 13:13:25 +02:00
Wojciech Mitros
c5a44b0f88 schema: add with_sharder overload accepting static_sharder reference
Add a schema_builder::with_sharder() overload that accepts a const
reference to dht::static_sharder. This allows schemas to use custom
sharder instances instead of only static sharder configurations.

This is needed to support tables that use custom partitioning and
sharding strategies, such as the incoming raft metadata tables for
strongly consistent tables.
2026-02-10 10:52:00 +01:00
Ernest Zaslavsky
7142b1a08d exceptions: add helper to build a chain of error handlers
Generalize error handling by creating exception dispatcher which allows to write error handlers by sequentially applying handlers the same way one would write `catch ()` blocks
2026-02-09 08:48:41 +02:00
Ernest Zaslavsky
7fd62f042e http: extract error classification code
move http client related error classification code to a common location for future reuse
2026-02-09 08:48:41 +02:00
Ernest Zaslavsky
5beb7a2814 aws_error: extract retryable from aws_error
Move aws::retryable to common location to reuse it later in other http based clients
2026-02-09 08:48:41 +02:00
Pavel Emelyanov
83e64b516a hint: Don't switch group in database::apply_hint()
The method is called from storage_proxy::mutate_hint() which is in turn
called from hint_mutation::apply_locally(). The latter is either called
from directly by hint sender, which already runs in streaming group, or
via RPC HINT_MUTATION handler which uses index 1 that negotiates streaming
group as well.

To be sure, add a debugging check for current group being the expected one.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-02-09 08:54:51 +03:00
Pavel Emelyanov
727f1be11c hint_sender: Switch to sender group on stop either
Currently sender only switches group for hints sending on start. It's
worth doing the same on stop too for consistency. There's nothing to
compete with at this point.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-02-09 08:54:51 +03:00
Pavel Emelyanov
44715a2d45 api: Remove map_to_key_value() helpers
All the callers had already been patched to stream their results
directly as json.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-02-09 08:52:50 +03:00
Pavel Emelyanov
dcbb5cb45b api: Streamify view_build_statuses handler
Similarly to previous patch, the handler can stream the map of build
statuses. Unlike previous patch, it doesn't need to fmt::format() key
and value, as these are strings already.

It could be a map_to_json<string, string> partial specialization, but
there's so far only one caller, so probably not worth it yet.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-02-09 08:52:50 +03:00
Pavel Emelyanov
73512a59ff api: Streamify few more storage_service/ handlers
Like get_token_endpoint one streams the map that it got from storage
service, the get_ownership and get_effective_ownership can do the same.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-02-09 08:52:49 +03:00
Pavel Emelyanov
a4bd9037b3 api: Add map_to_json() helper
The get_token_endpoint handler converts iterator of std::map into
generated maplist_mapper type. Next patch will do the same for more
handlers, so it's good to have a helper converter for it.

As a nice side effect, it's possible to avoid multiline lambda argument
to stream_range_as_array().

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-02-09 08:52:49 +03:00
Pavel Emelyanov
63cafab56c api: Coroutinize view_build_statuses handler
Further patching will be nicer if this handler is a coroutine

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-02-09 08:52:49 +03:00
Pawel Pery
81d11a23ce Revert "Merge 'vector_search: add validator tests' from Pawel Pery"
This reverts commit bcd1758911, reversing
changes made to b2c2a99741.

There is a design decision to not introduce additional test
orchestration tool for scylladb.git (see comments for #27499). One
commit has already been reverted in 55c7bc7. Last CI runs made validator
test flaky, so it is a time to remove all remaining validator tests.

It needs a backport to 2026.1 to remove remaining validator tests from there.

Fixes: VECTOR-497

Closes scylladb/scylladb#28568
2026-02-08 16:29:58 +02:00
Avi Kivity
bb99bfe815 test: scylla_gdb: tighten check for Error output from gdb
When running a gdb command, we check that the string 'Error'
does not appear within the output. However, if the command output
includes the string 'Error' as part of its normal operation, this
generates a false positive. In fact the task_histogram can include
the string 'error::Error' from the Rust core::error module.

Allow for that and only match 'Error' that isn't 'error::Error'.

Fixes #28516.

Closes scylladb/scylladb#28574
2026-02-08 09:48:23 +02:00
Pavel Emelyanov
1b1aae8a0d test: Merge test_restore_primary_replica_different_... tests
The difference is very small:

@@ -1,18 +1,18 @@
 @pytest.mark.asyncio
 async def test_restore_primary_replica_different_...(manager: ManagerClient, object_storage):
     ''' Comment '''

-    topology = topo(rf = 2, nodes = 2, racks = 2, dcs = 1)
-    scope = "dc"
+    topology = topo(rf = 1, nodes = 2, racks = 1, dcs = 2)
+    scope = "all"
     ks = 'ks'
     cf = 'cf'

-    servers, host_ids = await create_cluster(topology, True, manager, logger, object_storage)
+    servers, host_ids = await create_cluster(topology, False, manager, logger, object_storage)

     await manager.disable_tablet_balancing()
     cql = manager.get_cql()
@@ -41,7 +41,6 @@ async def test_restore_primary_replica_d
         log = await manager.server_open_log(s.server_id)
         res = await log.grep(r'INFO.*sstables_loader - load_and_stream:.*target_node=([0-9a-z-]+),.*num_bytes_sent=([0-9]+)')
         streamed_to = set([ r[1].group(1) for r in res ])
-        logger.info(f'{s.ip_addr} {host_ids[s.server_id]} streamed to {streamed_to}')
+        logger.info(f'{s.ip_addr} {host_ids[s.server_id]} streamed to {streamed_to}, expected {servers}')
         assert len(streamed_to) == 2

The (removed in the above example) test description comments differ only
in their usage of "rack" and "dc" words.

Squashing them into one parametrized test makes perfect sense.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-02-06 14:14:43 +03:00
Pavel Emelyanov
70988f9b61 test: Merge test_restore_primary_replica_same_... tests
The difference is very tiny:

@@ -1,12 +1,12 @@
 @pytest.mark.asyncio
 async def test_restore_primary_replica_same_...(manager: ManagerClient, object_storage):
     ''' comment '''

-    topology = topo(rf = 4, nodes = 8, racks = 2, dcs = 1)
-    scope = "rack"
+    topology = topo(rf = 4, nodes = 8, racks = 2, dcs = 2)
+    scope = "dc"
     ks = 'ks'
     cf = 'cf'

@@ -42,7 +42,7 @@ async def test_restore_primary_replica_s
         for r in res:
             nodes_by_operation[r[1].group(1)].append(r[1].group(2))

-        scope_nodes = set([ str(host_ids[s.server_id]) for s in servers if s.rack == servers[i].rack ])
+        scope_nodes = set([ str(host_ids[s.server_id]) for s in servers if s.datacenter == servers[i].datacenter ])
         for op, nodes in nodes_by_operation.items():
             logger.info(f'Operation {op} streamed to nodes {nodes}')
             assert len(nodes) == 1, "Each streaming operation should stream to exactly one primary replica"

The (removed in the above example) test description comments differ only
in their usage of "rack" and "dc" words.

Squashing them into one parametrized test makes perfect sense.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-02-06 14:12:23 +03:00
Pavel Emelyanov
98b4092153 test: Don't specify expected_replicas in test_restore_primary_replica_different_dc_scope_all
If not specified, the call would use dcs * rf default, which match this
teat parameters perfectly

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-02-06 14:11:47 +03:00
Pavel Emelyanov
1ac1e90b16 test: Remove local r_servers variable from test_restore_primary_replica_different_dc_scope_all
It merely copies the `servers` one, no need in it

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-02-06 14:11:22 +03:00
Anna Stuchlik
dc8f7c9d62 doc: replace the OS Support page with a link to the new location
We've moved that page to another place; see https://github.com/scylladb/scylladb/issues/28561.
This commit replaces the page with the link to the new location
and adds a redirection.

Fixes https://github.com/scylladb/scylladb/issues/28561

Closes scylladb/scylladb#28562
2026-02-06 11:38:21 +02:00
Tomasz Grabiec
41930c0176 tablets, config: Reduce migration concurrency to 2
Tablet migration keeps sstable snapshot during streaming, which may
cause temporary increase in disk utilization if compaction is running
concurrently. SStables compacted away are kept on disk until streaming
is done with them. The more tablets we allow to migrate concurrently,
the higher disk space can rise. When the target tablet size is
configured correcly, every tablet should own about 1% of disk
space. So concurrency of 4 shouldn't put us at risk. But target tablet
size is not chosen dynamically yet, and it may not be aligned with
disk capacity.

Also, tablet sizes can temporary grow above the target, up to 2x
before the split starts, and some more because splits take a while to
complete.

The reduce impact from this, reduce concurrency of
migation. Concurrency of 2 should still be enough to saturate
resources on the leaving shard.

Also, reducing concurrency means that load balancing is more
responsive to preemption. There will be less bandwidth sharing, so
scheduled migrations complete faster. This is important for scale-out,
where we bootstrap a node and want to start migrations to that new
node as soon as possible.

Refs scylladb/siren#15317
2026-02-06 00:42:19 +01:00
Tomasz Grabiec
56e40e90c9 tablets: load_balancer: Always accept migration if the load is 0
Different transitions have different weights, and limits are
configurable.  We don't want a situation where a high-cost migration
is cut off by limits and the system can make no progress.

For example, repair uses weight 2 for read concurrency.  Migrating
co-located tablets scales the cost by the number of co-located
tablets.
2026-02-06 00:42:18 +01:00
Tomasz Grabiec
39492596c2 config, tablets: Make tablet migration concurrency configurable
We're about to reduce it. It's better to not have it hard-coded in
case we change our mings again.
2026-02-06 00:42:18 +01:00
Avi Kivity
7a3ce5f91e test: minio: disable web console
minio starts a web console on a random port. This was seen to interfere
with the nodetool tests when the web console port clashed with the mock
API port.

Fix by disabling the web console.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-496

Closes scylladb/scylladb#28492
2026-02-05 20:11:32 +02:00
Nikos Dragazis
5d1e6243af test/cluster: Remove short_tablet_stats_refresh_interval injection
The test `test_size_based_load_balancing.py::test_balance_empty_tablets`
waits for tablet load stats to be refreshed and uses the
`short_tablet_stats_refresh_interval` injection to speed up the refresh
interval.

This injection has no effect; it was replaced by the
`tablet_load_stats_refresh_interval_in_seconds` config option (patch: 1d6808aec4),
so the test currently waits for 60 seconds (default refresh interval).

Use the config option. This reduces the execution time to ~8 seconds.

Fixes SCYLLADB-556.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>

Closes scylladb/scylladb#28536
2026-02-05 20:11:32 +02:00
Pavel Emelyanov
10c278fff7 database: Remove _flush_sg member from replica::database
This field is only used to initialize the following _memtable_controller
one. It's simpler just to do the initialization with whatever value the
field itself is initialized and drop the field itself.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#28539
2026-02-05 13:02:35 +02:00
Petr Hála
a04dbac369 open-coredump: Change to use new backtrace
* This is a breaking change, which removes compatibility with the old backtrace
    - See https://staging.backtrace.scylladb.com/api/docs#/default/search_by_build_id_search_build_id_post for the APIDoc
* Add timestamp field to log
* Tested locally

Closes scylladb/scylladb#28325
2026-02-05 11:50:47 +02:00
Marcin Maliszkiewicz
0753d9fae5 Merge 'test: remove xfail marker from a few passing tests' from Nadav Har'El
This patch fixes the few remaining cases of XPASS in test/cqlpy and test/alternator.
These are tests which, when written, reproduced a bug and therefore were marked "xfail", but some time later the bug was fixed and we either did not notice it was ever fixed, or just forgot to remove the xfail marker.

Removing the no-longer-needed xfail markers is good for test hygiene, but more importantly is needed to avoid regressions in those already-fixed areas (if a test is already marked xfail, it can start to fail in a new way and we wouldn't notice).

Backport not needed, xpass doesn't bother anyone.

Closes scylladb/scylladb#28441

* github.com:scylladb/scylladb:
  test/cqlpy: remove xfail from tests for fixed issue 7972
  test/cqlpy: remove xfail from tests for fixed issue 10358
  test/cqlpy: remove xfail from passing test testInvalidNonFrozenUDTRelation
  test/alternator: remove xfail from passing test_update_item_increases_metrics_for_new_item_size_only
2026-02-05 10:10:43 +01:00
Marcin Maliszkiewicz
6eca74b7bb Merge 'More Alternator tests for BatchWriteItem' from Nadav Har'El
The goal of this small pull request is to reproduce issue #28439, which found a bug in the Alternator Streams output when BatchWriteItem is called to write multiple items in the same partition, and always_use_lwt write isolation mode is used.

* The first patch reproduces this specific bug in Alternator Streams.
* The second patch adds missing (Fixes #28171) tests for BatchWriteItem in different write modes, and shows that BatchWriteItem itself works correctly - the bug is just in Alternator Streams' reporting of this write.

Closes scylladb/scylladb#28528

* github.com:scylladb/scylladb:
  test/alternator: add test for BatchWriteItem with different write isolations
  test/alternator: reproducer for Alternator Streams bug
2026-02-05 10:07:29 +01:00
Yaron Kaikov
b30ecb72d5 ci: fix PR number extraction for unlabeled events
When the workflow is triggered by removing the 'conflicts' label
(pull_request_target unlabeled event), github.event.issue.number is
not available. Use github.event.pull_request.number as fallback.

Fixes: https://scylladb.atlassian.net/browse/RELENG-245

Closes scylladb/scylladb#28543
2026-02-05 08:41:43 +02:00
Michał Hudobski
6b9fcc6ca3 auth: add CDC streams and timestamps to vector search permissions
It turns out that the cdc driver requires permissions to two additional system tables. This patch adds them to VECTOR_SEARCH_INDEXING and modifies the unit tests. The integration with vector store was tested manually, integration tests will be added in vector-store repository in a follow up PR.

Fixes: SCYLLADB-522

Closes scylladb/scylladb#28519
2026-02-04 09:10:08 +01:00
Nadav Har'El
47e827262f test/alternator: add test for BatchWriteItem with different write isolations
Alternator's various write operations have different code paths for the
different write isolation modes. Because most of the test suite runs in
only a single write mode (currently - only_rmw_uses_lwt), we already
introduced a test file test/alternator/test_write_isolation.py for
checking the different write operations in *all* four write isolation
modes.

But we missed testing one write operation - BatchWriteItem. This
operation isn't very "interesting" because it doesn't support *any*
read-modify-option option (it doesn't support UpdateExpression,
ConditionExpression or ReturnValues), but even without those, the
pure write code still has different code paths with and without LWT,
and should be tested. So we add the missing test here - and it passes.

In issue #28439 we discovered a bug that can be seen in Alternator
Streams in the case of BatchWriteItem with multiple writes to the
same partition and always_use_lwt mode. The fact that the test added
here passes shows that the bug is NOT in BatchWriteItem itself, which
works correctly in this case - but only in the Alternator Streams layer.

Fixes #28171

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-02-04 09:24:29 +02:00
Nadav Har'El
c63f43975f test/alternator: reproducer for Alternator Streams bug
This patch adds a reproducer for an Alternator Streams bug described in
issue #28439, where the stream returns the wrong events (and fewer of
them) in the following specific combination of the following circumstances:

1. A BatchWriteItem operation writing multiple items to the *same*
   partition.

2. The "always_use_lwt" write isolation mode is used. (the bug doesn't
   occur in other write isolation modes).

We didn't catch this bug earlier because the Alternator Streams test
we had for BatchWriteItem had multiple items in multiple partitions,
and we missed the multiple-items-in-one-partition case. Moreover,
today we run all the tests in only_rmw_uses_lwt mode (in the past,
we did use always_use_lwt, but changed recently in commit e7257b1393
following commit 76a766c that changed test.py).

As issue #28439 explains, the underlying cause of the bug is that the
always_use_lwt causes the multiple items to be written with the same
timestamp, which confused the Alternator Streams code reading the CDC
log. The bug is not in BatchWriteItem itself, or in ScyllaDB CDC, but
just in the Alternator Streams layer.

The test in this patch is parameterized to run on each of the four
write isolation modes, and currently fails (and so marked xfail) just
for the one mode 'always_use_lwt'. The test is scylla_only, as its
purpose is to checks the different write isolation mode - which don't
exist in AWS DynamoDB.

Refs #28439

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-02-04 09:17:48 +02:00
Radosław Cybulski
03ff091bee alternator: improve events output when test failed
Improve events printing, when test in test_streams.py failed.
New code will print both expected and received events (keys, previous
image, new image and type).
New code will explicitly mark, at which output event comparison failed.

Fixes #28455

Closes scylladb/scylladb#28476
2026-02-03 21:55:07 +02:00
Anna Stuchlik
a427ad3bf9 doc: remove the link to the Open Source blog post
Fixes https://github.com/scylladb/scylladb/issues/28486

Closes scylladb/scylladb#28518
2026-02-03 14:15:16 +01:00
Botond Dénes
3adf8b58c4 Merge 'test: pylib: scylla_cluster: set shutdown_announce_in_ms to 0' from Patryk Jędrzejczak
The usual Scylla shutdown in a cluster test takes ~2.1s. 2s come from
```
co_await sleep(std::chrono::milliseconds(_gcfg.shutdown_announce_ms));
```
as the default value of `shutdown_announce_in_ms` is 2000. This sleep
makes every `server_stop_gracefully` call 2s slower. There are ~300 such
calls in cluster tests (note that some come from `rolling_restart`). So,
it looks like this sleep makes cluster tests 300 * 2s = 10min slower.
Indeed, `./test.py --mode=dev cluster` takes 61min instead of 71min
on the potwor machine (the one in the Warsaw office) without it.

We set `shutdown_announce_in_ms` to 0 for all cluster tests to make them
faster.

The sleep is completely unnecessary in tests. Removing it could introduce
flakiness, but if that's the case, then the test for which it happens is
incorrect in the first place. Tests shouldn't assume that all nodes
receive and handle the shutdown message in 2s. They should use functions
like `server_not_sees_other_server` instead, which are faster and more
reliable.

Improvement of the tests running time, so no backport. The fix of
`test_tablets_parallel_decommission` may have to be backported to
2026.1, but it can be done manually.

Closes scylladb/scylladb#28464

* github.com:scylladb/scylladb:
  test: pylib: scylla_cluster: set shutdown_announce_in_ms to 0
  test: test_tablets_parallel_decommission: prevent group0 majority loss
  test: delete test_service_levels_work_during_recovery
2026-02-03 08:19:05 +02:00
Pavel Emelyanov
19ea05692c view_build_worker: Do not switch scheduling groups inside work_on_view_building_tasks
The handler appeared back in c9e710dca3. In this commit it performed the
"core" part of the task -- the do_build_range() method -- inside the
streaming sched group. The setup code looks seemingly was copied from the
view_builder::do_build_step() method and got the explicit switch of the
scheduling group.

The switch looks both -- justified and not. On one hand, it makes it
explict that the activity runs in the streaming scheduling group. On the
other hand, the verb already uses RPC index on 1, which is negotiated to
be run in streaming group anyway. On the "third hand", even though being
explicit the switch happens too late, as there exists a lot of other
activities performed by the handler that seems to also belong to the
same scheduling group, but which is not switched into explicitly.

By and large, it seems better to avoid the explicit switch and rely on
the RPC-level negotiation-based sched group switching.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#28397
2026-02-03 07:00:32 +02:00
Anna Stuchlik
77480c9d8f doc: fix the links on the repair-related pages
This is a follow-up to https://github.com/scylladb/scylladb/pull/28199.

This commit fixes the syntax of the internal links.

Fixes https://github.com/scylladb/scylladb/issues/28486

Closes scylladb/scylladb#28487
2026-02-03 06:54:08 +02:00
Botond Dénes
64b38a2d0a Merge 'Use gossiper scheduling group where needed' from Pavel Emelyanov
This is the continuation of #28363 , this time about getting gossiper scheduling group via database.
Several places that do it already have gossiper at hand and should better get the group from it.
Eventually, this will allow to get rid of database::get_gossip_scheduling_group().

Refining inter-components API, not backporting

Closes scylladb/scylladb#28412

* github.com:scylladb/scylladb:
  gossiper: Export its scheduling group for those who need it
  migration_manager: Reorder members
2026-02-03 06:51:31 +02:00
Nadav Har'El
48b01e72fa test/alternator: add test verifying that keys only allow S/B/N type
Recently we had a question whether key columns can have any supported
type. I knew that actually - they can't, that key columns can have only
the types S(tring), B(inary) or N(umber), and that is all. But it turns
out we never had a test that confirms this understanding is true.

We did have a test for it for GSI key types already,
test_gsi.py::test_gsi_invalid_key_types, but we didn't have one for the
base table. So in this patch we add this missing test, and confirm that,
indeed, both DynamoDB and Alternator refuse a key attribute with any
type other than S, B or N.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#28479
2026-02-03 06:49:02 +02:00
Andrei Chekun
ed9a96fdb7 test.py: modify logic for adding function_path in JUnit
Current way is checking only fail during the test phase, and it will
miss the cases when fail happens on another phase. This PR eliminate
this, so every phase will have modified node reporter to enrich the
JUnit XML report with custom attribute function_path.

Closes scylladb/scylladb#28462
2026-02-03 06:42:18 +02:00
Andrei Chekun
3a422e82b4 test.py: fix the file name in test summary
Current way is always assumed that the error happened in the test file,
but that not always true. This PR will show the error from the boost
logger where actually error is happened.

Closes scylladb/scylladb#28429
2026-02-03 06:38:21 +02:00
Benny Halevy
84caa94340 gossiper: add_expire_time_for_endpoint: replace fmt::localtime with gmtime in log printout
1. fmt::localtime is deprecated.
2. We should really print times in UTC, especially on the cloud.
3. The current log message does not print the timezone so it'd unclear
   to anyone reading the lof message if the expiration time is in the
   local timezone or in GMT/UTC.

Fixes the following warning:
```
gms/gossiper.cc:2428:28: warning: 'localtime' is deprecated [-Wdeprecated-declarations]
 2428 |             endpoint, fmt::localtime(clk::to_time_t(expire_time)), expire_time.time_since_epoch().count(),
      |                            ^
/usr/include/fmt/chrono.h:538:1: note: 'localtime' has been explicitly marked deprecated here
  538 | FMT_DEPRECATED inline auto localtime(std::time_t time) -> std::tm {
      | ^
/usr/include/fmt/base.h:207:28: note: expanded from macro 'FMT_DEPRECATED'
  207 | #  define FMT_DEPRECATED [[deprecated]]
      |                            ^
```

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#28434
2026-02-03 06:36:53 +02:00
Pavel Emelyanov
8c42704c72 storage_service: Check raft rpc scheduling group from debug namespace
Some storage_service rpc verbs may checks that a handler is executed
inside gossiper scheduling group. For that, the expected group is
grabbed from database.

This patch puts the gossiper sched group into debug namespace and makes
this check use it from there. It removes one more place that uses
database as config provider.

Refs #28410

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#28427
2026-02-03 06:34:03 +02:00
Asias He
b5c3587588 repair: Add request type in the tablet repair log
So we can know if the repair is an auto repair or a user repair.

Fixes SCYLLADB-395

Closes scylladb/scylladb#28425
2026-02-03 06:26:58 +02:00
Nadav Har'El
a63ad48b0f test/cqlpy: remove xfail from tests for fixed issue 7972
The test test_to_json_double used to fail due to #7972, but this issue
was already fixed in Scylla 5.1 and we didn't notice.
So remove the xfail marker from this test, and also update another test
which still xfails but no longer due to this issue.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-02-02 23:49:32 +02:00
Nadav Har'El
10b81c1e97 test/cqlpy: remove xfail from tests for fixed issue 10358
The tests testWithUnsetValues and testFilteringWithoutIndices used to fail
due to #10358, but this issue was already fixed three years ago, when the
UNSET-checking code was cleaned up, and the test is now passing.
So remove the xfail marker from these tests.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-02-02 23:49:31 +02:00
Nadav Har'El
508bb97089 test/cqlpy: remove xfail from passing test testInvalidNonFrozenUDTRelation
The test testInvalidNonFrozenUDTRelation used to fail due to #10632
(an incorrectly-printed column name in an error message) and was marked
"xfail". But this issue has already been fixed two years ago, and
the test is now passing. So remove the xfail marker.
2026-02-02 23:49:31 +02:00
Nadav Har'El
3682c06157 test/alternator: remove xfail from passing test_update_item_increases_metrics_for_new_item_size_only
The test test_metrics.py::test_update_item_increases_metrics_for_new_item_size_only
tests whether the Alternator metrics report the exactly-DynamoDB-compatible
WCU number. It is parameterized with two cases - one that uses
alternator_force_read_before_write and one which doesn't.

The case that uses alternator_force_read_before_write is expected to
measure the "accurate" WCU, and currently it doesn't, so the test
rightly xfails.
But the case that doesn't use alternator_force_read_before_write is not
expected to measure the "accurate" WCU and has a different expectation,
so this case actually passes. But because the entire test is marked
xfail, it is reported as "XPASS" - unexpected pass.

Fix this by marking only the "True" case with xfail, while the "False"
case is not marked. After this pass, the True case continues to XFAIL
and the False case passes normally, instead of XPASS.

Also removed a sentence promising that the failing case will be solved
"by the next PR". Clearly this didn't happen. Maybe we even have such
a PR open (?), but it won't the "the next PR" even if merged today.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-02-02 23:49:31 +02:00
Nadav Har'El
df69dbec2a Merge ' cql3/statements/describe_statement: hide paxos state tables ' from Michał Jadwiszczak
Paxos state tables are internal tables fully managed by Scylla
and they shouldn't be exposed to the user nor they shouldn't be backed up.

This commit hides those kind of tables from all listings and if such table
is directly described with `DESC ks."tbl$paxos"`, the description is generated
withing a comment and a note for the user is added.

Fixes https://github.com/scylladb/scylladb/issues/28183

LWT on tablets and paxos state tables are present in 2025.4, so the patch should be backported to this version.

Closes scylladb/scylladb#28230

* github.com:scylladb/scylladb:
  test/cqlpy: add reproducer for hidden Paxos table being shown by DESC
  cql3/statements/describe_statement: hide paxos state tables
2026-02-02 21:22:59 +02:00
Nadav Har'El
f23e796e76 alternator: fix typos in comments and variable names
Copilot found these typos in comments and variable name in alternator/,
so might as well fix them.

There are no functional changes in this patch.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#28447
2026-02-02 19:16:43 +03:00
Marcin Maliszkiewicz
88c4ca3697 Merge 'test: migrate guardrails_test.py from scylla-dtest' from Andrzej Jackowski
This patch series copies `guardrails_test.py` from scylla-dtest, fix it and enables it.

The motivation is to unify the test execution of guardrails test, as some tests (`cqlpy/test_guardrail_...`) were already in scylladb repo, and some were in `scylla-dtest`.

Fixes: SCYLLADB-255

No backport, just test migration

Closes scylladb/scylladb#28454

* github.com:scylladb/scylladb:
  test: refactor test_all_rf_limits in guardrails_test.py
  test: specify exceptions being caught in guardrails_test.py
  test: enable guardrails_test.py
  test: add wait_other_notice to test_default_rf in guardrails_test.py
  test: copy guardrails_test.py from scylla-dtest
2026-02-02 16:54:13 +01:00
Avi Kivity
acc54cf304 tools: toolchain: adapt future toolchain to loss of toxiproxy in Fedora
Next Fedora will likely not have toxiproxy packaged [1]. Adapt
by installing it directly. To avoid changing the current toolchain,
add a ./install-dependencies --future option. This will allow us
to easily go back to the packages if the Fedora bug is fixed.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=2426954

Closes scylladb/scylladb#28444
2026-02-02 17:02:19 +02:00
Avi Kivity
419636ca8f test: ldap: regularize toxiproxy command-line options
Modern toxiproxy interprets `-h` as help and requires the subcommand
subject (e.g. the proxy name) to be after the subcommand switches.
Arrange the command line in the way it likes, and spell out the
subcommands to be more comprehensible.

Closes scylladb/scylladb#28442
2026-02-02 17:00:58 +02:00
Botond Dénes
2b3f3d9ba7 Merge 'test.py: support boost labels in test.py' from Artsiom Mishuta
related PR: https://github.com/scylladb/scylladb/pull/27527

This PR changes test.py logic of parsing boost test cases to use -- --list_json_content
and pass boost labels as pytests markers

using  -- --list_json_content is not ideal and currenly require to implement severall [workarounds](https://github.com/scylladb/scylladb/pull/27527#issuecomment-3765499812), but having the ability to support boost labels in pytest is worth it. because now we can apply the tiering mechanism for the boost tests as well

Fixes SCYLLADB-246

Closes scylladb/scylladb#28232

* github.com:scylladb/scylladb:
  test: add nightly label
  test.py: support boost labels in test.py
2026-02-02 16:55:29 +02:00
Dawid Mędrek
68981cc90b Merge 'raft topology: generate notification about released nodes only once' from Piotr Dulikowski
Hints destined for some other node can only be drained after the other node is no longer a replica of any vnode or tablet. In case when tablets are present, a node might still technically be a replica of some tablets after it moved to left state. When it no longer is a replica of any tablet, it becomes "released" and storage service generates a notification about it. Hinted handoff listens to this notification and kicks off draining hints after getting it.

The current implementation of the "released" notification would trigger every time raft topology state is reloaded and a left node without any tokens is present in the raft topology. Although draining hints is idempotent, generating duplicate notifications is wasteful and recently became very noisy after in 44de563 verbosity of the draining-related log messages have been increased. The verbosity increase itself makes sense as draining is supposed to be a rare operation, but the duplicate notification bug now needs to be addressed.

Fix the duplicate notification problem by passing the list of previously released nodes to the `storage_service::raft_topology_update_ip` function and filtering based on it. If this function processes the topology state for the first time, it will not produce any notifications. This is fine as hinted handoff is prepared to detect "released" nodes during the startup sequence in main.cc and start draining the hints there, if needed.

Fixes: scylladb/scylladb#28301
Refs: scylladb/scylladb#25031

The log messages added in 44de563 cause a lot of noise during topology operations and tablet migrations, so the fix should be backported to all affected versions (2025.4 and 2026.1).

Closes scylladb/scylladb#28367

* github.com:scylladb/scylladb:
  storage_service: fix indentation after previous patch
  raft topology: generate notification about released nodes only once
  raft topology: extract "released" nodes calculation to external function
2026-02-02 15:39:15 +01:00
Jenkins Promoter
c907fc6789 Update pgo profiles - aarch64 2026-02-02 14:56:49 +02:00
Dawid Mędrek
b0afd3aa63 Merge 'storage_service: set up topology properly in maintenance mode' from Patryk Jędrzejczak
We currently make the local node the only token owner (that owns the
whole ring) in maintenance mode, but we don't update the topology properly.
The node is present in the topology, but in the `none` state. That's how
it's inserted by `tm.get_topology().set_host_id_cfg(host_id);` in
`scylla_main`. As a result, the node started in maintenance mode crashes
in the following way in the presence of a vnodes-based keyspace with the
NetworkTopologyStrategy:
```
scylla: locator/network_topology_strategy.cc:207:
    locator::natural_endpoints_tracker::natural_endpoints_tracker(
    const token_metadata &, const network_topology_strategy::dc_rep_factor_map &):
    Assertion `!_token_owners.empty() && !_racks.empty()' failed.
```
Both `_token_owners` and `_racks` are empty. The reason is that
`_tm.get_datacenter_token_owners()` and
`_tm.get_datacenter_racks_token_owners()` called above filter out nodes
in the `none` state.

This bug basically made maintenance mode unusable in customer clusters.

We fix it by changing the node state to `normal`.

We also extend `test_maintenance_mode` to provide a reproducer for

Fixes #27988

This PR must be backported to all branches, as maintenance mode is
currently unusable everywhere.

Closes scylladb/scylladb#28322

* github.com:scylladb/scylladb:
  test: test_maintenance_mode: enable maintenance mode properly
  test: test_maintenance_mode: shutdown cluster connections
  test: test_maintenance_mode: run with different keyspace options
  test: test_maintenance_mode: check that group0 is disabled by creating a keyspace
  test: test_maintenance_mode: get rid of the conditional skip
  test: test_maintenance_mode: remove the redundant value from the query result
  storage_proxy: skip validate_read_replica in maintenance mode
  storage_service: set up topology properly in maintenance mode
2026-02-02 13:28:19 +01:00
Andrzej Jackowski
298aca7da8 test: refactor test_all_rf_limits in guardrails_test.py
Before this commit, `test_all_rf_limits` was implemented in a
repetitive manner, making it harder to understand how the guardrails
were tested. This commit refactors the test to reduce code redundancy
and verify the guardrails more explicitly.
2026-02-02 10:49:12 +01:00
Andrzej Jackowski
136db260ca test: specify exceptions being caught in guardrails_test.py
Before this commit, the test caught a broad `Exception`. This change
specifies the expected exceptions to avoid a situation where the product
or test is broken and it goes undetected.
2026-02-02 10:48:07 +01:00
Patryk Jędrzejczak
ec2f99b3d1 test: pylib: scylla_cluster: set shutdown_announce_in_ms to 0
The usual Scylla shutdown in a cluster test takes ~2.1s. 2s come from
```
co_await sleep(std::chrono::milliseconds(_gcfg.shutdown_announce_ms));
```
as the default value of `shutdown_announce_in_ms` is 2000. This sleep
makes every `server_stop_gracefully` call 2s slower. There are ~300 such
calls in cluster tests (note that some come from `rolling_restart`). So,
it looks like this sleep makes cluster tests 300 * 2s = 10min slower.
Indeed, `./test.py --mode=dev cluster` takes 61min instead of 71min
on the potwor machine (the one in the Warsaw office) without it.

We set `shutdown_announce_in_ms` to 0 for all cluster tests to make them
faster.

The sleep is completely unnecessary in tests. Removing it could introduce
flakiness, but if that's the case, then the test for which it happens is
incorrect in the first place. Tests shouldn't assume that all nodes
receive and handle the shutdown message in 2s. They should use functions
like `server_not_sees_other_server` instead, which are faster and more
reliable.
2026-02-02 10:39:55 +01:00
Patryk Jędrzejczak
1f28a55448 test: test_tablets_parallel_decommission: prevent group0 majority loss
Both of the changed test cases stop two out of four nodes when there are
three group0 voters in the cluster. If one of the two live nodes is
a non-voter (node 1, specifically, as node 0 is the leader), a temporary
majority loss occurs, which can cause the following operations to fail.
In the case of `test_tablets_are_rebuilt_in_parallel`, the `exclude_node`
API can fail. In the case of `test_remove_is_canceled_if_there_is_node_down`,
removenode can fail with an unexpected error message:
```
"service::raft_operation_timeout_error (group
[46dd9cf1-fe21-11f0-baa0-03429f562ff5] raft operation [read_barrier] timed out)"
```

Somehow, these test cases are currently not flaky, but they become flaky in
the following commit.

We can consider backporting this commit to 2026.1 to prevent flakiness.
2026-02-02 10:39:55 +01:00
Patryk Jędrzejczak
bcf0114e90 test: delete test_service_levels_work_during_recovery
The test becomes flaky in one of the following commits. However, there is
no need to fix it, as we should delete it anyway. We are in the process of
removing the gossip-based topology from the code base, which includes the
recovery mode. We don't have to rewrite the test to use the new Raft-based
recovery procedure, as there is nothing interesting to test (no regression
to legacy service levels).
2026-02-02 10:39:54 +01:00
Artsiom Mishuta
af2d7a146f test: add nightly label
add nightly label for test
test_foreign_reader_as_mutation_source
as an example of usinf boost labels pytest as markers

command to test :
./tools/toolchain/dbuild  pytest --test-py-init --collect-only -q -m=nightly test/boost

output:
boost/mutation_reader_test.cc::test_foreign_reader_as_mutation_source.debug.1
boost/mutation_reader_test.cc::test_foreign_reader_as_mutation_source.release.1
boost/mutation_reader_test.cc::test_foreign_reader_as_mutation_source.dev.1
2026-02-02 10:30:38 +01:00
Gleb Natapov
08268eee3f topology: disable force-gossip-topology-changes option
The patch marks force-gossip-topology-changes as deprecated and removes
tests that use it. There is one test (test_different_group0_ids) which
is marked as xfail instead since it looks like gossiper mode was used
there as a way to easily achieve a certain state, so more investigation
is needed if the tests can be fixed to use raft mode instead.

Closes scylladb/scylladb#28383
2026-02-02 09:56:32 +01:00
Avi Kivity
ceec703bb7 Revert "main: test: add future and abort_source to after_init_func"
This reverts commit 7bf7ff785a. The commit
tried to add clean shutdown to `scylla perf` paths, but forgot at least
`scylla perf-alternator --workload wr` which now crashes on uninitialized
`c.as`.

Fixes #28473

Closes scylladb/scylladb#28478
2026-02-02 09:22:24 +01:00
Avi Kivity
cc03f5c89d cql3: support literals and bind variables in selectors
Add support for literals in the SELECT clause. This allows
SELECT fn(column, 4) or SELECT fn(column, ?).

Note, "SELECT 7 FROM tab" becomes valid in the grammar, but is still
not accepted because of failed type inference - we cannot infer the
type of 7, and don't have a favored type for literals (like C favors
int). We might relax this later.

In the WHERE clause, and Cassandra in the SELECT clause, type hints
can also resolve type ambiguity: (bigint)7 or (text)?. But this is
deferred to a later patch.

A few changes to the grammar are needed on top of adding a `value`
alternative to `unaliasedSelector`:

 - vectorSimilarityArg gained access to `value` via `unaliasedSelector`,
   so it loses that alternate to avoid ambiguity. We may drop
   `vectorSimilarityArg` later.
 - COUNT(1) became ambiguous via the general function path (since
   function arguments can now be literals), so we remove this case
   from the COUNT special cases, remaining with count(*).
 - SELECT JSON and SELECT DISTINCT became "ambiguous enough" for
   ANTLR to complain, though as far as I can tell `value` does not
   add real ambiguity. The solution is to commit early (via "=>") to
   a parsing path.

Due to the loss of count(1) recognition in the parser, we have to
special-case it in prepare. We may relax it to count any expression
later, like modern Cassandra and SQL.

Testing is awkward because of the type inference problem in top-level.
We test via the set_intersection() function and via lua functions.

Example:

```
cqlsh> CREATE FUNCTION ks.sum(a int, b int) RETURNS NULL ON NULL INPUT RETURNS int  LANGUAGE LUA AS 'return a + b';
cqlsh> SELECT ks.sum(1, 2) FROM system.local;

 ks.sum(1, 2)
--------------
            3

(1 rows)
cqlsh>
```

(There are no suitable system functions!)

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-296

Closes scylladb/scylladb#28256
2026-02-02 00:06:13 +02:00
Patryk Jędrzejczak
68b105b21c db: virtual tables: add the rack column to cluster_status
`system.cluster_status` is missing the rack info compared to `nodetool status`
that is supposed to be equivalent. It has probably been an omission.

Closes scylladb/scylladb#28457
2026-02-01 20:36:53 +01:00
Pavel Emelyanov
6f3f30ee07 storage_service: Use stream_manager group for streaming
The hander of raft_topology_cmd::command::stream_ranges switches to
streaming scheduling group to perform data streaming in it. It grabs the
group from database db_config, which's not great. There's streaming
manager at hand in storage service handlers, since it's using its
functionality, it should use _its_ scheduling group.

This will help splitting the streaming scheduling group into more
elaborated groups under the maintenance supergroup: SCYLLADB-351

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#28363
2026-02-01 20:42:37 +02:00
Marcin Maliszkiewicz
b8c75673c8 main: remove confusing duplicated auth start message
Before we could observe two exactly the same
"starting auth service" messages in the log.
One from checkpoint() the other from notify().
We remove the second one to stay consistent
with other services.

Closes scylladb/scylladb#28349
2026-02-01 13:57:53 +02:00
Avi Kivity
6676953555 Merge 'test: perf: add option to write results to json in perf-cql-raw and perf-alternator' from Marcin Maliszkiewicz
Adds --json-result option to perf-cql-raw and perf-alternator, the same as perf-simple-query has.
It is useful for automating test runs.

Related: https://scylladb.atlassian.net/browse/SCYLLADB-434

Bacport: no, original benchmark is not backported

Closes scylladb/scylladb#28451

* github.com:scylladb/scylladb:
  test: perf: add example commands to perf-alternator and perf-cql-raw
  test: perf: add option to write results to json in perf-cql-raw
  test: perf: add option to write results to json in perf-alternator
  test: perf: move write_json_result to a common file
2026-02-01 13:57:10 +02:00
Artsiom Mishuta
e216504113 test.py: support boost labels in test.py
related PR: https://github.com/scylladb/scylladb/pull/27527

This PR changes test.py logic of parsing boost test cases to use -- --list_json_content
and pass boost labels as pytests markers

fixes: https://github.com/scylladb/scylladb/issues/25415
2026-02-01 11:31:26 +01:00
Tomasz Grabiec
b93472d595 Merge 'load_stats: fix problem with load_stats refresh throwing no_such_column_family' from Ferenc Szili
When the topology coordinator refreshes load_stats, it caches load_stats for every node. In case the node becomes unresponsive, and fresh load_stats can not be read from the node, the cached version of load_stats will be used. This is to allow the load balancer to have at least some information about the table sizes and disk capacities of the host.

During load_stats refresh, we aggregate the table sizes from all the nodes. This procedure calls db.find_column_family() for each table_id found in load_stats. This function will throw if the table is not found. This will cause load_stats refresh to fail.

It is also possible for a table to have been dropped between the time load_stats has been prepared on the host, and the time it is processed on the topology coordinator. This would also cause an exception in the refresh procedure.

This fixes this problem by checking if the table still exists.

Fixes: #28359

Closes scylladb/scylladb#28440

* github.com:scylladb/scylladb:
  test: add test and reproducer for load_stats refresh exception
  load_stats: handle dropped tables when refreshing load_stats
2026-01-31 21:12:19 +01:00
Ferenc Szili
92dbde54a5 test: add test and reproducer for load_stats refresh exception
This patch adds a test and reproducer for the issue where the load_stats
refresh procedure throws exceptions if any of the tables have been
dropped since load_stats was produced.
2026-01-30 15:11:29 +01:00
Patryk Jędrzejczak
7e7b9977c5 test: test_maintenance_mode: enable maintenance mode properly
The same issue as the one fixed in
394207fd69.
This one didn't cause real problems, but it's still cleaner to fix it.
2026-01-30 12:55:17 +01:00
Patryk Jędrzejczak
6c547e1692 test: test_maintenance_mode: shutdown cluster connections
Leaked connections are known to cause inter-test issues.
2026-01-30 12:55:17 +01:00
Patryk Jędrzejczak
867a1ca346 test: test_maintenance_mode: run with different keyspace options
We extend the test to provide a reproducer for #27988 and to avoid
similar bugs in the future.

The test slows down from ~14s to ~19s on my local machine in dev
mode. It seems reasonable.
2026-01-30 12:55:17 +01:00
Patryk Jędrzejczak
53f58b85b7 test: test_maintenance_mode: check that group0 is disabled by creating a keyspace
In the following commit, we make the rest run with multiple keyspaces,
and the old check becomes inconvenient. We also move it below to the
part of the code that won't be executed for each keyspace.

Additionally, we check if the error message is as expected.
2026-01-30 12:55:17 +01:00
Patryk Jędrzejczak
408c6ea3ee test: test_maintenance_mode: get rid of the conditional skip
This skip has already caused trouble.
After 0668c642a2, the skip was always hit, and
the test was silently doing nothing. This made us miss #26816 for a long
time. The test was fixed in 222eab45f8, but we
should get rid of the skip anyway.

We increase the number of writes from 256 to 1000 to make the chance of not
finding the key on server A even lower. If that still happens, it must be
due to a bug, so we fail the test. We also make the test insert rows until
server A is a replica of one row. The expected number of inserted rows is
a small constant, so it should, in theory, make the test faster and cleaner
(we need one row on server A, so we insert exactly one such row).

It's possible to make the test fully deterministic, by e.g., hardcoding
the key and tokens of all nodes via `initial_token`, but I'm afraid it would
make the test "too deterministic" and could hide a bug.
2026-01-30 12:55:17 +01:00
Patryk Jędrzejczak
c92962ca45 test: test_maintenance_mode: remove the redundant value from the query result 2026-01-30 12:55:17 +01:00
Patryk Jędrzejczak
9d4a5ade08 storage_proxy: skip validate_read_replica in maintenance mode
In maintenance mode, the local node adds only itself to the topology. However,
the effective replication map of a keyspace with tablets enabled contains all
tablet replicas. It gets them from the tablets map, not the topology. Hence,
`network_topology_strategy::sanity_check_read_replicas` hits
```
throw std::runtime_error(format("Requested location for node {} not in topology. backtrace {}", id, lazy_backtrace()));
```
for tablet replicas other than the local node.

As a result, all requests to a keyspace with tablets enabled and RF > 1 fail
in debug mode (`validate_read_replica` does nothing in other modes). We don't
want to skip maintenance mode tests in debug mode, so we skip the check in
maintenance mode.

We move the `is_debug_build()` check because:
- `validate_read_replicas` is a static function with no access to the config,
- we want the `!_db.local().get_config().maintenance_mode()` check to be
  dropped by the compiler in non-debug builds.

We also suppress `-Wunneeded-internal-declaration` with `[[maybe_unused]]`.
2026-01-30 12:55:17 +01:00
Patryk Jędrzejczak
a08c53ae4b storage_service: set up topology properly in maintenance mode
We currently make the local node the only token owner (that owns the
whole ring) in maintenance mode, but we don't update the topology properly.
The node is present in the topology, but in the `none` state. That's how
it's inserted by `tm.get_topology().set_host_id_cfg(host_id);` in
`scylla_main`. As a result, the node started in maintenance mode crashes
in the following way in the presence of a vnodes-based keyspace with the
NetworkTopologyStrategy:
```
scylla: locator/network_topology_strategy.cc:207:
    locator::natural_endpoints_tracker::natural_endpoints_tracker(
    const token_metadata &, const network_topology_strategy::dc_rep_factor_map &):
    Assertion `!_token_owners.empty() && !_racks.empty()' failed.
```
Both `_token_owners` and `_racks` are empty. The reason is that
`_tm.get_datacenter_token_owners()` and
`_tm.get_datacenter_racks_token_owners()` called above filter out nodes
in the `none` state.

This bug basically made maintenance mode unusable in customer clusters.

We fix it by changing the node state to `normal`. We also update its
rack, datacenter, and shards count. Rack and datacenter are present in the
topology somehow, but there is nothing wrong with updating them again.
The shard count is also missing, so we better update it to avoid other
issues.

Fixes #27988
2026-01-30 12:55:16 +01:00
Andrzej Jackowski
625f292417 test: enable guardrails_test.py
After guardrails_test.py has been migrated to test.py and fixed in
previous commits of this patch series, it can finally be enabled.

Fixes: SCYLLADB-255
2026-01-30 11:51:46 +01:00
Andrzej Jackowski
576ad29ddb test: add wait_other_notice to test_default_rf in guardrails_test.py
This commit adds `wait_other_notice=True` to `cluster.populate` in
`guardrails_test.py`. Without this, `test_default_rf` sometimes fails
because `NetworkTopologyStrategy` setting fails before
the node knows about all other DCs.

Refs: SCYLLADB-255
2026-01-30 11:51:46 +01:00
Andrzej Jackowski
64c774c23a test: copy guardrails_test.py from scylla-dtest
This commit copies guardrails_test.py from dtest repository and
(temporarily) disables it, as it requires improvement in following
commits of this patch series before being enabled.

Refs: SCYLLADB-255
2026-01-30 11:51:40 +01:00
Marcin Maliszkiewicz
e18b519692 cql3: remove find_schema call from select check_access
Schema is already a member of select statement, avoiding
the call saves around 400 cpu instructions on a select
request hot path.

Closes scylladb/scylladb#28328
2026-01-30 11:49:09 +01:00
Ferenc Szili
71be10b8d6 load_stats: handle dropped tables when refreshing load_stats
When the topology coordinator refreshes load_stats, it caches load_stats
for every node. In case the node becomes unresponsive, and fresh
load_stats can not be read from the node, the cached version of
load_stats will be used. This is to allow the load balancer to
have at least some information about the table sizes and disk capacities
of the host.

During load_stats refresh, we aggregate the table sizes from all the
nodes. This procedure calls db.find_column_family() for each table_id
found in load_stats. This function will throw if the table is not found.
This will cause load_stats refresh to fail.

It is also possible for a table to have been dropped between the time
load_stats has been prepared on the host, and the time it is processed
on the topology coordinator. This would also cause an exception in the
refresh procedure.

This patch fixes this problem by checking if the table still exists.
2026-01-30 09:48:59 +01:00
Marcin Maliszkiewicz
80e627c64b test: perf: add example commands to perf-alternator and perf-cql-raw 2026-01-30 08:48:19 +01:00
Pawel Pery
f49c9e896a vector_search: allow full secondary indexes syntax while creating the vector index
Vector Search feature needs to support creating vector indexes with additional
filtering column. There will be two types of indexes: global which indexes
vectors per table, and local which indexes vectors per partition key. The new
syntaxes are based on ScyllaDB's Global Secondary Index and Local Secondary
Index. Vector indexes don't use secondary indexes functionalities in any way -
all indexing, filtering and processing data will be done on Vector Store side.

This patch allows creating vector indexes using this CQL syntax:

```
CREATE TABLE IF NOT EXISTS cycling.comments_vs (
  commenter text,
  comment text,
  comment_vector VECTOR <FLOAT, 5>,
  created_at timestamp,
  discussion_board_id int,
  country text,
  lang text,
  PRIMARY KEY ((commenter, discussion_board_id), created_at)
);

CREATE CUSTOM INDEX IF NOT EXISTS global_ann_index
  ON cycling.comments_vs(comment_vector, country, lang) USING 'vector_index'
  WITH OPTIONS = { 'similarity_function': 'DOT_PRODUCT' };

CREATE CUSTOM INDEX IF NOT EXISTS local_ann_index
  ON cycling.comments_vs((commenter, discussion_board_id), comment_vector, country, lang)
  USING 'vector_index'
  WITH OPTIONS = { 'similarity_function': 'DOT_PRODUCT' };
```

Currently, if we run these queries to create indexes we will receive such errors:

```
InvalidRequest: Error from server: code=2200 [Invalid query] message="Vector index can only be created on a single column"
InvalidRequest: Error from server: code=2200 [Invalid query] message="Local index definition must contain full partition key only. Redundant column: XYZ"
```

This commit refactors `vector_index::check_target` to correctly validate
columns building the index. Vector-store currently support filtering by native
types, so the type of columns is checked. The first column from the list must
be a vector (to build index based on these vectors), so it is also checked.

Allowed types for columns are native types without counter (it is not possible
to create a table with counter and vector) and without duration (it is not
possible to correctly compare durations, this type is even not allowed in
secondary indexes).

This commits adds cqlpy test to check errors while creating indexes.

Fixes: SCYLLADB-298

This needs to be backported to version 2026.1 as this is a fix for filtering support.

Closes scylladb/scylladb#28366
2026-01-30 01:14:31 +02:00
Avi Kivity
3d1558be7e test: remove xfail markers from SELECT JSON count(*) tests
These were marked xfail due to #8077 (the column name was wrong),
but it was fixed long ago for 5.4 (exact commit not known).

Remove the xfail markers to prevent regressions.

Closes scylladb/scylladb#28432
2026-01-29 21:56:00 +02:00
Piotr Dulikowski
f150629948 Merge 'auth: switch find_record to use cache' from Marcin Maliszkiewicz
This series optimizes role lookup by moving find_record into standard_role_manager and switching it to use the auth cache. This allows reverting can_login to its original simpler form, ensuring hot paths are properly cached while maintaining consistency via group0_guard.

Backport: no, it's not a bug fix.

Closes scylladb/scylladb#28329

* github.com:scylladb/scylladb:
  auth: bring back previous version of standard_role_manager::can_login
  auth: switch find_record to use cache
  auth: make find_record and callers standard_role_manager members
2026-01-29 17:25:42 +01:00
Avi Kivity
7984925059 Merge 'Use coroutine::switch_to() in table::try_flush_memtable_to_sstable' from Pavel Emelyanov
The method was coroutinized by 6df07f7ff7. Back then thecoroutine::switch_to() wasn't available, and the code used with_scheduling_group() to call coroutinized lambdas. Those lambdas were implemented as on-stack variables to solve the capture list lifetime problems. As a result, the code looks like

```
auto flush = [] {
    ... // do the flushing
    auto post_flush = [] {
        ... // do the post-flushing
    }
    co_return co_await with_scheduling_group(group_b, post_flush);
};
co_return co_await with_scheduling_group(group_a, flush);
```

which is a bit clumsy. Now we have switch_to() and can make the code flow of this method more readable, like this

```
co_await switch_to(group_a);
... // do the flushing
co_await switch_to(group_b);
... // do the post-flushing
```

Code cleanup, not backporting

Closes scylladb/scylladb#28430

* github.com:scylladb/scylladb:
  table: Fix indentation after previous patch
  table: Use coroutine::switch_to() in try_flush_memtable_to_sstable()
2026-01-29 18:12:35 +02:00
Nadav Har'El
a6fdda86b5 Merge 'test: test_alternator_proxy_protocol: fix race between node startup and test start' from Avi Kivity
test_alternator_proxy_protocol starts a node and connects via the alternator ports.
Starting a node, by default, waits until the CQL ports are up. This does not guarantee
that the alternator ports are up (they will be up very soon after this), so there is a short
window where a connection to the alternator ports will fail.

Fix by adding a ServerUpState=SERVING mode, which waits for the node to report
to its supervisor (systemd, which we are pretending to be) that its ports are open.
The test is then adjusted to request this new ServerUpState.

Fixes #28210
Fixes #28211

Flaky tests are only in master and branch-2026.1, so backporting there.

Closes scylladb/scylladb#28291

* github.com:scylladb/scylladb:
  test: test_alternator_proxy_protocol: wait for the node to report itself as serving
  test: cluster_manager: add ability to wait for supervisor STATUS=serving
2026-01-29 16:18:26 +02:00
Pavel Emelyanov
56e212ea8d table: Fix indentation after previous patch
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-01-29 15:02:33 +03:00
Pavel Emelyanov
258a1a03e3 table: Use coroutine::switch_to() in try_flush_memtable_to_sstable()
It allows dropping the local lambdas passed into with_scheduling_group()
calls. Overall the code flow becomes more readable.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-01-29 15:01:27 +03:00
Marcin Maliszkiewicz
ea29e4963e test: perf: add option to write results to json in perf-cql-raw 2026-01-29 10:56:03 +01:00
Marcin Maliszkiewicz
d974ee1e21 test: perf: add option to write results to json in perf-alternator 2026-01-29 10:55:52 +01:00
Marcin Maliszkiewicz
a74b442c65 test: perf: move write_json_result to a common file
The implementation is going to be shared with
perf-alternator and perf-cql-raw.
2026-01-29 10:54:11 +01:00
Botond Dénes
3158e9b017 doc: reorganize properties in config.cc and config.hh
This commit moves the "Ungrouped properties" category to the end of the
properties list. The properties are now published in the documentation,
and it doesn't look good if the list starts with ungrouped properties.

This patch was taken over from Anna Stuchlik <anna.stuchlik@scylladb.com>.

Closes scylladb/scylladb#28343
2026-01-29 11:27:42 +03:00
Pavel Emelyanov
937d008d3c Merge 'Clean up partition_snapshot_reader' from Botond Dénes
Move to `replica/`, drop `flat` from name and drop unused usages as well as unused includes.

Code cleanup, no backport

Closes scylladb/scylladb#28353

* github.com:scylladb/scylladb:
  replica/partition_snapshot_reader: remove unused includes
  partition_snapshot_reader: remove "flat" from name
  mv partition_snapshot_reader.hh -> replica/
2026-01-29 11:22:15 +03:00
Botond Dénes
f6d7f606aa memtable_test: disable flushing_rate_is_reduced_if_compaction_doesnt_keep_up for debug
This test case was observed to take over 2 minutes to run on CI
machines, contributing to already bloated CI run times.
Disable this test in debug mode. This test checks for memtable flush
being slowed down when compaction can't keep up. So this test needs to
overwhelm the CPU by definition. On the other hand, this is not a
correctness test, there are such tests for the memtable and compaction
already, so it is not critical to run this in debug mode, it is not
expected to catch any use-after-free and such.

Closes scylladb/scylladb#28407
2026-01-29 11:13:22 +03:00
Jakub Smolar
e978cc2a80 scylla_gdb: use persistent GDB - decrease test execution time
This commit replaces the previous approach of running pytest inside
GDB’s Python interpreter. Instead, tests are executed by driving a
persistent GDB process externally using pexpect.

- pexpect: Python library for controlling interactive programs
  (used here to send commands to GDB and capture its output)
- persistent GDB: keep one GDB session alive across multiple tests
  instead of starting a new process for each test

Tests can now be executed via `./test.py gdb` or with
`pytest test/scylla_gdb`. This improves performance and
makes failures easier to debug since pytest no longer runs
hidden inside GDB subprocesses.

Closes scylladb/scylladb#24804
2026-01-29 10:01:39 +02:00
Avi Kivity
347c69b7e2 build: add clang-tools-extra (for clang-include-cleaner) to frozen toolchain
clang-include-cleaner is used in the iwyu.yaml github workflow (include-
what-you-use). Add it to the frozen toolchain so it can be made part
of the regular build process.

The corresponding install command is removed from iwyu.yaml.

Regenerated frozen toolchain with optimized clang from

  https://devpkg.scylladb.com/clang/clang-21.1.8-Fedora-43-aarch64.tar.gz
  https://devpkg.scylladb.com/clang/clang-21.1.8-Fedora-43-x86_64.tar.gz

Closes scylladb/scylladb#28413
2026-01-29 08:44:49 +02:00
Botond Dénes
482ffe06fd Merge 'Improve load shedding on the replica side' from Łukasz Paszkowski
When reads arrive, they have to wait for admission on the reader
concurrency semaphore. If the node is overloaded, the reads will
be queued. They can time out while in the queue, but will not time
out once admitted.

Once the shard is sufficiently loaded, it is possible that most
queued reads will time out, because the average time it takes to
for a queued read to be admitted is around that of the timeout.

If a read times out, any work we already did, or are about to do
on it is wasted effort. Therefore, the patch tries to prevent it
by checking if an admitted read has a chance to complete in time
and abort it if not. It uses the following criteria:

if read's remaining time <= read's timeout when arrived to the semaphore * live updateable preemptive_abort_factor;
the read is rejected and the next one from the wait list is considered.

Fixes https://github.com/scylladb/scylladb/issues/14909
Fixes: SCYLLADB-353

Backport is not needed. Better to first observe its impact.

Closes scylladb/scylladb#21649

* github.com:scylladb/scylladb:
  reader_concurrency_semaphore: Check during admission if read may timeout
  permit_reader::impl: Replace break with return after evicting inactive permit on timeout
  reader_concurrency_semaphore: Add preemptive_abort_factor to constructors
  config: Add parameters to control reads' preemptive_abort_factor
  permit_reader: Add a new state: preemptive_aborted
  reader_concurrency_semaphore: validate waiters counter when dequeueing a waiting permit
  reader_concurrency_semaphore: Remove cpu_concurrency's default value
2026-01-29 08:27:22 +02:00
Botond Dénes
a8767f36da Merge 'Improve load balancer logging and other minor cleanups' from Tomasz Grabiec
Contains various improvements to tablet load balancer. Batched together to save on the bill for CI.

Most notably:
 - Make plan summary more concise, and print info only about present elements.
 - Print rack name in addition to DC name when making a per-rack plan
 - Print "Not possible to achieve balance" only when this is the final plan with no active migrations
 - Print per-node stats when "Not possible to achieve balance" is printed
 - amortize metrics lookup cost
 - avoid spamming logs with per-node "Node {} does not have complete tablet stats, ignoring"

Backport to 2026.1: since the changes enhance debuggability and are relatively low risk

Fixes #28423
Fixes #28422

Closes scylladb/scylladb#28337

* github.com:scylladb/scylladb:
  tablets: tablet_allocator.cc: Convert tabs to spaces
  tablets: load_balancer: Warn about incomplete stats once for all offending nodes
  tablets: load_balancer: Improve node stats printout
  tablets: load_balancer: Warn about imbalance only when there are no more active migrations
  tablets: load_balancer: Extract print_node_stats()
  tablet: load_balancer: Use empty() instead of size() where applicable
  tablets: Fix redundancy in migration_plan::empty()
  tablets: Cache pointer to stats during plan-making
  tablets: load_balancer: Print rack in addition to DC when giving context
  tablets: load_balancer: Make plan summary concise
  tablets: load_balancer: Move "tablet_migration_bypass" injection point to make_plan()
2026-01-29 08:25:17 +02:00
Amnon Heiman
f2e142ac6e test/boost/estimated_histogram_test.cc: Switch to real Sum
Now that the sum function in the histogram uses true values instead of
an estimate, the test should reflect that.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2026-01-28 23:19:00 +02:00
Piotr Dulikowski
ec6a2661de Merge 'Keep view_builder background fiber in maintenance scheduling group' from Pavel Emelyanov
In fact, it's partially there already. When view_builder::start() is called is first calls initialization code (the start_in_background() method), then kicks do_build_step() that runs a background fiber to perform build steps. The starting code inherits scheduling group from main(). And the step fiber code needs to run itself in a maintenance scheduling group, so it explicitly grabs one via database->db_config.

This PR mainly gets rid of the call to database::get_streaming_scheduling_group() from do_build_step() as preparation to splitting the streaming scheduling group into parts (see SCYLLADB-351). To make it happen the do_build_step() is patched to inherit its scheduling group from view_builder::start() and the start() itself is called by main from maintenance scheduling group (like for other view building services).

New feature (nested scheduling group), not backporting

Closes scylladb/scylladb#28386

* github.com:scylladb/scylladb:
  view_builder: Start background in maintenance group
  view_builder: Wake-up step fiber with condition variable
2026-01-28 20:49:19 +01:00
Pavel Emelyanov
cb1d05d65a streaming: Get streaming sched group from debug:: namespace
In a lambda returned from make_streaming_consumer() there's a check for
current scheudling group being streaming one. It came from #17090 where
streaming code was launched in wrong sched group thus affecting user
groups in a bad way.

The check is nice and useful, but it abuses replica::database by getting
unrelated information from it.

To preserve the check and to stop using database as provider of configs,
keep the streaming scheduling group handle in the debug namespace. This
emphasises that this global variable is purely for debugging purposes.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#28410
2026-01-28 19:14:59 +02:00
Marcin Maliszkiewicz
5d4e2ec522 Merge 'docs: add documentation for automatic repair' from Botond Dénes
Explain what automatic repair is and how to configure it. While at it, improve the existing repair documentation a bit.

Fixes: SCYLLADB-130

This PR missed the 2026.1 branch date, so it needs backport to 2026.1, where the auto repair feature debuts.

Closes scylladb/scylladb#28199

* github.com:scylladb/scylladb:
  docs: add feature page for automatic repair
  docs: inter-link incremental-repair and repair documents
  docs: incremental-repair: fix curl example
2026-01-28 17:46:53 +01:00
Nadav Har'El
1454228a05 test/cqlpy: fix "assertion rewriting" in translated Cassandra tests
One of the best features of the pytest framework is "assertion
rewriting": If your test does for example "assert a + 1 == b", the
assertion is "rewritten" so that if it fails it tells you not only
that "a+1" and "b" are not equal, what the non-equal values are,
how they are not equal (e.g., find different elements of arrays) and
how each side of the equality was calculated.

But pytest can only "rewrite" assertion that it sees. If you call a
utility function checksomething() from another module and that utility
function calls assert, it will not be able to rewrite it, and you'll
get ugly, hard-to-debug, assertion failures.

This problem is especially noticable in tests we translated from
Cassandra, in test/cqlpy/cassandra_tests. Those tests use a bunch of
assertion-performing utility functions like assertRows() et al.
Those utility functions are defined in a separate source file,
porting.py, so by default do not get their assertions rewritten.

We had a solution for this: test/cqlpy/cassandra_test/__init__.py had:

    pytest.register_assert_rewrite("cassandra_tests.porting")

This tells pytest to rewrite assertions in porting.py the first time
that it is imported.

It used to work well, but recently it stopped working. This is because
we change the module paths recently, and it should be written as
test.cqlpy.cassandra_tests.porting.

I verified by editing one of the cassandra_tests to make a bad check
that indeed this statement stopped working, and fixing the module
path in this way solves it, and makes assertion rewriting work
again.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#28411
2026-01-28 18:34:57 +02:00
Pavel Emelyanov
3ebd02513a view_builder: Start background in maintenance group
Currently view_builder::start() is called in default scheduling group.
Once it initializes itself, it wakes up the step fiber that explicitly
switches to maintenance scheduling group.

This explicit switch made sence before previous patch, when the fiber
was implemented as a serialized action. Now the fiber starts directly
from .start() method and can inherit scheduling group from it.

Said that, main code calls view_builder::start() in maintenance
scheduling group killing two birds with one stone. First, the step fiber
no longer needs borrow its scheduling group indirectly via database.
Second, the start_in_background() code itself runs in a more suitable
scheduling group.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-01-28 18:34:59 +03:00
Pavel Emelyanov
2439d27b60 view_builder: Wake-up step fiber with condition variable
View builder runs a background fiber that perform build steps. To kick
the fiber it uses serizlized action, but it's an overkill -- nobody
waits for the action to finish, but on stop, when it's joined.

This patch uses condition variable to kick the fiber, and starts it
instantly, in the place where serialized action was first kicked.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-01-28 18:34:58 +03:00
Pavel Emelyanov
5ce12f2404 gossiper: Export its scheduling group for those who need it
There are several places in the code that need to explicitly switch into
gossiper scheduling group. For that they currently call database to
provide the group, but it's better to get gossiper sched group from
gossiper itself, all the more so all those places have gossiper at hand.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-01-28 18:29:33 +03:00
Pavel Emelyanov
0da1a222fc migration_manager: Reorder members
This is to initialize dependency references, in particular gossiper&,
before _group0_barrier. The latter will need to access this->_gossiper
in the next patch.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-01-28 18:29:33 +03:00
Botond Dénes
1713d75c0d docs: add feature page for automatic repair
Explain what the feature is and how to confiture it.
Inter-link all the repair related pages, so one can discover all about
repair, regardless of which page they land on.
2026-01-28 16:45:57 +02:00
Łukasz Paszkowski
7e1bbbd937 reader_concurrency_semaphore: Check during admission if read may timeout
When a shard on a replica is overloaded, it breaks down completely,
throughput collapses, latencies go through the roof and the
node/shard can even become completely unresponsive to new connection
attempts.

When reads arrive, they have to wait for admission on the reader
concurrency semaphore. If the node is overloaded, the reads will
be queued and thus they can time out while being in the queue or during
the execution. In the latter case, the timeout does not always
result in the read being aborted.

Once the shard is sufficiently loaded, it is possible that most
queued reads will time out, because the average time it takes
for a queued read to be admitted is around that of the timeout.

If a read times out, any work we already did, or are about to do
on it is wasted effort. Therefore, the patch tries to prevent it
by checking if an admitted read has a chance to complete in time
and abort it if not. It uses the following cryteria:

if read's remaining time <= read's timeout when arrived to the semaphore * preemptive factor;
the read is rejected and the next one from the wait list is
considered.
2026-01-28 14:24:45 +01:00
Łukasz Paszkowski
8a613960af permit_reader::impl: Replace break with return after evicting inactive permit on timeout
Evicting an inactive permit destroyes the permit object when the
reader is closed, making any further member access invalid. Switch
from break to an early return to prevent any possible use-after-free
after evict() in the state::inactive timeout path.
2026-01-28 14:24:33 +01:00
Łukasz Paszkowski
fde09fd136 reader_concurrency_semaphore: Add preemptive_abort_factor to constructors
The new parameter parametrizes the factor used to reject a read
during admission. Its value shall be between 0.0 and 1.0 where
  + 0.0 means a read will never get rejected during admission
  + 1.0 means a read will immediatelly get rejected during admission

Although passing values outside the interaval is possible, they
will have the exact same effects as they were clamped to [0.0, 1.0].
2026-01-28 14:20:01 +01:00
Łukasz Paszkowski
21348050e8 config: Add parameters to control reads' preemptive_abort_factor 2026-01-28 14:20:01 +01:00
Łukasz Paszkowski
2d3a40e023 permit_reader: Add a new state: preemptive_aborted
A permit gets into the preemptive_aborted state when:
- times out;
- gets rejected from execution due to high chance its execution would
  not finalize on time;

Being in this state means a permit was removed from the wait list,
its internal timer was canceled and semaphore's statistic
`total_reads_shed_due_to_overload` increased.
2026-01-28 14:20:01 +01:00
Łukasz Paszkowski
5a7cea00d0 reader_concurrency_semaphore: validate waiters counter when dequeueing a waiting permit
Add a defensive check in dequeue_permit() to avoid underflowing
_stats.waiters and report an internal error if the stats are already
inconsistent.
2026-01-28 14:19:53 +01:00
Tomasz Grabiec
df949dc506 Merge 'topology_coordinator: make cleanup reliable on barrier failures' from Łukasz Paszkowski
Fix a subtle but damaging failure mode in the tablet migration state machine: when a barrier fails, the follow-up barrier is triggered asynchronously, and cleanup can get skipped for that iteration. On the next loop, the original failure may no longer be visible (because the failing node got excluded), so the tablet can incorrectly move forward instead of entering `cleanup_target`.

To make cleanup reliable this PR:

Adds an additional “fallback cleanup” stage

- `write_both_read_old_fallback_cleanup`

that does not modify read/write selectors. This stage is safe to enter immediately after a barrier failure, and it funnels the tablet into cleanup with the required barriers.

Avoids changing both read and write selectors in a single step transitioning from `write_both_read_new` to `cleanup_target`. The fallback path updates selectors in a safe order: read first, then write.

Allows a direct no-barrier transition from `allow_write_both_read_old` to `cleanup_target` after failure, because in that specific case `cleanup_target` doesn’t change selectors and the hop is safe.

No need for backport. It's an improvement. Currently, tablets transition to `cleanup_target` eventually via failed streaming.

Closes scylladb/scylladb#28169

* github.com:scylladb/scylladb:
  topology_coordinator: add write_both_read_old_fallback_cleanup state
  topology_coordinator: allow cleanup_target transition from streaming/rebuild_repair without barrier
  topology_coordinator: allow cleanup_target transition without barrier after failure in write_both_read_old
  topology_coordinator: allow cleanup_target transition without barrier after failure in allow_write_both_read_old
2026-01-28 13:33:39 +01:00
Amnon Heiman
3175540e87 transport/server: to bytes_histogram
This patch replaces simple counters with bytes_histogram for tracking
CQL request and response sizes, enabling better visibility into message
size distribution.

Changes:
- Replace request_size and response_size metrics with bytes_histogram in
  cql_sg_stats::request_kind_stats
- Per-shard metrics continue to be reported as before
- QUERY, EXECUTE, and BATCH operations now report per-node, per-scheduling-group
  histograms of bytes sent and received, providing detailed insight into these
  operations

Other CQL operations (e.g., PREPARE, OPTIONS) are not included in per-node
histogram reporting as they are less performance-critical, but can be added
in the future if proven useful.

Metrics example:

```
 # HELP scylla_transport_cql_request_bytes Counts the total number of received bytes in CQL messages of a specific kind.
 # TYPE scylla_transport_cql_request_bytes counter
scylla_transport_cql_request_bytes{kind="BATCH",scheduling_group_name="sl:default",shard="0"} 129808
scylla_transport_cql_request_bytes{kind="EXECUTE",scheduling_group_name="sl:default",shard="0"} 227409
scylla_transport_cql_request_bytes{kind="PREPARE",scheduling_group_name="sl:default",shard="0"} 631
scylla_transport_cql_request_bytes{kind="QUERY",scheduling_group_name="sl:default",shard="0"} 2809
scylla_transport_cql_request_bytes{kind="QUERY",scheduling_group_name="sl:driver",shard="0"} 4079
scylla_transport_cql_request_bytes{kind="REGISTER",scheduling_group_name="sl:default",shard="0"} 98
scylla_transport_cql_request_bytes{kind="STARTUP",scheduling_group_name="sl:driver",shard="0"} 432
 # HELP scylla_transport_cql_request_histogram_bytes A histogram of received bytes in CQL messages of a specific kind and specific scheduling group.
 # TYPE scylla_transport_cql_request_histogram_bytes histogram
scylla_transport_cql_request_histogram_bytes_sum{kind="QUERY",scheduling_group_name="sl:driver"} 4079
scylla_transport_cql_request_histogram_bytes_count{kind="QUERY",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="1024.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="2048.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="4096.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="8192.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="16384.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="32768.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="65536.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="131072.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="262144.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="524288.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="1048576.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="2097152.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="4194304.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="8388608.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="16777216.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="33554432.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="67108864.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="134217728.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="268435456.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="536870912.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="1073741824.000000",scheduling_group_name="sl:driver"} 57
```
2026-01-28 13:53:47 +02:00
Amnon Heiman
5875bcca23 approx_exponential_histogram: Add sum() method for accurate value tracking
Previously, histogram sums were estimated by multiplying bucket offsets
by their counts, which produces inaccurate results - typically too high
when using upper limits or too low when using lower limits.

This patch adds accurate sum tracking to approx_exponential_histogram:
- Adds a _sum member variable to track the actual sum of all values
- Implements sum() method to return the accumulated total
- Updates add() to increment _sum for each value
- Modifies to_metrics_histogram() helper to use the new sum() method

This change is important as histograms will be used instead of counters for
byte statistics, where accurate totals are essential for metrics reporting.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2026-01-28 13:39:46 +02:00
Amnon Heiman
2fd453f4ec utils/estimated_histogram.hh: Add bytes_histogram
For various use cases, we need to report byte histograms, such as for
request and reply message sizes.

This patch introduce bytes_histogram as a type alias for
approx_exponential_histogram configured to track byte values from 1KB to
1GB with power-of-2 buckets (Precision=1).

This provides a convenient, performance-efficient histogram for
measuring message sizes, payload sizes, and other byte-based metrics.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2026-01-28 13:31:39 +02:00
Botond Dénes
ee631f31a0 Merge 'Do not export system keyspace from raft_group0_client' from Pavel Emelyanov
There are few places that use raft_group0_client as a way to get to system_keyspace. Mostly they can live without it -- either the needed reference is already at hand, or it's (ab)used to get to the database reference. The only place that really needs the system keyspace is the state merger code that needs last state ID. For that, the explicit helper method is added to group0_client.

Refining API between components, not backporting

Closes scylladb/scylladb#28387

* github.com:scylladb/scylladb:
  raft_group0_client: Dont export system keyspace
  raft_group0_client: Add and use get_last_group0_state_id()
  group0_state_machine: Call ensure_group0_sched() with data_dictionary
  view_building_worker: Use its own system_keyspace& reference
2026-01-28 13:24:32 +02:00
Yaron Kaikov
7c49711906 test/cqlpy: Remove redundant pytest.register_assert_rewrite call
During test.py run, noticed this warning:
```
10:38:22  test/cqlpy/cassandra_tests/validation/operations/insert_update_if_condition_test.py:14: 32 warnings
10:38:22    /jenkins/workspace/releng-testing/scylla-ci/scylla/test/cqlpy/cassandra_tests/validation/operations/insert_update_if_condition_test.py:14: PytestAssertRewriteWarning: Module already imported so cannot be rewritten: test.cqlpy.cassandra_tests.porting
10:38:22      pytest.register_assert_rewrite('test.cqlpy.cassandra_tests.porting')
```

The insert_update_if_condition_test.py was calling
pytest.register_assert_rewrite() for the porting module, but this
registration is already handled by cassandra_tests/__init__.py which
is automatically loaded before any test runs.

Closes scylladb/scylladb#28409
2026-01-28 13:17:05 +02:00
Avi Kivity
42fdea7410 github: fix iwyu workflow permissions
The include-what-you-use workflow fails with

```
Invalid workflow file: .github/workflows/iwyu.yaml#L25
The workflow is not valid. .github/workflows/iwyu.yaml (Line: 25, Col: 3): Error calling workflow 'scylladb/scylladb/.github/workflows/read-toolchain.yaml@257054deffbef0bde95f0428dc01ad10d7b30093'. The nested job 'read-toolchain' is requesting 'contents: read', but is only allowed 'contents: none'.
```

Fix by adding the correct permissions.

Closes scylladb/scylladb#28390
2026-01-28 12:38:54 +02:00
Jakub Smolar
e1f623dd69 skip_mode: Allow multiple build modes in pytest skip_mode marker
Enhance the skip_mode marker to accept either a single mode string
or a list of modes, allowing tests to be skipped across multiple
build configurations with a single marker.

Before:
  @pytest.mark.skip_mode("dev", reason="...")
  @pytest.mark.skip_mode("debug", reason="...")

After:
  @pytest.mark.skip_mode(["dev", "debug"], reason="...")

This reduces duplication when the same skip condition applies
to multiple build modes.

Closes scylladb/scylladb#28406
2026-01-28 12:27:41 +02:00
Patryk Jędrzejczak
a2c1569e04 test: test_gossiper_orphan_remover: get host ID of the bootstrapping node before it crashes
The test is currently flaky. It tries to get the host ID of the bootstrapping
node via the REST API after the node crashes. This can obviously fail. The
test usually doesn't fail, though, as it relies on the host ID being saved
in `ScyllaServer._host_id` at this point by `ScyllaServer.try_get_host_id()`
repeatedly called in `ScyllaServer.start()`. However, with a very fast crash
and unlucky timings, no such call may succeed.

We deflake the test by getting the host ID before the crash. Note that at this
point, the bootstrapping node must be serving the REST API requests because
`await log.wait_for("finished do_send_ack2_msg")` above guarantees that the
node has started the gossip shadow round, which happens after starting the REST
API.

Fixes #28385

Closes scylladb/scylladb#28388
2026-01-28 10:54:22 +02:00
Avi Kivity
8d2689d1b5 build: avoid sccache by default for Rust targets
A bug[1] in sccache prevents correct distributed compilation of wasmtime.

Disable it by default for now, but allow users to enable it.

[1] https://github.com/mozilla/sccache/issues/2575

Closes scylladb/scylladb#28389
2026-01-28 10:36:49 +02:00
Pavel Emelyanov
2ffe5b7d80 tablet_allocator: Have its own explicit background scheduling group
Currently, tablet_allocator switches to streaming scheduling group that
it gets from database. It's not nice to use database as provider of
configs/scheduling_groups.

This patch adds a background scheduling group for tablet allocator
configured via its config and sets it to streaming group in main.cc
code.

This will help splitting the streaming scheduling group into more
elaborated groups under the maintenance supergroup: SCYLLADB-351

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#28356
2026-01-28 10:34:28 +02:00
Avi Kivity
47315c63dc treewide: include Seastar headers with angle brackets
Seastar is a "system" library from our point of view, so
should be included with angle brackets.

Closes scylladb/scylladb#28395
2026-01-28 10:33:06 +02:00
Botond Dénes
b7dccdbe93 Merge 'test/storage: speed up out-of-space prevention tests' from Łukasz Paszkowski
This PR reduces the runtime of `test_out_of_space_prevention.py` by addressing two main sources of overhead: slow “critical utilization” setup and delayed tablet load stats propagation. Combined, these changes cut the module’s total execution time from 324s to 185s.

Improvements. No backup is required.

Closes scylladb/scylladb#28396

* github.com:scylladb/scylladb:
  test/storage: speed up out-of-space prevention tests by using smaller volumes
  test/storage: reduce tablet load stats refresh interval to speed up OOS prevention tests
2026-01-28 10:28:20 +02:00
Marcin Maliszkiewicz
931a38de6e service: remove unused has_schema_access
It became unused after we dropped support for thrift
in ad649be1bf

Closes scylladb/scylladb#28341
2026-01-28 10:18:26 +02:00
Pavel Emelyanov
834921251b test: Replace memory_data_source with seastar::util::as_input_stream
The existing test-only implementation is a simplified version of the
generic one.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#28339
2026-01-28 10:15:03 +02:00
Andrei Chekun
335e81cdf7 test.py: migrate nodetool to run by pytest
As a next step of migration to the pytest runner, this PR moves
responsibility for nodetool tests execution solely to the pytest.

Closes scylladb/scylladb#28348
2026-01-28 09:49:59 +02:00
Tomasz Grabiec
8e831a7b6d tablets: tablet_allocator.cc: Convert tabs to spaces 2026-01-28 01:32:01 +01:00
Tomasz Grabiec
9715965d0c tablets: load_balancer: Warn about incomplete stats once for all offending nodes
To reduce log spamming when all nodes are missing stats.
2026-01-28 01:32:01 +01:00
Tomasz Grabiec
ef0e9ad34a tablets: load_balancer: Improve node stats printout
Make it more concise:
- reduce precision for load to 6 fractional digits
- reduce precision for tablets/shard to 3 fractional digits
- print "dc1/rack1" instead of "dc=dc1 rack=rack1", like in other places
- print "rd=0 wr=0" instead of "stream_read=0 stream_write=0"

Example:

 load_balancer - Node 477569c0-f937-11f0-ab6f-541ce4a00601: dc10/rack10c load=170.666667 tablets=1 shards=12 tablets/shard=0.083 state=normal cap=64424509440 stream: rd=0 wr=0
 load_balancer - Node 47678711-f937-11f0-ab6f-541ce4a00601: dc10/rack10c load=0.000000 tablets=0 shards=12 tablets/shard=0.000 state=normal cap=64424509440 stream: rd=0 wr=0
 load_balancer - Node 47832560-f937-11f0-ab6f-541ce4a00601: dc10/rack10c load=0.000000 tablets=0 shards=12 tablets/shard=0.000 state=normal cap=64424509440 stream: rd=0 wr=0
2026-01-28 01:32:01 +01:00
Tomasz Grabiec
4a161bff2d tablets: load_balancer: Warn about imbalance only when there are no more active migrations
Otherwise, it may be only a temporary situation due to lack of
candidates, and may be unnecessarily alerting.

Also, print node stats to allow assessing how bad the situation is on
the spot. Those stats can hint to a cause of imbalance, if balancing
is per-DC and racks have different capacity.
2026-01-28 01:32:00 +01:00
Tomasz Grabiec
7228bd1502 tablets: load_balancer: Extract print_node_stats() 2026-01-28 01:32:00 +01:00
Tomasz Grabiec
615b86e88b tablet: load_balancer: Use empty() instead of size() where applicable 2026-01-28 01:32:00 +01:00
Tomasz Grabiec
12fdd205d6 tablets: Fix redundancy in migration_plan::empty() 2026-01-28 01:32:00 +01:00
Tomasz Grabiec
0d090aa47b tablets: Cache pointer to stats during plan-making
Saves on lookup cost, esp. for candidate evaluation. This showed up in
perf profile in the past.

Also, lays the ground for splitting stats per rack.
2026-01-28 01:32:00 +01:00
Tomasz Grabiec
f2b0146f0f tablets: load_balancer: Print rack in addition to DC when giving context
Load-balancing can be now per-rack instead of per-DC. So just printing
"in DC" is confusing. If we're balancing a rack, we should print which
rack is that.
2026-01-28 01:32:00 +01:00
Tomasz Grabiec
df32318f66 tablets: load_balancer: Make plan summary concise
Before:

  load_balancer - Prepared 1 migration plans, out of which there were 1 tablet migration(s) and 0 resize decision(s) and 0 tablet repair(s) and 0 rack-list colocation(s)

After:

  load_balancer - Prepared plan: migrations: 1

We print only stats about elements which are present.
2026-01-28 01:32:00 +01:00
Emil Maskovsky
834961c308 db/view: add missing include for coroutine::all to fix build without precompiled headers
When building with `--disable-precompiled-header`, view.cc failed to
compile due to missing <seastar/coroutine/all.hh> include, which provides
`coroutine::all`.

The problem doesn't manifest when precompiled headers are used, which is
the default. So that's likely why it was missed by the CI.

Adding the explicit include fixes the build.

Fixes: scylladb/scylladb#28378
Ref: scylladb/scylladb#28093

No backport: This problem is only present in master.

Closes scylladb/scylladb#28379
2026-01-27 18:56:56 +01:00
Calle Wilund
87aa6c8387 utils/gcp/object_storage: URL-encode object names in URL:s
Fixes #28398

When used as path elements in google storage paths, the object names
need to be URL encoded. Due to a.) tests not really using prefixes including
non-url valid chars (i.e. / etc) and the mock server used for most
testing not enforcing this particular aspect, this was missed.

Modified unit tests to use prefixing for all names, so when run
in real GS, any errors like this will show.
2026-01-27 18:01:21 +01:00
Calle Wilund
a896d8d5e3 utils::gcp::object_storage: Fix list object pager end condition detection
Fixes #28399

When iterating with pager, the mock server and real GCS behaves differently.
The latter will not give a pager token for last page, only penultimate.

Need to handle.
2026-01-27 17:57:17 +01:00
Pavel Emelyanov
02af292869 Merge 'Introduce TTL and retries to address resolution' from Ernest Zaslavsky
In production environments, we observed cases where the S3 client would repeatedly fail to connect due to DNS entries becoming stale. Because the existing logic only attempted the first resolved address and lacked a way to refresh DNS state, the client could get stuck in a failure loop.

Introduce RR TTL and connection failure retry to
- re-resolve the RR in a timely manner
- forcefully reset and re-resolve addresses
- add a special case when the TTL is 0 and the record must be resolved for every request

Fixes: CUSTOMER-96
Fixes: CUSTOMER-139

Should be backported to 2025.3/4 and 2026.1 since we already encountered it in the production clusters for 2025.3

Closes scylladb/scylladb#27891

* github.com:scylladb/scylladb:
  connection_factory: includes cleanup
  dns_connection_factory: refine the move constructor
  connection_factory: retry on failure
  connection_factory: introduce TTL timer
  connection_factory: get rid of shared_future in dns_connection_factory
  connection_factory: extract connection logic into a member
  connection_factory: remove unnecessary `else`
  connection_factory: use all resolved DNS addresses
  s3_test: remove client double-close
2026-01-27 18:45:43 +03:00
Avi Kivity
59f2a3ce72 test: test_alternator_proxy_protocol: wait for the node to report itself as serving
Use the new ServerUpState=SERVING mechanism to wait to the alternator
ports to be up, rather than relying on the default waiting for CQL,
which happens earlier and therefore opens a window where a connection to
the alternator ports will fail.
2026-01-27 17:25:59 +02:00
Avi Kivity
ebac810c4e test: cluster_manager: add ability to wait for supervisor STATUS=serving
When running under systemd, ScyllaDB sends a STATUS=serving message
to systemd. Co-opt this mechanism by setting up NOTIFY_SOCKET, thus
making the cluster manager pretend it is systemd. Users of the cluster
manager can now wait for the node to report itself up, rather than
having to parse log files or retry connections.
2026-01-27 17:24:55 +02:00
Botond Dénes
7ac32097da docs/cql/ddl.rst: Tombstones GC: explain propagation delay
This parameter was not mentioned at all anywhere in the documentation.
Add an explanation of this parameter: why we need it, what is the
default and how it can be changed.

Closes scylladb/scylladb#28132
2026-01-27 16:05:52 +01:00
Tomasz Grabiec
32b336e062 tablets: load_balancer: Move "tablet_migration_bypass" injection point to make_plan()
Just a cleanup. After this, we don't have a new scope in the outmost
make_plan() just for injection handling.
2026-01-27 16:01:36 +01:00
Piotr Dulikowski
29da20744a storage_service: fix indentation after previous patch 2026-01-27 15:49:01 +01:00
Piotr Dulikowski
d28c841fa9 raft topology: generate notification about released nodes only once
Hints destined for some other node can only be drained after the other
node is no longer a replica of any vnode or tablet. In case when tablets
are present, a node might still technically be a replica of some tablets
after it moved to left state. When it no longer is a replica of any
tablet, it becomes "released" and storage service generates a
notification about it. Hinted handoff listens to this notification and
kicks off draining hints after getting it.

The current implementation of the "released" notification would trigger
every time raft topology state is reloaded and a left node without any
tokens is present in the raft topology. Although draining hints is
idempotent, generating duplicate notifications is wasteful and recently
became very noisy after in 44de563 verbosity of the draining-related log
messages have been increased. The verbosity increase itself makes sense
as draining is supposed to be a rare operation, but the duplicate
notification bug now needs to be addressed.

Fix the duplicate notification problem by passing the list of previously
released nodes to the `storage_service::raft_topology_update_ip`
function and filtering based on it. If this function processes the
topology state for the first time, it will not produce any
notifications. This is fine as hinted handoff is prepared to detect
"released" nodes during the startup sequence in main.cc and start
draining the hints there, if needed.

Fixes: #28301
Refs: #25031
2026-01-27 15:48:05 +01:00
Łukasz Paszkowski
8829098e90 reader_concurrency_semaphore: Remove cpu_concurrency's default value
The commit 59faa6d, introduces a new parameter called cpu_concurrency
and sets its default value to 1 which violates the commit fbb83dd that
removes all default values from constructors but one used by the unit
tests.

The patch removes the default value of the cpu_concurrency parameter
and alters tests to use the test dedicated reader_concurrency_semaphore
constructor wherever possible.
2026-01-27 15:40:11 +01:00
Łukasz Paszkowski
3ef594f9eb test/storage: speed up out-of-space prevention tests by using smaller volumes
Tests in test_out_of_space_prevention.py spend a large fraction of
time creating a random “blob” file to cross the 0.8 critical disk
utilization threshold. With 100MB volumes this requires writing
~70–80MB of data, which is slow inside Docker/Podman-backed volumes.

Most tests only use ~11MB of data, so large volumes are unnecessary.
Reduce the test volume size to 20MB so the critical threshold is
reached at ~16MB and the blob file is much smaller.

This cuts ~5–6s per test.
2026-01-27 15:28:59 +01:00
Łukasz Paszkowski
0f86fc680c test/storage: reduce tablet load stats refresh interval to speed up OOS prevention tests
Set `--tablet-load-stats-refresh-interval-in-seconds=1` for this module’s
clusters applicable to all tests. This significantly reduces runtime
for the slowest cases:
- test_reject_split_compaction: 75.62s -> 23.04s
- test_split_compaction_not_triggered: 69.36s -> 22.98s
2026-01-27 15:28:59 +01:00
Piotr Dulikowski
10e9672852 raft topology: extract "released" nodes calculation to external function
In the following commits we will need to compare the set of released
nodes before and after reload of raft topology state. Moving the logic
that calculates such a set to a separate function will make it easier to
do.
2026-01-27 14:37:43 +01:00
Pavel Emelyanov
87920d16d8 raft_group0_client: Dont export system keyspace
Now system_keyspace reference is used internally by the client code
itself, no need to encourage other services abuse it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-01-27 14:51:40 +03:00
Pavel Emelyanov
966119ce30 raft_group0_client: Add and use get_last_group0_state_id()
There are several places that want to get last state id and for that
they make raft_group0_client() export system_keyspace reference.

This patch adds a helper method to provide the needed ID.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-01-27 14:50:25 +03:00
Pavel Emelyanov
dded1feeb7 group0_state_machine: Call ensure_group0_sched() with data_dictionary
There's a validation for tables being used by group0 commands are marked
with the respective prop. For it the caller code needs to provide
database reference and it gets one from client -> system_keyspace chain.

There's more explicit way -- get the data_dictionary via proxy.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-01-27 14:48:22 +03:00
Pavel Emelyanov
20a2b944df view_building_worker: Use its own system_keyspace& reference
Some code in the worker need to mess with system_keyspace&. While
there's a reference on it from the worker object, it gets one via
group0 -> group0_client, which is a bit an overkill.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-01-27 14:46:48 +03:00
Avi Kivity
16b56c2451 Merge 'Audit: avoid dynamic_cast on a hot path' from Marcin Maliszkiewicz
This patch set eliminates special audit info guard used before for batch statements
and simplifies audit::inspect function by returning quickly if audit is not needed.
It saves around 300 instructions on a request's hot path.

Related: https://github.com/scylladb/scylladb/issues/27941
Backport: no, not a bug

Closes scylladb/scylladb#28326

* github.com:scylladb/scylladb:
  audit: replace batch dynamic_cast with static_cast
  audit: eliminate dynamic_cast to batch_statement in inspect
  audit: cql: remove create_no_audit_info
  audit: add batch bool to audit_info class
2026-01-27 12:54:16 +02:00
Pavel Emelyanov
c61d855250 hints: Provide explicit scheduling group for hint_sender
Currently it grabs one from database, but it's not nice to use database
as config/sched-groups provider.

This PR passes the scheduling group to use for sending hints via manager
which, in turn, gets one from proxy via its config (proxy config already
carries configuration for hints manager). The group is initialized in
main.cc code and is set to the maintenance one (nowadays it's the same
as streaming group).

This will help splitting the streaming scheduling group into more
elaborated groups under the maintenance supergroup: SCYLLADB-351

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#28358
2026-01-27 12:50:11 +02:00
Gleb Natapov
9daa109d2c test: get rid of consistent_cluster_management usage in test
consistent_cluster_management is deprecated since scylla-5.2 and no
longer used by Scylladb, so it should not be used by test either.

Closes scylladb/scylladb#28340
2026-01-27 11:31:30 +01:00
Avi Kivity
fa5ed619e8 Merge 'test: perf: add perf-cql-raw benchmarking tool' from Marcin Maliszkiewicz
The tool supports:
- auth or no auth modes
- simple read and write workloads
- connection pool or connection per request modes
- in-process or remote modes, remote may be usefull to assess tool's overhead or use it as bigger scale benchmark
- multi table mode
- non superuser mode

It could support in the future:
- TLS mode
- different workloads
- shard awareness

Example usage:
> build/release/scylla perf-cql-raw --workdir /tmp/scylla-data --smp 2
--cpus 0,1 \
--developer-mode 1 --workload read --duration 5 2> /dev/null

> Running test with config: {workload=read, partitions=10000, concurrency=100, duration=5, ops_per_shard=0}
Pre-populated 10000 partitions
97438.42 tps (269.2 allocs/op,   1.1 logallocs/op,  35.2 tasks/op,  113325 insns/op,   80572 cycles/op,        0 errors)
102460.77 tps (261.1 allocs/op,   0.0 logallocs/op,  31.7 tasks/op,  108222 insns/op,   75447 cycles/op,        0 errors)
95707.93 tps (261.0 allocs/op,   0.0 logallocs/op,  31.7 tasks/op,  108443 insns/op,   75320 cycles/op,        0 errors)
102487.87 tps (261.0 allocs/op,   0.0 logallocs/op,  31.7 tasks/op,  107956 insns/op,   74320 cycles/op,        0 errors)
100409.60 tps (261.0 allocs/op,   0.0 logallocs/op,  31.7 tasks/op,  108337 insns/op,   75262 cycles/op,        0 errors)
throughput:
        mean=   99700.92 standard-deviation=3039.28
        median= 100409.60 median-absolute-deviation=2759.85
        maximum=102487.87 minimum=95707.93
instructions_per_op:
        mean=   109256.53 standard-deviation=2281.39
        median= 108337.37 median-absolute-deviation=1034.83
        maximum=113324.69 minimum=107955.97
cpu_cycles_per_op:
        mean=   76184.36 standard-deviation=2493.46
        median= 75320.20 median-absolute-deviation=922.09
        maximum=80572.19 minimum=74320.00

Backports: no, new tool

Closes scylladb/scylladb#25990

* github.com:scylladb/scylladb:
  test: perf: reuse stream id
  main: test: add future and abort_source to after_init_func
  test: perf: add option to stress multiple tables in perf-cql-raw
  test: perf: add perf-cql-raw benchmarking tool
  test: perf: move cut_arg helper func to common code
2026-01-27 12:23:25 +02:00
Yaron Kaikov
3f10f44232 .github/workflows/backport-pr-fixes-validation: support Atlassian URL format in backport PR fixes validation
Add support for matching full Atlassian JIRA URLs in the format
https://scylladb.atlassian.net/browse/SCYLLADB-400 in addition to
the bare JIRA key format (SCYLLADB-400).

This makes the validation more flexible by accepting both formats
that developers commonly use when referencing JIRA issues.

Fixes: https://github.com/scylladb/scylladb/issues/28373

Closes scylladb/scylladb#28374
2026-01-27 10:59:21 +02:00
Avi Kivity
f1c6094150 Merge 'Remove buffer_input_stream and limiting_input_stream from core code' from Pavel Emelyanov
These two streams mostly play together. The former provides an input_stream from read from in-memory temporary buffers, the latter wraps it to limit the size of provided temporary buffers. Both are used to test contiguous data consumer, also the buffer_input_stream has a caller in sstables reversing reader.

This PR removes the buffer_input_stream in favor of seastar memory_data_source, and moves the limiting_input_stream into test/lib.

Enanching testing code, not backporting

Closes scylladb/scylladb#28352

* github.com:scylladb/scylladb:
  code: Move limiting data source to test/lib
  util: Simplify limiting_data_source API
  util: Remove buffer_input_stream
  test: Use seastar::util::temporary_buffer_data_source in data consumer test
  sstables: Use seastar::util::as_input_stream() in mx reader
2026-01-26 22:05:59 +02:00
Raphael S. Carvalho
0e07c6556d test: Remove useless compaction group testing in database_test
This compaction group testing is useless because the machinery for it
to work was removed. This was useful in the early tablet days, where
we wanted to test compaction groups directly. Today groups are stressed
and tested on every tablet test.

I see a ~40% reduction time after this patch, since database_test is
one of the most (if not the most) time consuming in boost suite.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes scylladb/scylladb#28324
2026-01-26 19:16:27 +02:00
Marcin Maliszkiewicz
19af46d83a audit: replace batch dynamic_cast with static_cast
Since we know already it's a batch we can use static
cast now.
2026-01-26 18:14:38 +01:00
Anna Stuchlik
edc291961b doc: update the GPG keys
Update the keys in the installation instructions (linux packages).

Fixes https://github.com/scylladb/scylladb/issues/28330

Closes scylladb/scylladb#28357
2026-01-26 19:13:18 +02:00
Piotr Dulikowski
5d5e829107 Merge 'db: view: refactor usage and building of semaphore in create and drop views plus change continuation to co routine style' from Alex Dathskovsky
db: view: refactor semaphore usage in create/drop view paths
Refactor the construction and usage of semaphore units in the create and drop view flows.
The previous semaphore handling was hard to follow (as noted while working on https://github.com/scylladb/scylladb/pull/27929), so this change restructures unit creation and movement to follow a clearer and symmetric pattern across shards.

The semaphore usage model is now documented with a detailed in-code comment to make the intended behavior and invariants explicit.

As part of the refactor, the control flow is modernized by replacing continuation-based logic with coroutine-style code, improving readability and maintainability.

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-250

backport: not required, this is a refactor

Closes scylladb/scylladb#28093

* github.com:scylladb/scylladb:
  db: view: extend try/catch scope in handle_create_view_local The try/catch region is extended to cover step functions and inner helpers, which may throw or abort during view creation. This change is safe because we are just swolowing more parts that may throw due to semaphore abortion or any other abortion request, and doesnt change the logic
  db: view: refine create/drop coroutine signatures Refactor the create/drop coroutine interfaces to accept parameters as const references, enabling a clearer workflow and safer data flow.
  db: view: switch from continuations to coroutines Refactor the flow and style of create and drop view to use coroutines instead of continuations. This simplifies the logic, improves readability, and makes the code easier to maintain and extend. This commit also utilizes the get_view_builder_units function that was added in the previous commit. this commit also introduces a new alisasing for optional unit type for simpler and more readable functions that use this type
  db: view: introduce helper to acquire or reuse semaphore units Introduce a small helper that acquires semaphore units when needed or reuses units provided by the caller. This centralizes semaphore handling, simplifies the current logic, and enables refactoring the view create/drop path to a coroutine-based implementation instead of continuation-style code.
  db: view: add detailed comments on semaphore bookkeeping and serialized create/drop on shard 0
2026-01-26 17:16:01 +01:00
Marcin Maliszkiewicz
55d246ce76 auth: bring back previous version of standard_role_manager::can_login
Previously, we wanted to make minimal changes with regards to the new
unified auth cache. However, as a result, some calls on the hot path
were missed. Now we have switched the underlying find_record call
to use the cache. Since caching is now at a lower level, we bring
back the original code.
2026-01-26 16:04:11 +01:00
Marcin Maliszkiewicz
3483452d9f auth: switch find_record to use cache
Since every write-type auth statement takes group0_guard at the beginning,
we hold read_apply_mutex and cannot have a running raft apply during our
operation. Therefore, the auth cache and internal CQL reads return the same,
consistent results. This makes it safe to read via cache instead of internal
CQL.

LDAP is an exception, but it is eventually consistent anyway.
2026-01-26 16:04:11 +01:00
Botond Dénes
755e8319ee replica/partition_snapshot_reader: remove unused includes 2026-01-26 16:52:46 +02:00
Botond Dénes
756837c5b4 partition_snapshot_reader: remove "flat" from name
The "flat" migration is long done, this distinction is no longer
meaningful.
2026-01-26 16:52:46 +02:00
Botond Dénes
9d1933492a mv partition_snapshot_reader.hh -> replica/
The partition snapshot lives in mutation/, however mutation/ is a lower
level concept than a mutation reader. The next best place for this
reader is the replica/ directory, where the memtable, its main user,
also lives.

Also move the code to the replica namespace.

test/boost/mvcc_test.cc includes this header but doesn't use anything
from it. Instead of updating the include path, just drop the unused
include.
2026-01-26 16:52:08 +02:00
Avi Kivity
32cc593558 Merge 'tools/scylla-sstable: introduce filter command' from Botond Dénes
Filter the content of sstable(s), including or excluding the specified partitions. Partitions can be provided on the command line via `--partition`, or in a file via `--partitions-file`. Produces one output sstable per input sstable -- if the filter selects at least one partition in the respective input sstable. Output sstables are placed in the path provided via `--oputput-dir`. Use `--merge` to filter all input sstables combined, producing one output sstable.

Fixes: #13076

New functionality, no backport.

Closes scylladb/scylladb#27836

* github.com:scylladb/scylladb:
  tools/scylla-sstable: introduce filter command
  tools/scylla-sstable: remove --unsafe-accept-nonempty-output-dir
  tools/scylla-sstable: make partition_set ordered
  tools/scylla-stable: remove unused boost/algorithm/string.hpp include
2026-01-26 16:32:38 +02:00
Ernest Zaslavsky
912c48a806 connection_factory: includes cleanup 2026-01-26 15:15:21 +02:00
Ernest Zaslavsky
3a31380b2c dns_connection_factory: refine the move constructor
Clean up the awkward move constructor that was declared in the header
but defaulted in a separate compilation unit, improving clarity and
consistency.
2026-01-26 15:15:15 +02:00
Ernest Zaslavsky
a05a4593a6 connection_factory: retry on failure
If connecting to a provided address throws, renew the address list and
retry once (and only once) before giving up.
2026-01-26 15:14:18 +02:00
Ernest Zaslavsky
6eb7dba352 connection_factory: introduce TTL timer
Add a TTL-based timer to connection_factory to automatically refresh
resolved host name addresses when they expire.
2026-01-26 15:11:49 +02:00
Nadav Har'El
25ff4bec2a Merge 'Refactor streams' from Radosław Cybulski
Refactor streams.cc - turn `.then` calls into coroutines.

Reduces amount of clutter, lambdas and referenced variables.

Fixes #28342

Closes scylladb/scylladb#28185

* github.com:scylladb/scylladb:
  alternator: refactor streams.cc
  alternator: refactor streams.cc
2026-01-26 14:31:15 +02:00
Andrei Chekun
3d3fabf5fb test.py: change the name of the test in failed directory
Generally square brackets are non allowed in URI, while pytest uses it
the test name to show that there were additional parameters for the same
test. When such a test fail it shows the directory correctly in Jenkins,
however attempt to download only this will fail, because of the square
brackets in URI. This change substitute the square brackets with round
brackets.

Closes scylladb/scylladb#28226
2026-01-26 13:29:45 +01:00
Łukasz Paszkowski
f06094aa95 topology_coordinator: add write_both_read_old_fallback_cleanup state
Yet another barrier-failure scenario exists in the `write_both_read_new`
state. When the barrier fails, the tablet is expected to transition
to `cleanup_target`, but because barrier execution is asynchronous,
the cleanup transition can be skipped entirely and the tablet may
continue forward instead.

Both `write_both_read_new` and `cleanup_target` modify read and write
selectors. In this situation, a barrier is required, and transitioning
directly between these states without one is unsafe.

Introduce an intermediate `write_both_read_old_fallback_cleanup`
state that modifies only a read selector and can be entered without
a barrier (there is no need to wait for all nodes to start using the
"new" read selector). From there, the tablet can proceed to `cleanup_target`,
where the required barriers are enforced.

This also avoids changing both selectors in a single step. A direct
transition from `write_both_read_new` to `cleanup_target` updates
both selectors at once, which can leave coordinators using the old
selector for writes and the new selector for reads, causing reads to
miss preceding writes.

By routing through the fallback state, selectors are updated in
order—read first, then write—preserving read-after-write correctness.
2026-01-26 13:14:37 +01:00
Łukasz Paszkowski
0a058e53c7 topology_coordinator: allow cleanup_target transition from streaming/rebuild_repair without barrier
In both `streaming` and `rebuild_repair` stages, the read/write
selectors are unchanged compared to the preceding stage. Because
entry into these stages is already fenced by a barrier from
`write_both_read_old`, and the `cleanup_target` itself requires
barrier, rolling back directly to `cleanup_target` is safe without
an additional barrier.
2026-01-26 13:14:36 +01:00
Łukasz Paszkowski
b12f46babd topology_coordinator: allow cleanup_target transition without barrier after failure in write_both_read_old
A similar barrier-failure scenario exists in the `write_both_read_old`
state. If the barrier fails, the tablet is expected to transition to
`cleanup_target`, but due to the barrier being evaluated asynchronously
the cleanup path can be skipped and the tablet may continue forward
instead.

In `write_both_read_old`, we already switched group0 writes from old
to both, while the barrier may not have executed yet. As a result,
nodes can be at most one step apart (some still use old, others use
both).

Transitioning to `cleanup_target` reverts the write selector back to
old. Nodes still differ by at most one step (old vs both), so the
transition is safe without an additional barrier.

This prevents cleanup from being skipped while keeping selector semantics
and barrier guarantees intact.
2026-01-26 13:14:36 +01:00
Łukasz Paszkowski
7c331b7319 topology_coordinator: allow cleanup_target transition without barrier after failure in allow_write_both_read_old
When a tablet is in `allow_write_both_read_old`, progressing normally
requires a barrier. If this first barrier fails, the tablet is supposed
to transition to `cleanup_target` on the next iteration:
```
case locator::tablet_transition_stage::allow_write_both_read_old:
    if (action_failed(tablet_state.barriers[trinfo.stage])) {
        if (check_excluded_replicas()) {
            transition_to_with_barrier(locator::tablet_transition_stage::cleanup_target);
            break;
        }
    }
    if (do_barrier()) {
        ...
    }
    break;
```
That transition itself requires a barrier, which is executed asynchronously.
Because the barrier runs in the background, the cleanup logic is skipped in
that iteration.

On the following iteration, `action_failed(barriers[stage])` no longer
returns true, since the node that caused the original barrier failure
has been excluded. The barrier is therefore observed as successful,
and the tablet incorrectly proceeds to the next stage instead of entering
`cleanup_target`.

Since `cleanup_target` does not modify read/write selectors, the transition
can be done safely without a barrier, simplifying the state machine and
ensuring cleanup is not skipped.

Without it, the tablet would still eventually reach `cleanup_target` via
`write_both_read_old` and `streaming`, but that path is unnecessary.
2026-01-26 13:14:36 +01:00
Marcin Maliszkiewicz
440590f823 auth: make find_record and callers standard_role_manager members
It will be usefull for following commit. Those methods are anyway class specific
2026-01-26 13:10:11 +01:00
Anna Stuchlik
84281f900f doc: remove the troubleshooting section on upgrades from OSS
This commit removes a document originally created to troubleshoot
upgrades from Open Source to Enterprise.

Since we no longer support Open Source, this document is now redundant.

Fixes https://github.com/scylladb/scylladb/issues/28246

Closes scylladb/scylladb#28248
2026-01-26 14:02:53 +02:00
Anna Stuchlik
c25b770342 doc: add the version name to the Install pages
This is a follow-up to https://github.com/scylladb/scylladb/pull/28022
It adds the version name to more install pages.

Closes scylladb/scylladb#28289
2026-01-26 13:11:14 +02:00
Alex
954d18903e db: view: extend try/catch scope in handle_create_view_local
The try/catch region is extended to cover step functions and inner helpers,
which may throw or abort during view creation.
This change is safe because we are just swolowing more parts that may throw due to semaphore abortion
or any other abortion request, and doesnt change the logic
2026-01-26 13:10:37 +02:00
Alex
2c3ab8490c db: view: refine create/drop coroutine signatures
Refactor the create/drop coroutine interfaces to accept parameters as
const references, enabling a clearer workflow and safer data flow.
2026-01-26 13:10:34 +02:00
Alex
114f88cb9b db: view: switch from continuations to coroutines
Refactor the flow and style of create and drop view to use coroutines instead of continuations.
This simplifies the logic, improves readability, and makes the code
easier to maintain and extend. This commit also utilizes the get_view_builder_units function that was added in the previous commit.
this commit also introduces a new alisasing for optional unit type for simpler and more readable functions that use this type
2026-01-26 13:08:24 +02:00
Alex
87c1c6f40f db: view: introduce helper to acquire or reuse semaphore units
Introduce a small helper that acquires semaphore units when needed or
reuses units provided by the caller.
This centralizes semaphore handling, simplifies the current logic, and
enables refactoring the view create/drop path to a coroutine-based
implementation instead of continuation-style code.
2026-01-26 13:03:26 +02:00
Avi Kivity
ec70cea2a1 test/cqlpy: restore LWT tests marked XFAIL for tablets
Commit 0156e97560 ("storage_proxy: cas: reject for
tablets-enabled tables") marked a bunch of LWT tests as
XFAIL with tablets enabled, pending resolution of #18066.
But since that event is now in the past, we undo the XFAIL
markings (or in some cases, use an any-keyspace fixture
instead of a vnodes-only fixture).

Ref #18066.

Closes scylladb/scylladb#28336
2026-01-26 12:27:19 +02:00
Pavel Emelyanov
77435206b9 code: Move limiting data source to test/lib
Only two tests use it now -- the limit-data-source-test iself and a test
that validates continuous_data_consumer template.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-01-26 12:49:42 +03:00
Pavel Emelyanov
111b376d0d util: Simplify limiting_data_source API
The source maintains "limit generator" -- a function that returns the
maximum size of bytes to return from the next buffer.

Currently all callers just return constant numbers from it. Passing a
function that returns non-constant one can, probably, be used for a
fuzzy test, but even the limiting-data-source-test itself doesn't do it,
so what's the point...

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-01-26 12:46:37 +03:00
Pavel Emelyanov
e297ed0b88 util: Remove buffer_input_stream
It's now unused. All the users had been patched to use seastar memory
data source implementation.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-01-26 12:46:10 +03:00
Pavel Emelyanov
4639681907 test: Use seastar::util::temporary_buffer_data_source in data consumer test
The test creates buffer_data_source_impl and wraps it with limiting data
source. The former data_source duplicates the functionality of the
existing seastar temporary_buffer_data_source.

This patch makes the test code use seastar facility. The
buffer_data_source_impl will be removed soon.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-01-26 12:44:33 +03:00
Pavel Emelyanov
2399bb8995 sstables: Use seastar::util::as_input_stream() in mx reader
Right now the code uses make_buffer_input_stream() helper that creates
an input stream with buffer_data_source_impl inside which, in turn,
provides the data_source_impl API over a single temporary_buffer.

Seastar has the very same facility, it's better to use it. Eventually
the buffer_data_source_impl will be removed.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-01-26 12:43:17 +03:00
Marcin Maliszkiewicz
6f32290756 audit: eliminate dynamic_cast to batch_statement in inspect
This is costly and not needed we can use a simple
bool flag for such check. It burns around 300 cpu
instructions on a hot request's path.
2026-01-26 10:18:38 +01:00
Marcin Maliszkiewicz
a93ad3838f audit: cql: remove create_no_audit_info
We don't need a special guard value, it's
only being filled for batch statements for
which we can simply ignore the value.

Not having special value allows us to return
fast when audit is not enabled.
2026-01-26 10:18:38 +01:00
Marcin Maliszkiewicz
02d59a0529 audit: add batch bool to audit_info class
In the following commit we'll use this field
instead of costly dynamic_cast when emitting
audit log.
2026-01-26 10:18:38 +01:00
Botond Dénes
57b2cd2c16 docs: inter-link incremental-repair and repair documents
The user can now discover the general explanatio of repair when reading
about incremental repair, useful if they don't know what repair is.
The user can now discover incremental repair while reading the generic
repair procedure document.
2026-01-26 09:55:54 +02:00
Botond Dénes
a84b1b8b78 docs: incremental-repair: fix curl example
Currently it is regular text, make it a code block so it is easier to
read and copy+paste.
2026-01-26 09:55:54 +02:00
Pavel Emelyanov
1796997ace Merge 'restore: Enable direct download of fully contained SSTables' from Ernest Zaslavsky
This PR refactors the streaming subsystem to support direct download of fully contained sstables. Instead of streaming these files, they are downloaded and attached directly to their corresponding tables. This approach reduces overhead, simplifies logic, and improves efficiency. Expected node scope restore performance improvement: ~4 times faster in best case scenario when all sstables are fully contained.

1. Add storage options field to sstable Introduce a data member to store storage options, enabling distinction between local and object storage types.
2. Add method to create component source Extend the storage interface with a public method to create a data_source for any sstable component.
3. Inline streamer instance creation Remove make_sstable_streamer and inline its usage to allow different sets of arguments at call sites.
4. Skip streaming empty sstable sets Avoid unnecessary streaming calls when the sstable set is empty.
5. Enable direct download of contained sstables Replace streaming of fully contained sstables with direct download, attaching them to their corresponding table.

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-200
Refs: https://github.com/scylladb/scylladb/issues/23908

No need to backport as this code targets 2026.2 release (for tablet-aware restore)

Closes scylladb/scylladb#26834

* github.com:scylladb/scylladb:
  tests: reuse test_backup_broken_streaming
  streaming: enable direct download of contained sstables
  storage: add method to create component source
  streaming: keep sharded database reference on tablet_sstable_streamer
  streaming: skip streaming empty sstable sets
  streaming: inline streamer instance creation
  tests: fix incorrect backup/restore test flow
2026-01-26 10:22:34 +03:00
Ernest Zaslavsky
cb2aa85cf5 aws_error: handle all restartable nested exception types
Previously we only inspected std::system_error inside
std::nested_exception to support a specific TLS-related failure
mode. However, nested exceptions may contain any type, including
other restartable (retryable) errors. This change unwraps one
nested exception per iteration and re-applies all known handlers
until a match is found or the chain is exhausted.

Closes scylladb/scylladb#28240
2026-01-26 10:19:57 +03:00
Avi Kivity
55422593a7 Merge 'test/lib: Fix bugs in boost_test_tree_lister.cc' from Dawid Mędrek
In this PR, we fix two bugs present in `boost_test_tree_lister` that
affected the output of `--list_json_content` added in
scylladb/scylladb@afde5f668a:

* The labels test units use were duplicated in the output.
* If a test suite or a test file didn't contain any tests, it wasn't
  listed in the output.

Refs scylladb/scylladb#25415

Backport: not needed. The code hasn't been used anywhere yet.

Closes scylladb/scylladb#28255

* github.com:scylladb/scylladb:
  test/lib/boost_test_tree_lister.cc: Record empty test suites
  test/lib/boost_test_tree_lister.cc: Deduplicate labels
2026-01-25 21:34:32 +02:00
Andrei Chekun
cc5ac75d73 test.py: remove deprecated skip_mode decorator
Finishing the deprecation of the skip_mode function in favor of
pytest.mark.skip_mode. This PR is only cleaning and migrating leftover tests
that are still used and old way of skip_mode.

Closes scylladb/scylladb#28299
2026-01-25 18:17:27 +02:00
Ernest Zaslavsky
66a33619da connection_factory: get rid of shared_future in dns_connection_factory
Move state management from dns_connection_factory into state class
itself to encapsulate its internal state and stop managing it from the
`dns_connection_factory`
2026-01-25 16:12:29 +02:00
Ernest Zaslavsky
5b3e513cba connection_factory: extract connection logic into a member
extract connection logic into a private member function to make it reusable
2026-01-25 15:42:48 +02:00
Ernest Zaslavsky
ce0c7b5896 connection_factory: remove unnecessary else 2026-01-25 15:42:48 +02:00
Ernest Zaslavsky
359d0b7a3e connection_factory: use all resolved DNS addresses
Improve dns_connection_factory to iterate over all resolved
addresses instead of using only the first one.
2026-01-25 15:42:48 +02:00
Ernest Zaslavsky
bd9d5ad75b s3_test: remove client double-close
`test_chunked_download_data_source_with_delays` was calling `close()` on a client twice, remove the unnecessary call
2026-01-25 15:42:48 +02:00
Alex
1aadedc596 db: view: add detailed comments on semaphore bookkeeping and serialized create/drop on shard 0 2026-01-25 14:29:09 +02:00
Ernest Zaslavsky
70f5bc1a50 tests: reuse test_backup_broken_streaming
reuse the `test_backup_broken_streaming` test to check for direct
sstable download
2026-01-25 13:27:44 +02:00
Ernest Zaslavsky
13fb605edb streaming: enable direct download of contained sstables
Instead of streaming fully contained sstables, download them directly
and attach them to their corresponding table. This simplifies the
process and avoids unnecessary streaming overhead.
2026-01-25 13:27:44 +02:00
Yaron Kaikov
3dea15bc9d Update ScyllaDB version to: 2026.2.0-dev 2026-01-25 11:09:17 +02:00
Botond Dénes
edda66886e Merge 'Replace sstable_stream_source_impl's internal sink and source with seastar utilities' from Pavel Emelyanov
The class in question has internal implementations of in-memory data_sink_impl and data_source_impl. In seastar there's generic implementation of the same facilities. From the "code re-use" perspective it makes sense to use both. TODO-s in Scylla code supports that.

Using newer seastar facilities, not backporting.

Closes scylladb/scylladb#28321

* github.com:scylladb/scylladb:
  sstable: Replace buffer_data_sink_impl with seastar::util::basic_memory_data_sink
  sstables: Use seastar::util::as_input_stream() and remove buffer_data_source_impl
2026-01-23 16:55:18 +02:00
Pavel Emelyanov
3e09d3cc97 test: Keep test_gossiper_live_endpoints checks togethger
There are two checks for live endpoints performed in test_gossiper.py,
but one of those sits in test_gossiper_unreachable_endpoints somehow.
This patch moves live endpoints check into live endpoints test.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#28224
2026-01-23 16:53:48 +02:00
Piotr Dulikowski
3ec4f67407 Merge 'vector_index: Implement rescoring' from Szymon Malewski
This series implements rescoring algorithm.

Index options allowing to enable this functionality were introduced in earlier PR https://github.com/scylladb/scylladb/pull/28165.

When Vector Index has enabled quantization, Vector Store uses reduced vector representation to save memory, but it may degrade correctness of ANN queries. For quantized index we can enable rescoring algorithm, which recalculates similarity score from full vector representation stored in Scylla and reorder returned result set.
It works also with oversampling - we fetch more candidates from Vector Store, rescore them at Scylla and return only requested number of results.

Example:

Creating a Vector Index with Rescoring

```sql
-- Create a table with a vector column
CREATE TABLE ks.products (
    id int PRIMARY KEY,
    embedding vector<float, 128>
);

-- Create a vector index with rescoring enabled
CREATE INDEX products_embedding_idx ON ks.products (embedding)
    USING 'vector_index'
    WITH OPTIONS = {
        'similarity_function': 'cosine',
        'quantization': 'i8',
        'oversampling': '2.0',
        'rescoring': 'true'
    };
```

1. **Quantization** (`i8`) compresses vectors in the index, reducing memory usage but introducing precision loss in distance calculations
2. **Oversampling** (`2.0`) retrieves 2× more candidates than requested from the vector store (e.g., `LIMIT 10` fetches 20 candidates)
3. **Rescoring** (`true`) recalculates similarity scores using full-precision (`f32`) vectors from the base table and re-ranks results

Query example:

```sql
-- Find 10 most similar products
SELECT id, similarity_cosine(embedding, [0.1, 0.2, ...]) AS score
FROM ks.products
ORDER BY embedding ANN OF [0.1, 0.2, ...]
LIMIT 10;
```

With rescoring enabled, the query:
1. Fetches 20 candidates from the quantized index (due to oversampling=2.0)
2. Reads full-precision embeddings from the base table
3. Recalculates similarity scores with full precision
4. Re-ranks and returns the top 10 results

In this implementation we use CQL similarity function implementation to calculate new score values and use them in post query ordering. We add that column manually to selection, but it has to be removed from the final response.

Follow-up https://github.com/scylladb/scylladb/pull/28165
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-83

New feature - doesn't need backport.

Closes scylladb/scylladb#27769

* github.com:scylladb/scylladb:
  vector_index: rescoring: Fetch oversampled rows
  vector_index: rescoring: Sort by similarity column
  select_statement: Modify `needs_post_query_ordering` condition
  vector_index: rescoring: Add hidden similarity score column
  vector_index: Refactor extracting ANN query information
2026-01-23 15:20:10 +01:00
Patryk Jędrzejczak
a41d7a9240 Merge 'test_lwt_shutdown: fix flakiness by removing storage_proxy::stop injection' from Petr Gusev
The storage_proxy::stop() is not called by main (it is commented out due to #293), so the corresponding message injection is never hit. When the test releases paxos_state_learn_after_mutate, shutdown may already be in progress or even completed by the time we try to trigger the storage_proxy::stop injection, which makes the test flaky.

Fix this by completely removing the storage_proxy::stop injection. The injection is not required for test correctness. Shutdown must wait for the background LWT learn to finish, which is released via the paxos_state_learn_after_mutate injection. The shutdown process blocks on in-flight HTTP requests through seastar::httpd::http_server::stop and its _task_gate, so the HTTP request that releases paxos_state_learn_after_mutate is guaranteed to complete before the node is shut down.

Fixes scylladb/scylladb#28260

backport: 2025.4, the `test_lwt_shutdown` test was introduced in this version

Closes scylladb/scylladb#28315

* https://github.com/scylladb/scylladb:
  storage_proxy: drop stop() method
  test_lwt_shutdown: fix flakiness by removing storage_proxy::stop injection
2026-01-23 15:18:17 +01:00
Avi Kivity
30d6f3b8e0 test: test_proxy_protocol: bump timeout
It was observed twice that the test times out in debug mode.
Fix by increasing the timeout.

The test never expects a timeout, so increasing it won't increase
the test duration.

Fixes #28028

Closes scylladb/scylladb#28272
2026-01-23 15:37:00 +02:00
Łukasz Paszkowski
09fde82a33 test/scylla_gdb: fix coro_task request usage, rename duplicate test
- Pass pytest request fixture into coro_task (used for scylla_tmp_dir
  and core dump path)
- Rename duplicate `test_sstable_summary` that runs sstable-index-cache
  to `test_sstable_index_cache` so both tests are collected

Refs https://github.com/scylladb/scylladb/issues/22501

Closes scylladb/scylladb#28286
2026-01-23 15:25:58 +02:00
Pavel Emelyanov
4a307d931a sstable: Replace buffer_data_sink_impl with seastar::util::basic_memory_data_sink
The former accumulates sstable writer writes into a vector of temporary
buffers. In seastar there's a generic memory data sink that provides a
sink to accumulate stream of bytes into any container.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-01-23 14:22:22 +03:00
Pavel Emelyanov
97b1340a68 sstables: Use seastar::util::as_input_stream() and remove buffer_data_source_impl
The latter is used to wrap vector of buffers into an input_stream.
Seastar already provides the very same functionality with the
convenience as_input_stream() helper.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-01-23 14:21:14 +03:00
Piotr Dulikowski
fe9237fdc9 Merge 'alternator: don't require rf_rack flag for indexes, validate instead' from Michael Litvak
In 8df61f6d99 we changed the requirements for creating materialized
views and MV-based indexes - instead of requiring the
rf_rack_valid_keyspaces flag to be set, we now require the keyspace to
be RF-rack-valid at the time of creation, and it is enforced to remain
RF-rack-valid while the MV exists. This validation is done in the cql
create view/index statements.

The same should be done also for alternator - when creating a table with
GSI or LSI, or when adding a GSI to an existing table, previously we
required the flag rf_rack_valid_keyspaces to be set. Now we change it to
instead check if the keyspace is RF-rack-valid, and if not the operation
fails with an appropriate error.

Fixes https://github.com/scylladb/scylladb/issues/28214

backport to 2025.4 to add RF-rack-valid enforcements in alternator

Closes scylladb/scylladb#28154

* github.com:scylladb/scylladb:
  locator: document the exception type of assert_rf_rack_valid_keyspace
  alternator: don't require rf_rack flag for indexes, validate instead
2026-01-23 11:49:02 +01:00
Botond Dénes
c4c2f87be7 Merge 'db: fail reads and writes with local consistency level to a DC with RF=0' from null
When read or write operations are performed on a DC with RF=0 with LOCAL_QUORUM
or LOCAL_ONE consistency level, Cassandra throws `Unavailable` exception.
Scylla allowed such read operations and failed write operations with a cryptic:
"broken promise" error. This occured because the initial availability
check passed (quorum of 0 requires 0 replicas), but execution failed
later when no replicas existed to process the mutation.

This patch adds an explicit RF=0 validation for LOCAL_ONE and LOCAL_QUORUM that
throws before attempting operation execution.

The change also requires `test_query_dc_with_rf_0_does_not_crash_db` to be
upgraded. This testcase was asserting somewhat similar scenario, but wasn't
taking into account the whole matrix of combinations:
- scenarios: successful vs unsuccesful operation outcome
- local consistency levels: LOCAL_QUORUM & LOCAL_ONE
- operations: SELECT (read) & INSERT (write)

and so it's been extended to cover both the pre-existing and the current issues
and the whole matrix of combinations.

Fixes: scylladb/scylladb#27893

A minor change, no need to backport.

Closes scylladb/scylladb#27894

* github.com:scylladb/scylladb:
  db: fail reads and writes with local consistencty level to a DC with RF=0
  db: consistency_level: split `local_quorum_for()`
  db: consistency_level: fix nrs -> nts abbreviation
2026-01-23 12:36:20 +02:00
Petr Gusev
c45244b235 storage_proxy: drop stop() method
It's not called by main.cc and can be confusing.
2026-01-23 11:22:03 +01:00
Petr Gusev
f5ed3e9fea test_lwt_shutdown: fix flakiness by removing storage_proxy::stop injection
storage_proxy::stop() is not called by main (it is commented out due to #293),
so the corresponding message injection is never hit. When the test releases
paxos_state_learn_after_mutate, shutdown may already be in progress or even
completed by the time we try to trigger the storage_proxy::stop injection,
which makes the test flaky.

Fix this by completely removing the storage_proxy::stop injection.
The injection is not required for test correctness. Shutdown must wait for the
background LWT learn to finish, which is released via the
paxos_state_learn_after_mutate injection.

The shutdown process blocks on in-flight api HTTP requests through
seastar::httpd::http_server::stop and its _task_gate, so the
shutdown will not prevent the HTTP request that released the
paxos_state_learn_after_mutate from completing successfully.

Fixes scylladb/scylladb#28260
2026-01-23 11:20:36 +01:00
Botond Dénes
35c9a00275 Merge 'test.py: pass correctly extra cmd line arguments' from Andrei Chekun
During rewrite --extra-scylla-cmdline-options was missed and it was not passed to the tests that are using pytest. The issue that there were no possibility to pass these parameters via cmd to the Scylla, while tests were not affected because they were using the parameters from the yaml file.
This PR fixes this issue so it will be easier to modify the Scylla start parameters without modifying code.

No backport needed, only framework enhancement.

Closes scylladb/scylladb#28156

* github.com:scylladb/scylladb:
  test.py: do not crash when there is no boost log
  test.py: pass correctly extra cmd line arguments
2026-01-23 11:26:01 +02:00
Andrzej Jackowski
c493a66668 test: check cql_requests_count instead of tasks_processed in SL
Before this change, the test function `_verify_tasks_processed_metrics`
verified that after service level reconfiguration, a given number of
`scylla_scheduler_tasks_processed` were processed by a given scheduling
group. Moreover, the check verified that another scheduling group
didn't process a high number of requests. The second check was vulnerable
to flakiness, because sometimes additional load caused extensive work
in the second scheduling group (e.g. password hashing in `sl:driver`
due to new connections being created).

To avoid test failures, this commit changes which metric is verified:
instead of `scylla_scheduler_tasks_processed`, the metric
`scylla_transport_cql_requests_count` is checked. This prevents similar
problems, because there is no reason for a high number of
requests to be processed by the second scheduling group. Moreover,
it allows decreasing the number of requests that are sent for
verification, and thus speeds up the test.

Fixes: scylladb/scylladb#27715

Closes scylladb/scylladb#28318
2026-01-23 10:19:29 +01:00
Patryk Jędrzejczak
4e984139b2 Merge 'strongly consistent tables: basic implementation' from Petr Gusev
In this PR we add a basic implementation of the strongly-consistent tables:
* generate raft group id when a strongly-consistent table is created
* persist it into system.tables table
* start raft groups on replicas when a strongly-consistent tablet_map reaches them
* add strongly-consistent version of the storage_proxy, with the `query` and `mutate` methods
* the `mutate` method submits a command to the tablets raft group, the query method reads the data with `raft.read_barrier()`
* strongly-consistent versions of the `select_statement` and `modification_statement` are added
* a basic `test_strong_consistency.py/test_basic_write_read` is added which to check that we can write and read data in a strongly consistent fashion.

Limitations:
* for now the strongly consistent tables can have tablets only on shard zero. This is because we (ab/re) use the existing raft system tables which live only on shard0. In the next PRs we'll create separate tables for the new tablets raft groups.
* No Scylla-side proxying - the test has to figure out who is the leader and submit the command to the right node. This will be fixed separately.
* No tablet balancing -- migration/split/merges require separate complicated code.

The new behavior is hidden behind `STRONGLY_CONSISTENT_TABLES` feature, which is enabled when the `STRONGLY_CONSISTENT_TABLES` experimental feature flag is set.

Requirements, specs and general overview of the feature can be found [here](https://scylladb.atlassian.net/wiki/spaces/RND/pages/91422722/Strong+Consistency). Short term implementation plan is [here](https://docs.google.com/document/d/1afKeeHaCkKxER7IThHkaAQlh2JWpbqhFLIQ3CzmiXhI/edit?tab=t.0#heading=h.thkorgfek290)

One can check the strongly consistent writes and reads locally via cqlsh:
scylla.yaml:
```
experimental_features:
  - strongly-consistent-tables
```

cqlsh:
```
CREATE KEYSPACE IF NOT EXISTS my_ks WITH replication = {'class': 'NetworkTopologyStrategy', 'replication_factor': 1} AND tablets = {'initial': 1} AND consistency = 'local';
CREATE TABLE my_ks.test (pk int PRIMARY KEY, c int);
INSERT INTO my_ks.test (pk, c) VALUES (10, 20);
SELECT * FROM my_ks.test WHERE pk = 10;
```

Fixes SCYLLADB-34
Fixes SCYLLADB-32
Fixes SCYLLADB-31
Fixes SCYLLADB-33
Fixes SCYLLADB-56

backport: no need

Closes scylladb/scylladb#27614

* https://github.com/scylladb/scylladb:
  test_encryption: capture stderr
  test/cluster: add test_strong_consistency.py
  raft_group_registry: disable metrics for non-0 groups
  strong consistency: implement select_statement::do_execute()
  cql: add select_statement.cc
  strong consistency: implement coordinator::query()
  cql: add modification_statement
  cql: add statement_helpers
  strong consistency: implement coordinator::mutate()
  raft.hh: make server::wait_for_leader() public
  strong_consistency: add coordinator
  modification_statement: make get_timeout public
  strong_consistency: add groups_manager
  strong_consistency: add state_machine and raft_command
  table: add get_max_timestamp_for_tablet
  tablets: generate raft group_id-s for new table
  tablet_replication_strategy: add consistency field
  tablets: add raft_group_id
  modification_statement: remove virtual where it's not needed
  modification_statement: inline prepare_statement()
  system_keyspace: disable tablet_balancing for strongly_consistent_tables
  cql: rename strongly_consistent statements to broadcast statements
2026-01-23 09:52:33 +01:00
Anna Stuchlik
c681b3363d doc: remove an outdated KB
This page was created for very outdated versions of ScyllaDB (5.1 (ScyllaDB Open Source) and 2022.2 (ScyllaDB Enterprise) to ensure smooth upgrade to that versions.

We no longer need that document.

Fixes https://github.com/scylladb/scylladb/issues/28264

Closes scylladb/scylladb#28266
2026-01-22 18:27:14 +01:00
Szymon Wasik
927aebef37 Add vector search documentation links to CQL docs
This patch adds links to the Vector Search documentation that is hosted
together with Scylla Cloud docs to the CQL documentation.
It also make the note about supported capabilities consistent and
removes the experimental label as the feature is GAed.

Fixes: SCYLLADB-371

Closes scylladb/scylladb#28312
2026-01-22 16:46:38 +01:00
Botond Dénes
f375288b58 tools/scylla-sstable: introduce filter command
Filter the content of sstable(s), including or excluding the specified
partitions. Partitions can be provided on the command line via
`--partition`, or in a file via `--partitions-file`.
Produces one output sstable per input sstable -- if the filter selects
at least one partition in the respective input sstable.
Output sstables are placed in the path provided via `--oputput-dir`.
Use `--merge` to filter all input sstables combined, producing one
output sstable.
2026-01-22 17:20:07 +02:00
Michael Litvak
d5009882c6 locator: document the exception type of assert_rf_rack_valid_keyspace
The function assert_rf_rack_valid_keyspace uses the exception type
std::invalid_argument when the RF-rack validation fails. Document it and
change all callers to catch this specific exception type when checking
for RF-rack validation failures, so that other exception types can be
propagated properly.
2026-01-22 16:11:35 +01:00
Michael Litvak
1f7a65904e alternator: don't require rf_rack flag for indexes, validate instead
In 8df61f6d99 we changed the requirements for creating materialized
views and MV-based indexes - instead of requiring the
rf_rack_valid_keyspaces flag to be set, we now require the keyspace to
be RF-rack-valid at the time of creation, and it is enforced to remain
RF-rack-valid while the MV exists. This validation is done in the cql
create view/index statements.

The same should be done also for alternator - when creating a table with
GSI or LSI, or when adding a GSI to an existing table, previously we
required the flag rf_rack_valid_keyspaces to be set. Now we change it to
instead check if the keyspace is RF-rack-valid, and if not the operation
fails with an appropriate error.
2026-01-22 16:11:35 +01:00
Szymon Malewski
29d090845a vector_index: rescoring: Fetch oversampled rows
So far with oversampling the extended set of keys was returned from VS,
but query to the base table was still limited by the query `limit`.
Now for rescoring we want to fetch rows for all the keys returned from VS.
However later we need to restore the command limit, to trim result_set accordingly.
For non-rescoring scenarios we trim directly keys set returned from VS if it happens to exceed query limit.

With this change rescoring validation tests (except `no_nulls_in_rescored_results`) pass fully.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-83
2026-01-22 15:38:44 +01:00
Szymon Malewski
0bc95bcf87 vector_index: rescoring: Sort by similarity column
This patch implements second part of rescoring - ordering results by similarity column added in earlier patch.
For this purpose in this patch we define `_ordering_comparator`, which enables pre-existing post-query ordering functionality.

However, none additional test passes yet, as they include ovesampling, which will be the subject of following patches.
2026-01-22 15:38:44 +01:00
Szymon Malewski
57e7a4fa4f select_statement: Modify needs_post_query_ordering condition
Our plan for rescoring is to use the existing post-query ordering mechanism to sort (and trim) result_set by similarity column.
For general SELECT case this ordering is permitted only for queries with IN on the partition key and an ORDER BY, which is checked in `needs_post_query_ordering`.
Recently this check was overriden for ANN queries in https://github.com/scylladb/scylladb/pull/28109 to enable IN queries handled by VS without excessive post-processing.

In this patch we revert that change - ANN case will be handled by general check.
However we change the condition - we will enable post processing anytime `_ordering_comparator` is set.

In current implementation `_ordering_comparator` is created only in `select_statement::prepare` with `get_ordering_comparator`,
only for the same conditions as were checked in `needs_post_query_ordering`, so this change should be transparent for general SELECT.
For ANN query it is also not set (yet), so it will not influence ANN filtering, but we confirm that this functionality still works
by adding filtering test: `test/vector_search/filter_test.cc::vector_store_client_test_filtering_ann_cql`.

Rescoring ordering for ANN queries will be enabled when we add `_ordering_comparator` in following patch.
2026-01-22 15:38:44 +01:00
Szymon Malewski
c89957b725 vector_index: rescoring: Add hidden similarity score column
Rescoring consist of recalculating similarity score and reordering results based on it.
In this patch we add calculation of similarity score as a hidden (non-serialized) column and following patch will add reordering.
Normal ordering uses `add_column_for_post_processing`, however this works only for regular columns, not function.
So we create it together with user requested columns (this also forces the use of `selection_with_processing`) and hide the column later.
This also requires special handling for 'SELECT *' case - we need to manually add all columns before adding similarity column.

In case user already asks for similarity score in the SELECT clause, this value will be calculated twice - is should be optimized in future patches.
2026-01-22 15:38:40 +01:00
Anna Stuchlik
0aa881f190 doc: add the info about Alternator ports to the Admin Guide
Fixes https://github.com/scylladb/scylladb/issues/23706

Closes scylladb/scylladb#27724
2026-01-22 16:10:58 +03:00
Israel Fruchter
6b3ce5fdcc dist: scylla_coredump_setup: force unmount /var/lib/systemd/coredump before setup
When setting up coredump handling, if there are old mounts in a deleted state (e.g. from an older installation),
systemd might fail to activate the new `.mount` unit properly because it assumes the path is already mounted.

Explicitly unmount `/var/lib/systemd/coredump` before proceeding with the setup to ensure a clean state.

Fix: scylladb/scylla-enterprise#5692

Closes scylladb/scylladb#28300
2026-01-22 14:35:26 +02:00
Pavel Emelyanov
cb6ee05391 Merge 'Extend snapshot manifest.json with tablet-aware metadata' from Benny Halevy
This series extends the json manifest file we create when taking snapshots.
It adds the following metadata:
- manifesr version and scope
- snapshot name
- created_at and expires_at timestamps (#24061)
- node metadata (host_id, dc, rack)
- keyspace and table metadat
- tablet_count (#26352)
- per-sstable metadata (#26352)

Fixes [SCYLLADB-189](https://scylladb.atlassian.net/browse/SCYLLADB-189)
Fixes [SCYLLADB-195](https://scylladb.atlassian.net/browse/SCYLLADB-195)
Fixes [SCYLLADB-196](https://scylladb.atlassian.net/browse/SCYLLADB-196)

* Enhancement, no backport needed

[SCYLLADB-189]: https://scylladb.atlassian.net/browse/SCYLLADB-189?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ
[SCYLLADB-195]: https://scylladb.atlassian.net/browse/SCYLLADB-195?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ
[SCYLLADB-196]: https://scylladb.atlassian.net/browse/SCYLLADB-196?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ

Closes scylladb/scylladb#27945

* github.com:scylladb/scylladb:
  snapshot: keep per-sstable metadata in manifest.json
  snapshot: add table info and tablet_count to manifest.json
  snapshot: add basic support for snapshot ttl in manifest.json
  table: snapshot_on_all_shards: take snapshot_options
  db: snapshot_ctl: move skip_flush to struct snapshot_options
  snapshot: add snapshot name in manifest.json
  test: lib: cql_test_env: apply db::config::tablets_mode_for_new_keyspaces
  snapshot: add node info to manifest.json
  snapshot: add manifest info to manifest.json
  test: database_test: snapshot_works: add validate_manifest
2026-01-22 15:19:11 +03:00
Patryk Jędrzejczak
67045b5f17 Merge 'raft_topology, tablets: Drain tablets in parallel with other topology operations' from Tomasz Grabiec
Allows other topology operations to execute while tablets are being
drained on decommission. In particular, bootstrap on scale-out. This
is important for elasticity.

Allows multiple decommission/removenode to happen in parallel, which
is important for efficiency.

Flow of decommission/removenode request:
  1) pending and paused, has tablet replicas on target node.
     Tablet scheduler will start draining tablets.
  2) No tablets on target node, request is pending but not paused
  3) Request is scheduled, node is in transition
  4) Request is done

Nodes are considered draining as soon as there is a leave or remove
request on them. If there are tablet replicas present on the target
node, the request is in a paused state and will not be picked by
topology coordinator. The paused state is computed from topology state
automatically on reload.

When request is not paused, its execution starts in
write_both_read_old state. The old tablet_draining state is not
entered (it's deprecated now).

Tablet load balancing will yield the state machine as soon as some
request is no longer paused and ready to be scheduled, based on
standard preemption mechanics.

Fixes #21452

Closes scylladb/scylladb#24129

* https://github.com/scylladb/scylladb:
  docs: Document parallel decommission and removenode and relevant task API
  test: Add tests for parallel decommission/removenode
  test: util: Introduce ensure_group0_leader_on()
  test: tablets: Check that there are no migrations scheduled on draining nodes
  test: lib: topology_builder: Introduce add_draining_request()
  topology_coordinator, tablets: Fail draining operations when tablet migration fails due to critical disk utilization
  tablets: topology_coordinator: Refactor to propagate reason for migration rollback
  tablet_allocator: Skip co-location on draining nodes
  node_ops: task_manager_module: Populate entity field also for active requests
  tasks: node_ops: Put node id in the entity field
  tasks, node_ops: Unify setting of task_stats in get_status() and get_stats()
  topology: Protect against empty cancelation reason
  tasks, topology: Make pending node operations abortable
  doc: topology-over-raft.md: Fix diagram for replacing, tablet_draining is not engaged
  raft_topology, tablets: Drain tablets in parallel with other topology operations
  virtual_tables: Show draining and excluded fields in system.cluster_status and system.load_by_node
  locator: topology: Add "draining" flag to a node
  topology_coordinator: Extract generate_cancel_request_update()
  storage_service: Drop dependency in topology_state_machine.hh in the header
  locator: Extract common code in assert_rf_rack_valid_keyspace()
  topology_coordinator, storage_service: Validate node removal/decommission at request submission time
2026-01-22 13:06:53 +01:00
Botond Dénes
21900c55eb tools/scylla-sstable: remove --unsafe-accept-nonempty-output-dir
This flag was added to operations which have an --output-dir
command-line arguments. These operations write sstables and need a
directory where to write them. Back in the numeric-generation world this
posed a problem: if the directory contained any sstable, generation
clash was almost guaranteed, because each scylla-sstable command
invokation would start output generations from 1. To avoid this, empty
output directory was a requirement, with the
--unsafe-accept-nonempty-output-dir allowing for a force-override.

Now in the timeuuid generation days, all this is not necessary anymore:
generations are unique, so it is not a problem if the output directory
already contains sstables: the probability of generation clash is almost
0. Even if it happens, the tool will just simply fail to write the new
sstable with the clashing generation.

Remove this historic relic of a flag and the related logic, it is just a
pointless nuissance nowadays.
2026-01-22 13:55:59 +02:00
Botond Dénes
a1ed73820f tools/scylla-sstable: make partition_set ordered
Next patch will want partitions to be ordered. Remove unused
partition_map type.
2026-01-22 13:55:59 +02:00
Botond Dénes
d228e6eda6 tools/scylla-stable: remove unused boost/algorithm/string.hpp include 2026-01-22 13:55:59 +02:00
Piotr Smaron
d4c28690e1 db: fail reads and writes with local consistencty level to a DC with RF=0
When read or write operations are performed on a DC with RF=0 with LOCAL_QUORUM
or LOCAL_ONE consistency level, Cassandra throws `Unavailable` exception.
Scylla allowed such read operations and failed write operations with a cryptic:
"broken promise" error. This occured because the initial availability
check passed (quorum of 0 requires 0 replicas), but execution failed
later when no replicas existed to process the mutation.

This patch adds an explicit RF=0 validation for LOCAL_ONE and LOCAL_QUORUM that
throws before attempting operation execution.

The change also requires `test_query_dc_with_rf_0_does_not_crash_db` to be
upgraded. This testcase was asserting somewhat similar scenario, but wasn't
taking into account the whole matrix of combinations:
- scenarios: successful vs unsuccesful operation outcome
- local consistency levels: LOCAL_QUORUM & LOCAL_ONE
- operations: SELECT (read) & INSERT (write)

and so it's been extended to cover both the pre-existing and the current issues
and the whole matrix of combinations.

Fixes: scylladb/scylladb#27893
2026-01-22 12:49:45 +01:00
Piotr Smaron
9475659ae8 db: consistency_level: split local_quorum_for()
The core of `local_quorum_for()` has been extracted to
`get_replication_factor_for_dc()`, which is going to be used later,
while `local_quorum_for()` itself has been recreated using the exracted
part.
2026-01-22 12:49:23 +01:00
Piotr Smaron
0b3ee197b6 db: consistency_level: fix nrs -> nts abbreviation
`network_topology_strategy` was abbreviated with `nrs`, and not `nts`. I
think someone incorrectly assumed it's 'network Replication strategy', hence
nrs.
2026-01-22 12:48:37 +01:00
Marcin Maliszkiewicz
32543625fc test: perf: reuse stream id
When one request is super slow and req/s high
in theory we have a collision on id, this patch
avoids that by reusing id and aborting when there
is no free one (unlikely).
2026-01-22 12:26:50 +01:00
Marcin Maliszkiewicz
7bf7ff785a main: test: add future and abort_source to after_init_func
This commit avoids leaking seastar::async future from two benchmark
tools: perf-alternator and perf-cql-raw. Additionally it adds
abort_source for fast and clean shutdown.
2026-01-22 12:26:50 +01:00
Marcin Maliszkiewicz
0d20300313 test: perf: add option to stress multiple tables in perf-cql-raw 2026-01-22 12:26:50 +01:00
Marcin Maliszkiewicz
a033b70704 test: perf: add perf-cql-raw benchmarking tool
The tool supports:
- auth or no auth modes
- simple read and write workloads
- connection pool or connection per request modes
- in-process or remote modes, remote may be usefull
to assess tool's overhead or use it as bigger scale benchmark
- uses prepared statements by default
- connection only mode, for testing storms

It could support in the future:
- TLS mode
- different workloads
- shard awareness

Example usage:
> build/release/scylla perf-cql-raw --workdir /tmp/scylla-data --smp 2
--cpus 0,1 \
--developer-mode 1 --workload read --duration 5 2> /dev/null

Running test with config: {workload=read, partitions=10000, concurrency=100, duration=5, ops_per_shard=0}
Pre-populated 10000 partitions
97438.42 tps (269.2 allocs/op,   1.1 logallocs/op,  35.2 tasks/op,  113325 insns/op,   80572 cycles/op,        0 errors)
102460.77 tps (261.1 allocs/op,   0.0 logallocs/op,  31.7 tasks/op,  108222 insns/op,   75447 cycles/op,        0 errors)
95707.93 tps (261.0 allocs/op,   0.0 logallocs/op,  31.7 tasks/op,  108443 insns/op,   75320 cycles/op,        0 errors)
102487.87 tps (261.0 allocs/op,   0.0 logallocs/op,  31.7 tasks/op,  107956 insns/op,   74320 cycles/op,        0 errors)
100409.60 tps (261.0 allocs/op,   0.0 logallocs/op,  31.7 tasks/op,  108337 insns/op,   75262 cycles/op,        0 errors)
throughput:
	mean=   99700.92 standard-deviation=3039.28
	median= 100409.60 median-absolute-deviation=2759.85
	maximum=102487.87 minimum=95707.93
instructions_per_op:
	mean=   109256.53 standard-deviation=2281.39
	median= 108337.37 median-absolute-deviation=1034.83
	maximum=113324.69 minimum=107955.97
cpu_cycles_per_op:
	mean=   76184.36 standard-deviation=2493.46
	median= 75320.20 median-absolute-deviation=922.09
	maximum=80572.19 minimum=74320.00
2026-01-22 12:26:50 +01:00
Patryk Jędrzejczak
e11450ccca test: test_raft_recovery_user_data: prevent repeated ALTER KEYSPACE request
The test is currently flaky. With `remove_dead_nodes_with == "remove"`,
it sends several ALTER KEYSPACE requests. The request performed just
after adding 3 new nodes can unexpectedly be sent twice to two
different nodes by the driver. The second receiver rejects the request
through the new guardrail added in 2e7ba1f8ce,
and the test fails.

This has been acknowledged as a bug in the Python driver. It shouldn't
retry non-idempotent requests with the default retry policy. There could
be one more bug in the driver, as it looks like the driver decides to
resend the request after it disconnects from the first receiver. The
first receiver has just bootstrapped, so the driver shouldn't disconnect.

We deflake the test by reconnecting the driver before performing the
problematic ALTER KEYSPACE request.

The change has been tested in byo, as the failure reproduces only in CI.
Without the change, the test fails once in ~250 runs in dev mode. With
the change, more than 1000 runs passed.

Fixes #27862

No backport needed as 2e7ba1f8ce is only
in master.

Closes scylladb/scylladb#28290
2026-01-22 14:13:42 +03:00
Nadav Har'El
9baaddb613 test/cqlpy: add reproducer for hidden Paxos table being shown by DESC
This patch adds a reproducer test showing issue #28183 - that when LWT
is used, hidden tables "...$paxos" are created but they are unexpectedly
shown by DESC TABLES, DESC SCHEMA and DESC KEYSPACE.

The new test was failing (in three places) on Scylla, as those internal
(and illegally-named) tables are listed, and passes on Cassandra
(which doesn't add hidden tables for LWT).

The commit also contains another test, which verifies if direct
description of paxos state table is wrapped in comment.

Refs #28183.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-01-22 10:43:59 +01:00
Szymon Malewski
e0cc6ca7e6 vector_index: Refactor extracting ANN query information
For the purpose of rescoring we will need information if the query is an ANN query
and the access to index option earlier in the `select_statement::prepare` than it happened before.
This patch refactors extracting this information to new helper structure `ann_ordering_info`
and is consistently using it.
2026-01-22 10:00:47 +01:00
Botond Dénes
7d2e6c0170 Merge 'config: add enforce_rack_list option' from Aleksandra Martyniuk
Add enforce_rack_list option. When the option is set to true,
all tablet keyspaces have rack list replication factor.

When the option is on:
- CREATE STATEMENT always auto-extends rf to rack lists;
- ALTER STATEMENT fails when there is numeric rf in any DC.

The flag is set to false by default and a node needs to be restarted
in order to change its value. Starting a node with enforce_rack_list
option will fail, if there are any tablet keyspaces with numeric rf
in any DC.

enforce_rack_list is a per-node option and a user needs to ensure
that no tablet keyspace is altered or created while nodes in
the cluster don't have the consistent value.

Mark rf_rack_valid_keyspaces as deprecated.

Fixes: https://github.com/scylladb/scylladb/issues/26399.

New feature; no backport needed

Closes scylladb/scylladb#28084

* github.com:scylladb/scylladb:
  test: add test for enforce_rack_list option
  db: mark rf_rack_valid_keyspaces as deprecated
  config: add enforce_rack_list option
  Revert "alternator: require rf_rack_valid_keyspaces when creating index"
2026-01-22 10:27:35 +02:00
Benny Halevy
d6557764b9 snapshot: keep per-sstable metadata in manifest.json
Adds a "sstables" array member to manifest.json.
For each sstables, keep the following metadata:
id - a uuid for the sstable (the sstable identifier
if the use-sstable-identifier option was used, otherwise
the sstable uuid generation)
toc_name - the name of the TOC.txt file
data_size and index_size - in bytes
first_token and last_token - of the sstable first and last keys.

Fixes: SCYLLADB-196

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-01-22 09:42:52 +02:00
Benny Halevy
dc9093303d snapshot: add table info and tablet_count to manifest.json
Add a table member to manifest.json with the keyspace_name,
table_name, table_id, tablets_type, and, for tablets-enabled tables, get
tablet_count on each shard and write the minimum to manifest.json.
For vnodes-based tables, tablet_count=0.

For now, `tablets_type` may be either `none` for vnodes tables, or
`powof2` for tablets tables.  In the future, when we support arbitrary
tablt boundaries, this will be reflected here, and it is likely we
would backup the whole tablets map sperately to get all tablet boundaries.

Fixes SCYLLADB-195

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-01-22 09:36:52 +02:00
Benny Halevy
91df129e21 snapshot: add basic support for snapshot ttl in manifest.json
Store the snapshot `created_at` time and an optional
`expires_at` time.

Fixes SCYLLADB-189

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-01-22 09:12:56 +02:00
Benny Halevy
5e90fbb9d2 table: snapshot_on_all_shards: take snapshot_options
And keep the options for now in the local_snapshot_writer.
The options will be used by following patches to pass
extra metadata like the snapshot creation time, expiration time, etc.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-01-22 09:12:56 +02:00
Benny Halevy
49a3e0914d db: snapshot_ctl: move skip_flush to struct snapshot_options
So we can easily extend it and add more options.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-01-22 09:12:56 +02:00
Benny Halevy
d9fc3b1c11 snapshot: add snapshot name in manifest.json
Store the snapshot tag in the manifest file.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-01-22 09:12:56 +02:00
Benny Halevy
0f014e2916 test: lib: cql_test_env: apply db::config::tablets_mode_for_new_keyspaces
If tablets are enabled via db::config add the `tablet = {'enabled': true}'
option when creating a keyspace, even if `cql_test_config.initial_tablets`
is disengaged.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-01-22 09:12:56 +02:00
Benny Halevy
0d82e56078 snapshot: add node info to manifest.json
Add metadata about the node: host_id, datacenter, and rack.
This enables dc- or rack- aware restore.
Today this information is "encoded" into the snapshot hierarchy
prefixes, but if all manifest files would be stored in a flat
directory, we'd need to encode that metadata in the object name,
but it'd be better for the manifest contents to be self descriptive.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-01-22 09:12:56 +02:00
Benny Halevy
24040efc54 snapshot: add manifest info to manifest.json
Add metadata about the manifest itself:
A version and the manifest scope (currently "node",
but in the future, may also be "shard", or "tablet")

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-01-22 09:12:56 +02:00
Benny Halevy
9e0f5410ae test: database_test: snapshot_works: add validate_manifest
Validate the manifest.json format by loading it using rjson::parse
and then validate its contents to ensure it lists exactly the
SSTables present in the snapshot directory.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-01-22 09:12:56 +02:00
Patryk Jędrzejczak
e503340efc test: test_raft_recovery_during_join: get host ID of the bootstrapping node before it crashes
The test is currently flaky. It tries to get the host ID of the bootstrapping
node via the REST API after the node crashes. This can obviously fail. The
test usually doesn't fail, though, as it relies on the host ID being saved
in `ScyllaServer._host_id` at this point by `ScyllaServer.try_get_host_id()`
repeatedly called in `ScyllaServer.start()`. However, with a very fast crash
and unlucky timings, no such call may succeed.

We deflake the test by getting the host ID before the crash. Note that at this
point, the bootstrapping node must be serving the REST API requests because
`await coordinator_log.wait_for("delay_node_bootstrap: waiting for message")`
above guarantees that the node has submitted the join topology request, which
happens after starting the REST API.

Fixes #28227

Closes scylladb/scylladb#28233
2026-01-22 06:55:16 +02:00
Botond Dénes
4281d18c2e Merge 'schema: Apply sstable_compression_user_table_options to CQL aux and Alternator tables' from Nikos Dragazis
In PR 5b6570be52 we introduced the config option `sstable_compression_user_table_options` to allow adjusting the default compression settings for user tables. However, the new option was hooked into the CQL layer and applied only to CQL base tables, not to the whole spectrum of user tables: CQL auxiliary tables (materialized views, secondary indexes, CDC log tables), Alternator base tables, Alternator auxiliary tables (GSIs, LSIs, Streams).

This gap also led to inconsistent default compression algorithms after we changed the option’s default algorithm from LZ4 to LZ4WithDicts (adf9c426c2).

This series introduces a general “schema initializer” mechanism in `schema_builder` and uses it to apply the default compression settings uniformly across all user tables. This ensures that all base and aux tables take their default compression settings from config.

Fixes #26914.

Backport justification: LZ4WithDicts is the new default since 2025.4, but the config option exists since 2025.2. Based on severity, I suggest we backport only to 2025.4 to maintain consistency of the defaults.

Closes scylladb/scylladb#27204

* github.com:scylladb/scylladb:
  db/config: Update sstable_compression_user_table_options description
  schema: Add initializer for compression defaults
  schema: Generalize static configurators into schema initializers
  schema: Initialize static properties eagerly
  db: config: Add accessor for sstable_compression_user_table_options
  test: Check that CQL and Alternator tables respect compression config
2026-01-22 06:50:48 +02:00
Nadav Har'El
5c1e525618 Merge 'vector_search: cache restrictions JSON at prepare time ' from Dawid Pawlik
Add `prepared_filter` class which handles the preparation, construction, and caching of Vector Search filtering compatible JSON object.

If no bind markers found in primary key restrictions, the JSON object will be built once at prepare time and cached for use during execution calls.

Additionally, this patch moves the filter functions from `cql3::restrictions` to `vector_search` namespace and does some renaming to make the purpose of those functions clear.

Follow-up: https://github.com/scylladb/scylladb/pull/28109
Fixes: [SCYLLADB-299](https://scylladb.atlassian.net/browse/SCYLLADB-299)

[SCYLLADB-299]: https://scylladb.atlassian.net/browse/SCYLLADB-299?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ

Closes scylladb/scylladb#28276

* github.com:scylladb/scylladb:
  test/vector_search: add filter tests with bind variables
  vector_search: cache restrictions JSON at prepare time
  refactor: vector_search: move filter logic to vector_search namespace
2026-01-21 19:03:35 +02:00
Ernest Zaslavsky
51285785fa storage: add method to create component source
Extend the storage interface with a public method to create a
`data_source` for any sstable component.
2026-01-21 16:40:12 +02:00
Ernest Zaslavsky
757e9d0f52 streaming: keep sharded database reference on tablet_sstable_streamer 2026-01-21 16:40:12 +02:00
Ernest Zaslavsky
4ffa070715 streaming: skip streaming empty sstable sets
Avoid invoking streaming when the sstable set is empty. This prevents
unnecessary calls and simplifies the streaming logic.
2026-01-21 16:40:12 +02:00
Ernest Zaslavsky
0fcd369ef2 streaming: inline streamer instance creation
Remove the `make_sstable_streamer` function and inline its usage where
needed. This change allows passing different sets of arguments
directly at the call sites.
2026-01-21 16:40:12 +02:00
Ernest Zaslavsky
32173ccfe1 tests: fix incorrect backup/restore test flow
When working directly with sstable components, the provided name should
be only the file name without path prefixes. Any prefixing tokens
belong in the 'prefix' argument, as the name suggests.
2026-01-21 16:40:12 +02:00
Patryk Jędrzejczak
1f0f694c9e test: test_zero_token_nodes_multidc: properly handle reads with CL=LOCAL_ONE
The test is currently flaky. It incorrectly assumes that a read with
CL=LOCAL_ONE will see the data inserted by a preceding write with
CL=LOCAL_ONE in the same datacenter with RF=2.

The same issue has already been fixed for CL=ONE in
21edec1ace. The difference is that
for CL=LOCAL_ONE, only dc1 is problematic, as dc2 has RF=1.

We fix the issue for CL=LOCAL_ONE by skipping the check for dc1.

Fixes #28253

The fix addresses CI flakiness and only changes the test, so it
should be backported.

Closes scylladb/scylladb#28274
2026-01-21 15:17:42 +01:00
Petr Gusev
843da2bbb8 test_encryption: capture stderr
The check_output function doesn't capture stderr by default, making
it impossible to debug if something goes wrong.
2026-01-21 14:56:01 +01:00
Petr Gusev
925d86fefc test/cluster: add test_strong_consistency.py
Add a basic test that creates a strongly consistent keyspace and table,
writes some data, and verifies that the same data can be read back.

Since Scylla-side request proxying is not yet implemented, writes are
handled only on the leader node. The test uses the existing
`/raft/leader_host` REST endpoint to determine the leader of the tablets
Raft group.
2026-01-21 14:56:01 +01:00
Petr Gusev
59a876cebb raft_group_registry: disable metrics for non-0 groups
The `raft::server` registers metrics using the `server_id` label. When
both a group0 Raft server and the tablets Raft server are created on
the same node/shard, duplicate metrics cause conflicts.

This commit temporarily disables metrics for non-0 groups. A proper fix
will likely require adding a `group_id` label in the future.
2026-01-21 14:56:01 +01:00
Petr Gusev
1f170d2566 strong consistency: implement select_statement::do_execute() 2026-01-21 14:56:01 +01:00
Petr Gusev
a5d611866e cql: add select_statement.cc 2026-01-21 14:56:01 +01:00
Petr Gusev
493ebe2add strong consistency: implement coordinator::query() 2026-01-21 14:56:01 +01:00
Petr Gusev
ccf90cfde8 cql: add modification_statement
We use decoration instead of inheritance, since inheritance already
serves to differentiate statement types (modification_statement has
update_statement and delete_statement as descendants). A better
solution would likely involve refactoring modification_statement and
extracting the mutation-generation logic into a reusable component
shared by both eventual and strongly consistent statements.
2026-01-21 14:56:01 +01:00
Petr Gusev
989566e8a3 cql: add statement_helpers
Introduce two helper methods that will be used for strongly consistent
select_statement and modification_statement.

redirect_statement() forwards the request to another shard or node.
Currently, only shard forwarding is implemented; node-level proxying
will be added in follow-up PRs.

is_strongly_consistent() will be used in the prepare() method of raw
statements to determine whether a strongly consistent statement should
be created for the given CQL statement.
2026-01-21 14:56:01 +01:00
Petr Gusev
b94fd11939 strong consistency: implement coordinator::mutate()
To guarantee monotonic mutation timestamps, we compute the maximum
timestamp used so far for the current tablet. This is done by calling
read_barrier() on the tablet’s Raft group server and extracting the
maximum timestamp from the local database via
table::get_max_timestamp_for_tablet().

Because read_barrier() may take a while, we perform it proactively in a
dedicated fiber, leader_info_updater, rather than during the mutation
request. This fiber is started when the Raft group server starts for a
tablet. It reacts to wait_for_state_change(), computes the maximum
timestamp, and stores it per term.

The new groups_manager::begin_mutate() function checks whether the
maximum timestamp has already been computed for the current term. If
not, it asks the client to wait. This two-step interface (synchronous
begin_mutate() + asynchronous wait on the need_wait_for_leader future)
is needed because the term can change at any asynchronous point.
If begin_mutate() were asynchronous, the client would need to recheck
the term after `co_await begin_mutate()`.

We currently do not handle raft::commit_status_unknown. We rethrow it to
the CQL client, which must check whether the command was applied and
retry if necessary. Handling this inside Scylla would require persisting
a deduplication key after applying the mutation, which introduces write
amplification. Additionally, connection breaks between Scylla and the
driver can always occur, so the client must be prepared to verify the
command status regardless.
2026-01-21 14:56:01 +01:00
Petr Gusev
801bf82a34 raft.hh: make server::wait_for_leader() public
When a strongly consistent request arrives at a node, we
need to know which replica is the leader, since such requests
are generally executed only on the leader. If a leader has
not yet been elected, we must wait. This commit exposes
wait_for_leader() so it can be used for that purpose.

We cannot rely solely on wait_for_state_change(), because it does not
trigger when some other node becomes a leader.
2026-01-21 14:56:01 +01:00
Petr Gusev
7d111f2396 strong_consistency: add coordinator
Add the `coordinator` class, which will be responsible for coordinating
reads and writes to strongly consistent tables. This commit includes
only the boilerplate; the methods will be implemented in separate
commits.
2026-01-21 14:56:01 +01:00
Petr Gusev
4413142f25 modification_statement: make get_timeout public
We'll need to access this method in a new
strong_consistency/modification_statement class.
2026-01-21 14:56:00 +01:00
Petr Gusev
4902186ede strong_consistency: add groups_manager
This class is reponsible for managing raft groups for
strongly-consistent tablets.
2026-01-21 14:56:00 +01:00
Petr Gusev
4eee5bc273 strong_consistency: add state_machine and raft_command
These commands will be used by strongly consistent tablets to submit
mutations to Raft. A simple state_machine implementation is introduced
to apply these commands.

We apply commands in batches to reduce commitlog I/O overhead. The
batched variant of database::apply has known atomicity issues. For
example, it does not guarantee atomicity under memory pressure: some
mutations may be published to the memtable while others are blocked in
run_when_memory_available. We will address these issues later.
2026-01-21 14:56:00 +01:00
Petr Gusev
a8350b274e table: add get_max_timestamp_for_tablet
Strongly consistent writes require knowing the maximum timestamp of
locally applied mutations to guarantee monotonically increasing
timestamps for subsequent writes.

This commit adds a function that returns the maximum timestamp for a
given tablet.

Why it is safe to use this function with deleted cells:
* Tombstones are included in memtable.get_max_timestamp() calculations.
* The maximum timestamp of a memtable is used to initialize the maximum
  timestamp of the resulting sstable.
* During compaction, a new sstable’s maximum timestamp is initialized as
  the maximum of the contributing sstables.
2026-01-21 14:56:00 +01:00
Petr Gusev
dc4896b69b tablets: generate raft group_id-s for new table
We assign shard to zero because raft is currently only supported on the
zero shard.
2026-01-21 14:56:00 +01:00
Petr Gusev
6dd74be684 tablet_replication_strategy: add consistency field
This commit adds a `consistency` field to `tablet_replication_strategy`.
In upcoming commits we'll use this field to determine if a
`raft_group_id` should be generated for a new table.
2026-01-21 14:56:00 +01:00
Petr Gusev
53f93eb830 tablets: add raft_group_id
Add a `raft_group_id` column to `system.tablets` and to the `tablet_map`
class. The column is populated only when the
`strongly_consistent_tables` feature is enabled.

This feature is currently disabled by default and is enabled only when
the user sets the `STRONGLY_CONSISTENT_TABLES` experimental flag.

The `raft_group_id` column is added to `system.tablets` only when this
flag is set. This allows the schema to evolve freely while the feature
is experimental, without requiring complex migrations.
2026-01-21 14:56:00 +01:00
Petr Gusev
cab3e1eea5 modification_statement: remove virtual where it's not needed
This is a refactoring/simplification commit.
2026-01-21 14:56:00 +01:00
Petr Gusev
9015bed794 modification_statement: inline prepare_statement()
This is a refactoring/simplification commit.

There are many 'prepare' functions in this class that don't
meaningfully differ from each other. The prepare_statement() adds
accidental complexity by adding a level of indirection -- the reader
has to jump between the call site and the function body to reconstruct
the full picture.
2026-01-21 14:56:00 +01:00
Petr Gusev
998ee5b7fb system_keyspace: disable tablet_balancing for strongly_consistent_tables
We don't yet support tablet balancing for strongly consistent tables.
2026-01-21 14:56:00 +01:00
Petr Gusev
6b0d757f28 cql: rename strongly_consistent statements to broadcast statements
In preparation for upcoming work on strongly consistent queries in
Scylla, this commit renames the existing `strongly_consistent`
statements to `broadcast_statements` to avoid confusion.

The old code paths are kept temporarily, as they may be useful for
reference or for copying parts during the implementation of the new
strongly consistent statements.
2026-01-21 14:56:00 +01:00
Pavel Emelyanov
18b5a49b0c Populate all sl:* groups into dedicated top-level supergroup
This patch changes the layout of user-facing scheduling groups from

/
`- statement
`- sl:default
`- sl:*
`- other groups (compaction, streaming, etc.)

into

/
`- user (supergroup)
   `- statement
   `- sl:default
   `- sl:*
`- other groups (compaction, streaming, etc.)

The new supergroup has 1000 static shares and is name-less, in a sense
that it only have a variable in the code to refer to and is not exported
via metrics (should be fixed in seastar if we want to).

The moved groups don't change their names or shares, only move inside
the scheduling hierarchy.

The goal of the change is to improve resource consumption of sl:*
groups. Right now activities in low-shares service levels are scheduled
on-par with e.g. streaming activity, which is considered to be low-prio
one. By moving all sl:* groups into their own supergroup with 1000
shares changes the meaning of sl:* shares. From now on these shares
values describe preirities of service level between each-other, and the
user activities compete with the rest of the system with 1000 shares,
regardless of how many service levels are there.

Unit tests keep their user groups under root supergroup (for simplicity)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#28235
2026-01-21 14:14:48 +02:00
Botond Dénes
cf4133c62d Merge 'service: pass topology guard to RBNO' from Aleksandra Martyniuk
Currently, raft-based node operations with streaming use topology guards, but repair-based don't.

Topology guards ensure that if a respective session is closed (the operation has finished), each leftover operation being a part of this session fails. Thanks to that we won't incorrectly assume that e.g. the old rpc received late belongs to the newly started operation. This is especially important if the operation involves writes.

Pass a topology_guard down from raft_topology_cmd_handler to repair tasks. Repair tasks already support topology guards.

Fixes: https://github.com/scylladb/scylladb/issues/27759

No topology_guard in any version; needs backport to all versions

Closes scylladb/scylladb#27839

* github.com:scylladb/scylladb:
  service: use session variable for streaming
  service: pass topology guard to RBNO
2026-01-21 12:47:28 +02:00
Robert Bindar
ea8a661119 reduce test_backup.py and test_refresh.py datasets
backup and restore tests. This made the testing times explode
with both cluster/object_store/test_backup.py and
cluster/test_refresh.py taking more than an hour each to complete
under test.py and around 14min under pytest directly.
This was painful especially in CI because it runs tests under test.py which
suffers from the issue of not being able to run test cases from within
the same file in parallel (a fix is attempted in 27618).

This patch reduces the dataset of these tests to the minimum and
gets rid of one of the tested topology as it was redundant.
The test times are reduced to 2min under pytest and 14 mins under
test.py.

Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>

Closes scylladb/scylladb#28280
2026-01-21 10:47:36 +02:00
Nadav Har'El
8962093d90 Merge 'vector_index: Introduce rescoring index option' from Szymon Malewski
This series introduces `rescoring` index option.
There is no rescoring algorithm implementation yet.
This series prepares it by:
- adding new index option
- adding documentation
- adding tests for option handling
- adding tests for rescoring implementation - at this point they report errors and are marked that this is expected, because rescoring is not implemented.

Follow-up https://github.com/scylladb/scylladb/pull/27677
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-293
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-294

No backporting - it is a new feature.

Closes scylladb/scylladb#28165

* github.com:scylladb/scylladb:
  vector_search: Add more rescoring validation tests
  vector_search: Add rescoring validation test
  vector_search: doc: Document new index option
  vector_search: test: Add `rescoring` index option test
  vector_index: introduce rescoring option
  vector_index: improve options validation
2026-01-21 10:46:22 +02:00
Avi Kivity
ecb6fb00f0 streamed_mutation_freezer: use chunked_vector instead of std::deque for clustering rows
The streamed_mutation_freezer class uses a deque to avoid large
allocations, but fails as seen in the referenced issue when the
vector backing the deque grows too large. This may be a problem
in itself, but the issue doesn't provide enough information to tell.

Fix the immediate problem by switching to chunked_vector, which
is better in avoiding large allocations. We do lose some early-free
in serialize_mutation_fragments(), but since most of the memory should
be in the clustering row itself, not in the deque/chunked_vector holding
it, it should not be a problem.

Fixes #28275

Closes scylladb/scylladb#28281
2026-01-21 10:13:44 +02:00
Botond Dénes
09d3b6c98b Merge 'auth: use paged internal queries during migration' from Piotr Smaron
Auth v2 migration uses non-paged queries via `execute_internal` API.
This commit changes it to use `query_internal` instead, which uses
paging under the hood.

Fixes: https://github.com/scylladb/scylladb/issues/27577

A minor enhancement, no need to backport.

Closes scylladb/scylladb#25395

* github.com:scylladb/scylladb:
  auth: use paged internal queries during migration
  auth: move some code in migrate_to_auth_v2 up
  auth: re-align pieces of migrate_to_auth_v2
  cql: extend `query_internal` with `query_state` param
2026-01-21 09:58:06 +02:00
Raphael S. Carvalho
d16f9c821d Revert "api: storage_service/tablets/repair: disable incremental repair by default"
This reverts commit c8cff94a5a.

Re-enabling incremental repair on master with "Aborting on shard 0 during
scaleout + repair #26041" and "Failure to attach sstables in streaming consumer
leaves sealed sstables on disk #27414" fixed.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes scylladb/scylladb#28120
2026-01-21 08:50:13 +02:00
Dawid Pawlik
4a7e20953a test/cqlpy: remove the xfail mark from already passing tests
Since #28109 was merged, those tests started to pass as we allow
the filtering on primary key columns within ANN vector queries.

Closes scylladb/scylladb#28231
2026-01-21 08:47:20 +02:00
Botond Dénes
7a3f51b304 Update seastar submodule
* seastar e00f1513...f55dc7eb (8):
  > build: detect io_uring correctly
  > ioinfo: report actual I/O latency goal instead of configured value
  > core/loop: repeater,repeat_until_value: don't rethrow exceptions
  > seastar::test: Add exception interception to extract nested info
  > bool_class: add operator<=>
  > Merge 'Perftune.py: generic virtual device with lower-level slave network devices support' from Vladislav Zolotarov
    scripts/perftune.py: NetPerfTuner: rename "hardware interface" methods to "tunable interface" methods
    scripts/perftune.py: NetPerfTuner: refactor NIC handling to support any virtual interface with slaves
    scripts/perftune.py: NetPerfTuner: rename methods to private for better encapsulation
  > scripts/perftune.py: add fast path queues IRQ filtering and indexing for Microsoft Azure Network Adapter (MANA) NICs
  > file: Finally override dma_read_bulk() in posix_file_impl

Closes scylladb/scylladb#28262
2026-01-21 08:44:20 +02:00
Szymon Malewski
d6226500f6 vector_search: Add more rescoring validation tests
Adding tests for specific cases of rescoring processing:
- wildcard selection - "SELECT * ..." is a case with slightly different path of rescoring processing. We want to confirm that it is handled correctly.
- calculating similarity with other vectors in SELECT clause should not influence ANN ordering.
- NULL handling - results that for any reason have NULL in a score should be filtered out.

As rescoring is not implemented yet, the tests use boost::unit_test::expected_failures
to indicate that the test reports errors.
2026-01-20 21:01:45 +01:00
Karol Nowacki
376c70be75 vector_search: Add rescoring validation test
Verify that vector store results will be correctly rescored and reordered
according to the rescoring algorithm.
As rescoring is not implemented yet, the tests use `boost::unit_test::expected_failures`
to indicate that they report errors.

First test checks rescoring with a simple selection list.
Second makes sure that rescoring is not triggered for quantization=f32 - full representation of vectors.
Third repeats the first one, but adds to it returning of similarity score value.
2026-01-20 21:01:45 +01:00
Karol Nowacki
41dcf12463 vector_search: doc: Document new index option
Adds documentation for the rescoring option for vector search indexes.
2026-01-20 21:01:45 +01:00
Karol Nowacki
b268eda67e vector_search: test: Add rescoring index option test
Add tests to validate `rescoring` index options.
It also improves tests for related `oversampling` option validation.
2026-01-20 21:01:45 +01:00
Szymon Malewski
c5945b1ef4 vector_index: introduce rescoring option
This patch adds vector index option allowing to enable rescoring - recalculation of similarity metric and re-ranking of quantized VS candidates.
Quantization is a necessary condition to run rescoring - checked in convenience function `is_rescoring_enabled`.
Rescoring itself is not implemented - it will come in following patches.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-294
2026-01-20 21:01:45 +01:00
Szymon Malewski
262a8cef0b vector_index: improve options validation
In this patch we enhance validation of option by:
- giving context (option name) in error messages
- listing supported values in error messages of enumerated options
- avoiding using templates

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-293
Follow-up: https://github.com/scylladb/scylladb/pull/27677
2026-01-20 21:01:41 +01:00
Dawid Pawlik
f27ef79d0d test/vector_search: add filter tests with bind variables
Add tests that check if preparation of the filter  does work
and not produce cache when the restrictions consist of bind variables.
2026-01-20 18:17:46 +01:00
Dawid Pawlik
e62cb29b7d vector_search: cache restrictions JSON at prepare time
Add `prepared_filter` class which handles the preparation, construction
and caching of Vector Search filtering compatible JSON object.

If no bind markers found in SELECT statement, the JSON object will be built
once at prepare time and cached for use during execution calls.

Adjust tests accordingly to use prepared filters.

Follow-up: #28109
Fixes: SCYLLADB-299
2026-01-20 17:15:52 +01:00
Michał Jadwiszczak
f89a8c4ec4 cql3/statements/describe_statement: hide paxos state tables
Paxos state tables are internal tables fully managed by Scylla
and they shouldn't be exposed to the user nor they shouldn't be backed up.

This commit hides those kind of tables from all listings and if such table
is directly described with `DESC ks."tbl$paxos"`, the description is generated
withing a comment and a note for the user is added.

Fixes scylladb/scylladb#28183
2026-01-20 15:58:08 +01:00
Andrei Chekun
91e2b027ce test.py: do not crash when there is no boost log
Small fix to not crash the whole process when boost tests failed to
start and do not produced the log file that can be parsed.
2026-01-20 15:52:40 +01:00
Andrei Chekun
58d3052ad4 test.py: pass correctly extra cmd line arguments
During rewrite --extra-scylla-cmdline-options was missed and it was not
passed to the tests that are using pytest. The issue that there were no
possibility to pass these parameters via cmd to the Scylla, while tests
were not affected because they were using the parameters from the yaml
file. This PR fixes this issue so it will be easier to modify the Scylla
start parameters without modifying code.
2026-01-20 15:52:40 +01:00
Anna Stuchlik
84e9b94503 doc: fix the default compaction strategy for Materialized Views
Fixes https://github.com/scylladb/scylladb/issues/24483

Closes scylladb/scylladb#27725
2026-01-20 15:41:03 +01:00
Dawid Pawlik
f54a4010c0 refactor: vector_search: move filter logic to vector_search namespace
Move Vector Search filter functions from `cql3::restrictions` to
`vector_search` namespace as it's a better place according to
it's purpose.
The effective name has now changed from `cql3::restrictions::to_json`
to `vector_search::to_json` which clearly mentions that the JSON
object will be used for Vector Search.

Rename the auxilary functions to use `to_json` suffix instead of
variety of verbs as those functions logic focuses on building JSON
object from different structures. The late naming emphasized too
much distinction between those functions, while they do pretty much
the same thing.

Follow-up: #28109
2026-01-20 13:13:43 +01:00
Botond Dénes
a53f989d2f db/row_cache: make_nonpopulating_reader(): pass cache tracker to snapshot
The API contract in partition_version.hh states that when dealing with
evictable entries, a real cache tracker pointer has to be passed to all
methods that ask for it. The nonpopulating reader violates this, passing
a nullptr to the snapshot. This was observed to cause a crash when a
concurrent cache read accessed the snapshot with the null tracker.

A reproducer is included which fails before and passes after the fix.

Fixes: #26847

Closes scylladb/scylladb#28163
2026-01-20 12:34:37 +01:00
Avi Kivity
c7dda5500c database: simplify apply_counter_update exception handling
Use coroutine::try_future to exit the coroutine immediately on
error instead of explict checks.

Closes scylladb/scylladb#28257
2026-01-20 11:13:49 +02:00
Wojciech Mitros
fc2aecea69 idl: don't redefine bound_weight and partition_region in paging_state.idl.hh
Bound_weight and partition_region are defined in both paging_state.idl.hh and
position_in_partition.idl.hh. This isn't currently causing any issues, but if
a future RPC uses both the paging_state and position_in_partition, after
including both files we'll get a duplicate error.
In this patch we prevent this by removing the definitions from paging_state.idl.hh
and including position_in_partition.idl.hh in their place.

Closes scylladb/scylladb#28228
2026-01-20 11:12:47 +02:00
Aleksandra Martyniuk
2be5ee9f9d service: use session variable for streaming
Use session that was retrieved at the beginning of the handler for
node operations with streaming to ensure that the session id won't
change in between.
2026-01-20 10:06:34 +01:00
Aleksandra Martyniuk
3fe596d556 service: pass topology guard to RBNO
Currently, raft-based node operations with streaming use topology
guards, but repair-based don't.

Topology guards ensure that if a respective session is closed
(the operation has finished), each leftover operation being a part
of this session fails. Thanks to that we won't incorrectly assume
that e.g. the old rpc received late belongs to the newly started
operation. This is especially important if the operation involves
writes.

Pass a topology_guard down from raft_topology_cmd_handler to repair
tasks. Repair tasks already support topology guards.

Fixes: https://github.com/scylladb/scylladb/issues/27759
2026-01-20 10:06:34 +01:00
Aleksandra Martyniuk
f0dbf6135d test: add test for enforce_rack_list option 2026-01-20 10:01:15 +01:00
Aleksandra Martyniuk
7dc371f312 db: mark rf_rack_valid_keyspaces as deprecated
Mark rf_rack_valid_keyspaces option as deprecated. User should
use enforce_rack_list option instead.

The option can still be used and it does not change it's behavior.

Docs is updated accordingly.
2026-01-20 09:58:57 +01:00
Aleksandra Martyniuk
761ace4f05 config: add enforce_rack_list option
Add enforce_rack_list option. When the option is set to true,
all tablet keyspaces have rack list replication factor.

When the option is on:
- CREATE STATEMENT always auto-extends rf to rack lists;
- ALTER STATEMENT fails when there is numeric rf in any DC.

The flag is set to false by default and a node needs to be restarted
in order to change its value. Starting a node with enforce_rack_list
option will fail, if there are any tablet keyspaces with numeric rf
in any DC.

enforce_rack_list is a per-node option and a user needs to ensure
that no tablet keyspace is altered or created while nodes in
the cluster don't have the consistent value.
2026-01-20 09:58:51 +01:00
Michael Litvak
e7ec87382e Revert "alternator: require rf_rack_valid_keyspaces when creating index"
This reverts commit 4b26a86cb0.

The rf_rack_valid_keyspaces option is now not required for creating MVs.
2026-01-20 09:56:48 +01:00
Pavel Emelyanov
8ecd4d73ac test: Update cluster/object_store/ tests to use new S3 config format
Currently the suite generates config in old format, and only a single
test validates that using new format "works".

This change updates the suite (mainly the MinioServer::create_conf()
method) to generate endpoint confit in new format.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#28113
2026-01-20 10:53:34 +02:00
Pavel Emelyanov
5e369c0439 audit: Stop using deprecated seastar UDP sending API
The datagram_channel::send() method that sends net::packet-s is
deprecated in favor of using span<temporary_buffer> one. Auditing code
still uses the former one -- it constructs a packet by using formatted
string by copying the string into the packet's fragment, then sends it.
This patch releases string into temporary_buffer and then passes
one-element span to send().

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#28198
2026-01-20 10:51:23 +02:00
Avi Kivity
874322f95e multishard_query: simplify do_query() coroutine/continuation complexity
do_query() is a coroutine but uses some continuations to take
advantage of exceptions being propagated via future::then() without
being thrown. We can accomplish the same thing with a nested coroutine
and coroutine::try_future(), simplifying the code.

While this area isn't performance intensive, we're not adding allocations.
The coroutine frame may add an allocation, but since read_page()
certainly does not return immediately, the following then() will allocate
as well. Since we eliminated that then(), the change is at least neutral
allocation-wise.

Closes scylladb/scylladb#28258
2026-01-20 10:45:10 +02:00
Łukasz Paszkowski
e07fe2536e test/pylib/util.py: Add retries and additional logging to start_writes()
Consider the following scenario:
1. Let nodes A,B,C form a cluster with RF=3
2. Write query with CL=QUORUM is submitted and is acknowledged by
   nodes B,C
3. Follow-up read query with CL=QUORUM is sent to verify the write
   from the previous step
4. Coordinator sends data/digest requests to the nodes A,B. Since the
   node A is missing data, digest mismatches and data reconciliation
   is triggered
5. The node A or B fails, becomes unavailable, etc
6. During reconciliation, data requests are sent to node A,B and fail
   failing the entire read query

When the above scenario happens, the tests using `start_writes()` fail
with the following stacktrace:
```
...

>           await finish_writes()

test/cluster/test_tablets_migration.py:259:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
test/pylib/util.py:241: in finish
    await asyncio.gather(*tasks)
test/pylib/util.py:227: in do_writes
    raise e
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

worker_id = 1

...

>                   rows = await cql.run_async(rd_stmt, [pk])
E                   cassandra.ReadFailure: Error from server: code=1300 [Replica(s) failed to execute read] message="Operation failed for test_1767777001181_bmsvk.test - received 1 responses and 1 failures from 2 CL=QUORUM." info={'consistency': 'QUORUM', 'required_responses': 2, 'received_responses': 1, 'failures': 1}
```

Note that when a node failure happens before/during a read query,
there is no test failure as the speculative retries are enabled
by default. Hence an additional data/digest read is sent to the third
remaining node.

However, the same speculative read is cancelled the moment, the read
query reaches CL which may trigger a read-repair.

This change:
- Retries the verification read in start_writes() on failure to mitigate
  races between reads and node failures
- Adds additional logging to correlate Python exceptions with Scylla logs

Fixes https://github.com/scylladb/scylladb/issues/27478
Fixes https://github.com/scylladb/scylladb/issues/27974
Fixes https://github.com/scylladb/scylladb/issues/27494
Fixes https://github.com/scylladb/scylladb/issues/23529

Note that this change test flakiness observed during tablet transitions.
However, it serves as a workaround for a higher-level issue
https://github.com/scylladb/scylladb/issues/28125

Closes scylladb/scylladb#28140
2026-01-20 10:38:20 +02:00
Piotr Smaron
6084f250ae auth: use paged internal queries during migration
Auth v2 migration uses non-paged queries via `execute_internal` API.
This commit changes it to use `query_internal` instead, which uses
paging under the hood.

Fixes: scylladb/scylladb#27577
2026-01-20 09:32:21 +01:00
Piotr Smaron
345458c2d8 auth: move some code in migrate_to_auth_v2 up
Just move the touched code above so the next commit is more readable.
But this has a drawback: previously, if the returned rows were empty,
this code was not executed, but now is independently of the query
results. This shouldn't be a big deal, though, as auth shouldn't be
empty.
2026-01-20 09:15:53 +01:00
Piotr Smaron
97fe3f2a2c auth: re-align pieces of migrate_to_auth_v2
This is needed to make next commits more readable.
Just moved the touched code 4 spaces to the right.
2026-01-20 09:15:53 +01:00
Piotr Smaron
3ca6b59f80 cql: extend query_internal with query_state param
This later is going to be used to pass a query timeout via `qs` to
`query_internal`.
2026-01-20 09:15:48 +01:00
Nadav Har'El
70b3cd0540 Merge 'vector_index: introduce quantization and oversampling options' from Szymon Malewski
This patch adds vector index options allowing to enable quantization and oversampling.
Specific quantization value will be used internally by vector store.

In the current implementation, get_oversampling allows us to decide how many times more candidates
to retrieve from vector store - final response is still trimmed to the given limit.
It is a first step to allow rescoring - recalculation of similarity metric and re-ranking.
Without rescoring oversampling will be also further optimized to happen internally in vector store.

`test/vector_search/rescoring_test.cc` implements basic tests of added functionality.
New options are documented in `docs/cql/secondary-indexes.rst`.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-82
Ref https://scylladb.atlassian.net/browse/SCYLLADB-83

New feature - no backporting

Closes scylladb/scylladb#27677

* github.com:scylladb/scylladb:
  vector_search: doc: Document new index options
  vector_search: test: Test oversampling
  vector_search: test: Add rescoring index options test
  vector_search: test: Extract Configure utility to shared header
  vector_index: introduce `quantization` and `oversampling` options
2026-01-20 08:50:46 +02:00
Avi Kivity
36347c3ce9 Merge 'db/system_keyspace: remove namespace v3' from Botond Dénes
Cassandra changed their system tables in 3.0. We migrated to the new system table layout in 2017, in ScyllaDB 2.0.
System tables introduced in Cassandra 3.0, as well as the 3.0 variant of pre-existing system tables were added to the db::system_table::v3 namespace.
We ended up adding some new ScyllaDB-only system tables to this namespace as well.

As the dust settled, most of the v3 system tables ended up being either simple aliases to non-v3 tables, or new tables.
Either way, the codebase uses just one variant of each table for a long time now the v3:: distinction is pointless.

Remove the v3 namespace and unify the table listing under the top-level db::system_keyspace scope.

Code cleanup, no backport

Closes scylladb/scylladb#28146

* github.com:scylladb/scylladb:
  db/system_keyspace: move remining tables out of v3 keyspace
  db/system_keyspace: relocate truncated() and commitlog_cleanups()
  db/system_keyspace: drop v3::local()
  db/system_keyspace: remove duplicate table names from v3
2026-01-19 20:54:38 +02:00
Dawid Mędrek
3b8bf85fbc test/lib/boost_test_tree_lister.cc: Record empty test suites
Before this commit, if a test file or a test suite didn't include
any actual test cases, it was ignored by `boost_test_tree_lister`.

However, this information is useful; for example, it allows us to tell
if the test file the user wants to run doesn't exist or simply doesn't
contain any tests. The kind of error we would return to them should be
different depending on which situation we're dealing with.

We start including those empty suites and files in the output of
`--list_json_content`.

---

Examples (with additional formatting):

* Consider the following test file, `test/boost/dummy_test.cc` [1]:

  ```
  BOOST_AUTO_TEST_SUITE(dummy_suite1)
  BOOST_AUTO_TEST_SUITE(dummy_suite2)
  BOOST_AUTO_TEST_SUITE_END()
  BOOST_AUTO_TEST_SUITE_END()

  BOOST_AUTO_TEST_SUITE(dummy_suite3)
  BOOST_AUTO_TEST_SUITE_END()
  ```

  Before this commit:

  ```
  $ ./build/debug/test/boost/dummy_test -- --list_json_content
  [{"file": "test/boost/dummy_test.cc", "content": {"suites": [], "tests": []}}]
  ```

  After this commit:

  ```
  $ ./build/debug/test/boost/dummy_test -- --list_json_content
  [{"file":"test/boost/dummy_test.cc", "content": {"suites": [
    {"name": "dummy_suite1", "suites": [
       {"name": "dummy_suite2", "suites": [], "tests": []}
    ], "tests": []},
    {"name": "dummy_suite3", "suites": [], "tests": []}
  ], "tests": []}}]
  ```

* Consider the same test file as in Example 1, but also assume it's compiled
  into `test/boost/combined_tests`.

  Before this commit:

  ```
  $ ./build/debug/test/boost/combined_tests -- --list_json_content | grep dummy
  $
  ```

  After this commit:

  ```
  $ ./build/debug/test/boost/combined_tests -- --list_json_content
  [..., {"file": "test/boost/dummy_test.cc", "content": {"suites": [
    {"name": "dummy_suite1", "suites":
      [{"name": "dummy_suite2", "suites": [], "tests": []}],
    "tests": []},
    {"name": "dummy_suite3", "suites": [], "tests": []}],
  "tests":[]}}, ...]
  ```

[1] Note that the example is simplified. As of now, it's not possible to use
    `--list_json_content` with a file without any Boost tests. That will
    result in the following error: `Test setup error: test tree is empty`.

Refs scylladb/scylladb#25415
2026-01-19 18:03:24 +01:00
Dawid Mędrek
1129599df8 test/lib/boost_test_tree_lister.cc: Deduplicate labels
In scylladb/scylladb@afde5f668a, we
implemented custom collection of information about Boost tests
in the repository. The solution boiled down to traversing through
the test tree via callbacks provided by Boost.Test and calling that
code from a global fixture. This way, the code is called automatically
by the framework.

Unfortunately, for an unknown reason, this leads to labels of test units
being duplicated. We haven't found the root cause yet and so we
deduplicate the labels manually.

---

Example (with additional formatting):

Consider the following test in the file `test/boost/dummy_test.cc`:

```
SEASTAR_TEST_CASE(dummy_case, *boost::unit_test::label("mylabel1")) {
    return make_ready_future();
}
```

Before this commit:

```
$ ./build/dev/test/boost/dummy_test -- --list_json_content
[{"file": "test/boost/dummy_test.cc", "content": {"suites": [],
  "tests": [{"name": "dummy_case", "labels": "mylabel1,mylabel1"}]}
}]
```

After this commit:

```
$ ./build/dev/test/boost/dummy_test -- --list_json_content
[{"file": "test/boost/dummy_test.cc", "content": {"suites": [],
  "tests": [{"name": "dummy_case", "labels": "mylabel1"}]}
}]
```

Refs scylladb/scylladb#25415
2026-01-19 18:01:14 +01:00
Nikos Dragazis
4cde34f6f2 storage_service: Remove redundant yields
The loops in `ongoing_rf_change()` perform explicit yields, but they
also perform coroutine operations which can yield implicitly. The
explicit yields are redundant.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2026-01-19 16:18:49 +02:00
Marcin Maliszkiewicz
1318ff5a0d test: perf: move cut_arg helper func to common code
It will be reused later.
2026-01-19 14:33:10 +01:00
Tomasz Grabiec
7977c97694 Merge 'db: add effective_capacity to load_per_node virtual table' from Ferenc Szili
`effective_capacity` is a value used in size based load balancing. It contains the sum of available disk space of a node and all the tablet sizes.
This change adds this value to the virtual table `system.load_per_node`. This can be useful for debugging size based load balancing.

Size based load balancing is currently only on master, so no backport is needed.

Closes scylladb/scylladb#28220

* github.com:scylladb/scylladb:
  docs: add effective_capacity to system keyspace docs
  virtual_table: add effective_capacity to load_per_node
2026-01-19 13:17:28 +01:00
Marcin Maliszkiewicz
be8a30230b Merge 'test/cluster/dtest: import scrub_test.py' from Botond Dénes
This test has to be adjusted in lock-step with scylladb.git, due to changes in https://github.com/scylladb/scylladb/pull/27836. It is simpler to just take the time and import it, so https://github.com/scylladb/scylladb/pull/27836 can patch all the affected tests, including this one.
All code is imported verbatim, then patched later, such that the series remains bisectable.

dtest import, no backport needed

Closes scylladb/scylladb#28085

* github.com:scylladb/scylladb:
  test/cluster/dtest: remove is_win() and users
  test/cluster/dtest/scrub_test.py: add license blurb
  test/cluster/dtest: import scrub_test.py
  test/cluster/dtest/ccmlib: scylla_node.py: adapt run_scylla_sstable() at al
  test/cluster/dtest/ccmlib: scylla_node.py: import run_scylla_sstable()
2026-01-19 12:14:08 +01:00
Botond Dénes
2e4d0e42f0 test/cluster/dtest: remove is_win() and users
ScyllaDB and its tests never run on windows, this function is not
needed, patch it out.
2026-01-19 12:56:57 +02:00
Botond Dénes
8953a143e5 test/cluster/dtest/scrub_test.py: add license blurb
The original scrub test was done by the Cassandra project, hence there
is two Licenses notices: one for the original work by Cassandra
(2015) and one for our modifications on top (2021).
2026-01-19 12:55:59 +02:00
Botond Dénes
d2c266eb47 test/cluster/dtest: import scrub_test.py
Import the test verbatim. Requires adding is_win() to ccmlib/common.py,
with a dummy implementation.
2026-01-19 12:52:44 +02:00
Botond Dénes
99e8a92aef test/cluster/dtest/ccmlib: scylla_node.py: adapt run_scylla_sstable() at al
To work in the local test.py context.
2026-01-19 12:52:44 +02:00
Botond Dénes
807da53583 test/cluster/dtest/ccmlib: scylla_node.py: import run_scylla_sstable()
And dependencies: get_sstables() and __gather_sstables().
Code is importend verbatim, but doesn't work yet (no users yet either).
Will be patched to work in the next commit.
2026-01-19 12:52:44 +02:00
Botond Dénes
e01041d3ee db/system_keyspace: move remining tables out of v3 keyspace
The last remining tables in the v3 keyspace are those that are genuinely
distinct -- added by Cassandra 3.0 or >= ScyllaDB 2.0.
Move these out of the v3 keyspace too, with this the v3 keyspace is
defunct and removed.
2026-01-19 12:32:21 +02:00
Botond Dénes
ce57ef94bd db/system_keyspace: relocate truncated() and commitlog_cleanups()
The name variables of these tables is outside the v3 namespace but the
method defining their schema is in the v3 namespace. Relocate the
methods out from the v3 namespace, to the scope where the name variables
live.
The methods are moved to the private: part of system_keyspace, as they
don't have external users currently.
2026-01-19 12:32:21 +02:00
Botond Dénes
2ccb8ff666 db/system_keyspace: drop v3::local()
It is unused, the non-v3 variant is used instead.
2026-01-19 12:32:21 +02:00
Botond Dénes
b52a3f3a43 db/system_keyspace: remove duplicate table names from v3
Those table names that are effectively just an alias of the their
counterpart outside of the v3 namespace (struct).

scylla_local() is made public. Currently it is private, but it has
external users, working around the private designation by using the
public v3::scylla_local() alias. This change just makes the existing
status clear.
2026-01-19 12:32:21 +02:00
Karol Nowacki
324b829263 vector_search: doc: Document new index options
Adds documentation for the `quantization` and `oversampling`
options for vector search indexes.
2026-01-19 10:28:46 +01:00
Karol Nowacki
bca17290f4 vector_search: test: Test oversampling
Add test to verify that Scylla correctly oversamples the limit
according to the oversampling option.
2026-01-19 10:28:46 +01:00
Karol Nowacki
e347f6d0d4 vector_search: test: Add rescoring index options test
Add tests to validate quantization and oversampling index options.
2026-01-19 10:28:44 +01:00
Karol Nowacki
24b037e8e3 vector_search: test: Extract Configure utility to shared header
Move Configure test utility to dedicated file for reuse across test suites.
2026-01-19 10:21:44 +01:00
Szymon Malewski
b8e91ee6ae vector_index: introduce quantization and oversampling options
This patch adds vector index options allowing to enable quantization and oversampling.
Specific quantization value will be used internally by vector store.

In the current implementation, `get_oversampling` allows us to decide how many times more candidates
to retrieve from vector store - final response is still trimmed to the given limit.
It is a first step to allow rescoring - recalculation of similarity metric and re-ranking.
Without rescoring oversampling will be also further optimized to happen internally in vector store.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-82
Ref https://scylladb.atlassian.net/browse/SCYLLADB-83
2026-01-19 10:21:43 +01:00
Tomasz Grabiec
dd0fc35c63 lsa: Export metrics for reclaim/evict/compact time
Currently, we only know about long reclaims from lsa-timing stall
reports. Shorter reclaims can go under the radar.

Those metrics will help to asses increase in LSA activity, which
translates to higher CPU cost of a workload.

reclaim tracks memory which goes to the standard allocator, e.g. when
entering and allocating_section or in the background reclaimer.

evict/compact count activity towrads building LSA reserve, in
allocating_section entry, or naked LSA allocation.

Closes scylladb/scylladb#27774
2026-01-19 12:08:16 +03:00
Nadav Har'El
3e270a49f7 test/cqlpy: remove test_describe.py from cluster reuse blacklist
The way that test.py runs test/cqlpy tests requires that tests end their
session with all keyspaces deleted. If we forget to delete a keyspace,
test.py suspects some test fails and reports a failure. As reported in
issue #26291, the test file test/cqlpy/test_describe.py caused this check
to trigger, so this file was added to the blacklist "dirties_cluster"
in suite.yaml to force test.py to ignore this problem.

I believe the cause of the problem was as follows: test_describe.py
didn't really leave any undeleted keyspace. Rather, test_describe.py had
one test which used "USE" and this broke DESC KEYSPACES (Refs #26334) -
which test.py used to see which keyspaces remained.

We solved this problem not just once, but twice:
1. In pull request #26345, I fixed the test not to use "USE" on the main
   CQL session.
2. In pull request #27971, I fixed DESC KEYSPACES implementation so even
   if "USE" was in effect, it will return the correct results.

I checked manually, and after removing test_describe.py from the
dirties_cluster blacklist, all cqlpy tests now pass, without
spurious failures in the test following test_describe.py. So it's time
to remove it from the blacklist.

Fixes #26291

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#27973
2026-01-19 12:02:00 +03:00
Dario Mirovic
823d1b9c03 audit: fix start_audit init sequence placement
Commit d54d409 (audit: write out to both table and syslog) unified
create_audit and start_audit, which moved the audit service creation later
in the startup sequence. This broke startup when audit is enabled because
view_builder prepares CQL queries before start_audit runs, and
query preparation calls audit_instance().local_is_initialized()
which crashes on the non-existent sharded service.

Move start_audit to run before view_builder::start() and other components
that may prepare CQL queries during their initialization.

Fixes SCYLLADB-252

Closes scylladb/scylladb#28139
2026-01-19 11:57:39 +03:00
Botond Dénes
6f5f42305a docs: make the glossary more tablet inclusive
Our glossary is stuck in the past, still discussing token ownership in
terms of vnodes and cluster synchronization in terms of gossip.
This patch tries to improve this a bit, although much more work needs to
be done.
The term `Tablet` is added and the definition of `Token` and `Token
Range` is rephrased to be tablet inclusive.
The term `Cluster` is changed to mention raft as the synchronization
mechanism instead of gossip.

One oustanding problem is that our general architecture page describing
the ring acrhitecture is still Vnode only. We have a seprate Tablets
page, but the two don't link to each other and most documentation refer
only to the former. A casual reader might be able to spend a a lot of
time on our documentation page, without even seeing the word: tablets.

Closes scylladb/scylladb#28170
2026-01-19 11:50:13 +03:00
Ernest Zaslavsky
829bd9b598 aws_error: fix nested exception handling
The loop that unwraps nested exception, rethrows nested exception and saves pointer to the temporary std::exception& inner on stack, then continues. This pointer is, thus, pointing to a released temporary

Closes scylladb/scylladb#28143
2026-01-19 11:41:47 +03:00
Botond Dénes
b7bc48e7b7 reader_concurrency_semaphore: improve handling of base resources
reader_permit::release_base_resources() is a soft evict for the permit:
it releases the resources aquired during admission. This is used in
cases where a single process owns multiple permits, creating a risk for
deadlock, like it is the case for repair. In this case,
release_base_resources() acts as a manual eviction mechanism to prevent
permits blockings each other from admission.

Recently we found a bad interaction between release_base_resources() and
permit eviction. Repair uses both mechanism: it marks its permits as
inactive and later it also uses release_base_resources(). This partice
might be worth reconsidering, but the fact remains that there is a bug
in the reader permit which causes the base resources to be released
twice when release_base_resources() is called on an already evicted
permit. This is incorrect and is fixed in this patch.

Improve release_base_resources():
* make _base_resources const
* move signal call into the if (_base_resources_consumed()) { }
* use reader_permit::impl::signal() instead of
  reader_concurrency_semaphore::signal()
* all places where base resources are released now call
  release_base_resources()

A reproducer unit test is added, which fails before and passes after the
fix.

Fixes: #28083

Closes scylladb/scylladb#28155
2026-01-19 11:37:51 +03:00
Nadav Har'El
d86d5b33aa test/cqlpy: translate Cassandra's unit tests for LWT
This is a translation of Cassandra's CQL unit test source file
validation/operations/InsertUpdateIfConditionTest.java into our cqlpy
framework.

This test file checks various LWT conditional updates. After that
file became too big, the Cassandra developers split parts from it -
moving tests for LWT with collections, UDTs, and static columns to
separate test files - which I already translated (pull request #13663).
This patch translates the remaining, main, LWT tests.

Strangely, this test file also has, in the middle of the file, several
tests for conditional schema changes, like CREATE KEYSPACE IF NOT EXISTS,
a feature which has *nothing* to do with LWT so really didn't belong in
this file. But I translated those as well.

These new tests all pass on both ScyllaDB and Cassandra, and have not
uncovered any new bug.

However these tests do demonstrate yet again something that users and
developers of ScyllaDB's LWT must be aware of: Whereas usually
ScyllaDB's goal has been compatiblity with Cassandra's CQL, in LWT
this has *not* been the case: ScyllaDB deviated from Cassandra's
behavior in its LWT implementation in several places. These intentional
deviations were documented in docs/kb/lwt-differences.rst.

Accordingly, the tests here include almost a hundred (!) modificatons
(search for "if is_scylla") to allow the same test to pass on both
ScyllaDB and Cassandra, as well as many comments explaining the types
of differences we're seeing.

Although these deviations from Cassandra compatibility are known and
intentional, it's worth listing here the ones re-discovered by these
new tests:

1. On a successful conditional write, Cassandra returns just true, Scylla
   also returns the old contents of the row.

2. Similarly, in an IF EXISTS write that failed (the row did not exist),
   Cassandra returns just false, Scylla also returns extra null values for
   each and every column of the row.

3. Cassandra allows in "IF v IN (?, ?)" to bind individual values to
   UNSET_VALUE and skips them, Scylla treats this as an error. Refs #13659.

4. When there are static columns, Scylla's LWT response returns the static
   column first, Cassandra returns the modified column first. Since both
   also say which columns they return, neither is more correct than the other,
   a normally users will address specific columns by name, not by position.

5. docs/kb/lwt-differences.rst explains that "the returned result set
   contains an old row for every conditional statement in the batch".
   Beyond this different, actually non-conditional updates in the batch will
   also get a row in Scylla's result. Refs #27955.

6. For batch statement, ScyllaDB allows mixing `IF EXISTS`, `IF NOT EXISTS`,
   and other conditions for the same row. Cassandra doesn't, so checks that
   these combinations are not allowed were commented out.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#27961
2026-01-19 09:46:04 +02:00
Botond Dénes
19efd7f6f9 Merge 'The system_replicated_keys should be mark as a system keyspace' from Amnon Heiman
This PR marks system_replicated_keys as a system keyspace.
It was missing when the keyspace was added.

A side effect of that is that metrics that are not supposed to be reported are.
Fixes #27903

Closes scylladb/scylladb#27954

* github.com:scylladb/scylladb:
  distributed_loader: system_replicated_keys as system keyspace
  replicated_key_provider: make KSNAME public
2026-01-19 09:37:41 +02:00
Aleksandra Martyniuk
65cba0c3e7 service: node_ops: remove coroutine::lambda wrappers
In storage_service::raft_topology_cmd_handler we pass a lambda
wrapped in coroutine::lambda to a function that creates streaming_task_impl.
The lambda is kept in streaming_task_impl that invokes it in its run
method.

The lambda captures may be destroyed before the lambda is called, leading
to use after free.

Do not wrap a lambda passed to streaming_task_impl into coroutine::lambda.
Use this auto dissociate the lambda lifetime from the calling statement.

Fixes: https://github.com/scylladb/scylladb/issues/28200.

Closes scylladb/scylladb#28201
2026-01-19 09:19:53 +02:00
Botond Dénes
c8811387e1 Merge 'service: do not change the schema while pausing the rf change ' from Aleksandra Martyniuk
Currently, if a rf change request is paused, it immediately changes
the system_schema.keyspaces to use rack list for this keyspace.
If the request is aborted, the co-location might not be finished.
Hence, we can end up with inconsistent schema and tablet replica state.

Update the system_schema.keyspaces only after the co-location is done (and
not when it's started).

Fixes: https://github.com/scylladb/scylladb/issues/28167

No backport needed; changes that introduced a bug are only on master

Closes scylladb/scylladb#28168

* github.com:scylladb/scylladb:
  service: fin indentation
  test: add test_numeric_rf_to_rack_list_conversion_abort
  service: tasks: fix type of global_topology_request_virtual_task
  service: do not change the schema while pausing the rf change
2026-01-19 09:15:20 +02:00
Botond Dénes
7d637b14e8 erge 'test/cluster/test_internode_compression: Transpose test from dtest' from Calle Wilund
Refs #27429

Re-implement the dtest with same name as a scylla pytest, using a python level network proxy instead of tcpdump etc.
Both to avoid sudo and also to ensure we don't race.

Juggles different listen_address and broadcast_address values to insert a proxy measuring RPC traffic.

Note: the measuring relies on python network IO not splitting data chunks, since we don't really have packet-level view of the connections.

Note that a scylla change is required to make the ip address magic work, otherwise topology mechanism gets
confused. This should maybe at some point be looked into more, since we should be more resilient against various services in scylla binding to different addresses.

When this test is merged, we can drop the flaky test from dtest. And hope no new flakiness comes from this one...

Closes scylladb/scylladb#28133

* github.com:scylladb/scylladb:
  test/cluster/test_internode_compression: Transpose test from dtest
  gossiper/main: Extend special treatment of node ID resolve for rpc_address
2026-01-19 08:34:31 +02:00
Ferenc Szili
1136a3f398 docs: add effective_capacity to system keyspace docs
This adds the description of effective_capacity to the documentation
of the system keyspace.
2026-01-18 16:57:08 +01:00
Ferenc Szili
3e0362ec67 virtual_table: add effective_capacity to load_per_node
This change adds effective_capacity to the virtual table load_per_node.
This value can be useful for debugging size based load balancing.
2026-01-18 16:52:13 +01:00
Nadav Har'El
3e138a2685 test/cqlpy: Add our copyright/license to translated Cassandra tests
All the tests under test/cqlpy/cassandra_tests/ were translated from
Cassandra's unit tests originally written in Java into our own test
framework, and accordinly carry a clear mention of their origin and
original license.

However, we did modify these original tests - even if the modification
was slight and mostly straightforward. Therefore I was asked to also
mention our own copyright (and license) for these modifications.

So this patch adds to every file in test/cqlpy/cassandra_tests/ text like:

   # Modifications: Copyright 2026-present ScyllaDB
   # SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0

with the appropriate year instead of 2026.

Fixes #28215

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#28216
2026-01-18 16:25:28 +01:00
Tomasz Grabiec
3b0df29ceb docs: Document parallel decommission and removenode and relevant task API 2026-01-18 15:36:08 +01:00
Tomasz Grabiec
85140cdf7e test: Add tests for parallel decommission/removenode 2026-01-18 15:36:08 +01:00
Tomasz Grabiec
5c93e12373 test: util: Introduce ensure_group0_leader_on()
Many tests want to assume that group0 leader runs on a particualr
server, typically the first server in the list.

And they cannot be easily made to work with arbitrary leader, becuase
they setup a particular topology and then stop particular nodes, and
want to assume the leader is stable. They open leader's log and
expect things to appear in that log.

It's much easier to ensure the leader, than to prepare tests to
handle failovers.
2026-01-18 15:36:07 +01:00
Tomasz Grabiec
478b8f09df test: tablets: Check that there are no migrations scheduled on draining nodes
In case of decommission, it's not desirable because it's less urgent.

In case of removenode, it leads to failure of removenode operation
because scheduled co-locating migration will fail if the destination
is on the excluded node, and this failure will be interpreted as drain
failure and coordinator will cancel the request.

Not a problem before "parallel decommission" because this failure is
only a streaming failure, not a barrier failure, so exception doesn't
escape into the catch clause in transition stage handler, and the
migration is simply rolled back. Once draining happens in the tablet
migration track, streaming failure will be interpreted as drain
failure and cancel the request.
2026-01-18 15:36:07 +01:00
Tomasz Grabiec
e082e32cc7 test: lib: topology_builder: Introduce add_draining_request() 2026-01-18 15:36:07 +01:00
Tomasz Grabiec
baea12c9cb topology_coordinator, tablets: Fail draining operations when tablet migration fails due to critical disk utilization
Reaching critical disk utilization on destination means the draining
either caused it, or at least works against reliveing it. So it's
better to cancel those requests. In case of decommission, if critical
disk utilization was caused by it due to not enough capacity, aborting
decomission will bring capacity back to the system and rebalancing
will relieve critical disk utlization.
2026-01-18 15:36:07 +01:00
Tomasz Grabiec
1b784e98f3 tablets: topology_coordinator: Refactor to propagate reason for migration rollback
Will be easier to implement later change to cancel topology request,
where we need to give a reason for doing so.
2026-01-18 15:36:07 +01:00
Tomasz Grabiec
2d954f4b19 tablet_allocator: Skip co-location on draining nodes
In case of decommission, it's not desirable because it's less
urgent.

In case of removenode, it leads to failure of removenode operation
because scheduled co-locating migration will fail if the destination
is on the excluded node, and this failure will be interpreted as drain
failure and coordinator will cancel the request.

Not a problem before "parallel decommission" because this failure is
only a streaming failure, not a barrier failure, so exception doesn't
escape into the catch clause in transition stage handler, and the
migration is simply rolled back. Once draining happens in the tablet
migration track, streaming failure will be interpreted as drain
failure and cancel the request.
2026-01-18 15:36:06 +01:00
Tomasz Grabiec
d9e1a6006f node_ops: task_manager_module: Populate entity field also for active requests 2026-01-18 15:36:06 +01:00
Tomasz Grabiec
bbd293d440 tasks: node_ops: Put node id in the entity field
If we have many node requests active at a time, it's useful to know
which requets works on which node.

Fixes #27208
2026-01-18 15:36:06 +01:00
Tomasz Grabiec
576ebcdd30 tasks, node_ops: Unify setting of task_stats in get_status() and get_stats()
They should return the same, so extract the common logic.
2026-01-18 15:36:05 +01:00
Tomasz Grabiec
629d6d98fa topology: Protect against empty cancelation reason
Request would be deemed successful, which is counter to the intention
of cancelation and effect on the system.
2026-01-18 15:36:05 +01:00
Tomasz Grabiec
7446eb7e8d tasks, topology: Make pending node operations abortable
We want to be able to cancel decommission when it's still in the
tablet draining phase. Such a request is in a pending and paused
state, and can be safely canceled. We set the node's "draining" flag
back to false.
2026-01-18 15:36:05 +01:00
Tomasz Grabiec
091ed4d54b doc: topology-over-raft.md: Fix diagram for replacing, tablet_draining is not engaged
Since 288e75fe22
2026-01-18 15:36:05 +01:00
Tomasz Grabiec
a009644c7d raft_topology, tablets: Drain tablets in parallel with other topology operations
Allows other topology operations to execute while tablets are being
drained on decommission. In particular, bootstrap on scale-out. This
is important for elasticity.

Allows multiple decommission/removenode to happen in parallel, which
is important for efficiency.

Flow of decommission/removenode request:
  1) pending and paused, has tablet replicas on target node.
     Tablet scheduler will start draining tablets.
  2) No tablets on target node, request is pending but not paused
  3) Request is scheduled, node is in transition
  4) Request is done

Nodes are considered draining as soon as there is a leave or remove
request on them. If there are tablet replicas present on the target
node, the request is in a paused state and will not be picked by
topology coordinator. The paused state is computed from topology state
automatically on reload.

When request is not paused, its execution starts in
write_both_read_old state. The old tablet_draining state is not
entered (it's deprecated now).

Tablet load balancing will yield the state machine as soon as some
request is no longer paused and ready to be scheduled, based on
standard preemption mechanics.

The test case test_explicit_tablet_movement_during_decommission is
removed. It verifies that tablet move API works during tablet draining
transition. After this PR, we no longer enter this transition, so the
test doesn't work. It loses its purpose, because movement during
normal tablet balancing is not special and tested elsewhere.
2026-01-18 15:36:05 +01:00
Tomasz Grabiec
e38ee160fc virtual_tables: Show draining and excluded fields in system.cluster_status and system.load_by_node
It gives a more accurate picture of what happens in the cluster.
2026-01-18 15:36:04 +01:00
Tomasz Grabiec
1c2e47e059 locator: topology: Add "draining" flag to a node
They are being drained of tablet replicas, tablet scheduler works to
move replicas away from such nodes. This state is set at the
beginning of decommission and removenode operations.
2026-01-18 15:36:04 +01:00
Tomasz Grabiec
a37b1ce832 topology_coordinator: Extract generate_cancel_request_update() 2026-01-18 15:36:04 +01:00
Tomasz Grabiec
77bd00bf9f storage_service: Drop dependency in topology_state_machine.hh in the header
To reduce compilation time.
2026-01-18 15:36:04 +01:00
Tomasz Grabiec
a24c3fc229 locator: Extract common code in assert_rf_rack_valid_keyspace() 2026-01-18 15:36:04 +01:00
Tomasz Grabiec
d3ee82ea51 topology_coordinator, storage_service: Validate node removal/decommission at request submission time
After parallel tablet draining, the validation at the time request
starts executing is too late, tablets will be already drained.

This trips tests which expect validation failure, but get tablet
draining failure instead.

Also, in case of decommission, it's a waste to go through draining
only to discover that the operation has to be rolled back due to
validation.

So avoid submitting a request altogether if it's invalid.
The validation at request execution start remains, for extra sefety.

validate_removing_node() was extracted out of topology_coordinator,
so that it can be called by storage_service on non-coordinator.

Some tests need adjusting for the fact that after failed removenode
the node may still not be marked as excluded, so we need to explicitly
exclude it or add to the list of ignored nodes in the next removenode
operation.
2026-01-18 15:36:04 +01:00
Nadav Har'El
34d28475d9 Merge 'Implement Vector Search filtering API' from Dawid Pawlik
Since Vector Store service filtering API has been implemented (scylladb/vector-store#334), there is a need for the implementation of Scylla side part.
This patch should implement a `statement_restrictions` parsing into Vector Store filtering API compatible JSON objects.
Those objects should be added to ANN query vector POST requests as `filter` object.

After this patch, the subset of all operations ([Vector Search Filtering Milestone 1](https://scylladb.atlassian.net/wiki/spaces/RND/pages/156729450/Vector+Search+Filtering+Design+Document#Milestone-1)) happy path should be completed, allowing users to filter on primary key columns with single column `=` and `IN` or multiple column `()=()` and `() IN ()`.
The restrictions for other operations should be implemented in a PR on Vector Store service side.

---

This PR implements parsing the `statement_restrictions` into Vector Store filtering API compatible JSON objects.
The JSON objects are created and used in ANN vector queries with filtering.
It closes the Scylla side implementation of Vector Search filtering milestone 1.

Unit tests for `statement_restrictions` parsing are added. Integration tests will be added on Vector Store service side PR.

---

Fixes: SCYLLADB-249

New feature, should land into 2026.1

Closes scylladb/scylladb#28109

* github.com:scylladb/scylladb:
  docs: update documentation on filtering with vector queries
  test/vector_search: add test for filtered ANN with VS mock
  test/vector_search: add restriction to JSON conversion unit tests
  vector_search: cql: construct and use filter in ANN vector queries
  select_statement: do not require post query ordering for vector queries
  vector_search: add `statement_restrictions` to JSON parsing
2026-01-18 16:11:29 +02:00
Ernest Zaslavsky
eb76858369 Update seastar submodule
seastar dd46b6f..e00f1513
```
e00f1513 Merge 'net: Add DNS TTL to the net::hostent' from Ernest Zaslavsky
8a69e1f4 net: extract common implementation of inet_address::find_all
cb469fd1 net: deprecate the addr_list in hostent
1d59c0ca net: expose DNS TTL via net::hostent
3c6d919f http: add virtual close() to connection_factory
bbd0001a Revert "net: expose DNS TTL via net::hostent"
```

Closes scylladb/scylladb#28147
2026-01-18 15:00:48 +02:00
Andrzej Jackowski
6eca7e4ff6 transport: unify lambda capture lifetime for control connections
Workload prioritization was added in scylladb/scylladb#22031.
The functionality of updating service levels was implemented as
a lambda coroutine, leaving room for the lambda coroutine fiasco.

The problem was noticed and addressed in scylladb/scylladb#26404.
There are currently three functions that call switch_tenant:
 - update_user_scheduling_group_v1 and update_user_scheduling_group_v2
   use the deducing this (this auto self) to ensure the proper
   lifecycle of the lambda capture.
 - update_control_connection_scheduling_group doesn’t use the deducing
   this, but the lambda captures only `this`, which is used before
   the first possible coroutine preemption. Therefore, it doesn’t seem
   that any memory corruption or undefined behavior is possible here.

Nevertheless, it seems better to start using the deducing this in
update_control_connection_scheduling_group as well, to avoid problems
in the future if someone modifies the code and forgets to add it.

Fixes: SCYLLADB-284

Closes scylladb/scylladb#28158
2026-01-17 20:36:31 +02:00
Nikos Dragazis
8aca7b0eb9 test: database_test: Fix serialization of partition key
The `make_key` lambda erroneously allocates a fixed 8-byte buffer
(`sizeof(s.size())`) for variable-length strings, potentially causing
uninitialized bytes to be included. If such bytes exist and they are
not valid UTF-8 characters, deserialization fails:

```
ERROR 2026-01-16 08:18:26,062 [shard 0:main] testlog - snapshot_list_contains_dropped_tables: cql env callback failed, error: exceptions::invalid_request_exception (Exception while binding column p1: marshaling error: Validation failed - non-UTF8 character in a UTF8 string, at byte offset 7)
```

Fixes #28195.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>

Closes scylladb/scylladb#28197
2026-01-17 20:32:06 +02:00
Botond Dénes
1e09a34686 replica: add abort polling to memtable and cache readers
Continuing the read once it is aborted (e.g. due to timeout) is a waste
of resources, as the produced results will be discarded.
Poll the permit's abort exception in the memtable and cache reader's
fill_buffer(). This results in one poll per buffer filled (8KB of data).

We already have similar poll for sstable readers, as disk reads are
usually much heavier and therefore it is more important to stop them
ASAP after abort. Cache and memtable reads are usually quick but not
always, hence it is important to also have polling in the cache and
memtable readers.

Refs: #11469
Fixes: #28148

Closes scylladb/scylladb#28149
2026-01-16 18:03:04 +01:00
Ferenc Szili
0aebc17c4c docs: correct spelling errors in size based balancing docs
0ede8d154b introduced the dev doc for size
based load balancing, but also added spelling errors.
This PR fixed these errors.

Closes scylladb/scylladb#28196
2026-01-16 17:41:57 +02:00
Aleksandra Martyniuk
ad2381923f service: fin indentation 2026-01-16 11:38:10 +01:00
Aleksandra Martyniuk
504290902c test: add test_numeric_rf_to_rack_list_conversion_abort
Add regression test that checks whether aborted rf change leaves
the system_schema.keyspaces unchanged.
2026-01-16 11:36:21 +01:00
Aleksandra Martyniuk
3ed8701301 service: tasks: fix type of global_topology_request_virtual_task
Currently, the type of global_topology_request_virtual_task isn't
taken out of std::variant before printing, which results with
a task of type variant(actual_type).

Retrieve the type from the variant before passing it to task type.
2026-01-16 11:36:21 +01:00
Aleksandra Martyniuk
580dfd63e5 service: do not change the schema while pausing the rf change
Currently, if a rf change request is paused, it immediately changes
the system_schema.keyspaces to use rack list for this keyspace.
If the request is aborted, the co-location might not be finished.
Hence, we can end up with inconsistent schema and tablet replica state.

Update the system_schema.keyspaces only after the co-location is done (and
not when it's started).
2026-01-16 11:36:15 +01:00
Patryk Jędrzejczak
eb7be9010d Merge 'topology_coordinator: Refresh load stats after table is created or altered' from Tomasz Grabiec
We switched to the size-based load balancing, which now has more
strict requirements for load stats. We no longer need only per-node
stats, but also per-tablet stats.

Bootstrapping a node triggers stats refresh, but allocating tablets on
table creation didn't. So after creating a table, load balancer
couldn't make progress for up to 60s (stats refresh period).

This makes tests take longer, and can even cause failures if tests are
using a low-enough timeout.

Fixes https://github.com/scylladb/scylladb/issues/27921

No backport becuse only master is vulnerable (size-based load balancing).

Closes scylladb/scylladb#27926

* https://github.com/scylladb/scylladb:
  test: cluster: Add reproducer for missed notification in topology coordinator
  topology_coordinator: Wake up the state machine after stats refresh
  topology_coordinator: Move tablet_load_stats_refresh_before_rebalancing injection earlier
  topology_coordinator: Fix potential missed notification
  topology_coordinator: Refresh load stats after table is created or altered
  tablets: Do a group0 read barrier on tablet load stats refresh
  topology_coordinator: Ensure stats are refreshed in the gossip scheduling group
  test: Use ManagerClient.{disable,enable}_tablet_balancing()
  test: Add missing calls to disable_tablet_balancing() in tests which use move_tablet() API
  test: pylib: Introduce ManagerClient.{disable,enable}_tablet_balancing()
2026-01-16 11:34:57 +01:00
Dawid Pawlik
383f9e6e56 docs: update documentation on filtering with vector queries
Add a description of available filtering options with ANN vector queries.
Provide an example of such query and a reference to `WHERE` clause restrictions.
2026-01-16 11:18:23 +01:00
Dawid Pawlik
67d3454d2b test/vector_search: add test for filtered ANN with VS mock
Implement a test using Vector Store mock to check if end-to-end
integration works with filtered ANN query.
2026-01-16 11:18:23 +01:00
Dawid Pawlik
a54be82536 test/vector_search: add restriction to JSON conversion unit tests
Add unit tests for conversion of CQL restrictions to Vector Store filtering API
compatible JSON objects. The tests include:
- empty restriction
- `ALLOW FILTERING` in restriction
- single column restrictions
    - `=`, `<`, `>`, `<=`, `>=`, `IN`
- multiple column restrictions
    - `()=()`, `()<()`, `()>()`, `()<=()`, `()>=()`, `() IN ()`
- multiple restrictions conjunction
- `TEXT` and `BOOLEAN` column restrictions
2026-01-16 11:18:23 +01:00
Dawid Pawlik
2a38794b8e vector_search: cql: construct and use filter in ANN vector queries
Add `filter` option in `ann()` function to write the filter JSON
object as the POST request in ANN vector queries.

Adjust existing `vector_store_client_test` tests accordingly.
2026-01-16 11:18:23 +01:00
Dawid Pawlik
304e908e3b select_statement: do not require post query ordering for vector queries
As there is only one `ORDER BY` clause with `ANN OF` ordering supported
in ANN vector queries, there is no need to require post query ordering
for the ANN vector queries. The standard ordering is not allowed here.
In fact the ordering is done on the Vector Store service side within
the ANN search, so that the returned primary keys are already sorted
accordingly.

If left unchanged, the filtering with `IN` clauses would cause
a `bad_function_call` server error as the filtering with `IN` clauses
require the post query ordering in a standard case.
2026-01-16 11:18:23 +01:00
Dawid Pawlik
a84d1361db vector_search: add statement_restrictions to JSON parsing
Add a module parsing the statement restrictions into Vector Store
filtering API compatible JSON objects.

The API was defined in: scylladb/vector-store#334

Examplary JSON object compatible with the API:
```
{
 "restrictions": [
     { "type": "==", "lhs": "pk", "rhs": 1 },
     { "type": "IN", "lhs": "pk", "rhs": [2, 3] },
     { "type": "<", "lhs": "ck", "rhs": 4 },
     { "type": "<=", "lhs": "ck", "rhs": 5 },
     { "type": ">", "lhs": "pk", "rhs": 6 },
     { "type": ">=", "lhs": "pk", "rhs": 7 },
     { "type": "()==()", "lhs": ["pk", "ck"], "rhs": [10, 20] },
     { "type": "()IN()", "lhs": ["pk", "ck"], "rhs": [[100, 200], [300, 400]] },
     { "type": "()<()", "lhs": ["pk", "ck"], "rhs": [30, 40] },
     { "type": "()<=()", "lhs": ["pk", "ck"], "rhs": [50, 60] },
     { "type": "()>()", "lhs": ["pk", "ck"], "rhs": [70, 80] },
     { "type": "()>=()", "lhs": ["pk", "ck"], "rhs": [90, 0] }
 ],
 "allow_filtering": true
}
```
2026-01-16 11:18:23 +01:00
Tomasz Grabiec
3fb7719277 topology_coordinator: Update load stats in case rebuilding with no live replica
Such rebuild has no read_from replica, but we know the tablet size will be 0.
If we don't, stats will be incomplete until the next refresh.

This is important for test cases which do removenode or replace while
all replicas are down. So for example test_replace from
test_tablets_removenode.py, which uses RF=1 and replaces a node.

Without this, the test waits for 60s needlessly after the first round
of rebuilding migrations before scheduling more migrations. This can
cause the test to time out.

Fixes #28115

Closes scylladb/scylladb#28121
2026-01-16 11:19:01 +02:00
Sergey Zolotukhin
799d837295 test: disable test_start_bootstrapped_with_invalid_seed
The test intermittently fails when an invalid DNS name is resolved,
likely due to ISP DNS error hijacking (see scylladb/scylladb#28153).

Disable this test to unblock CI.

Fixes scylladb/scylladb#28153

Closes scylladb/scylladb#28162
2026-01-15 10:25:45 +01:00
Jenkins Promoter
51d61f809e Update pgo profiles - aarch64 2026-01-15 05:13:03 +02:00
Jenkins Promoter
eed1e7fa23 Update pgo profiles - x86_64 2026-01-15 04:33:43 +02:00
Tomasz Grabiec
eef798d84f Merge 'Distribute data evenly among primary replicas during restore' from Robert Bindar
Most likely 817fdad uncovered the fact that our choice of primary replica was resonating with tablet allocation and we were ending up picking the same replica as primary within a scope instead of rotating primaryship among all replicas in the scope.
This created situations where for instance, restoring into a 9 nodes with primary_replica_only=true would put all data into 3 nodes, leaving the other 6 unused. The balancing of the dataset was performed by the subsequent repair step.

This PR fixes this by changing the formula for picking up the primary replica out of a set of eligible replicas from within the passed scope.
The PR also extends the testing scenarios in `test_backup.py` so we get to run restore for a set of topologies, for all combinations of scope, primary_replica_only and min_tablet_counts.
Most of the work was done by @bhalevy [here](https://github.com/scylladb/scylladb/compare/master...bhalevy:scylla:load-balance-primary-replica), this PR just splitted it and did touchups here and there.

Fixes #27281

Closes scylladb/scylladb#27397

* github.com:scylladb/scylladb:
  test: reduce dataset and number of test cases or debug builds
  test: bump repair timeout up, it's sometimes not enough in CI
  test: refactor test_refresh.py to match test_restore_with_streaming_scopes.
  test: extend test_restore_with_streaming_scopes
  test: Adjust test_restore_primary_replica_different_dc_scope_all
  test: Refactor restoring code in test_backup to match SM pattern
  test: add check_mutation_replicas calls after fresh creation of dataset
  test: extend create_dataset to accept consistency_level
  test: refactor check_mutation_replicas so it's more readable
  test: make create_dataset async and refactor so it's configurable
  test: use defaultdict in collect_mutations
  test: add log marks to facilitate reusing server for restore
  locator: tablets: Distribute data evenly among primary replicas during restore
2026-01-14 18:57:55 +01:00
Avi Kivity
bd08b6e5b2 Merge 'Unify configuration of object storage endpoints (take 2)' from Pavel Emelyanov
To configure S3 storage, one needs to do

```
object_storage_endpoints:
  - name: s3.us-east-1.amazonaws.com
    port: 443
    https: true
    aws_region: us-east-1
```

and for GCS it's

```
object_storage_endpoints:
  - name: https://storage.googleapis.com:433
    type: gs
    credentials_file: <gcp account credentials json file>
```

This PR updates the S3 part to look like

```
object_storage_endpoints:
  - name: https://s3.us-east-1.amazonaws.com:443
    aws_region: us-east-1
```

fixes: #26570

This is 2nd attempt, previous one (#27360) was reverted because it reported endpoint configs in new format via API and CQL always, even if the endpoint was configured in the old way. This "broke" scylla manager and some dtests. This version has this bug fixed, and endpoints are reported in the same format as they were configured with.

About correctness of the changes.

No modifications to existing tests are made here, so old format is respected correctly (as far as it's covered by tests). To prove the new format works the the test_get_object_store_endpoints is extended to validate both options. Some preparations to this test to make this happen come on their own with the PR #28111  to show that they are valid and pass before changing the core code.

Enhancing the way configuration is made, likely no need to backport.

Closes scylladb/scylladb#28112

* github.com:scylladb/scylladb:
  test: Validate S3 endpoints new format works
  docs: Update docs according to new endpoints config option format
  object_storage: Create s3 client with "extended" endpoint name
  s3/storage: Tune config updating
  sstable: Shuffle args for s3_client_wrapper
  test: Rename badconf variable into objconf
  test: Split the object_store/test_get_object_store_endpoints test
2026-01-14 18:29:03 +02:00
Yaniv Michael Kaul
d919aacc69 storage_proxy: mark write_timeouts metric for counter write timeouts
When a counter write times out (due to rpc::timeout_error or timed_out_error),
the code was throwing mutation_write_timeout_exception but not marking the
write_timeouts metric. This resulted in counter write timeouts not being
counted in the scylla_storage_proxy_coordinator_write_timeouts metric.

Regular writes go through mutate_internal -> mutate_end, which catches
mutation_write_timeout_exception and marks the metric. However, counter
writes use a separate code path (mutate_counters) that has its own
exception handling but was missing the metric update.

This fix adds get_stats().write_timeouts.mark() before throwing the
timeout exception in the counter write path, consistent with how the
CAS path handles cas_write_timeouts.

Refs: https://scylladb.atlassian.net/browse/SCYLLADB-245

Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>

Closes scylladb/scylladb#28019
2026-01-14 17:50:46 +02:00
Gleb Natapov
bee5f63cb6 topology coordinator: complete pending operation for a replaced node
A replaced node may have pending operation on it. The replace operation
will move the node into the 'left' state and the request will never be
completed. More over the code does not expect left node to have a
request. It will try to process the request and will crash because the
node for the request will not be found.

The patch checks is the replaced node has peening request and completes
it with failure. It also changes topology loading code to skip requests
for nodes that are in a left state. This is not strictly needed, but
makes the code more robust.

Fixes #27990

Closes scylladb/scylladb#28009
2026-01-14 13:11:27 +01:00
Botond Dénes
551eecab63 Merge 'EAR: deprecate the replicated key provider' from Calle Wilund
Refs #22733.

Adds runtime warning and docs info that replicated provider is deprecated and will be removed.

Fixes #27292

Closes scylladb/scylladb#27270

* github.com:scylladb/scylladb:
  docs::encryption: Add warning that replicated provider is deprecated
  ent::encryption: Switch default key provider from replicated to local
  replicated_key_provider: Add deprecation warning on usage
2026-01-14 13:47:23 +02:00
Calle Wilund
2fd6ca4c46 test/cluster/test_internode_compression: Transpose test from dtest
Refs #27429

re-implement the dtest with same name as a scylla pytest, using
a python level network proxy instead of tcpdump etc. Both to avoid
sudo and also to ensure we don't race.

v2:
* Included positive test (mode=all)
2026-01-14 10:53:34 +01:00
Patryk Jędrzejczak
6b5923c64e test: test_group0_schema_versioning: wait for schema sync in system.local
`test_schema_versioning_with_recovery` is currently flaky. It performs
a write with CL=ALL and then checks if the schema version is the same on
all nodes by calling `verify_table_versions_synced`. All nodes are expected
to sync their schema before handling the replica write. The node in
RECOVERY mode should do it through a schema pull, and other nodes should do
it through a group 0 read barrier.

The problem is in `verify_local_schema_versions_synced` that compares the
schema versions in `system.local`. The node in RECOVERY mode updates the
schema version in `system.local` after it acknowledges the replica write
as completed. Hence, the check can fail.

We fix the problem by making the function wait until the schema versions
match.

Note that RECOVERY mode is about to be retired together with the whole
gossip-based topology in 2026.2. So, this test is about to be deleted.
However, we still want to fix it, so that it doesn't bother us in older
branches.

Fixes #23803

Closes scylladb/scylladb#28114
2026-01-14 09:55:45 +01:00
Jakub Smolar
aefd815194 test.py: add pexpect to the dependencies
Use pexpect to control a presistent GDB process with pattern reads and timeouts. This makes 'scylla_gdb' tests faster and less flaky.
Added python3-pexpect in 'install-dependencies.sh'.

Closes scylladb/scylladb#26419

[avi:
  build optimized clang 21.1.8
  regenerated frozen toolchain with optimized clang from

    https://devpkg.scylladb.com/clang/clang-21.1.8-Fedora-43-aarch64.tar.gz
    https://devpkg.scylladb.com/clang/clang-21.1.8-Fedora-43-x86_64.tar.gz
]

Closes scylladb/scylladb#28134
2026-01-14 10:17:37 +02:00
Botond Dénes
122b7847e5 Merge 'index: Accept view properties in CREATE INDEX' from Dawid Mędrek
Problem
-------
Secondary indexes are implemented via materialized views under the
hood. The way an index behaves is determined by the configuration
of the view. Currently, it can be modified by performing the CQL
statement `ALTER MATERIALIZED VIEW` on it. However, that raises some
concerns.

Consider, for instance, the following scenario:

1. The user creates a secondary index on a table.
2. In parallel, the user performs writes to the base table.
3. The user modifies the underlying materialized view, e.g. by setting
   the `synchronous_updates` to `true` [1].

Some of the writes that happened before step 3 used the default value
of the property (which is `false`). That had an actual consequence
on what happened later on: the view updates were performed
asynchronously. Only after step 3 had finished did it change.

Unfortunately, as of now, there is no way to avoid a situation like
that. Whenever the user wants to configure a secondary index they're
creating, they need to do it in another schema change. Since it's
not always possible to control how the database is manipulated in
the meantime, it leads to problems like the one described.

That's not all, though. The fact that it's not possible to configure
secondary indexes is inconsistent with other schema entities. When
it comes to tables or materialized views, the user always have a means
to set some or even all of the properties during their creation.

Solution
--------
The solution to this problem is extending the `CREATE INDEX` CQL
statement by view properties. The syntax is of form:

```
> CREATE INDEX <index name>
> .. ON <keyspace>.<table> (<columns>)
> .. WITH <properties>
```

where `<properties>` corresponds to both index-specific and view
properties [2, 3]. View properties can only be used with indexes
implemented with materialized views; for example, it will be impossible
to create a vector index when specifying any view property (see
examples below).

When a view property is provided, it will be applied when creating the
underlying materialized view. The behavior should be similar to how
other CQL statements responsible for creating schema entities work.

High-level implementation strategy
----------------------------------
1. Make auxiliary changes.
2. Introduce data structures representing the new set of index
   properties: both index-specific and those corresponding to the
   underlying view.
3. Extend `CREATE INDEX` to accept view properties.
4. Extend `DESCRIBE INDEX` and other `DESCRIBE` statements to include
   view properties in their output.

User documentation is also updated at the steps to reflect the
corresponding changes.

Implementation considerations
-----------------------------
There are a number of schema properties that are now obsolete. They're
accepted by other CQL statements, but they have no effect. They
include:

* `index_interval`
* `replicate_on_write`
* `populate_io_cache_on_flush`
* `read_repair_chance`
* `dclocal_read_repair_chance`

If the user tries to create a secondary index specifying any of those
keywords, the statement will fail with an appropriate error (see
examples below).

Unlike materialized views, we forbid specifying the clustering order
when creating a secondary index [4]. This limitation may be lifted
later on, but it's a detail that may or may not prove troublesome. It's
better to postpone covering it to when we have a better perspective on
the consequences it would bring.

Examples
--------
Good examples
```
> CREATE INDEX idx ON ks.t (v);
> CREATE INDEX idx ON ks.t (v) WITH comment = 'ok view property';
> CREATE INDEX idx ON ks.t (v)
  .. WITH comment = 'multiple view properties are ok'
  .. AND synchronous_updates = true;
> CREATE INDEX idx ON ks.t (v)
  .. WITH comment = 'default value ok'
  .. AND synchronous_updates = false;
```

Bad examples
```
> CREATE INDEX idx ON ks.t (v) WITH replicate_on_write = true;

SyntaxException: Unknown property 'replicate_on_write'

> CREATE INDEX idx ON ks.t (v)
  .. WITH OPTIONS = {'option1': 'value1'}
  .. AND comment = 'some text';

InvalidRequest: Error from server: code=2200 [Invalid query]
  message="Cannot specify options for a non-CUSTOM index"

> CREATE CUSTOM INDEX idx ON ks.t (v)
  .. WITH OPTIONS = {'option1': 'value1'}
  .. AND comment = 'some text';

InvalidRequest: Error from server: code=2200 [Invalid query]
  message="CUSTOM index requires specifying the index class"

> CREATE CUSTOM INDEX idx ON ks.t (v)
  .. USING 'vector_index'
  .. WITH OPTIONS = {'option1': 'value1'}
  .. AND comment = 'some text';

InvalidRequest: Error from server: code=2200 [Invalid query]
  message="You cannot use view properties with a vector index"

> CREATE INDEX idx ON ks.t (v) WITH CLUSTERING ORDER BY (v ASC);

InvalidRequest: Error from server: code=2200 [Invalid query]
  message="Indexes do not allow for specifying the clustering order"
```

and so on. For more examples, see the relevant tests.

References:
[1] https://docs.scylladb.com/manual/branch-2025.4/cql/cql-extensions.html#synchronous-materialized-views
[2] https://docs.scylladb.com/manual/branch-2025.4/cql/secondary-indexes.html#create-index
[3] https://docs.scylladb.com/manual/branch-2025.4/cql/mv.html#mv-options
[4] https://docs.scylladb.com/manual/branch-2025.4/cql/dml/select.html#ordering-clause

Fixes scylladb/scylladb#16454

Backport: not needed. This is an enhancement.

Closes scylladb/scylladb#24977

* github.com:scylladb/scylladb:
  cql3: Extend DESC INDEX by view properties
  cql3: Forbid using CLUSTERING ORDER BY when creating index
  cql3: Extend CREATE INDEX by MV properties
  cql3/statements/create_index_statement: Allow for view options
  cql3/statements/create_index_statement: Rename member
  cql3/statements/index_prop_defs: Re-introduce index_prop_defs
  cql3/statements/property_definitions: Add extract_property()
  cql3/statements/index_prop_defs.cc: Add namespace
  cql3/statements/index_prop_defs.hh: Rename type
  cql3/statements/view_prop_defs.cc: Move validation logic into file
  cql3/statements: Introduce view_prop_defs.{hh,cc}
  cql3/statements/create_view_statement.cc: Move validation of ID
  schema/schema.hh: Do not include index_prop_defs.hh
2026-01-14 09:54:27 +02:00
Pavel Emelyanov
e57ee84662 util: Re-use seastar::util::memory_data_sink
A data_sink that stores buffers into an in-memory collection had
appeared in seastar recently. In Scylla there's similar thing that uses
memory_data_sink_buffer as a container, so it's possible to drop the
data_sink_impl iself in favor of seastar implementation.

For that to work there should be append_buffers() overload for the
aforementioned container. For its nice implementation the container, in
turn, needs to get push_back() method and value_type trait. The method
already exists, but is called put(), so just rename it. There's one more
user of it this method in S3 client, and it can enjoy the added
append_buffers() helper.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#28124
2026-01-14 08:54:00 +02:00
Avi Kivity
3fc5a32136 tools: toolchain: update instructions for building optimized clang with version information
The instructions for building optimized clang neglected to mention
that the clang version to be built must be specified. Correct that.

Closes scylladb/scylladb#28135
2026-01-14 06:46:20 +02:00
Botond Dénes
6e17bf5c1a tools/scylla-nodetool: migrate to std::localtime
fmt::localtime() is now deprecated, users should migrate to equivalents
from the standard libraries.
std::localtime is not thread safe, so a local wrapper is introduced,
based on the thread-safe localtime_r() (from libc).

Closes scylladb/scylladb#27821
2026-01-13 20:46:31 +02:00
Nikos Dragazis
7fa1f87355 db/config: Update sstable_compression_user_table_options description
Clarify what "user table" means.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2026-01-13 20:45:59 +02:00
Nikos Dragazis
1e37781d86 schema: Add initializer for compression defaults
In PR 5b6570be52 we introduced the config option
`sstable_compression_user_table_options` to allow adjusting the default
compression settings for user tables. However, the new option was hooked
into the CQL layer and applied only to CQL base tables, not to the whole
spectrum of user tables: CQL auxiliary tables (materialized views,
secondary indexes, CDC log tables), Alternator base tables, Alternator
auxiliary tables (GSIs, LSIs, Streams).

Fix this by moving the logic into the `schema_builder` via a schema
initializer. This ensures that the default compression settings are
applied uniformly regardless of how the table is created, while also
keeping the logic in a central place.

Register the initializer at startup in all executables where schemas are
being used (`scylla_main()`, `scylla_sstable_main()`, `cql_test_env`).

Finally, remove the ad-hoc logic from `create_table_statement`
(redundant as of this patch), remove the xfail markers from the relevant
tests and adjust `test_describe_cdc_log_table_create_statement` to
expect LZ4WithDicts as the default compressor.

Fixes #26914.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2026-01-13 20:45:59 +02:00
Nikos Dragazis
d5ec66bc0c schema: Generalize static configurators into schema initializers
Extend the `static_configurator` mechanism to support initialization of
arbitrary schema properties, not only static ones, by passing a
`schema_builder` reference to the configurator interface.

As part of this change, rename `static_configurator` to
`schema_initializer` to better reflect its broader responsibility.

Add a checkpoint/restore mechanism to allow de-registering an
initializer (useful for testing; will be used in the next patch).

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2026-01-13 20:45:59 +02:00
Nikos Dragazis
5b4aa4b6a6 schema: Initialize static properties eagerly
Schemas maintain a set of so-called "static properties". These are not
user-visible schema properties; they are internal values carried by
in-memory `schema` objects for convenience (349bc1a9b6,
https://github.com/scylladb/scylladb/pull/13170#issuecomment-1469848086).

Currently, the initialization of these properties happens when a
`schema_builder` builds a schema (`schema_builder::build()`), by
invoking all registered "static configurators".

This patch moves the initialization of static properties into the
`schema_builder` constructor. With this change, the builder initializes
the properties once, stores them in a data member, and reuses them for
all schema objects that it builds. This doesn't affect correctness as
the values produced by static configurators are "static" by
nature; they do not depend on runtime state.

In the next patch, we will replace the "static configurator" pattern
with a more general pattern that also supports initialization of regular
schema properties, not just static ones. Regular properties cannot be
initialized in `build()` because users may have already explicitly set
values via setters, and there is no way to distinguish between default
values and explicitly assigned ones.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2026-01-13 20:45:55 +02:00
Nikos Dragazis
76b2d0f961 db: config: Add accessor for sstable_compression_user_table_options
The `sstable_compression_user_table_options` config option determines
the default compression settings for user tables.

In patch 2fc812a1b9, the default value of this option was changed from
LZ4 to LZ4WithDicts and a fallback logic was introduced during startup
to temporarily revert the option to LZ4 until the dictionary compression
feature is enabled.

Replace this fallback logic with an accessor that returns the correct
settings depending on the feature flag. This is cleaner and more
consistent with the way we handle the `sstable_format` option, where the
same problem appears (see `get_preferred_sstable_version()`).

As a consequence, the configuration option must always be accessed
through this accessor. Add a comment to point this out.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2026-01-13 18:30:38 +02:00
Nikos Dragazis
4ec7a064a9 test: Check that CQL and Alternator tables respect compression config
In patches 11f6a25d44 and 7b9428d8d7 we added tests to verify that
auxiliary tables for both CQL and Alternator have the same default
compression settings as their base tables. These tests do not check
where these defaults originate from; they just verify that they are
consistent.

Add some more tests to verify the actual source of the defaults, which
is expected to be the `sstable_compression_user_table_options`
from the configuration. Unlike the previous tests, these tests require
dedicated Scylla instances with custom configuration, so they must be
placed under `test/cluster/`.

Mark them as xfail-ing. The marker will be removed later in this series.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2026-01-13 18:16:08 +02:00
Avi Kivity
489d1a0fbc Merge 'replica: don't throw exceptions for read timeout' from Botond Dénes
Read timeouts are a common occurence and they typically occur when the replica is overloaded. So throwing exceptions for read timeouts is very harmful. Be careful not to thow exceptions while propagating them up the future chain. Add a test to enfore and detect regressions.

Fixes: scylladb/scylladb#25062

Improvement, normally not a backport candidate, but we may decide to backport if customer(s) are found to suffer from this.

Closes scylladb/scylladb#25068

* github.com:scylladb/scylladb:
  reader_permit: remove check_abort()
  test/boost/database_test: add test for read timeout exceptions
  sstables/mx/reader: don't throw exceptions on the read-path
  readers/multishard: don't throw exceptions on the read-path
  replica/table: don't throw exceptions on the read-path
  multishard_mutation_query: fix indentation
  multishard_mutation_query: don't throw exceptions on the read-path
  service/storage_proxy: don't throw exceptions on the full-scan path
  cql3/query_processor: don't throw exceptions on the read-path
  reader_permit: add get_abort_exception()
2026-01-13 16:17:41 +02:00
Calle Wilund
da17e8b18b gossiper/main: Extend special treatment of node ID resolve for rpc_address
Refs #27429

If running with broadcast_address != listen/cql/rpc address, topology
gets confused about the varying addresses. Need to special case
resolve both addresses as "self". I.e. extend broadcast_address
treatment to cql_address as well.

Added export of this via gossiper for symmetry.
2026-01-13 14:12:19 +01:00
Avi Kivity
c6dfae5661 treewide: #include Seastar headers with angle brackets
Seastar is an external library from the point of view of
ScyllaDB, so should be included with angle brackets.

Closes scylladb/scylladb#27947
2026-01-13 14:56:15 +02:00
Tomasz Grabiec
63b9a7e2b5 test: pylib: log_browsing: Grep logs without considering newly appended lines
At the end of the test case, the framework greps logs for errors and
backtraces. The servers are still running at this point. Some test
cases enable debug-level logging. If servers manage to produce new
lines between the python script processes them, the grep will never
return.

Protect against this by grepping over a file snapshot.

Fixes #28086

Closes scylladb/scylladb#28088
2026-01-13 14:41:02 +02:00
Nadav Har'El
fc6fff61d1 docs/alternator: add document on reducing Alternator network costs
This patch adds a new document, docs/alternator/network.md,
explaining the various mechanisms that can be used to reduce
network usage in Alternator. It explains compression of requests
and responses, header reduction, rack-aware routing, and RPC compression.

Many of these topics - especially support in the client libraries -
are work in progress, so some details are still missing in the new
document. Still, I think it is a good start that can be improved
later.

Fixes #27915.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#27927
2026-01-13 14:29:01 +02:00
Radosław Cybulski
7b1060fad3 alternator: refactor streams.cc
Fix indentation levels from previous commit.
2026-01-13 12:04:13 +01:00
Radosław Cybulski
ef63fe400a alternator: refactor streams.cc
Refactor streams.cc - turn `.then` calls into coroutines.
Reduces amount of clutter, lambdas and referenced variables.
Note - the code is kept at the same indentation level to ease review,
the next commit will fix this.
2026-01-13 12:03:54 +01:00
Karol Baryła
2c471ec57a protocol-extensions.md: Fix client_options docs
When this column and relevant SUPPORTED key were added, the
documentation was mistakenly put in the section about shard awareness
extension. This commit moves the documentation into a dedicated section.

I also expended it to describe both the new column and the new SUPPORTED
key.
2026-01-13 11:49:00 +01:00
Karol Baryła
30d4d3248d system_keyspace.md: Add client_options column
It was recently introduced, but the documentation was not updated.
2026-01-13 11:35:52 +01:00
Karol Baryła
a0a6140436 system_keyspace.md: Fix order in system.clients
scheduling_group column is places after protocol_version in the current
version.
2026-01-13 11:33:34 +01:00
Pavel Emelyanov
9ffd22491f test: Validate S3 endpoints new format works
Extend the test_get_object_store_endpoints() test to configure S3
endpoints in full-url format and check that they are rendered properly
via API/CQL.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-01-13 13:24:18 +03:00
Pavel Emelyanov
bd225784bd docs: Update docs according to new endpoints config option format
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-01-13 13:24:06 +03:00
Pavel Emelyanov
f227de24b2 object_storage: Create s3 client with "extended" endpoint name
For this, add the s3::client::make(endpoint, ...) overload that accepts
endpoint in proto://host:port format. Then it parses the provided url
and calls the legacy one, that accepts raw host string and config with
port, https bit, etc.

The generic object_storage_endpoint_param no longer needs to carry the
internal s3::endpoint_config, the config option parsing changes
respectively.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-01-13 13:24:06 +03:00
Pavel Emelyanov
8f97e6b3de s3/storage: Tune config updating
Don't prepare s3::endpoint_config from generic code, jut pass the region
and iam_role_arn (those that can potentially change) to the callback.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-01-13 13:24:06 +03:00
Pavel Emelyanov
bee3564564 sstable: Shuffle args for s3_client_wrapper
Make it construct like gs_client_wrapper -- with generic endpoint param
reference and make the storage-specific casts/gets/whatever internally.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-01-13 13:24:06 +03:00
Pavel Emelyanov
83e88d206c test: Rename badconf variable into objconf
It's not actually a "bad" config, it's just some config the test works
with.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-01-13 13:23:20 +03:00
Pavel Emelyanov
9c627bc44a test: Split the object_store/test_get_object_store_endpoints test
It tests two things -- the way object storage config is represented via
API and CQL (from sytem.config) and that updating config affects CREATE
KEYSPACE CQL (with keyspace storage options)

It's better to split the test, as its former part is going to be
extented to validate old/new config formats (see #26570)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-01-13 13:23:03 +03:00
Robert Bindar
dfcabb5fa4 test: reduce dataset and number of test cases or debug builds
Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
2026-01-13 11:46:51 +02:00
Robert Bindar
ca3c57e821 test: bump repair timeout up, it's sometimes not enough in CI
Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
2026-01-13 11:46:49 +02:00
Robert Bindar
6f5e58e718 test: refactor test_refresh.py to match test_restore_with_streaming_scopes.
Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
2026-01-13 11:46:48 +02:00
Robert Bindar
6e636a4231 test: extend test_restore_with_streaming_scopes
to test restoring with a different min_tablet_count
than the schema was originally created with.

Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
2026-01-13 11:46:46 +02:00
Robert Bindar
92cd1ddec3 test: Adjust test_restore_primary_replica_different_dc_scope_all
to match the new topology arhitecture

Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
2026-01-13 11:46:44 +02:00
Robert Bindar
db13ece9a0 test: Refactor restoring code in test_backup to match SM pattern
This patch refactors the restoring code in cluster/test_backup.py
so it matches better the way SM works.
The patch also refactors test_restore_with_streaming_scopes so to
facilitate running restore scenarios under all supported scopes
with or w/o primary_replica_only enabled by reusing the servers
and backups for a topology. This allows us to test a lot more scenarios
without making the test impossibly slow.

split from bhalevy/load-balance-primary-replica

Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
2026-01-13 11:46:43 +02:00
Robert Bindar
ba01589f53 test: add check_mutation_replicas calls after fresh creation of dataset
to validate that mutation assertions are sane

split from bhalevy/load-balance-primary-replica

Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
2026-01-13 11:46:41 +02:00
Robert Bindar
b835d32cb0 test: extend create_dataset to accept consistency_level
Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
2026-01-13 11:46:35 +02:00
Robert Bindar
e7d44356d9 test: refactor check_mutation_replicas so it's more readable
split from bhalevy/load-balance-primary-replica

Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
2026-01-13 11:46:31 +02:00
Robert Bindar
733b4dbbb7 test: make create_dataset async and refactor so it's configurable
with num_keys and min_tablet_count

split from bhalevy/load-balance-primary-replica

Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
2026-01-13 11:46:20 +02:00
Robert Bindar
f2c8949e4a test: use defaultdict in collect_mutations
split from bhalevy/load-balance-primary-replica

Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
2026-01-13 11:45:03 +02:00
Robert Bindar
45faeba97d test: add log marks to facilitate reusing server for restore
split from bhalevy/load-balance-primary-replica

Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
2026-01-13 11:44:48 +02:00
Robert Bindar
d88036db48 locator: tablets: Distribute data evenly among primary replicas during restore
Most likely 817fdad uncovered the fact that our choice of
primary replica was resonating with tablet allocation and we were ending up
picking the same replica as primary within a scope instead of rotating
primaryship among all replicas in the scope.
This created situations where for instance, restoring into a 9 nodes cluster
with primary_replica_only=true would put all data into 3 nodes, leaving
the other 6 unused. The balancing of the dataset was performed by the
subsequent repair step.

split from bhalevy/load-balance-primary-replica

Fixes #27281

Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
2026-01-13 11:44:20 +02:00
Nadav Har'El
2a831ad373 Merge 'Address CodeQL Errors' from Botond Dénes
Address all errors reported by CodeQL as reported on https://github.com/scylladb/scylladb/security/quality.
This is a mixed bag, with some harmless issues, while others are severe problems which will result in the code breaking (if it is even run). I suspect some of the more severe problems were found in dead code that is not used at all -- hence nobody noticed.
Still, these issues are good to fix, so we can reduce noise in the reports and improve the maintainability of the code.

Code cleanup, no backport

Closes scylladb/scylladb#27838

* github.com:scylladb/scylladb:
  pgo/pgo.py: don't mutate input params
  test/pylib/coverage_utils.py: profdata_to_lcov: don't mutate defaulted param
  test/cluster/dtest/tools/misc.py: add type annotations to list_to_hashed_dict()
  idl-compiler.py: raise TypeError instead of raw str
  test/pylib/lcov_utils.py: don't call set when iterating over it
  configure.py: move away from .format(**locals())
  test/cluster/object_store/conftest.py: add missing call to parent constructor
  idl-compiler.py: add missing call to parent class constructor
  tools/scyllatop/fake.py: pass correct number of args to _add_metric
2026-01-13 11:43:57 +02:00
Nadav Har'El
609b283d98 test/cqlpy: add another reproducer for known issue
This patch adds a second reproducer for issue #25839, which is about
scanning a secondary index which returns partial results. The new test
uses count(*) without requesting the row themselves, but still has the
same problem of counting only part of the rows. This is the problem that
a user reported in issue #28026.

Unlike the previous test, this test works correctly on older versions
of Scylla - by using larger data, like on Cassandra - without changing
a configuration variable that did not yet exist. So with this test we
can confirm that this bug is a Scylla 5.2 regression:

  test/cqlpy/run --release 5.1 test_secondary_index.py::test_short_count

passes, while

  test/cqlpy/run --release 5.2 test_secondary_index.py::test_short_count

fails. It also fails on master, so the new test is marked "xfail".

Refs #25839
Refs #28026

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#28108
2026-01-13 11:15:27 +02:00
Botond Dénes
7b562bb185 Merge 'system.clients: Address SSL refactor review comments' from Piotr Smaron and Copilot
Addresses outstanding review comments from PR #22961 where SSL field
collection was refactored into generic_server::connection base class.
This patch consists of minor cosmetic enhancements for increased
readability, mainly, with some minor fixups explained in specific
commits.

Cosmetic changes, no need to backport.

Closes scylladb/scylladb#27575

* github.com:scylladb/scylladb:
  test_ssl: fix indentation
  generic_server: improve logging broken TLS connection
  test_ssl: improve timeout and readability
  alternator/server: update SSL comment
2026-01-13 11:00:26 +02:00
Botond Dénes
354c805e6a reader_permit: remove check_abort()
This method can cause performance regressions if used in the wrong place
-- namely if it is used to abort reads by throwing the abort exception.
Exceptions should be propagated during reads without throwing them,
otherwise they cause extra CPU load, making a bad situation worse.
Remove this method, so it doesn't accidentally get more users, migrate
remaining users to get_abort_exception().
2026-01-13 10:47:57 +02:00
Botond Dénes
a0ddac655d test/boost/database_test: add test for read timeout exceptions
Read timeouts shouldn't trigger exceptions thrown, exceptions should be
solely propagated via futures, otherwise they put extra strain on the
system at the worst possible time: when it is overload already enough
that reads started to time out.
The test covers both single partition reads and full scans, with two
scenarios:
* timeout while the read is queued
* timeout when the read is already ongoing
2026-01-13 10:47:57 +02:00
Botond Dénes
6801611b94 sstables/mx/reader: don't throw exceptions on the read-path
If the read is aborted via the permit (due to timeout) don't throw the
abort exception, instead propagate it via the future chain.
Also, use try_catch<> instead of try ... catch to decorate
malformed_sstable_exception with the file name.
2026-01-13 10:47:57 +02:00
Botond Dénes
fa6ffe9d20 readers/multishard: don't throw exceptions on the read-path
Use coroutine::try_future() to avoid exceptions taking flight and
triggering expensive stack-unwinding.
Especially bad for common exceptions like timeouts.
2026-01-13 10:47:57 +02:00
Botond Dénes
1f6f7ceb68 replica/table: don't throw exceptions on the read-path
Use coroutine::as_future() to avoid exceptions taking flight and
triggering expensive stack-unwinding.
Especially bad for common exceptions like timeouts.

Not using coroutine::try_future(), because on the error path, the
querier has to be closed.
2026-01-13 10:47:57 +02:00
Botond Dénes
131489fe48 multishard_mutation_query: fix indentation
Left broken by previous patch.
2026-01-13 10:47:57 +02:00
Botond Dénes
7eeb7fcfba multishard_mutation_query: don't throw exceptions on the read-path
Use coroutine::try_future() to avoid exceptions taking flight and
triggering expensive stack-unwinding.
Especially bad for common exceptions like timeouts.
2026-01-13 10:47:57 +02:00
Botond Dénes
9bea842c01 service/storage_proxy: don't throw exceptions on the full-scan path
Use coroutine::try_future() to avoid exceptions taking flight and
triggering expensive stack-unwinding.
Especially bad for common exceptions like timeouts.
2026-01-13 10:47:57 +02:00
Botond Dénes
404f5b0808 cql3/query_processor: don't throw exceptions on the read-path
Use coroutine::try_future() to avoid exceptions taking flight and
triggering expensive stack-unwinding.
Especially bad for common exceptions like timeouts.
2026-01-13 10:47:57 +02:00
Botond Dénes
5cd237fec5 reader_permit: add get_abort_exception()
Will replace check_abort(). The latter throws an exception which is
something we want to avoid when a read is aborted, in particular when it
times out.
Also add a convenience get_abort_exception() method to mutation_reader.
2026-01-13 10:47:57 +02:00
Michał Hudobski
c8aa49b196 vector search, paging: add test for paging warnings
We add a test that validates that indexed queries
do not throw a warning related to vector search paging

Fixes: SCYLLADB-248

Closes scylladb/scylladb#28077
2026-01-13 10:33:36 +02:00
Nadav Har'El
34191d8fd4 alternator: fix signature checking of headers with multiple spaces
We have a test in test_compressed_response.py that reproduces a bug
where in Alternator's signature checking code, if a header had multiple
consecutive spaces its signature isn't checked correctly.

This patch fixes this and that xfailing test begins to pass.

But it turns out that the handling of multiple consecutive spaces in
headers when calculating the authentication signature is just one example
of "header canonization" that the AWS Signature V4 specification requires
us to do. There are additional types of header canonization that Alternator
must do, and this patch also adds new tests in test_authorization.py for
checking *all* the types of canonization.

Fortunately, for all other types of canonizations, we already handled
them correctly - Alternator already lowercases header names, sorts them
alphabetically and removes leading and trailing spaces before calculating
the signature. So most of the new tests added pass also without this patch,
and only one of them, test_canonization_middle_whitespace, needs this
patch to pass. As usual, all the new tests also pass on DynamoDB.

Fixes #27775

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#28102
2026-01-13 10:29:13 +02:00
Andrei Chekun
3f14d45d6e test.py: fix the link for the failed_test directory
With new UI Jenkins escaping the HTML tags during rendering to prevent
XSS. This will show just link without custom name as a string that can
be copied and then pasted to navigate to the failed directory.

Closes scylladb/scylladb#28062
2026-01-13 10:22:38 +02:00
Yaniv Kaul
4e3aa53f8b test/rest_api/test_gossiper.py: fix for Variable defined multiple times
To fix the problem, we need to remove the first, redundant definition of
test_gossiper_unreachable_endpoints (lines 19-24). The second definition
(lines 25-40) should be retained as it has more substantial test logic.
No other code changes or imports are needed, as the test logic is
preserved fully in the retained definition.

Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>

Closes scylladb/scylladb#27632
2026-01-13 10:17:29 +02:00
Yaniv Kaul
1722e35a7e .github/workflows/docs-validate-metrics.yml: add workflow permissions
Potential fix for code scanning alert no. 167: Workflow does not contain permissions

Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>

Closes scylladb/scylladb#27819
2026-01-13 10:16:35 +02:00
Yaniv Kaul
39c26794ec .github/workflows/docs-pr.yaml: add workflow permissions
Potential fix for code scanning alert no. 145: Workflow does not contain permissions

Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>

Closes scylladb/scylladb#27808
2026-01-13 10:15:45 +02:00
Yaniv Michael Kaul
d9cce7ccbf .github/copilot-instructions.md: add test philosophy
Try to teach CoPi a bit about how we'd like to see it implement tests, according to this repo best practices.

Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>

Closes scylladb/scylladb#28032
2026-01-13 10:05:13 +02:00
Anna Stuchlik
8141283262 doc: add the version name to the Install ScyllaDB page
Fixes https://github.com/scylladb/scylladb/issues/28021

Closes scylladb/scylladb#28022
2026-01-13 10:01:48 +02:00
Avi Kivity
66aee0fb5e alternator: add optional listeners for proxy protocol v2
Following 954f2cbd2f, which added proxy protocol v2 listeners
for CQL, we do the same for alternator. We add two optional ports
for plain and TLS-wrapped HTTP.

We test each new port, that the old ports still work, and that
mixing up a port with no proxy protocol and a connection with proxy
protocol (or the opposite) fails. The latter serves to show
that the testing strategy is valid and doesn't just pass whatever
happens. We also verify that the correct addresses (and TLS mode)
show up in system.clients.

Closes scylladb/scylladb#27889
2026-01-13 09:59:24 +02:00
Nadav Har'El
e7df03127b alternator: support "deflate" encoding in request compression
Currently Alternator supports compressed requests in the gzip format
with "Content-Encoding: gzip". We did not support any other compression
formats.

It turns out that DynamoDB also supports the "deflate" encoding.
The "deflate" format is just a small variant of gzip and also supported
by the same zlib library that we already use, so it is very easy
to add support for it as well. So this patch adds it.

Beyond compatibility with DynamoDB, another benefit of this patch is
symmetry with our response compression support (PR #27454), where
we supported both gzip and deflate compression of responses - so
we should support the same for requests.

This patch also adds tests for Content-Encoding: deflate, which pass
on DynamoDB (proving that "deflate" is indeed supported there).
On Alternator the new tests failed before this patch and pass with
this patch.

Refs #27243 (which asks to support more compression formats).

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#27917
2026-01-13 09:58:12 +02:00
Nadav Har'El
a1f198d453 test/cqlpy: translate Cassandra's test InsertInvalidateSizedRecordsTest
This is a translation of Cassandra's CQL unit test source file
validation/operations/InsertInvalidateSizedRecordsTest.java into our
cqlpy framework.

This is one of the tests added to Cassandra as part of the vector
search work, but actually has nothing to do with vector search -
it checks what happens when key columns of different types exceeed
their maximum size (64KB).

Unfortunately, each one of the tests added here *fail* on ScyllaDB,
providing more reproducers for two already known issues (which
already had plenty of reproducers...):

Refs  #8627 Cleanly reject updates with indexed values where value > 64k
Refs #12247 Better error reporting for oversized keys during INSERT

One of the tests also fails on Cassandra, due to CASSANDRA-19270.
It is not clear to me how this unit test actually passed on Cassandra,
I can only guess that the Python driver somehow makes the request
differently than what the Java unit tests use to make requests to
Cassandra.

One of the tests in the original Cassandra source file I did not
translate, readingEmptyStringsForDifferentTypes, because it tests
cqlsh, not pure CQL.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#27944
2026-01-13 08:59:36 +02:00
Yaniv Kaul
2c26ca3651 .github/workflows/read-toolchain.yaml: hardening for code scanning alert no. 146: Workflow does not contain permissions #27807
Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>

Closes scylladb/scylladb#27807
2026-01-13 08:57:53 +02:00
tomek7667
19313d67e3 docs/cql/ddl.rst: fix formatting of deprecated initial sub-option
Closes scylladb/scylladb#26852
2026-01-13 08:55:24 +02:00
Anna Stuchlik
14cadcbc18 doc: remove references to Open Source
Fixes https://github.com/scylladb/scylladb/issues/28118

Closes scylladb/scylladb#28119
2026-01-13 08:43:26 +02:00
Botond Dénes
25db8f6a70 pgo/pgo.py: don't mutate input params
It is considered a dangerous practice with possible unintended
side-effects, affecting later calls to the same function.
Found by CodeQL "Modification of parameter with default".
2026-01-13 08:33:17 +02:00
Botond Dénes
4ec3d76b87 test/pylib/coverage_utils.py: profdata_to_lcov: don't mutate defaulted param
It is considered a dangerous practice as it creates a side-effect for
later calls to the same function.
Create a new variable instead and mutate that. Also remove the unused
update_known_ids parameter, which defaults to True and no caller changes
it. Passing False to this param also seem to have no effect. Instead of
trying to guess what the desired effect of passing False is and fixing
it, just remove this unused param.

Found by CodeQL "Modification of parameter with default".
2026-01-13 08:33:17 +02:00
Botond Dénes
157fe2b80e test/cluster/dtest/tools/misc.py: add type annotations to list_to_hashed_dict()
To hopefully shut up CodeQL "Iterable can be either a string or a sequence".
This change makes the code more readable anyway, so it is more than just
a gratuitous change to make some code-scanner happy.
2026-01-13 08:33:17 +02:00
Botond Dénes
849332c5c2 idl-compiler.py: raise TypeError instead of raw str
Unlike in C++, in Python one can only throw objects which inherit from
Exception. The message complains about wrong type so wrap it in
TypeError before passing to raise.
Found by CodeQL "Illegal raise".
2026-01-13 08:33:17 +02:00
Botond Dénes
3c6e9637a0 test/pylib/lcov_utils.py: don't call set when iterating over it
Probably a typo. Found by CodeQL "Non-callable called".
2026-01-13 08:33:17 +02:00
Botond Dénes
eb4ee5a126 configure.py: move away from .format(**locals())
Use f strings instead, they are just as convenient with the added bonus
of editors providing syntax highighting for it.

Additionally, this shuts up CodeQL complaint about "Suspicious unused
loop iteration variable" in loops where the loop variable was passed to
format indirectly via **locals().
2026-01-13 08:33:17 +02:00
Botond Dénes
d2db84714e test/cluster/object_store/conftest.py: add missing call to parent constructor
Replace manual init of parent fields.
Found by CodeQL: "Missing call to superclass `__init__` during object
initialization".

The secret_key is not initialized to server.secret_key, instead of
server.access_key. This probably fixes a (benign) bug.
2026-01-13 08:33:17 +02:00
Botond Dénes
26e978cf08 idl-compiler.py: add missing call to parent class constructor
Replace manual init of parent fields.
Found by CodeQL "Missing call to superclass `__init__` during object
initialization".
2026-01-13 08:33:17 +02:00
Botond Dénes
20d0782dc4 tools/scyllatop/fake.py: pass correct number of args to _add_metric
Discovered by CodeQL "Wrong number of arguments in call".
2026-01-13 08:33:17 +02:00
Michał Jadwiszczak
649efd198f docs/dev/service_levels: update docs to service levels on raft
Since Scylla 6.0, service levels are manged by Raft group0.
This patch updates table name used by service levels and adds a
paragraph describing service levels on raft.

Fixes scylladb/scylladb#18177

Closes scylladb/scylladb#26556
2026-01-13 06:49:18 +02:00
Botond Dénes
725e99e263 Merge 'test.py: Fix boost logs' from Andrei Chekun
Write the boost logs into stdout in HRF format and in XML to the file. The XML file will be used for parsing and providing the error information in the summary section of the fail.

Fixes: https://github.com/scylladb/scylladb/issues/28045

Framework enhancements, no need to backport.

Closes scylladb/scylladb#28107

* github.com:scylladb/scylladb:
  test.py: remove XML log from fail summary
  test.py: fix truncated boost output to stdout file
2026-01-13 06:19:05 +02:00
Anna Stuchlik
791ab4ed02 doc: clarify the information about SSTable version support
Fixes https://github.com/scylladb/scylladb/issues/27765

Closes scylladb/scylladb#27835
2026-01-13 06:17:37 +02:00
Tomasz Grabiec
8dc20e6aaf test: cluster: Add reproducer for missed notification in topology coordinator 2026-01-13 00:40:23 +01:00
Tomasz Grabiec
7a04dd2d22 topology_coordinator: Wake up the state machine after stats refresh
Otherwise, coordinator may not react to changing stats after explicit
calls to trigger_load_stats_refresh() done on node replace or table
creation, if stats take longer to refresh than it takes the
coordinator to go idle.

The periodic refresh does wake up the topology coordinator, so the
issue is not dramatic in production, but it's annoying in tests, which
take longer because of that.

Fixes #25163
2026-01-13 00:40:23 +01:00
Tomasz Grabiec
d910e6ea63 topology_coordinator: Move tablet_load_stats_refresh_before_rebalancing injection earlier
Refreshing stats will signal _topo_sm.event, so do it before waiting
for the event, to avoiding busy looping in the coordinator.

This will produce lots of logs in test cases which enable debug-level
logging in the raft logger.

Refs #28086
2026-01-13 00:40:16 +01:00
Tomasz Grabiec
e5dee2aab8 topology_coordinator: Fix potential missed notification
Checking for work is not atomic, so there is room for missed
notification. Especially that notifications are not always triggered
from fibers which take the group0 guard.

Fix by subscribing for the event before checking for work.

Fixes #27958
2026-01-13 00:39:01 +01:00
Tomasz Grabiec
2b7aa3211d topology_coordinator: Refresh load stats after table is created or altered
We switched to the size-based load balancing, which now has more
strict requirements for load stats. We no longer need only per-node
stats (capacity), but also per-tablet stats.

Bootstrapping a node triggers stats refresh, but allocating tablets on
table creation didn't. So after creating a table, load balancer
couldn't make progress for up to 60s (stats refresh period).

This makes tests take longer, and can even cause failures if tests are
using a low-enough timeout.

Fixes #27921
2026-01-13 00:38:59 +01:00
Tomasz Grabiec
663831ebd7 tablets: Do a group0 read barrier on tablet load stats refresh
Stats refresh will be triggered on topology coordinator by events like
allocating new tablets on table creation. For refresh to be effective,
all replicas must see the new tablets, otherwise stats will be
incomplete.
2026-01-13 00:38:00 +01:00
Tomasz Grabiec
c4c5ed5aba topology_coordinator: Ensure stats are refreshed in the gossip scheduling group
Refresh can be triggered from different places, but it should run in
the gossip scheduling group, like group0 operations.
2026-01-13 00:38:00 +01:00
Tomasz Grabiec
5e6935f276 test: Use ManagerClient.{disable,enable}_tablet_balancing() 2026-01-13 00:38:00 +01:00
Tomasz Grabiec
6936704677 test: Add missing calls to disable_tablet_balancing() in tests which use move_tablet() API
If a test tries to move a tablet, it assumes the tablets are stable.
This fixes flakiness exposed by size-based load-balancing and a later
change to refresh stats sooner.
2026-01-13 00:38:00 +01:00
Tomasz Grabiec
c8098e07c9 test: pylib: Introduce ManagerClient.{disable,enable}_tablet_balancing()
It's a global operation, so we can use any server.

It's not only convenient. The call via api.disable_tablet_balancing()
confuse people to think that it's a per-server operation. This leads
to proliferation of code which does it needlessly on all servers.
2026-01-13 00:38:00 +01:00
Andrei Chekun
dfa6a61721 test.py: remove XML log from fail summary
Remove XML log from fail summary.
Add text from the first error in the XML file to the fail summary
2026-01-12 14:26:58 +01:00
Andrei Chekun
d96a50481a test.py: fix truncated boost output to stdout file
Change the behavior of the catching the boost log output. With this
change boost will output it's logging to stdour with HRF format and to
the tempfile in XML format. This will help for easier debuggint when all
messages will be in the output file and still in the fail summary.
2026-01-12 14:15:23 +01:00
Botond Dénes
6bcc18e5c6 erge 'test.py: integrate python tests to be executed with pytest runner' from Andrei Chekun
This will move responsibility for running tests with pytest in the same manner as it was done with boost tests. From this commit, test.py is not responsible anymore for running python tests and relies completely on pytest.
This is another step for unification of test execution.
Convert skip_mode function to `pytest.mark` to be able to use to annotate the whole module instead of each test explicitly.

NOTE: this is a breaking change. From this commit, several directories with tests will require a path to the file to launch the test. Affected directories
test/alternator
test/broadcast_tables
test/cql
test/cqlpy
test/rest_api

Changes only in framework, so no backport.

This PR will increase the amount of the tests by 30 test, due to the fact that how test.py and pytest discover tests. test.py count a file as a test, and when skip used in suite.yaml it will exclude the tests from discovery completely.
While the pytest count test funstion as a test and uses skip_mode mark and will discover the tests, but it will skip them during execution, hence the difference
test.py output before PR:
```bash
> ./test.py --mode=release  rest_api/test_compaction_task rest_api/test_task_manager --list --no-gather-metrics

```
test.py output in this PR:
```bash
> ./test.py --mode=release  test/rest_api/test_compaction_task.py test/rest_api/test_task_manager.py --list

rest_api/test_compaction_task.py::test_global_major_keyspace_compaction_task.release.1
rest_api/test_compaction_task.py::test_major_keyspace_compaction_task.release.1
rest_api/test_compaction_task.py::test_cleanup_keyspace_compaction_task.release.1
rest_api/test_compaction_task.py::test_offstrategy_keyspace_compaction_task.release.1
rest_api/test_compaction_task.py::test_rewrite_sstables_keyspace_compaction_task.release.1
rest_api/test_compaction_task.py::test_reshaping_compaction_task.release.1
rest_api/test_compaction_task.py::test_resharding_compaction_task.release.1
rest_api/test_compaction_task.py::test_regular_compaction_task.release.1
rest_api/test_compaction_task.py::test_compaction_task_abort.release.1
rest_api/test_compaction_task.py::test_major_keyspace_compaction_task_async.release.1
rest_api/test_compaction_task.py::test_cleanup_keyspace_compaction_task_async.release.1
rest_api/test_compaction_task.py::test_offstrategy_keyspace_compaction_task_async.release.1
rest_api/test_compaction_task.py::test_rewrite_sstables_keyspace_compaction_task_async.release.1
rest_api/test_compaction_task.py::test_compaction_progress[major_keyspace_compaction_task_impl_run_fail].release.1
rest_api/test_compaction_task.py::test_compaction_progress[shard_major_keyspace_compaction_task_impl_run_fail].release.1
rest_api/test_compaction_task.py::test_compaction_progress[table_major_keyspace_compaction_task_impl_run_fail].release.1
rest_api/test_task_manager.py::test_task_manager_modules.release.1
rest_api/test_task_manager.py::test_task_manager_tasks.release.1
rest_api/test_task_manager.py::test_task_manager_status_running.release.1
rest_api/test_task_manager.py::test_task_manager_status_done.release.1
rest_api/test_task_manager.py::test_task_manager_status_failed.release.1
rest_api/test_task_manager.py::test_task_manager_not_abortable.release.1
rest_api/test_task_manager.py::test_task_manager_wait.release.1
rest_api/test_task_manager.py::test_task_manager_ttl.release.1
rest_api/test_task_manager.py::test_task_manager_user_ttl.release.1
rest_api/test_task_manager.py::test_task_manager_sequence_number.release.1
rest_api/test_task_manager.py::test_task_manager_recursive_status.release.1
rest_api/test_task_manager.py::test_module_not_exists.release.1
rest_api/test_task_manager.py::test_task_folding.release.1
rest_api/test_task_manager.py::test_abort_on_unregistered_task.release.1
```

Fixes: https://github.com/scylladb/scylladb/issues/27716

Closes scylladb/scylladb#26395

* github.com:scylladb/scylladb:
  test.py: fix test_vector_similarity.py
  docs: add directories excluded from test.py
  test.py: prevent file descriptors leaking
  test.py: capture print inside the test
  test.py: do not print header for collection with test.py
  test.py: remove not supported functionality
  test.py: switch of execution of several test directories by test.py runner
  test.py: integrate python tests to be executed with pytest runner
  test.py: fix test/vector_search_validator to be able to run with pytest
  test.py: prepare base class for migration
  test.py: move environment preparation to one method
  test.py: introduce new environment variable TESTPY_PREPARED_ENVIRONMENT
2026-01-12 14:17:19 +02:00
Botond Dénes
04b8f72946 Merge 'repair: Implement auto repair for tablet repair' from Asias He
repair: Implement auto repair for tablet repair

This patch implements the basic auto repair support for tablet repair.

It was decided to add no per table configuration for the initial
implementation, so two scylla yaml config options are introduced to set
the default auto repair configs for all the tablet tables.

- auto_repair_enabled_default

Set true to enable auto repair for tablet tables by default. The value
will be overridden by the per keyspace or per table configuration which
is not implemented yet.

- auto_repair_threshold_default_in_seconds

Set the default time in seconds for the auto repair threshold for tablet
tables. If the time since last repair is bigger than the configured
time, the tablet is eligible for auto repair. The value will be
overridden by the per keyspace or per table configuration which is not
implemented yet.

The following metrcis are added:

- auto_repair_needs_repair_nr

The number of tablets with auto repair enabled that needs repair

- auto_repair_enabled_nr

The number of tablets with auto repair enabled

The metrics are useful to tell if auto repair is falling behind.

In the future, more auto repair scheduling will be added, e.g.,
scheduling based on the repaired and unrepaired sstable set size,
tombstone ratio and so on, in addition to the time based scheduling.

Fixes SCYLLADB-99

New feature. No backport.

Closes scylladb/scylladb#27534

* github.com:scylladb/scylladb:
  topology_coordinator: Add metrics for tablet repair
  repair: Implement auto repair for tablet repair
2026-01-12 14:16:01 +02:00
Yaniv Kaul
f1c9eda49e Potential fix for code scanning alert no. 144: Workflow does not contain permissions
Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>

Closes scylladb/scylladb#27809
2026-01-12 12:21:35 +02:00
Marcin Maliszkiewicz
3c9f52e709 Merge 'doc: update the Web Installer instructions' from Anna Stuchlik
This PR:
- Replaces a fixed version name with the variable for the current version in the instructions for installing a non-default version with Web Installer. This will make using the installer more user-friendly.
- Removes the instruction for Open Source from the Web Installer docs.

Fixes https://github.com/scylladb/scylladb/issues/28005
Fixes https://github.com/scylladb/scylladb/issues/28079

Closes scylladb/scylladb#28046

* github.com:scylladb/scylladb:
  doc: remove the instruction for Open Source from the Web Installer docs
  doc: add the version variable to the Web Installer instructions
2026-01-12 11:10:04 +01:00
Petr Gusev
889d7782ed treewide: use coroutine::maybe_yield in coroutines
It's more efficient since coroutine::maybe_yield returns
a lightweight struct (awaitable), not the future.

Closes scylladb/scylladb#28101
2026-01-12 10:38:47 +01:00
Marcin Maliszkiewicz
09af3828ab auth: remove confusing deprecation msg from hash_with_salt
Closes scylladb/scylladb#27705
2026-01-12 10:12:54 +01:00
Asias He
7980890029 topology_coordinator: Add metrics for tablet repair
- scylla_tablet_ops_failed

Number of failed tablet {auto, user} repair

- scylla_tablet_ops_succeeded

Number of succeeded tablet {auto, user} repair

Currently auto_repair and user_repair tablet task are added. We can add
more tablet tasks later, e.g., rebuild, migration.
2026-01-12 15:26:05 +08:00
Alex
e430065c92 db: views: serialize create/drop view operations via shard 0
Create and drop view operations are currently performed on all shards, and their execution is not fully serialized. On slower processors this can lead to interleavings that leave stale entries in `system.scylla_views_build`

A problematic sequence looks like this:

* `on_create_view()` runs on shard 0 → entries for shard 0 and shard 1 are created
* `on_drop_view()` runs on shard 0 → entry for shard 0 is removed
* `on_create_view()` runs on shard 1 → entries for shard 0 and shard 1 are created again
* `on_drop_view()` runs on shard 1 → entry for shard 1 is removed, while the shard 0 entry remains

This results in a leftover row in `system.scylla_views_builds_in_progress`, causing `view_build_test.cc` to get stuck indefinitely in an eventual state and eventually be terminated by CI.

This patch fixes the issue by fully serializing all view create and drop operations through shard 0. Shard 0 becomes the single execution point and notifies other shards to perform their work in order. Requests originating.

new process:
  - view_builder::on_create_view(...) runs only on shard 0 and kicks off dispatch_create_view(...) in the background.
  - dispatch_create_view(...) (shard 0) first checks should_ignore_tablet_keyspace(...) and returns early if needed.
  - dispatch_create_view(...) calls handle_seed_view_build_progress(...) on shard 0. That:
      - writes the global “build progress” row across all shards via _sys_ks.register_view_for_building_for_all_shards(...).
  - After seeding, dispatch_create_view(...) broadcasts to all shards with container().invoke_on_all(...).
  - Each shard runs handle_create_view_local(...), which:
      - waits for pending base writes/streams, flushes the base,
      - resets the reader to the current token and adds the new view,
      - handles errors and triggers _build_step to continue processing.

  Drop view

  - view_builder::on_drop_view(...) runs only on shard 0 and kicks off dispatch_drop_view(...) in the background.
  - dispatch_drop_view(...) (shard 0) first checks should_ignore_tablet_keyspace(...) and returns early if needed.
  - It broadcasts handle_drop_view_local(...) to all shards with invoke_on_all(...).
  - Each shard runs handle_drop_view_local(...), which:
      - removes the view from local build state (_base_to_build_step and _built_views) by scanning existing steps,
      - ignores missing keyspace cases.
  - After all shards finish local cleanup, shard 0 runs handle_drop_view_global_cleanup(...), which:
      - removes global build progress, built‑view state, and view build status in system tables,

  Shutdown

  - drain() waits on _view_notification_sem before _sem so in‑flight dispatches finish before bookkeeping is halted.

In addition, the test is adjusted to remove the long eventual wait (596.52s / 30 iterations) and instead rely on the default wait of 17 iterations (~4.37 minutes), eliminating unnecessary delays while preserving correctness.

Fixes: https://github.com/scylladb/scylladb/issues/27898
Backport: not required as the problem happens on master

Closes scylladb/scylladb#27929
2026-01-12 09:23:22 +02:00
Michał Hudobski
92c988514c vector_search: allow all where clauses in vector search queries
To prepare for implementation of filtering we skip validation
of where clauses in vector search queries. All queries that would
be blocked by the lack of ALLOW FILTERING now will pass through.

Fixes: VECTOR-410

Closes scylladb/scylladb#27758
2026-01-11 12:56:44 +02:00
Marcin Maliszkiewicz
03e0dd0841 Merge 'test/alternator: fix most tests to run on DynamoDB' from Nadav Har'El
We can run Alternator's tests against DynamoDB with `test/alternator/run --aws`, and our intention is that all except a few specially marked should pass on DynamoDB - indicating that the test itself is correct and checks compatibility with DynamoDB and not with some misunderstood spec.

Before this patch series, almost two dozen Alternator's tests failed on DynamoDB. This series fixes most of them.

Refs #26079 (it fixes almost all the problems but probably not all of them so let's keep the issue open for a while longer)

Closes scylladb/scylladb#27995

* github.com:scylladb/scylladb:
  test/alternator: fix some expected error messages to fit DynamoDB
  test/alternator: fix compressed request test on non-us-east1
  test/alternator: fix test's expected error message on DynamoDB
  test/alternator: mark Alternator-only test scylla_only
  test/alternator: fix test on DynamoDB
  test/alternator: increase wait_for_gsi() timeout
  test/alternator: fix test passing a spurious parameter
2026-01-09 18:05:20 +01:00
Botond Dénes
7e1c8776b7 docs: remove sstabledump and sstablemetadata
These tools are deprecated and no longer shipped by ScyllaDB packages.
They no longer support the latest SSTable versions and ScyllaDB-only
features, like encryption and dictionary based compression.

Remove them from the documentation.

Closes scylladb/scylladb#27608
2026-01-09 17:31:54 +01:00
Dawid Mędrek
2385afa1c7 scripts/pull_github_pr.sh: Update instructions for creating token
The interface of Jenkins has changed, and the instructions for creating
a token are out-of-date. This commit updates them.

Closes scylladb/scylladb#28054
2026-01-09 17:45:00 +02:00
Ferenc Szili
0ede8d154b docs: add docs for size based load balancing
This patch updates the documentation for size based load balancing.

Closes scylladb/scylladb#27616
2026-01-09 16:25:25 +02:00
Andrei Chekun
1f60208aa0 test.py: fix test_vector_similarity.py
There is a known limitation of the xdist.
Since it makes discovery in each thread, then compare it with master thread. The discovered lists of test should be the same. Sets are not order guaranteed, so they should not be used for parametrized testing, because discovery of the tests with using xdist will fail.
This PR just converts set to dist, to eliminate issue mentioned above.
2026-01-09 15:08:40 +01:00
Yaniv Michael Kaul
af8eaa9ea5 scripts: fixes flagged by CodeQL/PyLens
Unused imports, unused variables and such.
Initially, there were no functional changes, just to get rid of some standard CodeQL warnings.

I've then broken the CI, as apparently there's a install time(!?) Python script creation for the sole purpose of product
naming. I changed it - we have it in etcdir, as SCYLLA-PRODUCT-FILE.
So added (copied from a different script) a get_product() helper function in scylla_util.py and used it instead.

While at it, also fixed the too broad import from scylla_util, which 'forced' me to also fix other specific imports (such as shutil).

Improvement - no need to backport.
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>

Closes scylladb/scylladb#27883
2026-01-09 15:13:12 +02:00
Anna Stuchlik
396093ff60 doc: remove the instruction for Open Source from the Web Installer docs
Fixes https://github.com/scylladb/scylladb/issues/28079
2026-01-09 14:07:32 +01:00
Botond Dénes
af6cb0d0a4 Merge 'raft topology: preserve IP -> ID mapping of a replacing node on restart' from Patryk Jędrzejczak
We currently do it only for a bootstrapping node, which is a bug. The
missing IP can cause an internal error, for example, in the following
scenario:
- replace fails during streaming,
- all live nodes are shut down before the rollback of replace completes,
- all live nodes are restarted,
- live nodes start hitting internal error in all operations that
  require IP of the replacing node (like client requests or REST API
  requests coming from nodetool).

We fix the bug here, but we do it separately for replace with different
IP and replace with the same IP.

For replace with different IP, we persist the IP -> host ID mapping
in `system.peers` just like for bootstrap. That's necessary, since there
is no other way to determine IP of the replacing node on restart.

For replace with the same IP, we can't do the same. This would require
deleting the row corresponding to the node being replaced from
`system.peers`. That's fine in theory, as that node is permanently
banned, so its IP shouldn't be needed. Unfortunately, we have many
places in the code where we assume that IP of a topology member is always
present in the address map or that a topology member is always present in
the gossiper endpoint set. Examples of such places:
- nodetool operations,
- REST API endpoints,
- `db::hints::manager::store_hint`,
- `group0_voter_handler::update_nodes`.

We could fix all those places and verify that drivers work properly when
they see a node in the token metadata, but not in `system.peers`.
However, that would be too risky to backport.

We take a different approach. We recover IP of the replacing node on
restart based on the state of the topology state machine and
`system.peers` just after loading `system.peers`.

We rely on the fact that group 0 is set up at this point. The only case
where this assumption is incorrect is a restart in the Raft-based
recovery procedure. However, hitting this problem then seems improbable,
and even if it happens, we can restart the node again after ensuring
that no client and REST API requests come before replace is rolled back
on the new topology coordinator. Hence, it's not worth to complicate the
fix (by e.g. looking at the persistent topology state instead of the
in-memory state machine).

Fixes #28057

Backport this PR to all branches as it fixes a problematic bug.

Closes scylladb/scylladb#27435

* github.com:scylladb/scylladb:
  gossiper: add_saved_endpoint: make generations of excluded nodes negative
  test: introduce test_full_shutdown_during_replace
  utils: error_injection: allow aborting wait_for_message
  raft topology: preserve IP -> ID mapping of a replacing node on restart
2026-01-09 14:56:16 +02:00
Calle Wilund
a7cdb602e1 db::commitlog: Fix sanity check error on race between segment flushing and oversized alloc
Fixes #27992

When doing a commit log oversized allocation, we lock out all other writers by grabbing
the _request_controller semaphore fully (max capacity).
We thereafter assert that the semaphore is in fact zero. However, due to how things work
with the bookkeep here, the semaphore can in fact become negative (some paths will not
actually wait for the semaphore, because this could deadlock).

Thus, if, after we grab the semaphore and execution actually returns to us (task schedule),
new_buffer via segment::allocate is called (due to a non-fully-full segment), we might
in fact grab the segment overhead from zero, resulting in a negative semaphore.

The same problem applies later when we try to sanity check the return of our permits.

Fix is trivial, just accept less-than-zero values, and take same possible ltz-value
into account in exit check (returning units)

Added whitebox (special callback interface for sync) unit test that provokes/creates
the race condition explicitly (and reliably).

Closes scylladb/scylladb#27998
2026-01-09 14:06:58 +02:00
Łukasz Paszkowski
7bf26ece4d test_user_writes_rejection: Fix test flakiness caused by typo and non-local CL=ONE reads
The current code:
```
try:
   cql.execute(f"INSERT INTO {cf} (pk, t) VALUES (-1, 'x')", host=host[0], execution_profile=cl_one_profile).result()
except Exception:
   pass
```

contains a typo: `host=host[0]` which throws an exception becase Host
object is not subscriptable. The test does not fail because the except
block is too broad and suppresses all exceptions.

Fixing the typo alone is insufficient. The write still succeeds because
the remaining nodes are UP and the query uses CL=ONE, so no failure
should be expected.

Another source of flakiness is data verification:
```
SELECT * FROM {cf} WHERE pk = 0;
```

Even when a coordinator is explicitly provided, using CL=ONE does not
guarantee a local read. The coordinator may forward the read request to
another replica, causing the verification to fail nondeterministically.

This patch rewrites the tests to address these issues:
- Fix the typo: `host[0]` to `hosts[0]`
- Verify data using `MUTATION_FRAGMENTS({cf})` which guarantees a local
  read on the coordinator node
- Reconnect the driver after node restart

Fixes https://github.com/scylladb/scylladb/issues/27933

Closes scylladb/scylladb#27934
2026-01-09 13:42:05 +02:00
Andrei Chekun
82e81a8664 docs: add directories excluded from test.py
Add new directories that are excluded from the test.py executor and will
be fully managed by pytest
2026-01-09 11:59:25 +01:00
Andrei Chekun
353bae7d66 test.py: prevent file descriptors leaking
With migration to the pytest, file descriptors will be hanged during the whole life of the process. Previously it was not an issue, because test.py was executing only one file with Popen, so descriptors will be freed with process done. With new approach they are blocked. This will allow to eliminate this.
Fix issue when we had issue with getting cluster and then trying to set it dirty while it None.
Put cluster to the pool only if it was created
2026-01-09 11:59:25 +01:00
Andrei Chekun
67c5267053 test.py: capture print inside the test
Capture the printing inside the test case to output it after the test and not directly during the testing process.
2026-01-09 11:59:25 +01:00
Andrei Chekun
594aedd6a5 test.py: do not print header for collection with test.py
Skip printing the default pytest headers when printing list of the tests.
Before:
```
$ ./test.py --mode=dev test/boost/sstable_conforms_to_mutation_source_test.cc --list

Test session starts (platform: linux, Python 3.13.9, pytest 8.3.4, pytest-sugar 1.1.1)
rootdir: /home/xtrey/projects/scylladb/test
configfile: pytest.ini
plugins: xdist-3.8.0, allure-pytest-2.15.0, sugar-1.1.1, anyio-4.8.0, asyncio-0.24.0, timeout-2.3.1
asyncio: mode=Mode.AUTO, default_loop_scope=session
timeout: 24000.0s
timeout method: signal
timeout func_only: False
session timeout: 24000.0s
boost/sstable_conforms_to_mutation_source_test.cc::test_sstable_conforms_to_mutation_source_mc_tiny.dev.1
boost/sstable_conforms_to_mutation_source_test.cc::test_sstable_conforms_to_mutation_source_mc_medium.dev.1
boost/sstable_conforms_to_mutation_source_test.cc::test_sstable_conforms_to_mutation_source_mc_large.dev.1
boost/sstable_conforms_to_mutation_source_test.cc::test_sstable_conforms_to_mutation_source_md_tiny.dev.1
boost/sstable_conforms_to_mutation_source_test.cc::test_sstable_conforms_to_mutation_source_md_medium.dev.1
boost/sstable_conforms_to_mutation_source_test.cc::test_sstable_conforms_to_mutation_source_md_large.dev.1
boost/sstable_conforms_to_mutation_source_test.cc::test_sstable_conforms_to_mutation_source_ms_tiny.dev.1
boost/sstable_conforms_to_mutation_source_test.cc::test_sstable_conforms_to_mutation_source_ms_medium.dev.1
boost/sstable_conforms_to_mutation_source_test.cc::test_sstable_conforms_to_mutation_source_ms_large.dev.1
boost/sstable_conforms_to_mutation_source_test.cc::test_sstable_reversing_reader_random_schema.dev.1

Results (0.06s):
```
After:
```
$ ./test.py --mode=dev test/boost/sstable_conforms_to_mutation_source_test.cc --list

boost/sstable_conforms_to_mutation_source_test.cc::test_sstable_conforms_to_mutation_source_mc_tiny.dev.1
boost/sstable_conforms_to_mutation_source_test.cc::test_sstable_conforms_to_mutation_source_mc_medium.dev.1
boost/sstable_conforms_to_mutation_source_test.cc::test_sstable_conforms_to_mutation_source_mc_large.dev.1
boost/sstable_conforms_to_mutation_source_test.cc::test_sstable_conforms_to_mutation_source_md_tiny.dev.1
boost/sstable_conforms_to_mutation_source_test.cc::test_sstable_conforms_to_mutation_source_md_medium.dev.1
boost/sstable_conforms_to_mutation_source_test.cc::test_sstable_conforms_to_mutation_source_md_large.dev.1
boost/sstable_conforms_to_mutation_source_test.cc::test_sstable_conforms_to_mutation_source_ms_tiny.dev.1
boost/sstable_conforms_to_mutation_source_test.cc::test_sstable_conforms_to_mutation_source_ms_medium.dev.1
boost/sstable_conforms_to_mutation_source_test.cc::test_sstable_conforms_to_mutation_source_ms_large.dev.1
boost/sstable_conforms_to_mutation_source_test.cc::test_sstable_reversing_reader_random_schema.dev.1

Results (0.06s):
```
2026-01-09 11:59:25 +01:00
Andrei Chekun
21a1ff3d5c test.py: remove not supported functionality
In the current state pytest do not support the order of execution, so this parameter is removed. There is no big need in this due to the differences what pytest and test.py counted test. pytest run test functions in the threads, while test.py executed test files in the threads. That's why pytest's way is more granular and allows to fill threads better.
Remove skip node, since it already added as a pytest mark for each test in the file.
Remove pool_size, since this is not used by pytest at all. Pytest uses
xdist to set the amount of threads instead of pool_size used by test.py
2026-01-09 11:59:25 +01:00
Andrei Chekun
e8c50a5ad4 test.py: switch of execution of several test directories by test.py runner
With this commit test.py will lose ability to run tests by itself always bypassing execution to the pytest.

NOTE: this is a breaking change. From this commit, several directories
with tests will require a path to the file to launch the test.
Affected directories
test/alternator
test/broadcast_tables
test/cql
test/cqlpy
test/rest_api
2026-01-09 11:59:25 +01:00
Andrei Chekun
61d49525ad test.py: integrate python tests to be executed with pytest runner
With this commit test.py will be bypassing the tests execution to the pytest. However, it will still be able to run test by itself.
With providing test name like `broadcast_tables/test_broadcast_tables` it will execute test with test.py runner, but if the path to the file will be provided like `test/broadcast_tables/test_broadcast_tables.py` it will bypass execution to the pytest.
`--test-py-init` tells to run pytest session in test.py-compatible mode
Update the help text for the name parameter for test.py about changes
how it works and which directory is served by pytest
2026-01-09 11:59:25 +01:00
Andrei Chekun
808b29885f test.py: fix test/vector_search_validator to be able to run with pytest
build_mode fixture have dynamic scope. It depends how the pytest is
executed. When it executed through test.py scope will be session and
since it's broader that package everything work fine. While with pure
pytest it will fail because build_mode will have module scope.
This fix allows to run tests with pure pytest, this needed for migration
test to be executed by pytest runner instead test.py.
2026-01-09 11:59:25 +01:00
Andrei Chekun
8252de7b55 test.py: prepare base class for migration
Since all tests share the same base class and some of the tests executed by test.py and some with pytest, we need to handle two cases where configuration is located: suite.yaml and test_config.yaml
After full migration suite.yaml case will be removed
2026-01-09 11:59:25 +01:00
Andrei Chekun
48ff74b6b2 test.py: move environment preparation to one method
Since anyway these two methods should be called one by one in two different cases: when test.py executes test and pytest executes test, merging them into one. Additionally, set environment variable to show the underneath pytest process that environment was already prepared and there is no need to clean directories or start additional services.
2026-01-09 11:59:25 +01:00
Andrei Chekun
e074e21490 test.py: introduce new environment variable TESTPY_PREPARED_ENVIRONMENT
Introduce the new environment variable that will be used to signalize to the pytest runner that environment war already prepared by test.py. This needed to be able to run the test with pytest and test.py(that actually will run pytest underneath).
2026-01-09 11:59:25 +01:00
copilot-swe-agent[bot]
8c48b82b84 test_ssl: fix indentation 2026-01-09 10:27:17 +01:00
Piotr Smaron
2bcbebe92d generic_server: improve logging broken TLS connection
Preiously we were logging a broken TLS connection and then this has been
logged later again, so now instead of logging we're constructing an
exception with a message extened with TLS info, which later will be
catched with its full message still logged.
2026-01-09 10:24:55 +01:00
Piotr Smaron
7016fc4835 test_ssl: improve timeout and readability
1. With this change the test really waits 10s, previously (in case
   something went wrong), the timeout could take way more than that.
2. Added `else` to above `if` to increase clarity of execution flow -
   it doesn't change logic, but makes it more clear.
2026-01-09 10:22:19 +01:00
Asias He
7ba7b25bdd repair: Implement auto repair for tablet repair
This patch implements the basic auto repair support for tablet repair.

It was decided to add no per table configuration for the initial
implementation, so two scylla yaml config options are introduced to set
the default auto repair configs for all the tablet tables.

- auto_repair_enabled_default

Set true to enable auto repair for tablet tables by default. The value
will be overridden by the per keyspace or per table configuration which
is not implemented yet.

- auto_repair_threshold_default_in_seconds

Set the default time in seconds for the auto repair threshold for tablet
tables. If the time since last repair is bigger than the configured
time, the tablet is eligible for auto repair. The value will be
overridden by the per keyspace or per table configuration which is not
implemented yet.

The following metrcis are added:

- auto_repair_needs_repair_nr

The number of tablets with auto repair enabled that needs repair

- auto_repair_enabled_nr

The number of tablets with auto repair enabled

The metrics are useful to tell if auto repair is falling behind.

In the future, more auto repair scheduling will be added, e.g.,
scheduling based on the repaired and unrepaired sstable set size,
tombstone ratio and so on, in addition to the time based scheduling.

Fixes SCYLLADB-99
2026-01-09 16:11:39 +08:00
Botond Dénes
60570d7114 Merge 'topology coordinator: restrict node join/remove to preserve RF-rack validity' from Michael Litvak
Allow creating materialized views and secondary indexes in a tablets keyspace only if it's RF-rack-valid, and enforce RF-rack-validity while the keyspace has views by restricting some operations:
* Altering a keyspace's RF if it would make the keyspace RF-rack-invalid
* Adding a node in a new rack
* Removing / Decommissioning the last node in a rack

Previously the config option `rf_rack_valid_keyspaces` was required for creating views. We now remove this restriction - it's not needed because we always maintain RF-rack-validity for keyspaces with views.

The restrictions are relevant only for keyspaces with numerical RF. Keyspace with rack-list-based RF are always RF-rack-valid.

Fixes scylladb/scylladb#23345
Fixes https://github.com/scylladb/scylladb/issues/26820

backport to relevant versions for materialized views with tablets since it depends on rf-rack validity

Closes scylladb/scylladb#26354

* github.com:scylladb/scylladb:
  docs: update RF-rack restrictions
  cql3: don't apply RF-rack restrictions on vector indexes
  cql3: add warning when creating mv/index with tablets about rf-rack
  service/tablet_allocator: always allow tablet merge of tables with views
  locator: extend rf-rack validation for rack lists
  test: test rf-rack validity when creating keyspace during node ops
  locator: fix rf-rack validation during node join/remove
  test: test topology restrictions for views with tablets
  test: add test_topology_ops_with_rf_rack_valid
  topology coordinator: restrict node join/remove to preserve RF-rack validity
  topology coordinator: add validation to node remove
  locator: extend rf-rack validation functions
  view: change validate_view_keyspace to allow MVs if RF=Racks
  db: enforce rf-rack-validity for keyspaces with views
  replica/db: add enforce_rf_rack_validity_for_keyspace helper
  db: remove enforce parameter from check_rf_rack_validity
  test: adjust test to not break rf-rack validity
2026-01-09 10:01:23 +02:00
Patryk Jędrzejczak
eee2b6c7af Merge 'tablets: Make balancing disabling RPC preempt tablet transitions' from Tomasz Grabiec
Disabling of balancing waits for topology state machine to become idle, to guarantee that no migrations are happening or will happen after the call returns. But it doesn't interrupt the scheduler, which means the call can take arbitrary amount of time. It may wait for tablet repair to be finished, which can take many hours.

We should do it via topology request, which will interrupt the tablet scheduler.

Enabling of balancing can be immediate.

Fixes https://github.com/scylladb/scylladb/issues/27647
Fixes #27210

Closes scylladb/scylladb#27736

* https://github.com/scylladb/scylladb:
  test: Verify that repair doesn't block disabling of tablet load balancing
  tablets: Make balancing disabling call preempt tablet transitions
2026-01-08 21:55:19 +02:00
Piotr Dulikowski
8e3e39a64a Merge 'service/storage_service: update service levels cache after upgrade to v2' from Michał Jadwiszczak
Service levels cache is empty after upgrade to consistent topology
if no mutations are commited to `system.service_levels_v2` or rolling
restart is not done.

To fix the bug, this patch adds service levels cache reloading after
upgrading the SL data accessor to v2 in `storage_service::topology_state_load()`.

Fixes [SCYLLADB-90](https://scylladb.atlassian.net/browse/SCYLLADB-90)

This fix should be backported to all versions containing service levels on Raft.

[SCYLLADB-90]: https://scylladb.atlassian.net/browse/SCYLLADB-90?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ

Closes scylladb/scylladb#27585

* github.com:scylladb/scylladb:
  service/storage_service: update service levels cache after upgrade to v2
  service/storage_service: check if service levels were already upgraded before doing migration to raft
2026-01-08 21:55:19 +02:00
Michał Hudobski
e2e479f20d auth: fix cdc vector search indexing permission bug
VECTOR_SEARCH_INDEXING permission didn't work on cdc tables as we mistakenly checked for vector indexes on the cdc table insted of the base.
This patch fixes that and adds a test that validates this behavior.

Fixes: VECTOR-476

Closes scylladb/scylladb#28050
2026-01-08 21:55:19 +02:00
Ernest Zaslavsky
19fe630c0e Update seastar submodule
seastar 4dcd4df..dd46b6fe
```
dd46b6fe net: expose DNS TTL via net::hostent
b94f81b0 test: Extend statat() test to check ENOENT exception reporting
```

Closes scylladb/scylladb#28006
2026-01-08 21:55:19 +02:00
Michael Litvak
8f15c7a874 db/view/view_update_generator: move discover_staging_sstables to start
Call discover_staging_sstables in view_update_generator::start() instead
of in the constructor, because the constructor is called during
initialization before sstables are loaded.

The initialization order was changed in 5d1f74b86a and caused this
regression. It means the view update generator won't discover staging
sstables on startup and view updates won't be generated for them. It
also causes issues in sstable cleanup.

view_update_generator::start() is called in a later stage of the
initialization, after sstable loading, so do the discovery of staging
sstables there.

Fixes scylladb/scylladb#27956

Closes scylladb/scylladb#27970
2026-01-08 21:55:19 +02:00
Botond Dénes
8c72dcc1ec Merge 'database: truncate_table_on_all_shards: consider can_flush on all shards' from Benny Halevy
Currently, database::truncate_table_on_all_shards calls the table::can_flush only on the coordinator shard
and therefore it may miss shards with dirty data if the coordinator shard happens to have empty memtables, leading to clearing the memtables with dirty data rather than flushing them.

This change fixes that by making flush safe to be called, even if the memtable list is empty, and calling it on every shard that can flush (i.e. seal_immediate_fn is engaged).

Also, change database_test::do_with_some_data is use random keys instead of hard-coded key names, to reproduce this issue with `snapshot_list_contains_dropped_tables`.

Fixes #27639

* The issue exists since forever and might cause data loss due to wrongly clearing the memtable, so it needs backport to all live versions

Closes scylladb/scylladb#27643

* github.com:scylladb/scylladb:
  test: database_test: do_with_some_data: randomize keys
  database: truncate_table_on_all_shards: drop outdated TODO comment
  database: truncate_table_on_all_shards: consider can_flush on all shards
  memtable_list: unify can_flush and may_flush
  test: database_test: add test_flush_empty_table_waits_on_outstanding_flush
  replica: table, storage_group, compaction_group: add needs_flush
  test: database_test: do_with_some_data_in_thread: accept void callback function
2026-01-08 21:55:19 +02:00
Avi Kivity
633e6e0037 build: update toolchain generation procedure for optimized clang
Explain where to pick up existing clang archives, and how to upload new ones.

Closes scylladb/scylladb#27690
2026-01-08 21:55:18 +02:00
Evgeniy Naydanov
a9da14be19 test: dtest: reproducer for parallel rebuild failure
2-DC cluster parallel non-RBNO rebuild failure when expanding RF in DC2.

Steps to reproduce:
    1. Provision a cluster with 2 datacenters and at least 2 nodes in the second datacenter.
    2. Let’s assume datacenter names are "dc1" and "dc2".
    3. Create a keyspace ("keyspace1") with RF=0 in dc2.
    4. Populate some data into dc1.
    5. Change keyspace1 replication in dc2 to 2.
    6. On 2 nodes in dc2 run the following command in parallel:
         nodetool rebuild --source-dc dc1

Parallel execution of rebuilds is not possible with RBNO enabled.

This test is the repro for #27804

Closes scylladb/scylladb#27747
2026-01-08 21:55:18 +02:00
Botond Dénes
9b4a7f1d14 Merge 'test: cluster: object_store: test_backup: modernize do_abort_restore' from Benny Halevy
Currently the function uses a regular expression
to check the system log for a specific message.
This is tangential to the ability to cleanly abort the restore task, plus the regular expression has a syntax error:
```
test/cluster/object_store/test_backup.py:534
  /home/bhalevy/dev/scylla/test/cluster/object_store/test_backup.py:534: SyntaxWarning: "\(" is an invalid escape sequence. Such sequences will not work in the future. Did you mean "\\("? A raw string is also an option.
    await wait_for_first_completed([l.wait_for("Failed to handle STREAM_MUTATION_FRAGMENTS \(receive and distribute phase\) for .+: Streaming aborted", timeout=10) for l in logs])
```

Thsi change modernizes the implementation by:
- using auto_dc_rack for manager.servers_add
- using new_test_keyspace to generate and auto delete the keyspace
- using async gatherio and a prepared statement to insert the data
- simplifing the keys and values by NOT using os.urandom (that is notoriously slow)
- inserting fewer keys in debug mode
- removing the log check

With that, the test can be reenabled in all modes.

* No backport needed since the test was disabled

Closes scylladb/scylladb#27892

* github.com:scylladb/scylladb:
  test_backup: do_abort_restore: reduce data footprint
  test_backup: do_abort_restore: use error injection
  test_backup: do_abort_restore: use asyncio for cql
  test_backup: do_abort_restore: use new_test_keyspace
  test_backup: do_abort_restore: use logger rather than print
  test_backup: do_abort_restore: pass auto_rack_dc to servers_add
2026-01-08 21:55:18 +02:00
Anna Stuchlik
f614482e66 doc: add the patch release upgrade procedure for version 2025.4
Adds the patch upgrade guide based on previous upgrade guides.

Fixes https://github.com/scylladb/scylladb/issues/27982

Closes scylladb/scylladb#27985
2026-01-08 21:55:18 +02:00
Asias He
4f77dd058d repair: Add tablet repair progress report support
This patch adds tablet repair progress report support so that the user
could use the /task_manager/task_status API to query the progress.

In order to support this, a new system table is introduced to record the
user request related info, i.e, start of the request and end of the
request.

The progress is accurate when tablet split or merge happens in the
middle of the request, since the tokens of the tablet are recorded when
the request is started and when repair of each tablet is finished. The
original tablet repair is considered as finished when the finished
ranges cover the original tablet token ranges.

After this patch, the /task_manager/task_status API will report correct
progress_total and progress_completed.

Fixes #22564
Fixes #26896

Closes scylladb/scylladb#27679
2026-01-08 21:55:18 +02:00
Anna Stuchlik
3f1c7c70f5 doc: remove the link to the Download Center
... from the OS support page.

Fixes https://github.com/scylladb/scylladb/issues/28047

Closes scylladb/scylladb#28048
2026-01-08 21:55:18 +02:00
Asias He
0aabf51380 repair: Fix sstable_list_to_mark_as_repaired with multishard writer
It was obseved:

```
test_repair_disjoint_row_2nodes_diff_shard_count was spuriously failing due to
segfault.

backtrace pointed to a failure when allocating an object from the chain of
freed objects, which indicates memory corruption.

(gdb) bt
    at ./seastar/include/seastar/core/shared_ptr.hh:275
    at ./seastar/include/seastar/core/shared_ptr.hh:430
Usual suspect is use-after-free, so ran the reproducer in the sanitize mode,
which indicated shared ptr was being copied into another cpu through the
multi shard writer:

seastar - shared_ptr accessed on non-owner cpu, at: ...
--------
seastar::smp_message_queue::async_work_item<mutation_writer::multishard_writer::make_shard_writer...

```

The multishard writer itself was fine, the problem was in the streaming consumer
for repair copying a shared ptr. It could work fine with same smp setting, since
there will be only 1 shard in the consumer path, from rpc handler all the way
to the consumer. But with mixed smp setting, the ptr would be copied into the
cpus involved, and since the shared ptr is not cpu safe, the refcount change
can go wrong, causing double free, use-after-free.

To fix, we pass a generic incremental repair handler to the streaming
consumer. The handler is safe to be copied to different shards. It will
be a no op if incremental repair is not enabled or on a different shard.

A reproducer test is added. The test could reproduce the crash
consistently before the fix and work well after the fix.

Fixes #27666

Closes scylladb/scylladb#27870
2026-01-08 21:55:18 +02:00
Radosław Cybulski
5f48ab3875 storage_proxy: fix invalid assert
Change invalid `assert(true)` into `SCYLLA_ASSERT(false)`, as
the latter was clearly meant.

Closes scylladb/scylladb#27900
2026-01-08 21:55:18 +02:00
Andrei Chekun
c950c2e582 test.py: convert skip_mode function to pytest.mark
Function skip_mode works only on function and only in cluster test. This if OK
when we need to skip one test, but it's not possible to use it with pytestmark
to automatically mark all tests in the file. The goal of this PR is to migrate

skip_mode to be dynamic pytest.mark that can be used as ordinary mark.

Closes scylladb/scylladb#27853

[avi: apply to test/cluster/test_tablets.py::test_table_creation_wakes_up_balancer]
2026-01-08 21:55:16 +02:00
Tomasz Grabiec
a52de4ecdc test: cluster: test_topology_ops[_encrypted]: Fix failures due to background migrations fencing out writes
The test if flaky, with failures in:

        for server in servers:
>           await check_node_log_for_failed_mutations(manager, server)

test/cluster/test_topology_ops_encrypted.py:84:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

manager = <test.pylib.manager_client.ManagerClient object at 0xffff602e8590>
server = ServerInfo(server_id=1769, ip_addr='127.82.127.43', rpc_address='127.82.127.43', datacenter='DEFAULT_DC', rack='DEFAULT_RACK', pid=186578)

    async def check_node_log_for_failed_mutations(manager: ManagerClient, server: ServerInfo):
        logging.info(f"Checking that node {server} had no failed mutations")
        log = await manager.server_open_log(server.server_id)
        occurrences = await log.grep(expr="Failed to apply mutation from", filter_expr="(TRACE|DEBUG|INFO)")
>       assert len(occurrences) == 0
E       AssertionError

test/cluster/util.py:319: AssertionError

As diagnosed by Gleb in https://github.com/scylladb/scylladb/issues/27942#issuecomment-3710013625:

"The fencing errors here look legit given that we do not wait for all
requests to complete while shutting down the storage proxy. The
scenario is this:

Test does writes to rf=3 keyspace with cl=one. One node is shutting
down while there is a tablet migration. Tablet migration executes
barrier and drain which fails on a node that is been shutdown. The
topology coordinator proceeds fencing the old topology, but there
still can be un-handled mutation requests from the shutting down node
on other nodes and they will generate fencing errors like they should.

They way to avoid it (though it is benign) is to wait for all outgoing
storage proxy requests to complete during shutdown, but even then the
error may still happen since a request may timeout before it is
processed by the other side, so it may be completed by a storage proxy
coordinator side, but still not handled by replica side. This what we
have fencing for in the first place."

Fix by diabling background tablet migrations, so that we have no
topology barriers concurrent with node shutdown.

Fixes #27942

Closes scylladb/scylladb#28034
2026-01-08 21:53:47 +02:00
Tomasz Grabiec
34df158605 test: cluster: Fix NoHostAvailable error in test_not_enough_token_owners
The driver must see server_c before we stop server_a, otherwise
there will be no live host in the pool when we attempt to drop
the keyspace:

```
   @pytest.mark.asyncio
    async def test_not_enough_token_owners(manager: ManagerClient):
        """
        Test that:
        - the first node in the cluster cannot be a zero-token node
        - removenode and decommission of the only token owner fail in the presence of zero-token nodes
        - removenode and decommission of a token owner fail in the presence of zero-token nodes if the number of token
          owners would fall below the RF of some keyspace using tablets
        """
        logging.info('Trying to add a zero-token server as the first server in the cluster')
        await manager.server_add(config={'join_ring': False},
                                 property_file={"dc": "dc1", "rack": "rz"},
                                 expected_error='Cannot start the first node in the cluster as zero-token')

        logging.info('Adding the first server')
        server_a = await manager.server_add(property_file={"dc": "dc1", "rack": "r1"})

        logging.info('Adding two zero-token servers')
        # The second server is needed only to preserve the Raft majority.
        server_b = (await manager.servers_add(2, config={'join_ring': False}, property_file={"dc": "dc1", "rack": "rz"}))[0]

        logging.info(f'Trying to decommission the only token owner {server_a}')
        await manager.decommission_node(server_a.server_id,
                                        expected_error='Cannot decommission the last token-owning node in the cluster')

        logging.info(f'Stopping {server_a}')
        await manager.server_stop_gracefully(server_a.server_id)

        logging.info(f'Trying to remove the only token owner {server_a} by {server_b}')
        await manager.remove_node(server_b.server_id, server_a.server_id,
                                  expected_error='cannot be removed because it is the last token-owning node in the cluster')

        logging.info(f'Starting {server_a}')
        await manager.server_start(server_a.server_id)

        logging.info('Adding a normal server')
        await manager.server_add(property_file={"dc": "dc1", "rack": "r2"})

        cql = manager.get_cql()

        await wait_for_cql_and_get_hosts(cql, [server_a], time.time() + 60)

>       async with new_test_keyspace(manager, "WITH replication = {'class': 'NetworkTopologyStrategy', 'replication_factor': 2} AND tablets = { 'enabled': true }") as ks_name:

test/cluster/test_not_enough_token_owners.py:57:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
/usr/lib64/python3.14/contextlib.py:221: in __aexit__
    await anext(self.gen)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

manager = <test.pylib.manager_client.ManagerClient object at 0x7f37efe00830>
opts = "WITH replication = {'class': 'NetworkTopologyStrategy', 'replication_factor': 2} AND tablets = { 'enabled': true }"
host = None

    @asynccontextmanager
    async def new_test_keyspace(manager: ManagerClient, opts, host=None):
        """
        A utility function for creating a new temporary keyspace with given
        options. It can be used in a "async with", as:
            async with new_test_keyspace(ManagerClient, '...') as keyspace:
        """
        keyspace = await create_new_test_keyspace(manager.get_cql(), opts, host)
        try:
            yield keyspace
        except:
            logger.info(f"Error happened while using keyspace '{keyspace}', the keyspace is left in place for investigation")
            raise
        else:
>           await manager.get_cql().run_async("DROP KEYSPACE " + keyspace, host=host)
E           cassandra.cluster.NoHostAvailable: ('Unable to complete the operation against any hosts', {<Host: 127.69.108.39:9042 dc1>: ConnectionException('Pool for 127.69.108.39:9042 is shutdown')})

test/cluster/util.py:544: NoHostAvailable
```

Fixes #28011

Closes scylladb/scylladb#28040
2026-01-08 21:53:47 +02:00
Andrei Chekun
ee0bf35615 test.py: add custome exit code for pytest in case maxfail reached
This PR adds custom exit code in case when maxfail reached. This is
needed for easier detection why pytest failed in CI.

Closes scylladb/scylladb#28018
2026-01-08 21:53:47 +02:00
Anna Stuchlik
1b653166f1 doc: add the version variable to the Web Installer instructions
This commit replaces a fixed version name with the variable for the current version
in the instructions for installing a non-default version with Web Installer.
This will make using the installer more user-friendly.

Fixes https://github.com/scylladb/scylladb/issues/28005
2026-01-08 10:12:21 +01:00
Benny Halevy
ebd667a8e0 test: database_test: do_with_some_data: randomize keys
With randomized keys, and since we're inserting only 2 keys,
it is possible that they would end up owned only by a single shard,
reproducing #27639 in snapshot_list_contains_dropped_tables.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-01-08 09:49:46 +02:00
Benny Halevy
93b827c185 database: truncate_table_on_all_shards: drop outdated TODO comment
The comment was added in 83323e155e
Since then, table::seal_active_memtable was improved to guarantee
waiting on oustanding flushes on success (See d55a2ac762), so
we can remove this TODO comment (it also not covered by any issue
so nobody is planned to ever work on it).

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-01-08 09:49:46 +02:00
Benny Halevy
2a803d2261 database: truncate_table_on_all_shards: consider can_flush on all shards
can_flush might return a different value for each shard
so check it right before deciding whether to flush or clear a memtable
shard.

Note that under normal condition can_flush would always return true
now that it checks only the presence of the seal memtable function
rather than check memtable_list::empty().

Fixes #27639

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-01-08 09:49:46 +02:00
Benny Halevy
02ee341a03 memtable_list: unify can_flush and may_flush
Now that we have a unit test proving that it's safe to flush an
empty memtable list there is no need to distinguish between
may_flush and can_flush.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-01-08 09:49:46 +02:00
Benny Halevy
0342a24ee0 test: database_test: add test_flush_empty_table_waits_on_outstanding_flush
Test that table::flush waits on outstanding flushes, even if the active memtable is empty

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-01-08 09:49:45 +02:00
Benny Halevy
5be6b80936 replica: table, storage_group, compaction_group: add needs_flush
Table needs flush if not all its memtable lists are empty.
To be used in the next patch for a unit test.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-01-08 09:41:22 +02:00
Benny Halevy
ec4069246d test: database_test: do_with_some_data_in_thread: accept void callback function
Many test cases already assume `func` is being called a seastar
thread and although the function they pass returns a (ready) future,
it serves no purpose other than to conform to the interface.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-01-08 09:41:22 +02:00
Patryk Jędrzejczak
946a2bb988 storage_service: do not call raft_topology_update_ip for left nodes
This `raft_topology_update_ip` call always returns after `t.find(raft_id)`
returns `nullptr`, so it effectively does nothing. It's not a bug, since
there is no reason to update `system.peers` for left nodes anyway. We
delete the rows corresponding to left nodes in `process_left_node` (called
just above).

Closes scylladb/scylladb#27899
2026-01-07 16:52:13 +01:00
Michał Jadwiszczak
be16e42cb0 service/storage_service: update service levels cache after upgrade to v2
Service levels cache is empty after upgrade to consistent topology
if no mutations are commited to `system.service_levels_v2` or rolling
restart is not done.

To fix the bug, this commit adds service levels cache reloading after
upgrading the SL data accessor to v2 in `storage_service::topology_state_load()`.

Fixes SCYLLADB-90
2026-01-07 14:06:13 +01:00
Michał Jadwiszczak
53d0a2b5dc service/storage_service: check if service levels were already upgraded
before doing migration to raft

There is no need to call `service_level_controller::upgrade_to_v2()`
on every topology state load, we only need to do it once.
2026-01-07 14:06:13 +01:00
Nadav Har'El
f7eae50d98 test/alternator: fix some expected error messages to fit DynamoDB
All tests I am fixing in this patch do pass for me on DynamoDB, but
other developers report that they fail because some DynamoDB servers
apparently use slightly different error messages, with less detail about
the cause of an error. For example, some of our tests currently expect
an error message that looks like:

    An error occurred (ValidationException) when calling the Query
    operation: Invalid operator used in KeyConditionExpression:
    attribute_exists

But some servers don't report the ": attribute_exists" at the end, so
we can't use the word "attribute_exists" it in the test to recognize
the correct error, and needs to use a different word (which both
versions of DynamoDB and Alternator all print).

As another example, the good old DynamoDB error:

    An error occurred (ValidationException) when calling the Query
    operation: 1 validation error detected: Value 'DOG' at
   'conditionalOperator' failed to satisfy constraint: Member must
   satisfy enum value set: [OR, AND]

Got replaced by the following less informative message:

    An error occurred (ValidationException) when calling the Query
    operation: Failed to satisfy constraint: Member must satisfy enum
    value set: [ALL, OR]'

So we need to fix the test to allow it too.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-01-07 14:06:33 +02:00
Nadav Har'El
e97fbc2d65 test/alternator: fix compressed request test on non-us-east1
The test test_compressed_request.py::test_compressed_request coerces
boto3 to send a compressed request, and wrongly used region_name=us-east-1
to set up the connection. Theoretically, this doesn't matter because
we also set the correct URL (for either Alternator or the desired region
in AWS). But in fact it does matter, because region name is part of the
request's signature, and DynamoDB refuses the request if it comes to
a different region than it is signed for. So this test fails when run
on DynamoDB on any other region except us-east-1.

The fix is simple - don't use the constant "us-east-1", but pick up the
correct region name from the original connection.

The functions new_dynamodb_session(), new_dynamodb() and
new_dynamodb_stremas() had the same bug and we fix it too, but it didn't
break any test because the only tests using these functions were
Scylla-only so the AWS region problem didn't apply to them.
2026-01-07 13:33:46 +02:00
Patryk Jędrzejczak
f0d159abb0 Merge 'test/raft: use valid sentinel in liveness check to prevent digest errors' from Emil Maskovsky
Replace -1 with 0 for the liveness check operation to avoid triggering digest validation failures. This prevents rare fatal errors when the cluster is recovering and ensures the test does not violate append_seq invariants.

The value -1 was causing invalid digest results in the append_seq structure, leading to assertion failures. This could happen when the sentinel value was the first (or only) element being appended, resulting in a digest that did not match the expected value.

By using 0 instead, we ensure that the digest calculations remain valid and consistent with the expected behavior of the test.

The specific value of the sentinel is not important, as long as it is a valid elem_t that does not violate the invariants of the append_seq structure. In particular, the sentinel value is typically used only when no valid result is received from any server in the current loop iteration, in which case the loop will retry.

Fixes: scylladb/scylladb#27307

Backporting to active branches - this is a test-only fix (low risk) for a flaky test that exists in older branches (thus affects the CI of active branches).

Closes scylladb/scylladb#28010

* https://github.com/scylladb/scylladb:
  test/raft: use valid sentinel in liveness check to prevent digest errors
  test/raft: improve debugging in randomized_nemesis_test
2026-01-07 12:31:21 +01:00
Nadav Har'El
2c02e463ff test/alternator: fix test's expected error message on DynamoDB
The Alternator test test_tag.py::test_tag_lsi_gsi expects to see an
error - it's not allowed to set a tag on a GSI or LSI - but the error
message that DynamoDB prints recently changed - instead of saying
"ResourceArn" the new error message says "resource arn".

Change the test to allow both forms, so it will pass on both Alternator
(which still uses the word ResourceArn - which is the name of the
parameter) and on DynamoDB (which uses "resource arn").

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-01-07 12:51:10 +02:00
Nadav Har'El
4f3150c282 test/alternator: mark Alternator-only test scylla_only
The test test_batch.py::test_batch_write_item_large_broken_connection
failed on DynamoDB (Refs #26079). It turns out this test has many
problems:

1. This test wrongly assumes a batch write needs to complete in one
   attempt - and this fails on DynamoDB with low WCU capacity where
   the batch needs to be resumed in multiple requests. Using boto3's
   batch_writer() fixes this problem.
2. This test has NOTHING to do with batches - so is mis-named and
   mis-placed. The batch write is just a way to prepare some data
   in the table, and the real test is about Query'ing the data back
   and observing the long response and reproducing issue #14454.
   I did not rename or move the test, but left a comment explaining
   the situation.
3. This test is written to assume the Query's response uses HTTP
   chunked encoding. Which isn't actually true for DynamoDB, at least
   not at the time of this writing. So the test fails on DynamoDB.

For the last reason, I made this test scylla_only. This test can't
really be run on DynamoDB without rewriting it.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-01-07 12:51:10 +02:00
Nadav Har'El
df6b347911 test/alternator: fix test on DynamoDB
The test test_batch.py::test_batch_write_item_large often fails when
running on DynamoDB, and this patch fixes it. The test checks that a
large but not over-the-limits large batch works. However, "works" only
means that the batch is not an error - it doesn't guarantee that all the
items in the batch are performed. If the WCU limits of the table are
exceeded DynamoDB may perform only part of the the batch and return the
remaining items as UnprocessedItems. This not only can happen, it
usually does happen on DynamoDB - because a new on-demand-billing table
always start with a very low WCU capacity.

So in this patch we update the test to recognize and perform the
UnprocessedItems, instead of assuming it needs to be empty.

The test continues to pass on Alternator, and finally passes on
DynamoDB.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-01-07 12:51:10 +02:00
Nadav Har'El
9d6a463324 test/alternator: increase wait_for_gsi() timeout
In Alternator tests, the wait_for_gsi() utility function is used in
tests that add a GSI to an existing table, to wait for this new GSI
to become ready. Although this takes a fraction of a second on
Alternator, we noticed that this takes many minutes (!) on DynamoDB
so we used an absurdly high 10 minute timeout to allow tests to also
pass on DynamoDB.

But it turns out that 10 minutes wasn't absurdly high enough, and
tests using it in test_gsi_updatetable.py started to fail on DynamoDB.
Empirically, 10 minutes was enough in the past but it seems that today
adding a GSI to an empty table routinely takes as much as 20 minutes.

So this patch increases the wait_for_gsi() timeout to a whopping 30
minutes. After this patch, the tests in test_gsi_updatetable.py which
used to fail - test_gsi_backfill_with_lsi,
test_gsi_backfill_with_real_column, test_gsi_creates_and_deletes and
test_gsi_backfill_oversized_key now all pass on DynamoDB - but each
takes more than 20 minutes to pass.

To allow the test to fail much more quickly on Alternator (where
creating a GSI takes a fraction of a second), we set a much lower
but still very high timeout when running on Alternator - 60 seconds.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-01-07 12:50:54 +02:00
Łukasz Paszkowski
62313a6264 load_sketch: Allow populating load_sketch with normalized current load
Currently, tablet allocation intentionally ignores current load (
introduced by the commit #1e407ab) which could cause identical shard
selection when allocating a small number of tablets in the same topology.
When a tablet allocator is asked to allocate N tablets (where N is smaller
than the number of shards on a node), it selects the first N lowest shards.
If multiple such tables are created, each allocator run picks the same
shards, leading to tablet imbalance across shards.

This change initializes the load sketch with the current shard load,
scaled into the [0,1] range, ensuring allocation still remains even
while starting from globally least-loaded shards.

Fixes https://github.com/scylladb/scylladb/issues/27620

Closes scylladb/scylladb#27802
2026-01-07 11:49:01 +01:00
Avi Kivity
2642636ada build: avoid ccache masquarading when choosing ccache too
In 12dcf79c60, we avoid the ccache masquarate directory
when choosing sccache, as that would give us a double-caching
effect: first sccache is called, then clang++ is looked up
finding ccache masquarading as clang++. We solved that by
converting the name clang++ to the absolute path /usr/bin/clang++
(or whatever), skipping over the masquarade directory in $PATH.

It turns out that we need to do the same for ccache. That commit
changed the compile command to 'ccache clang++', and ccache will
look up clang++ in $PATH, finding itself in the masquarade directory.

Fix that by avoiding the masquarade directory if a compiler cache is
specified explicitly or is found with --compiler-cache=auto.

Closes scylladb/scylladb#27996
2026-01-06 17:47:09 +02:00
Nadav Har'El
5f79d93102 Merge 'Alternator response compression' from Szymon Malewski
This pull request introduces HTTP response compression to Alternator, allowing responses (both string and chunked) to be compressed using `gzip` or `deflate` when requested by clients and when the response size exceeds configurable thresholds.

* Added new source files `http_compression.cc` and `http_compression.hh` implementing compression logic, including parsing client `Accept-Encoding` headers, selecting compression algorithms, and compressing response bodies using zlib.

* Added two new configuration options to `db::config` (`alternator_response_gzip_compression_level` and `alternator_response_gzip_compression_threshold_in_bytes`) to control compression level (and optionally disable compression with level 0 - no compression) and minimum response size for compression.

* Added tests showing compliance with DynamoDB behavior.

Fixes #27246

New feature - no backporting

Closes scylladb/scylladb#27454

* github.com:scylladb/scylladb:
  alternator/http_compression: Add compression of streamed response
  alternator/http_compression: Add implementation od gzip/deflate of string response
  alternator/http_compression: Add handling of Accept-Encoding header
  test/alternator: add tests for compressed responses
2026-01-06 16:47:11 +02:00
Emil Maskovsky
4ba3e90f33 test/raft: use valid sentinel in liveness check to prevent digest errors
Replace -1 with 0 for the liveness check operation to avoid triggering
digest validation failures. This prevents rare fatal errors when the
cluster is recovering and ensures the test does not violate append_seq
invariants.

The value -1 was causing invalid digest results in the append_seq
structure, leading to assertion failures. This could happen when the
sentinel value was the first (or only) element being appended, resulting
in a digest that did not match the expected value.

By using 0 instead, we ensure that the digest calculations remain valid
and consistent with the expected behavior of the test.

The specific value of the sentinel is not important, as long as it is
a valid elem_t that does not violate the invariants of the append_seq
structure. In particular, the sentinel value is typically used only
when no valid result is received from any server in the current loop
iteration, in which case the loop will retry.

Fixes: scylladb/scylladb#27307
2026-01-06 14:34:02 +01:00
Emil Maskovsky
3af5183633 test/raft: improve debugging in randomized_nemesis_test
Move the post-condition check before the assertion to ensure it is
always executed first. Before, the wrong value could be passed to the
digest_remove assertion, making the pre-check trigger there instead of
the post-check as expected.

Also, add a check in the append_seq constructor to ensure that the
digest value is valid when creating an append_seq object.
2026-01-06 14:32:46 +01:00
Ferenc Szili
a51cb3dad9 test: fix flaky test_update_load_stats_after_migration
Disable load balancing to avoid the balancer moving the tablet from a
node with less to a node with more available disk space. Otherwise, the
move_tablet API can fail (if the tablet is already in transisiton) or
be a no-op (in case the tablet has already been migrated)

Fixes: #27980

Closes scylladb/scylladb#27993
2026-01-06 11:57:35 +02:00
Andrei Chekun
b546315edf test.py: fix race condition in initizlization of cqlpy tests
Fix the race condition when the process finished, while test is trying
to checks its descriptors. Now instead of failing the whole loop, it
will continue to iterate the rest of the process to find the needed
process.

Closes scylladb/scylladb#27994
2026-01-06 10:40:25 +02:00
Avi Kivity
4c9c3aae23 tools: toolchain: add dockerfile for future toolchain
To avoid surprises when libstdc++, clang, or other components
in the toolchain introduce regressions, we introduce a "future
toolchain". This builds on the Fedora version under active
development, and the development branches of gcc and llvm.

The future toolchain is not intended to be frozen. Rather,
periodically we will build the future toolchain, then build
ScyllaDB and run its unit tests under that toolchain, then
discard it. Any problems will then have be be tracked down
by a developer and either reported to the source repository,
or fixed in ScyllaDB.

Closes scylladb/scylladb#27964
2026-01-05 19:38:58 +02:00
Nadav Har'El
384e394ff0 Merge 'Add similarity functions to calculate similarity of given vectors' from Dawid Pawlik
It should be possible to return the similarity of vectors in CQL statements following the [Cassandra compatible syntax](https://cassandra.apache.org/doc/latest/cassandra/getting-started/vector-search-quickstart.html#query-vector-data-with-cql):

```
SELECT comment, similarity_cosine(comment_vector, [0.1, 0.15, 0.3, 0.12, 0.05])
    FROM cycling.comments_vs;
```

Although the calculations are slow, and we already have calculated results returned via Vector Store API,
we need the functionality as it allows us to calculate similarity of vectors not stored in vector indexes.

It will be needed for [quantization and rescoring](https://scylladb.atlassian.net/wiki/spaces/RND/pages/195985800/Quantization+and+Rescoring).

The feature is also a nice-to-have in testing as requested many times by testing and CX teams.

The optimized version utilizing already calculated distances from Vector Store without a need of rescoring will be coming soon after via https://github.com/scylladb/scylladb/pull/27991.

---

The patch adds functions:
- `similarity_cosine(<vector>, <vector>)`,
- `similarity_euclidean(<vector>, <vector>)`,
- `similarity_dot_product(<vector>, <vector>)`

Where `<vector>` is either a column of type `VECTOR<FLOAT, N>` or a vector of floats literal.

These functions can be called with every `SELECT` query, not only ANN vector queries as opposed to https://github.com/scylladb/scylladb/pull/25993.

The similarity calculations are implemented inspired by [USearch's implementation](
a2f1759910/include/usearch/index_plugins.hpp (L1304-L1385)) and made compatible with [Cassandra's documentation](https://cassandra.apache.org/doc/5.0/cassandra/developing/cql/functions.html#vector-similarity-functions).
That would guarantee the results in ScyllaDB are calculated using the exact same algorithms as used in Vector Store indexes.

---

Fixes: SCYLLADB-88
Fixes: SCYLLADB-89

New feature, should land into 2026.1

Closes scylladb/scylladb#27524

* github.com:scylladb/scylladb:
  docs: add vector similarity functions documentation
  test/cqlpy: add similarity functions correctness tests
  test/cqlpy: add similarity functions invalid call tests
  cql3: introduce similarity functions syntax
  vector_similarity_fcts: introduce similarity functions
  vector_similarity_fcts: retrieve similarity function argument types
  vector_similarity_fcts: add calculating similarity between vectors
2026-01-05 18:28:10 +02:00
Tomasz Grabiec
ffa11d6a2d test: Verify that repair doesn't block disabling of tablet load balancing
Refs #27647
2026-01-05 13:22:15 +01:00
Tomasz Grabiec
ccdb301731 tablets: Make balancing disabling call preempt tablet transitions
This patch modifies RESTful API handler which disables tablet
balancing to use topology request to wait for already running tablet
transitions. Before, it was just waiting for topology to be idle, so
it could wait much longer than necessary, also for operations which
are not affected by the flag, like repair. And repair can take hours.

New request type is introduced for this synchronization: noop_request.
It will preempt the tablet scheduler, and when the request executes,
we know all later tablet transitions will respect the "balancing
disabled" flag, and only things which are unuaffected by the flag,
like repair, will be scheduled.

Fixes #27647
2026-01-05 13:22:08 +01:00
Nadav Har'El
5c2ca56adf test/alternator: fix test passing a spurious parameter
The test test_streams.py::test_streams_putitem_new_item_overrides_old_lsi
failed on DynamoDB (Refs #26079) because we passed an unused parameter
NonKeyAttributes to the Projection setting an LSI. NonKeyAttributes is
only allowed when ProjectionType=INCLUDE, but we used ProjectionType=ALL.
DynamoDB refuses to create an LSI with such inconsistent parameters,
and we just need to remove this unnecessary parameter from this test.

The reason why this test didn't fail on Alternator is that Alternator
doesn't yet support or even parse the Projection parameter (Refs #5036).

We also add an xfailing test (passes on DynamoDB, fails on Alternator)
checking that a spurious NonKeyAttributes parameter is rejected. When
we get around to implement the projection feature (#5036), this will
be yet another acceptance test for this feature.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-01-05 13:51:01 +02:00
Botond Dénes
e4da0afb8d reader_concurrency_semaphore: add protection against negative count resource leaks
The semaphore has detection and protection against regular resource
leaks, where some resources go unaccounted for and are not released by
the time the semaphore is destroyed. There is no detection or protection
against negative leaks: where resources are "made up" of thin air. This
kind of leaks looks benign at first sight, a few extra resources won't
hurt anyone so long as this is a small amount. But turns out that even a
single extra count resource can defeat a very important anti-deadlock
protection in can_admit_read(): the special case which admits a new
permit regardless of memory resources, when all original count resources
all available. This check uses ==, so if resource > original, the
protection is defeated indefinitely. Instead of just changing == to >=,
we add detection of such negative leaks to signal(), via
on_internal_error_noexcept().
At this time I still don't now how this negative leak happens (the code
doesn't confess), with this detection, hopefully we'll get a clue from
tests or the field. Note that on_internal_error_noexcept() will not
generate a coredump, unless ScyllaDB is explicitely configured to do so.
In production, it will just generate an error log with a backtrace.
The detection also clams the _resources to _initial_resources, to
prevent any damage from the negativae leak.

I just noticed that there is no unit test for the deadlock protection
described above, so one is added in this PR, even if only loosely
related to the rest of the patch.

Fixes: SCYLLADB-163

Closes scylladb/scylladb#27764
2026-01-05 12:45:15 +02:00
Anna Stuchlik
375479d96c doc: fix the syntax of internal links
Some internal links had the wrong syntax: they were formatted as external links.
As a result, they redirected the user to the outdated Open Source documentation.
This commit fixes that bug.

Fixes https://github.com/scylladb/scylladb/issues/25899

Closes scylladb/scylladb#27905
2026-01-05 10:44:58 +01:00
Szymon Malewski
1f658bb2e2 alternator/http_compression: Add compression of streamed response
This patch adds compression of chunked responses.
It adds intermediate stream to compress chunks of data that are provided to http sink.

Fixes #27246
2026-01-05 10:14:42 +01:00
Szymon Malewski
b8afb173a6 alternator/http_compression: Add implementation od gzip/deflate of string response
Previous commit added means to decide whether client asks for compression and with which algorithm.
This patch adds actual compression of responses based on zlib library.
For now only string (not chunked) responses are compressed.
Several previously defined tests start to pass.
2026-01-05 10:14:42 +01:00
Szymon Malewski
ec329f85b0 alternator/http_compression: Add handling of Accept-Encoding header
This is an initial patch to add support of Alternator's compressed responses.
The actual compression (gzip,deflate) will be added in the following commits.
The main functionality added in this commmit is parsing of Accept-Encoding header,
that indicates compression algorithms supported by the client.
In this commit we add also configuration parameters of response gzip/deflate compression.
They allow to enable/disable compression, set level and a size threshold below which a response is not compressed.
With current implementation it is possible to decide a compression for each response, but it is not used yet.
2026-01-05 10:14:40 +01:00
Szymon Malewski
08386ea959 test/alternator: add tests for compressed responses
Adds set of tests that:
1. Show how DynamoDB handles response compression.
It supports 'gzip' and 'deflate' compression, which can be selected by providing 'Accept-Encoding` header. It only encodes response above 4096B.
- `test_compressed_response`, `test_compressed_response_large` show compression for various response sizes.
- `test_accept_encoding_header` focuses on testing various values of Accept-Encoding header.
- `test_multiple_accept_encoding_headers` verifies behaviour with repeted Accept-Encoding headers.

2. Will confirm implementation of response compression in Alternator (#27246)
Additonally to above test, we check Altenator specific expectations:
- `test_chunked_response_compression` makes sure that compression will work also for chunked responses.
- `test_set_compression_options` checks config options to set response size threshold for compression and compression level

3. `test_signature_trims_accept_encoding_spaces` reveals Alternator's bug in signature verification (#27775)
2026-01-05 10:13:40 +01:00
Avi Kivity
0df85c8ae8 Revert "Merge 'Unify configuration of object storage endpoints' from Pavel Emelyanov"
This reverts commit 1bb897c7ca, reversing
changes made to 954f2cbd2f. It makes
incompatible changes to the object storage configuration format, breaking
tests [1]. It's likely that it doesn't break any production configuration,
but we can't be sure.

Fixes #27966

Closes scylladb/scylladb#27969
2026-01-05 08:53:41 +02:00
Benny Halevy
a8114f9bcc test_backup: do_abort_restore: reduce data footprint
To make the test fast, in particular in debug mode
insert fewer keys and do not rely on os.urandom
which is notoriously slow

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-01-05 08:04:08 +02:00
Benny Halevy
c0dd662144 test_backup: do_abort_restore: use error injection
Currently the test depends on timing and enough inserted
data to abort the restore tasks at exactly the right time.
This is flaky in nature, so instead, use error injection
to synchronize the abort with mutation streaming.

Note that with that we no longer get the STREAM_MUTATION_FRAGMENTS
log message, so waiting for it is dropped from the test.
The most imporant thing is that some restore tasks must fail.
(We cannot guarantee all would fail unfortunately)

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-01-05 08:03:53 +02:00
Benny Halevy
16dd07c7d4 test_backup: do_abort_restore: use asyncio for cql
Use the more modern asyncio facility to run cql queries
and a prepared statement to insert data into the table.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-01-05 08:03:53 +02:00
Benny Halevy
f1a583c39c test_backup: do_abort_restore: use new_test_keyspace
For creating a keyspace with a unique name and auto-deleting
it on exit.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-01-05 08:03:53 +02:00
Benny Halevy
3e8431a3d9 test_backup: do_abort_restore: use logger rather than print
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-01-05 08:03:53 +02:00
Benny Halevy
acb2c9b045 test_backup: do_abort_restore: pass auto_rack_dc to servers_add
To generate multi-rack cluster, otherwise we get the following error:
```
E   cassandra.protocol.ConfigurationException: <Error from server: code=2300 [Query invalid because of configuration issue] message="Replication factor 3 exceeds the number of racks (1) in dc datacenter1">
```

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-01-05 08:03:53 +02:00
Dani Tweig
1ef6ac5439 consolidating jira automation to one workflow file
Closes scylladb/scylladb#27854
2026-01-05 07:09:03 +02:00
copilot-swe-agent[bot]
4e41b6f106 tools/scylla-nodetool: Increase precision of compression ratio from 1 to 2 decimal places
In the tablestats (cfstats) command.
Fixes: https://github.com/scylladb/scylladb/issues/27962

Closes scylladb/scylladb#27965
2026-01-05 07:07:06 +02:00
Avi Kivity
e03d24e3f3 Merge 'Use file_stat with a relative path when listing directories' from Benny Halevy
With the additional file_stat overload introduced in
[Update seastar submodule](3e9b071838),
use the opened directory for more efficient, relative-path based stat.

* Enhancement, no backport needed

Closes scylladb/scylladb#27967

* github.com:scylladb/scylladb:
  table: get_snapshot_details: use relative-path based file_stat
  table: get_snapshot_details: fix warning in exists_in_dir
  table: get_snapshot_details: fix staging dir calculation
  backup: process_snapshot_dir: use relative-path based file_stat
  directory_lister: add ctor with opened directory
2026-01-04 22:06:34 +02:00
Nadav Har'El
c4a9d7eb3e cql: fix DESC KEYSPACES when a "USE" is in effect
If a CQL session USEs a keyspace and then calls DESC TABLES, the user
expects to see only the tables in the chosen keyspace. However, calling
DESC KEYSPACES should still return list all the keyspaces - returning
just the USEd one is not useful - and also not what Cassandra does.
We had an xfailing test test_describe.py::test_keyspaces_with_use which
reproduces this bug (and passes on Cassandra).

In this patch we fix this bug. The fix is simple - USE should affect
DESC statements, but be ignored for DESC KEYSPACES. We can then remove
the xfail marker from the test.

The patch also includes a new test for the DESC TABLES case, where the
USE *does* have an affect. And I wanted to make sure the patch doesn't
break this case. As usual, the new test passes on both Cassandra and
ScyllaDB.

Fixes #26334

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#27971
2026-01-04 22:01:12 +02:00
Dawid Mędrek
77a934e5b9 db/hints: Prevent draining hints before hint replay is allowed
Context
-------
The procedure of hint draining boils down to the following steps:

1. Drain a hint sender. That should get rid of all hints stored
   for the corresponding endpoint.
2. Remove the hint directory corresponding to that endpoint.

Obviously, it gets more complex than this high-level perspective.
Without blurring the view, the relevant information is that step 1
in the algorithm above may not be executed.

Breaking it down, it comprises of two calls to
`hint_sender::send_hints_maybe()`. The function is responsible for
sending out hints, but it's not unconditional and will not be performed
if any of the following bullets is not satisfied:

* `hint_sender::replay_allowed()` is not `true`. This can happen when
  hint replay hasn't been turned on yet.
* `hint_sender::can_send()` is not `true`. This can happen if the
  corresponding endpoint is not alive AND it hasn't left the cluster
  AND it's still a normal token owner.

There is one more relevant point: sending hints can be stopped if
replaying hints fails and `hint_sender::send_hints_maybe()` returns
`false`. However, that's not not possible in the case of draining.
In that case, if Scylla comes across any failure, it'll simply delete
the corresponding hint segment. Because of that, we ignore it and
only focus on the two bullets.

---

Why is it a problem?
--------------------
If a hint directory is not purged of all hint segments in it,
any attempt to remove it will fail and we'll observe an error like this:

```
Exception when draining <host ID>: std::filesystem::__cxx11::filesystem_error
(error system:39, filesystem error: remove failed: Directory not empty [<path>])
```

The folder with the remaining hints will also stay on disk, which is, of
course, undesired.

---

When can it happen?
-------------------
As highlighted in the Context section of this commit message, the
key part of the code that can lead to a dangerous situation like that
is `hint_sender::send_hints_maybe()`. The function is called twice when
draining a hint endpoint manager: once to purge all of the existing
hints, and another time after flushing all hints stored in a commitlog
instances, but not listed by `hint_sender` yet. If any of those calls
misbehaves, we may end up with a problem. That's why it's crucial to
ensure that the function always goes through ALL of the hints.

Dangerous situations:

1. We try to drain hints before hint replay is allowed. That will
   violate the first bullet above.
2. The node we're draining is dead, but it hasn't left the cluster,
   and it still possesses some tokens.

---

How do we solve that?
---------------------
Hint replay is turned on in `main.cc`. Once enabled, it cannot be
disabled. So to address the first bullet above, it suffices to ensure
that no draining occurs beforehand. It's perfectly fine to prevent it.
Soon after hint replay is allowed, `main.cc` also asks the hint manager
to drain all of the endpoint managers whose endpoints are no longer
normal token owners (cf. `db::hints::manager::drain_left_nodes()`).

The other bullet is more tricky. It's important here to know that
draining only initiated in three situations:

1. As part of the call to `storage_service::notify_left()`.
2. As part of the call to `storage_service::notify_released()`.
3. As part of the call to `db::hints::manager::drain_left_nodes()`.

The last one is trivially non-problematic. The nodes that it'll try to
drain are no longer normal token owners, so `can_send()` must always
return `true`.

The second situation is similar. As we read in the commit message of
scylladb/scylladb@eb92f50413, which
introduced the notion of released nodes, the nodes are no longer
normal token owners:

> In this patch we postpone the hint draining for the "left" nodes to
> the time when we know that the target nodes no longer hold ownership
> of any tokens - so they're no longer referenced in topology. I'm
> calling such nodes "released".

I suggest reading the full commit message there because the problems
there are somewhat similar these changes try to solve.

Finally, the first situation: unfortunately, it's more tricky. The same
commit message says:

> When a node is being replaced, it enters a "left" state while still
> owning tokens. Before this patch, this is also the time when we start
> draining hints targeted to this node, so the hints may get sent before
> the token ownership gets migrated to another replica, and these hints
> may get lost.

This suggests that `storage_service::notify_left()` may be called when
the corresponding node still has some tokens! That's something that may
prevent properly draining hints.

Fortunately, no hope is lost. We only drain hints via `notify_left()`
when hinted handoff hasn't been upgraded to being host-ID-based yet.
If it has, draining always happens via `notify_released()`.

When I write this commit message, all of the supported versions of
Scylla 2025.1+ use host-ID-based hinted handoff. That means that
problems can only arise when upgrading from an older version of Scylla
(2024.1 downwards). Because of that, we don't cover it. It would most
likely require more extensive changes.

---

Non-issues
----------
There are notions that are closely related to sending hints. One of them
is the host filter that hinted handoff uses. It decides which endpoints
are eligible for receiving hints, and which are not. Fortunately, all
endpoints rejected by the host filter lose their hint endpoint managers
-- they're stopped as part of that procedure. What's more, draining
hints and changing the host filter cannot be happening at the same time,
so it cannot lead to any problems.

The solution
------------
To solve the described issue, we simply prevent draining hints before
hint replay is allowed. No reproducer test is attached because it's not
feasible to write one.

Fixes scylladb/scylladb#27693

Closes scylladb/scylladb#27713
2026-01-04 16:54:05 +02:00
Benny Halevy
4d46674d03 table: get_snapshot_details: use relative-path based file_stat
With the additional file_stat overload introduced in
3e9b071838, use the opened
directory for more efficient, relative-path based stat.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-01-04 11:05:56 +02:00
Benny Halevy
2d2177d2c9 table: get_snapshot_details: fix warning in exists_in_dir
The functor is called both on the data directory as well
as on the staging directory, so the warning printed if the
found file is not the same inode should print the given path,
not datadir / name (as was copy and pasted).

Refs #27635

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-01-04 11:05:56 +02:00
Benny Halevy
240b32a87a table: get_snapshot_details: fix staging dir calculation
staging is based off of datadir, not snapshot_dir.

the issue was introduced in f5ca3657e2.

Refs #27635

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-01-04 11:05:56 +02:00
Benny Halevy
1a08ef2062 backup: process_snapshot_dir: use relative-path based file_stat
With the additional file_stat overload introduced in
3e9b071838, use the opened
directory for more efficient, relative-path based stat.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-01-04 11:05:56 +02:00
Benny Halevy
8d00266f88 directory_lister: add ctor with opened directory
This ctor allows the caller to open the directory first,
on its own, and pass it down to the directory_lister.

Once all callers use this ctor we can get rid of
the delayed open in the get() method.

Also, in can be used to replace full-path based file_stat calls
on listed entries with file_stat(directory, name) calls
that are based on statat() and a relative path name that is present
in the listed directory entry.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

sq
2026-01-04 11:05:18 +02:00
Amnon Heiman
c6d1c63ddb distributed_loader: system_replicated_keys as system keyspace
This patch adds system_replicated_keys to the list of known system
keyspaces.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2026-01-02 16:41:47 +02:00
Amnon Heiman
83c1103917 replicated_key_provider: make KSNAME public
Move KSNAME constant from internal static to public member of
replicated_key_provider_factory class.

It will be used to identify it as a system keyspace.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2026-01-02 16:39:51 +02:00
Dawid Pawlik
c0b06a7fc6 docs: add vector similarity functions documentation
Add documentation in `functions.rst` as the CQL reference
for a vector similarity functions.
This includes the syntax, example usage, and prerequisites
for the parameters.
2026-01-02 13:02:59 +01:00
Dawid Pawlik
115bd51873 test/cqlpy: add similarity functions correctness tests
Add `calculate_similarity` function for testing purposes.

Add tests checking if CQL returned values match the calculated
ones with the precision up to 5th decimal place.

The tests should also be run on Cassandra to check compatibility
with their responses.
2026-01-02 13:02:59 +01:00
Dawid Pawlik
12aa33106f test/cqlpy: add similarity functions invalid call tests
Add tests checking that calling similarity functions with:
- non-vector columns
- non-vector values
- vectors with mismatching dimensions
as arguments fails.
2026-01-02 12:49:22 +01:00
Dawid Pawlik
b03d520aff cql3: introduce similarity functions syntax
The similarity function syntax is:

`similarity_<metric_name>(<vector>, <vector>)`

Where `<metric_name>` is one of `cosine`, `euclidean` and `dot_product`
matching the intended similarity metric to be used within calculations.
Where `<vector>` is either a vector column name or vector literal.

Add `vectorSimilarityArgs` symbol that is an extension of `selectionFunctionArgs`,
but allowing to use the `value` as an argument as well as the `unaliasedSelector`.
This is needed as the similarity function syntax allows both the arguments to be
a vector value, so the grammar needs to recognize the vector literal there as well.

Since we actually support `SELECT`s with constants since this patch,
return true instead of throwing an error while trying to convert the function call
to constant.
2026-01-02 12:48:43 +01:00
Dawid Pawlik
5b2b8d596a vector_similarity_fcts: introduce similarity functions
This patch introduces scalar functions `similarity_cosine()`,
`similarity_euclidean()`, and `similarity_dot_product()`
which should return a float - similarity of the given vectors
calculated according to the function's similarity metric.

The argument types of this function are retrieved with
the `retrieve_vector_arg_types`, but shall be assignable to
`vector<float, N>` where `N` is the same for both arguments.

This patch introduces a dimensionality check during the execusion
of those functions.
2026-01-02 12:48:43 +01:00
Dawid Pawlik
b72df3ae27 vector_similarity_fcts: retrieve similarity function argument types
This patch retrieves the argument types for similarity functions.
Newly introduced `retrieve_vector_arg_types` function checks if
the provided arguments are vectors of floats and if
both the vector values match the same type (dimension).
If so, we know the exact type and set it as the function arguments type.
Otherwise, if the exact type is unkown, but we can assign to vector<float, N>
then the dimensionality check will be done during execution of
the similarity function.
This also takes care of null values and bind variables the same way
as implemented in Cassandra to stay compatible.
Meaning that if we can infer the type from one argument, then the latter
may be unknown (null or ?).

Additionally this patch adds `test_assignment_any_vector` function
which tests the weak assignment to vector<float, N> as mentioned
above.
2026-01-02 12:48:43 +01:00
Dawid Pawlik
2bedefbb85 vector_similarity_fcts: add calculating similarity between vectors
This commit introduces `compute_cosine_similarity`, `compute_euclidean_similarity`,
`compute_dot_product_similarity` functions to calculate the vectors similarity
in respective metric.
The similarity is a float value meaning how similar the vectors are in a range of [0, 1].
Values closer to 1 indicate greater similarity.

The `dot_product` similarity requires L2 normalized vectors as arguments.
The similarity is calculated based on the jVector's implementation used by Cassandra.
f967f1c924/jvector-base/src/main/java/io/github/jbellis/jvector/vector/VectorSimilarityFunction.java (L36-L69)
2026-01-02 12:48:08 +01:00
Nadav Har'El
6c8ddfc018 test/alternator: fix typo in test_returnvalues.py
Different DynamoDB operations have different settings allowed for
their "ReturnValues" argument. In particular, some operations allow
ReturnValues=UPDATED_OLD but the DeleteItem operation *does not*.

We have a test, test_delete_item_returnvalues, aimed to verify this
but it had a typo and didn't actually check "UPDATED_OLD". This patch
fixes this typo.

The test still passes because the code itself (executor.cc,
delete_item_operation's constructor) has the correct check - it was
just the test that was wrong.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#27918
2026-01-01 19:33:23 +02:00
Israel Fruchter
40ada3f187 Update tools/cqlsh submodule (v6.0.32)
* tools/cqlsh scylladb/scylla-cqlsh@9e5a91d7...scylladb/scylla-cqlsh@5a1d7842 (9):
  > fix wrong reference in copyutil.py
  > Add GitHub Action workflow to create releases on new tags
  > test_copyutil.py: introdcue test for ImportTask
  > fix(copyutil.py): avoid situatuions file might be move withing multiple processes
  > Fix Unix socket port display in show_host() method
  > Merge pull request #157 from scylladb/alert-autofix-1
    .github/workflows/build-push.yml: Potential fix for code scanning alert no. 1: Workflow does not contain permissions
  > .github/workflows/dockerhub-description.yml: Potential fix for code scanning alert no. 9: Workflow does not contain permissions
  > test_cqlsh_output: skip some cassandra 5.0 table options
  > tests: template compression cql to use `class` insted of `sstable_comprission`
  > Pin Cassandra version to 5.0 for reproducible builds
  > Remove scylla-enterprise integration test and update Cassandra to latest

Closes scylladb/scylladb#27924
2026-01-01 19:30:34 +02:00
Łukasz Paszkowski
76b84b71d1 storage/test_out_of_space_prevention.py: Fix async/await bugs
- Add missing await keywords for async operations on s2_log.wait_for()
  and coord_log.wait_for()
- Fix incorrect regex: "compaction .* Split {cf}" → "compaction.*Split {cf}"
- The commit https://github.com/scylladb/scylladb/commit/f7324a4 demoted
  compaction start/end log messages to debug level. Hence add
  compaction=debug log messages to the following tests:
    test_split_compaction_not_triggered
    test_node_restart_while_tablet_split
    test_repair_failure_on_split_rejection

Fixes https://github.com/scylladb/scylladb/issues/27931

Closes scylladb/scylladb#27932
2026-01-01 14:24:30 +02:00
Anna Stuchlik
624869de86 doc: remove cassandra-stress from installation instructions
The cassandra-stress tool is no longer part of the default package
and cannot be run in the way described.

This commit removes the instruction to run cassandra-stress.

Fixes https://github.com/scylladb/scylladb/issues/24994

Closes scylladb/scylladb#27726
2026-01-01 14:20:58 +02:00
Jenkins Promoter
69d6e63a58 Update pgo profiles - aarch64 2026-01-01 05:10:51 +02:00
Jenkins Promoter
d6e2d3d34c Update pgo profiles - x86_64 2026-01-01 04:27:14 +02:00
Nadav Har'El
e28df9b3d0 test: fix Python warnings in regular expressions
Like C, Python supports some escape sequences in strings such as the
familiar "\n" that converts to a newline character.
Originally, when backslash was used before a random character, for
example, "\.", Python used to just use these literal characters
backslash and dot, in the string - and not make a fuss about it.
This made it ok to use a string like "hi\.there" as a regular expression.
We have a few instances of this in our Python tests.

But recent releases of Python started to produce ugly warnings about
these cases. The error message looks like:

    SyntaxWarning: "\." is an invalid escape sequence. Such sequences
    will not work in the future. Did you mean "\\."? A raw string is
    also an option.

Indeed in most cases the easiest solution is to use a "raw string",
a string literal preceded with r. For example, r"hi\.there". In such
strings Python doesn't replace escape sequences like \n in the string,
and also leaves the \. unchanged for the regular expression to see.

So in this patch we use raw strings in all places in test/ where Python
warns have this problem.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#27856
2025-12-31 20:44:01 +02:00
Yaniv Michael Kaul
597d300527 main.cc: remove warning: 'metric_help' is deprecated
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>

Backport: no, benign issue.

Closes scylladb/scylladb#27680
2025-12-31 18:36:55 +02:00
Avi Kivity
b690ddb9e5 tools: toolchain: dbuild: bind-mount full ~/.cache to container
In afb96b6387, we added support for sccache. As a side effect
it changed the method of invoking ccache from transparent via PATH
(if it contains /usr/lib64/ccache) to explicit, by changing the compiler
command line from 'clang++' (which may or may not resolve the the ccache
binary) to 'ccache /usr/local/bin/clang++', which always invokes ccache.

In the default dbuild configuration, PATH does not contain /usr/lib64/ccache,
so ccache isn't invoked by default. Users can change this via the
SCYLLADB_DBUILD environment variable.

As a result of ccache being suddenly enabled for dbuild builds, ccache
will now attempt to create ~/.cache/ccache. Under docker, this does
not work, because we bind-mount ~/.cache/dbuild. Docker will create the
intermediate ~/.cache, but under the root user, not $USER. The intermediate
directory being root-owned prevents ~/.cache/ccache from being created.

Under podman, this does work, because everything runs under the container's
root user.

The fix is to bind-mount the entire ~/.ccache into the container. This
not only lets ccache create the directory, it will also find an existing
~/.cache/ccache directory and use it, enabling reuse across invocations.

Since ccache will now respect configuration changes without access to
its configuration file (notably, the maximum cache size), we also
bind-mount ~/.config.

Since ~/.ccache and ~/.config are not automatically created, we create
them explicitly so the bind mounts can work. This is for new nodes enlisted
from the cloud; developer machines will have those directories preexisting.

Note that the ccache directory used to be ~/.ccache, but was later changed.
Had the author known, we would have bind-mounted ~/.cache much earlier.

Fixes #27919.

Closes scylladb/scylladb#27920
2025-12-31 14:08:41 +01:00
Asias He
3abda7d15e topology_coordinator: Ensure repair_update_compaction_ctrl is executed
Consider this:

- n1 is a coordinator and schedules tablet repair
- n1 detects tablet repair failed, so it schedules tablet transition to end_repair state
- n1 loses leadership and n2 becomes the new topology coordinator
- n2 runs end_repair on the tablet with session_id=00000000-0000-0000-0000-000000000000
- when a new tablet repair is scheduled, it hangs since the lock is already taken because it was not removed in previous step

To fix, we use the global_tablet_id to index the lock instead of the
session id.

In addition, we retry the repair_update_compaction_ctrl verb in case of
error to ensure the verb is eventually executed. The verb handler is
also updated to check if it is still in end_repair stage.

Fixes #26346

Closes scylladb/scylladb#27740
2025-12-31 13:17:18 +01:00
Benny Halevy
3e9b071838 Update seastar submodule
* seastar f0298e40...4dcd4df5 (29):
  > file: provide a default implementation for file_impl::statat
  > util: Genralize memory_data_sink
  > defer: Replace static_assert() with concept
  > treewide: drop the support of fmtlib < 9.0.0
  > test: Improve resilience of netsed scheduling fairness test
  > Merge 'file: Use query_device_alignment_info in blkdev_alignments ' from Kefu Chai
    file: Put alignment helpers in anonymous namespace
    file: Use query_device_alignment_info in blkdev_alignments
  > Merge 'file: Query physical block size and minimum I/O size' from Kefu Chai
    file: Apply physical_block_size override to filesystem files
    file: Use designated initializers in xfs_alignments
    iotune: Add physical block size detection
    disk_params: Add support for physical_block_size overrides from io_properties.yaml
    block_device: Query alignment requirements separately for memory and I/O
  > Merge 'json: formatter: fix formatting of std:string_view' from Benny Halevy
    json: formatter: fix formatting of std:string_view
    json: formatter: make sure std::string_view conforms to is_string_like

Fixes #27887

  > demos:improve the output of demo_with_io_intent() in file_demo
  > test: Add accept() vs accept_abort() socket test
  > file: Refine posix_file_impl alignments initialization
  > Add file::statat and a corresponding file_stat overload
  > cmake: don't compile memcached app for API < 9
  > Merge 'Revert to ~old lifetime semantics for lvalues passed to then()-alikes' from Travis Downs
    future: adjust lifetime for lvalue continuations
    future: fix value class operator()
  > pollable_fd: Unfriend everything
  > Merge 'file: experimental_list_directory: use buffered generator' from Benny Halevy
    file: experimental_list_directory: use buffered generator
    file: define list_directory_generator_type
  > Merge 'Make datagram API use temporary_buffer<>-s' from Pavel Emelyanov
    net: Deprecate datagram::get_data() returning packet
    memcache: Fix indentation after previous patch
    memcache: Use new datagram::get_buffers() API
    dns: Use new datagram::get_buffers() API
    tests: Use new datagram::get_buffers() API
    demo: Use new datagram::get_buffers() API
    udp: Make datagram implementations return span of temporary_buffer-s
  > Merge 'Remove callback from timer_set::complete()' from Pavel Emelyanov
    reactor: Fix indentation after previous patch
    timers: Remove enabling callback from timer_set::complete()
  > treewide: avoid 'static sstring' in favor of 'constexpr string_view'
  > resource: Hide hwloc from public interface
  > Merge 'Fix handle_exception_type for lvalues' from Travis Downs
    futures_test: compile-time tests
    function_traits: handle reference_wrapper
  > posix_data_sink_impl: Assert to guard put UB
  > treewide: fix build with `SEASTAR_SSTRING` undefined
  > avoid deprecation warnings for json_exception
  > `util/variant_utils`: correct type deduction for `seastar::visit`
  > net/dns: fixed socket concurrent access
  > treewide: add missing headers
  > Merge 'Remove posix file helper file_read_state class' from Pavel Emelyanov
    file: Remove file_read_state
    test: Add a test for posix_file_impl::do_dma_read_bulk()
  > membarrier: simplify locking

Adjust scylla to the following changes in scylla:
- file_stat became polymorphic
  - needs explicit inference in table::snapshot_exists, table::get_snapshot_details
- file::experimental_list_directory now returns list_directory_generator_type

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#27916
2025-12-30 19:37:13 +03:00
Yaniv Kaul
0264ec3c1d test: test_downgrade_after_partial_upgrade: check that feature is disabled on all nodes after partial upgrade
We should check that the test feature is disabled on all nodes after a partial
upgrade. This hardens the test a bit, although the old code wasn't that bad,
since enabled features are a part of the group 0 state shared by all nodes.

Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>

Closes scylladb/scylladb#27654
2025-12-30 17:34:56 +01:00
Nadav Har'El
ffcce1ffc8 test/boost: fix flaky test node_view_update_backlog
The boost test view_schema_test.cc::node_view_update_backlog can be
flaky if the test machine has a hiccup of 100ms, and this patch fixes
it:

The test is a unit test for db::view::node_update_backlog, which is
supposed to cache the backlog calculation for a given interval. The
test asks to cache the backlog for 100ms, and then without sleeping
at all tries to fetch a value again and expect the unchanged cached
value to be returned. However, if the test run experiences a context
switch of 100ms, it can fail, and it did once as reported in #27876.

The fix is to change the interval in this test from 100ms to something
much larger, like 10 seconds. We don't sleep this amount - we just need
the second fetch to happen *before* 10 seconds has passed, so there's
no harm in using a very large interval.

However, the second half of this test wants to check that after the
interval is over, we do get a new backlog calculation. So for the
second half of this test we can and should use a shorter backlog -
e.g., 10ms. We don't care if the test machine is slow or context switched,
for this half of the test we want to to sleep *more* than 10ms, and
that's easy.

The fixed test is faster than the old one (10ms instead of 100ms) and
more reliable on a shared test machine.

Fixes #27876.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#27878
2025-12-30 10:10:42 +01:00
Benny Halevy
c9eab7fbd4 test: test_refresh: add test_refresh_deletes_uploaded_sstables
The refresh api is expected to automatically delete
the sstable files from the uploads/ dir.  Verify that.

The code that does that is currently called by
sstables_loader::load_new_sstables:
```c++
        if (load_and_stream) {
...
                co_await loader.load_and_stream(ks_name, cf_name, table_id, std::move(sstables_on_shards[this_shard_id()]), primary_replica_only(primary), true /* unlink */, scope, {});
```

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#27586
2025-12-30 10:51:24 +03:00
Nadav Har'El
80e5860a8c docs/alternator: document that Streams needs vnodes
The current state (after PR #26836) is that Alternator tables are
created by default using tablets. But due to issue #23838, Alternator
Streams cannot be enabled on a table that uses tablets... An attempt to
enable Streams on such a table results in a clear error:

    "Streams not yet supported on a table using tablets (issue #23838).
    If you want to use streams, create a table with vnodes by setting
    the tag 'system:initial_tablets' set to 'none'."

But users should be able to learn this fact from the documentation -
not just retroactively from an error message. This is especially important
because a user might create and fill a table using tablets, and only get
this error when attempting to enable Streams on the existing table -
when it is too late to change anything.

So this patch adds a paragraph on this to compatibility.md, where
several other requirements of Alternator Streams are already mentioned.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#27000
2025-12-30 10:45:34 +03:00
Avi Kivity
853f3dadda Merge 'treewide: fix some spelling errors' from Piotr Smaron
Irritated by prevailing spellchecker comments attached to every PR, I aim to fix them all.

No need to backport, just cosmetic changes.

Closes scylladb/scylladb#27897

* github.com:scylladb/scylladb:
  treewide: fix some spelling errors
  codespell: ignore `iif` and `tread`
2025-12-29 20:45:31 +02:00
Patryk Jędrzejczak
0fed9f94f8 gossiper: add_saved_endpoint: make generations of excluded nodes negative
The explanation is in the new comment in `gossiper::add_saved_endpoint`.

We add a test for this change. It's "extremely white-box", but it's better
than nothing.
2025-12-29 19:13:55 +01:00
Patryk Jędrzejczak
749b0278e5 test: introduce test_full_shutdown_during_replace 2025-12-29 19:13:55 +01:00
Patryk Jędrzejczak
4526dd93b1 utils: error_injection: allow aborting wait_for_message
The test added in the following commit utilizes it.
2025-12-29 19:13:55 +01:00
Patryk Jędrzejczak
fc4c2df2ce raft topology: preserve IP -> ID mapping of a replacing node on restart
We currently do it only for a bootstrapping node, which is a bug. The
missing IP can cause an internal error, for example, in the following
scenario:
- replace fails during streaming,
- all live nodes are shut down before the rollback of replace completes,
- all live nodes are restarted,
- live nodes start hitting internal error in all operations that
  require IP of the replacing node (like client requests or REST API
  requests coming from nodetool).

We fix the bug here, but we do it separately for replace with different
IP and replace with the same IP.

For replace with different IP, we persist the IP -> host ID mapping
in `system.peers` just like for bootstrap. That's necessary, since there
is no other way to determine IP of the replacing node on restart.

For replace with the same IP, we can't do the same. This would require
deleting the row corresponding to the node being replaced from
`system.peers`. That's fine in theory, as that node is permanently
banned, so its IP shouldn't be needed. Unfortunately, we have many
places in the code where we assume that IP of a topology member is always
present in the address map or that a topology member is always present in
the gossiper endpoint set. Examples of such places:
- nodetool operations,
- REST API endpoints,
- `db::hints::manager::store_hint`,
- `group0_voter_handler::update_nodes`.

We could fix all those places and verify that drivers work properly when
they see a node in the token metadata, but not in `system.peers`.
However, that would be too risky to backport.

We take a different approach. We recover IP of the replacing node on
restart based on the state of the topology state machine and
`system.peers` just after loading `system.peers`.

We rely on the fact that group 0 is set up at this point. The only case
where this assumption is incorrect is a restart in the Raft-based
recovery procedure. However, hitting this problem then seems improbable,
and even if it happens, we can restart the node again after ensuring
that no client and REST API requests come before replace is rolled back
on the new topology coordinator. Hence, it's not worth to complicate the
fix (by e.g. looking at the persistent topology state instead of the
in-memory state machine).
2025-12-29 19:13:53 +01:00
Patryk Jędrzejczak
4e63e74438 messaging: improve the error messages of closed_errors
The default error message of `closed_error` is "connection is closed".
It lacks the host ID and the IP address of the connected node, which
makes debugging harder. Also, it can be more specific when
`closed_error` is thrown due to the local node shutting down.

Fixes #16923

Closes scylladb/scylladb#27699
2025-12-29 18:36:07 +02:00
Avi Kivity
567c28dd0d Merge 'Decouple sstables::storage::snapshot() and ::clone() functionality' from Pavel Emelyanov
The storage::snapshot() is used in two different modes -- one to save sstable as snapshot somewhere, and another one to create a copy of sstable. The latter use-case is "optimized" by snapshotting an sstable under new generation, but it's only true for local storage. Despite for S3 storage snapshot is not implemented, _cloning_ sstable stored on S3 is not necessarily going to be the same as doing a snapshot.

Another sign of snapshot and clone being different is that calling snapshot() for snapshot itself and for clone use two very different sets of arguments -- snapshotting specifies relative name and omits new generation, while cloning doesn't need "name" and instead provides generation. Recently (#26528) cloning got extra "leave_unsealed" tag, that makes no sense for snapshotting.

Having said that, this PR introduces sstables::storage::clone() method and modifies both, callers and implementations, according to the above features of each. As a result, code logic in both methods become much simpler and a bunch of bool classes and "_tag" helper structures goes away.

Improving internal APIs, no need to backport

Closes scylladb/scylladb#27871

* github.com:scylladb/scylladb:
  sstables, storage: Drop unused bool classes and tags
  sstables/storage: Drop create_links_common() overloads
  sstable: Simplify storage::snapshot()
  sstables: Introduce storage::clone()
2025-12-29 17:50:54 +02:00
Avi Kivity
9927c6a3d4 Merge 'Reapply "audit: enable some subset of auditing by default"' from Piotr Smaron
This reverts commit a5edbc7d612df237a1dd9d46fd5cecf251ccfd13.

<h3>Why re-enabling table audit</h3>

Audit has been disabled (scylladb/scylla-enterprise/pull/3094) over many concerns raised against the table implementation, e.g. scylladb/scylla-enterprise/issues/2939 / scylladb/scylla-enterprise/issues/2759 + there's whole outstanding backlog of issues . One of the concerns was also a possible loss of availability, and since then we migrated audit keyspace from SimpleStrategy RF=1 to NetworkTopologyStrategy RF=3 (scylladb/scylla-enterprise/pull/3399) and stopped failing queries when auditing fails (scylladb/scylla-enterprise/pull/3118 & scylladb/scylla-enterprise/pull/3117), which improves the situation but doesn't address all the concerns. Eventually we want to use syslog as audit's sink, but it's not fully ready just yet, and so we'll restore table audit for now to increase the security, but later switch to syslog. BTW. cloud will enable table audit for AUTH category scylladb/sre-ops-automation/issues/2970 separately from this effort.

<h3>Performance considerations</h3>

We are assuming that the events for the enabled categories, i.e. DCL, DDL, AUTH & ADMIN, should appear at about the same, low cadence, with AUTH perhaps having the biggest impact of them all under some workloads. The performance penalty of enabling just the AUTH category [has been measured](https://scylladb.atlassian.net/wiki/spaces/RND/pages/148308005/Audit+performance+impact+test) and while authentication throughput and read/write throughput remain stable, the queries' P99 latency may decrease by a couple of % in the most hardcore scenarios.

Fixes: https://github.com/scylladb/scylladb/issues/26020

Gradually re-enabling audit feature, no need to backport.

Closes scylladb/scylladb#27262

* github.com:scylladb/scylladb:
  doc: audit: set audit as enabled by default
  Reapply "audit: enable some subset of auditing by default"
2025-12-29 16:41:04 +02:00
Tomasz Grabiec
bbf9ce18ef Merge 'load_balancer: compute node load based on tablet sizes' from Ferenc Szili
Currently, the tablet load balancer performs capacity based balancing by collecting the gross disk capacity of the nodes, and computes balance assuming that all tablet sizes are the same.

This change introduces size-based load balancing. The load balancer does not assume identical tablet sizes any more, and computes load based on actual tablet sizes.

The size-based load balancer computes the difference between the most and least loaded nodes in the balancing set (nodes in DC, or nodes in a rack in case of `rf-rack-valid-keyspaces`) and stops further balancing if this difference is bellow the config option `size_based_balance_threshold_percentage`.

This config option does not apply to the absolute load, but instead to the percentage of how much the most loaded node is more loaded than the least loaded node:

`delta = (most_loaded - least_loaded) / most_loaded`

If this delta is smaller then the config threshold, the balancer will consider the nodes balanced.

This PR is a part of a series of PRs which are based on top of each other.

- First part for tablet size collection via load_stats: #26035
- Second part reconcile load_stats: #26152
- The third part for load_sketch changes: #26153
- The fourth part which performs tablet load balancing based on tablet size: #26254
- The fifth part changes the load balancing simulator: #26438

This is a new feature, backport is not needed.

Fixes #26254

Closes scylladb/scylladb#26254

* github.com:scylladb/scylladb:
  test, load balancing: add test for table balance
  load_balancer: add cluster feature for size based balancing
  load_balancer: implement size-based load balancing
  config: add size based load balancing config params
  load_stats: use trinfo to decide how to reconcile tablet size
  load_sketch: use tablet sizes in load computation
  load_stats: add get_tablet_size_in_transition()
2025-12-29 15:01:38 +01:00
Pavel Emelyanov
d892140655 Merge 'Reduce allocations when traversing compaction_groups' from Benny Halevy
- table, storage_group: add compaction_group_count
  - And use to reserve vector capacity before adding an item per compaction_group
- table: reduce allocations by using for_each_compaction_group rather than compaction_groups()
  - compaction_groups() may allocate memory, but when called from a synchronous call site, the caller can use for_each_compaction_group instead.

* Improvement, no backport needed

Closes scylladb/scylladb#27479

* github.com:scylladb/scylladb:
  table: reduce allocations by using for_each_compaction_group rather than compaction_groups()
  replica: storage_group: rename compaction_groups to compaction_groups_immediate
2025-12-29 16:26:33 +03:00
Gleb Natapov
4a5292e815 raft topology: Notify that a node was removed only once
Raft topology goes over all nodes in a 'left' state and triggers 'remove
node' notification in case id/ip mapping is available (meaning the node
left recently), but the problem is that, since the mapping is not removed
immediately, when multiple nodes are removed in succession a notification
for the same node can be sent several times. Fix that by sending
notification only if the node still exists in the peers table. It will
be removed by the first notification and following notification will not
be sent.

Closes scylladb/scylladb#27743
2025-12-29 14:22:34 +01:00
Piotr Smaron
fb4d89f789 treewide: fix some spelling errors 2025-12-29 13:53:56 +01:00
Piotr Smaron
ba5c70d5ab codespell: ignore iif and tread
There are correct:
- iif is a boost's header name
- `tread carefully` is an actual english phrase
2025-12-29 13:53:56 +01:00
Nadav Har'El
8df9cfcde8 Merge 'Add table size bytes to describe table' from Radosław Cybulski
Add table size to DescribeTable's reply in Alternator

Fills DescribeTable's reply with missing field TableSizeBytes.

- add helper class simple_value_with_expiry, which is like std::optional
  but the value put has a timeout.
- add ignore_errors to estimate_total_sstable_volume function - if set
  to true the function will catch errors during RPC and ignore them,
  substituting 0 for missing value.
- add a reference to storage_service to executor class (needed to call
  estimate_total_sstable_volume function).
- add fill_table_description and create_table_on_shard0 as non static
  methods to executor class
- calculate TableSizeBytes value for a given table and return it as
  part of DescribeTable's return value. The value calculated is cached for
  approximately 6 hours (as per DescribeTable's specification).
  The algorithm is as follows:
  - if the requested value is in cache and is still valid it's returned,
    nothing else happens.
  - otherwise:
    - every shard of every node is requested to calculate size of its data
    - if the error happens, the error is ignored and we assume the given
      shard has a size of 0
    - all such values are summed producing total size
    - produced value is returned to a caller
    - on the node the call for a size happened every shard is requested to
      cache produced value with a 6 hour timeout.
    - if the next call comes for a differet shard on the same node that
      doesn't yet have cached value, the shard will request the value to
      be calculated again. The new value will overwrite the old one on
      every shard on this node.
    - if the next call comes to a different node, the process of
      calculation will happen from start, possibly producing different
      value. The value will have it's own timeout, there's no attempt made
      to synchronize value between nodes.
- add a alternator_describe_table_info_timeout_in_seconds parameter, which
  will control, how long DescribeTable's table information are being held
  in cache. Default is 6 hours.
- update test to use parameter
  `alternator_describe_table_info_timeout_in_seconds` - setting it to 0
  and forcing flushing memtables to disk allows checking, that table size
  has grown.

Fixes #7551

Closes scylladb/scylladb#24634

* github.com:scylladb/scylladb:
  alternator: fix invalid rebase
  Update tests
  Update documentation
  Add table size to DescribeTable's output
  Promote fill_table_description and create_table_on_shard0 to methods
  Modify estimate_total_sstable_volume to opt ignore errors
  Add alternator_describe_table_info_cache_validity_in_seconds config option
  Add ref to service::storage_service to executor
  Add simple_value_with_expiry util class
2025-12-29 14:47:36 +02:00
Benny Halevy
f60033db63 db: system_keyspace: get_group0_history: unfreeze_gently
Prevent stall when the group0 history is too long using unfreeze_gently
rather than the synchronous unfreeze() function

Fixes #27872

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#27873
2025-12-29 12:00:54 +02:00
copilot-swe-agent[bot]
d25d295e84 alternator/server: update SSL comment 2025-12-29 09:34:08 +01:00
Radosław Cybulski
df20f178aa alternator: fix invalid rebase
Fix an invalid rebase, that would properly merge code coming
from master, except that code would ignore refactor done in the patch.
2025-12-29 08:33:10 +01:00
Radosław Cybulski
a31c8762ca Update tests 2025-12-29 08:33:09 +01:00
Radosław Cybulski
5e1254eef0 Update documentation 2025-12-29 08:33:08 +01:00
Radosław Cybulski
a86b782d3f Add table size to DescribeTable's output
Add a table size to DescribeTable's output.
2025-12-29 08:33:07 +01:00
Radosław Cybulski
1bd855a650 Promote fill_table_description and create_table_on_shard0 to methods
Promote `executor::fill_table_description` and
`executor::create_table_on_shard0` to methods (from static functions).
2025-12-29 08:33:06 +01:00
Radosław Cybulski
6a26381f4f Modify estimate_total_sstable_volume to opt ignore errors
Modify `storage_service::estimate_total_sstable_volume` function to
optionally ignore errors (instead substitute 0), when `ignore_errors`
parameter is set to `yes`.
2025-12-29 08:33:06 +01:00
Radosław Cybulski
a532fc73bc Add alternator_describe_table_info_cache_validity_in_seconds config option
Add a `alternator_describe_table_info_cache_validity_in_seconds`
configuration option with default value of 6 hours.
2025-12-29 08:33:05 +01:00
Radosław Cybulski
e246abec4d Add ref to service::storage_service to executor
Add a reference to `service::storage_service` to executor object.
2025-12-29 08:33:03 +01:00
Radosław Cybulski
dfa600fb8f Add simple_value_with_expiry util class
Add a `simple_value_with_expiry` utility class, which functions like
a `std::optional` with added timeout. When emplacing a value, user
needs to provide timeout, after which value expires (in which case
the `simple_value_with_expiry` object behaves as if was never set
at all).
Add boost tests for the new class.
2025-12-29 08:32:52 +01:00
Pavel Emelyanov
2e33234e91 util: Remove lister::rmdir()
There's seastar helper that does the same, no need to carry yet another
implementation

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#27851
2025-12-28 19:46:19 +02:00
Avi Kivity
63e3a22f2e Merge 'group0_state_machine: don't update in-memory state machine until start' from Piotr Dulikowski
Group0 commands consist of one or more mutations and are supposed to be
atomic - i.e. the data structures that reflect the group0 tables state
are not supposed to be updated while only some mutations of a command
are applied, the logic responsible for that is not supposed to observe
an inconsistent state of group0 tables.

It turns out that this assumption can be broken if a node crashes in the
middle of applying a multi-mutation group0 command. Because these
mutations are, in general, applied separately, only some mutations might
survive a crash and a restart, so the group0 tables might be in an
inconsistent state. The current logic of group0_state_machine will
attempt to read the group0 tables' state as it was left after restart,
so it may observe inconsistent state.

This can confuse the node as it may observe a state that it was not
supposed to observe, or the state will just outright break some
invariants and trigger some sanity checks. One of those was observed in
https://github.com/scylladb/scylladb/issues/26945, where a command from the CDC generation
publisher fiber was partially applied. The fiber, in addition to
publishing generations, it removes old, expired generations as well.
Removal is done by removing data that describes the generation from
cdc_generations_v3 and by removing the generation's ID from the
committed generation list in the topology table. If only the first
mutation gets through but not the other one, on reload the node will see
a committed CDC generation without data, which will trigger an
on_internal_error check.

Fix this by delaying the moment when the in memory data structures are
first loaded. In 579dcf187a, a mechanism was introduced which persists the
commit index before applying commands that are considered committed.
Starting a raft server waits until commands are replayed up to that
point. The fix is to start the group0_state_machine in a mode which only
applies mutations - the aforementioned mechanism will re-apply the
commands which will, thanks to the mutation idempotency, bring the
group0 to a consistent state. After the group0 is known to be in
consistent state (so, after raft::server_impl::start) the in-memory data
structures of group0 are loaded for the first time.

There is an exception, however: schema tables. Information about schema
is actually loaded into memory earlier than the moment when group0 is
started. Applying changes to schema is done through the migration
manager module which compares the persisted state before and after the
schema mutations are applied and acts on that. Refactoring migration
manager is out of scope of this PR. However, this is not a problem
because the migration manager takes care to apply all of the mutations
given in a command in a single commitlog segment, so the initial schema
loading code should not see an inconsistent state due to the state being
partially applied.

The fix is accompanied by a reproducer of scylladb/scylladb#26945.

Fixes: scylladb/scylladb#26945

This is not a regression, so no need to backport.

Closes scylladb/scylladb#27528

* github.com:scylladb/scylladb:
  test: cluster: test for recovery after partial group0 command
  group0_state_machine: remove obsolete comment about group0 consistency
  group0_state_machine: don't update in-memory state machine until start
  group0_state_machine: move reloading out of std::visit
  service: raft: add state machine ref to raft_server_for_group
2025-12-28 13:59:26 +02:00
Pavel Emelyanov
e963a8d603 checked-file: Implement experimental_list_directory()
The method in question returns coroutine generator that co_yields
directory_entry-s. In case the method is not implemented, seastar
creates a fallback generator, that calls existing subscription-based
list_directory() and co_yields them. And since checked file doesn't yet
have it, fallback generator is used, thus skipping the lower file
yielding lister. Not nice.

This patch implements the generator lister for checked file, thus making
full use of lower file generator lister too.

A side note. It's not enough to implement it like

    return do_io_check([] {
        return lower_file->experimental_list_directory();
    });

like list_directory() does, since io-checking will _not_ happen on
directory reading itself, as it's supposed to.

This is the problem of the check_file::list_directory() implementation
-- it only checks for exception when creating the subscription (and it
really never happens), but reading the directory itself happens without
io checks.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#27850
2025-12-28 13:37:44 +02:00
Yaron Kaikov
1ee89c9682 Revert "scripts: benign fixes flagged by CodeQL/PyLens"
This reverts commit 377c3ac072.

This breaks all artifact tests and cloud image build process

Closes scylladb/scylladb#27881
2025-12-28 09:49:49 +02:00
Ferenc Szili
6d3c720a08 test, load balancing: add test for table balance
This change adds a boost test which validates the resulting table
balance of size based load balancing. The threshold was set to a
conservative 1.5 overcommit to avoid flakyness.
2025-12-27 11:39:08 +01:00
Ferenc Szili
b7ebd73e53 load_balancer: add cluster feature for size based balancing
This patch adds a cluster feature size_based_load_balancing which, until
enabled, will force capacity based balancing. This is needed because
during rolling upgrades some of the nodes will have incomplete data in
load_stats (missing tablet sizes and effective_capacity) which are
needed for size based balancing to make good decisions and issue correct
migrations.
2025-12-27 11:39:08 +01:00
Ferenc Szili
10eb364821 load_balancer: implement size-based load balancing
This changes introduces tablet size based load balancing. It is an
extension of capacity based balancing with the addition of actual tablet
sizes.

It computes the difference between the most and least loaded nodes in
the DC and stops further balancing if this difference is bellow the
config option size_based_balance_threshold_percentage.

This config option does not apply to the absolute load, but instead to
the percentage of how much the most loaded node is more loaded than the
least loaded node:

delta = (most_loaded - least_loaded) / most_loaded

If this delta is smaller then the config threshold, the balancer will
consider the nodes balanced.
2025-12-27 11:20:20 +01:00
Ferenc Szili
cc9e125f12 config: add size based load balancing config params
This change adds:

- The config paremeter force_capacity_based_balancing which, when
  enabled performs capacity based balancing instead of size based.
- The config parameter size_based_balance_threshold_percentage which
  sets the balance threshold for the size based load balancer.
- The config parameter minimal_tablet_size_for_balancing which sets the
  minimal tablet size for the load balancer.
2025-12-27 10:37:38 +01:00
Ferenc Szili
0c9b93905e load_stats: use trinfo to decide how to reconcile tablet size
This patch corrects the way update_load_stats_on_end_migration() decides
which tablet transition occured, in order to reconcile tablet sizes in
load_stats. Before, the transition kind was inferred from the value of
leaving and pending replicas. This patch changes this to use the value
of trinfo.transition.

In case of a rebuild, and in case there is only one replica, the new
tablet size will be set to 0.
2025-12-27 10:37:38 +01:00
Ferenc Szili
621cb19045 load_sketch: use tablet sizes in load computation
This commit changes load_sketch so that it computes node and shard load
based on tablet sizes instead of tablet count.
2025-12-27 10:37:23 +01:00
Ferenc Szili
1c9ec9a76d load_stats: add get_tablet_size_in_transition()
This patch adds a method to load_stats which searches for the tablet
size during tablet transition. In case of tablet migration, the tablet
will be searched on the leaving replica, and during rebuild we will
return the average tablet size of the pending replicas.
2025-12-27 10:37:23 +01:00
Pavel Emelyanov
bda1709734 Merge 'test: fix infinite loop in python log browsing code triggered from test_orphaned_sstables_on_startup' from Avi Kivity
Recently, test/cluster/test_tablet.py::test_orphaned_sstables_on_startup started
spinning in the log browsing code, part of a the test library that looks into log files
for expected or unexpected patterns. This reproduced somewhat in continuous
integration, and very reliably for me locally.

The test was introduced in fa10b0b390, a year ago.

There are two bugs involved: first, that we're looking for crashes in this test,
since in fact it is expected to crash. The node expectedly fails with an
on_internal_error. Second, the log browsing code contains an infinite loop
if the crash backtrace happens to be the last thing in the log. The series
fixes both bugs.

Fixes #27860.

While the bad code exists in release branches, it doesn't trigger there so far, so best
to only backport it if it starts manifesting there.

Closes scylladb/scylladb#27879

* github.com:scylladb/scylladb:
  test: pylib: log_browsing: fix infinite loop in find_backtraces()
  test: pylib/log_browsing, cluster/test_tablets: don't look for expected crashes
2025-12-26 10:45:56 +03:00
Pavel Emelyanov
712cc8b8f1 sstables, storage: Drop unused bool classes and tags
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-12-26 09:47:27 +03:00
Pavel Emelyanov
9e189da23a sstables/storage: Drop create_links_common() overloads
There's a bunch of tagged create_links_common() overloads that call the
most generic one with properly crafted arguments and the link_mode.
Callers of those one-liners can craft the args themselves.

As a result, there's only one create_links_common() overload and callers
explicitly specifying what they want from it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-12-26 09:47:27 +03:00
Pavel Emelyanov
32cf358f44 sstable: Simplify storage::snapshot()
Now there are only two callers left -- sstable::snapshot() and
sstable::seal() that wants to auto-backup the sealed sstable.

The snapshot arguments are:
- relative path, use _base_dir
- no new generation provided
- no leave-unsealed tag

With that, the implementation of filesystem_storage::snapshot() is as
simple as
- prepare full path relative to _base_dir
- touch new directory
- call create_links_common()

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-12-26 09:47:27 +03:00
Pavel Emelyanov
8e496a2f2f sstables: Introduce storage::clone()
And call it from sstable::clone() instead of storage::snapshot().

The snapshot arguements are:
- target directory is storage::prefix(), that's _dir itself
- new generation is always provided, no need for optional
- leave_unsealed bool flag

With that, the implementation of filesystem_storage::clone() is as
simple as call create_links_common() forwarding args and _dir to it. The
unification of leave_unsealed branches will come a bit later making this
code even shorter.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-12-26 09:47:27 +03:00
Nadav Har'El
9c50d29a00 test/boost: fix flaky test_inject_future_disabled
The test boost/error_injection_test.cc::test_inject_future_disabled
checks what happens when a sleep injection is *disabled*: The test
has a 10-millisecond-sleep injection and measures how much it takes.
The test expects it to take less than 10 milliseconds - in fact it
should take almost zero. But this is not guaranteed - on a slow debug
build and an overcommitted server this do-nothing injection can take
some time, and in one run (#27798) it took 14 milliseconds - and the
test failed.

The solution is easy - make the sleep-that-doesn't-happen much longer -
e.g., 10 whole seconds. Since this sleep still doesn't happen, we
expect the injection to return in less - much less - than 10 seconds.
This 10 seconds is so ridiculously high we don't expect the do-nothing
injection to take 10 seconds, not even a ridiculously busy test machine.

Fixes #27798

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#27874
2025-12-25 20:46:31 +02:00
Avi Kivity
92996ce9fa test: pylib: log_browsing: fix infinite loop in find_backtraces()
The find_backtraces() function uses a very convoluted loop to
read the log file. The loop fails to terminate if the last thing
in the log file is the backtrace, since the loop termination condition
(`not line`) continues to be true.

It's not clear why this did not reliably hit before, but it now
reliably reproduces for me on both x86 and aarch64. Perhaps timing
changed, or perhaps previously we had more text on the log.
2025-12-25 20:22:17 +02:00
Avi Kivity
50a3460441 test: pylib/log_browsing, cluster/test_tablets: don't look for expected crashes
test_tablets.test_orphaned_sstables_on_startup verifies that an
on_internal_error("Unable to load SSTable...") is generated when
an sstable outside a tablet boundary is found on startup.

The test indeed finds the error, but then proceeds to hang in
find_backtraces(), or fail if find_backtraces() is fixed, since
it finds an unexpected (for it) crash.

Fix this by not looking for crashes if a new option expected_crash
is set. Set it for this test.
2025-12-25 20:22:17 +02:00
Avi Kivity
55c7bc746e Revert "vector_search_validator: move high availability tests from vector-store.git"
This reverts commit caa0cbe328. It is
either extremely slow or broken. I was never able to get it to
run on an r8gd.8xlarge (on the NVMe disk). Even when it passes,
it is very slow.

Test script:

```

git submodule update --recursive || exit 125

rm -rf build

d() { ./tools/toolchain/dbuild -it -- "$@"; }

d ./configure.py --mode release || exit 125
d ninja release-build || exit 125
d ./test.py --mode release
```

Ref #27858
Ref #27859
Ref #27860
2025-12-25 12:30:22 +00:00
Botond Dénes
ebb101f8ae scylla-gdb.py: scylla small-objects: make freelist traversal more robust
Traversing the span's freelist is known to generate "Cannot access
memory at address ..." errors, which is especially annoying when it
results in failed CI. Make this loop more robust: catch gdb.error coming
from it and just log a warning that some listed objects in the span may
be free ones.

Fixes: #27681

Closes scylladb/scylladb#27805
2025-12-25 13:26:09 +03:00
Alex
f769e52877 test: boost: Fix flaky test_large_file_upload_s3 by creating induvidual files for testing During CI runs, multiple instances of the same test may execute concurrently. Although the test uses a temporary directory, the downloaded bucket artifacts were written using an identical filename across all instances.
This caused concurrent writers to operate on the same file, leading to file corruption. In some cases, this manifested as test failures and intermittent std::bad_alloc exceptions.

Change Description

This change ensures that each test instance uses a unique filename for downloaded bucket files.
By isolating file writes per test execution, concurrent runs no longer interfere with each other.

Fixes: #27824

backport not required

Closes scylladb/scylladb#27843
2025-12-25 09:40:13 +02:00
Benny Halevy
51433b838a table: reduce allocations by using for_each_compaction_group rather than compaction_groups()
compaction_groups_immediate() may allocate memory, but when called from a
synchronous call site, the caller can use for_each_compaction_group
instead to iterate over the compaction groups with no extra allocations.

Calling compaction_groups_immediate() is still required from an async
context when we want to "sample" the compaction groups
so we can safely iterate over them and yield in the inner loop.

Also, some performance insensitive call sites using
compaction_groups_immediate had been left as they are
to keep them simple.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-12-24 21:19:28 +02:00
Benny Halevy
0e27ee67d2 replica: storage_group: rename compaction_groups to compaction_groups_immediate
To better reflect that it returns a materialized vector
of compaction_group ptrs.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-12-24 21:19:26 +02:00
Nadav Har'El
186c91233b Merge 'scylla-gdb.py: improve scylla fiber and scylla read-stats' from Botond Dénes
Improve scylla fiber's ability to traverse through coroutines.
Add --direction command-line parameter to scylla-fiber.
Fix out-of-date premit collection in scylla read-stat and improve the printout.

scylla-gdb.py improvements, no backport needed

Closes scylladb/scylladb#27766

* github.com:scylladb/scylladb:
  scylla-gdb.py: scylla read-stats: include all permit lists
  scylla-gdb.py: scylla fiber: add --direction command-line param
  scylla-gdb.py: scylla fiber: add support for traversing through coroutines backward
2025-12-24 17:49:58 +02:00
Botond Dénes
27bf65e77a db/batchlog_manager: add missing <seastar/coroutine/parallel_for_each.hh> include
Build only fails if `--disable-precompiled-header` is passed to
`configure.py`. Not sure why.

Closes scylladb/scylladb#27721
2025-12-24 16:32:12 +02:00
Botond Dénes
c66275e05c cql3/statements/batch_statement: make size error message more verbose
Mention the type of batch: Logged or Unlogged. The size (warn/fail on
too large size) error has different significance depending on the type.

Refs: #27605

Closes scylladb/scylladb#27664
2025-12-24 15:27:01 +02:00
Piotr Szymaniak
9c5b4e74c3 doc: Correct reference in dev/audit.md
Closes scylladb/scylladb#27832
2025-12-24 15:25:15 +02:00
Botond Dénes
ccc03d0026 test/pylib/runner.py: pytest_configure(): coerce repeat to int
Coerce the return value of config.getoption("--repeat") to int to avoid:

    Traceback (most recent call last):
      File "/usr/bin/pytest", line 8, in <module>
        sys.exit(console_main())
                 ~~~~~~~~~~~~^^
      File "/usr/lib/python3.14/site-packages/_pytest/config/__init__.py", line 201, in console_main
        code = main()
      File "/usr/lib/python3.14/site-packages/_pytest/config/__init__.py", line 175, in main
        ret: ExitCode | int = config.hook.pytest_cmdline_main(config=config)
                              ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^
      File "/usr/lib/python3.14/site-packages/pluggy/_hooks.py", line 512, in __call__
        return self._hookexec(self.name, self._hookimpls.copy(), kwargs, firstresult)
               ~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/usr/lib/python3.14/site-packages/pluggy/_manager.py", line 120, in _hookexec
        return self._inner_hookexec(hook_name, methods, kwargs, firstresult)
               ~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/usr/lib/python3.14/site-packages/pluggy/_callers.py", line 167, in _multicall
        raise exception
      File "/usr/lib/python3.14/site-packages/pluggy/_callers.py", line 121, in _multicall
        res = hook_impl.function(*args)
      File "/usr/lib/python3.14/site-packages/_pytest/helpconfig.py", line 154, in pytest_cmdline_main
        config._do_configure()
        ~~~~~~~~~~~~~~~~~~~~^^
      File "/usr/lib/python3.14/site-packages/_pytest/config/__init__.py", line 1118, in _do_configure
        self.hook.pytest_configure.call_historic(kwargs=dict(config=self))
        ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/usr/lib/python3.14/site-packages/pluggy/_hooks.py", line 534, in call_historic
        res = self._hookexec(self.name, self._hookimpls.copy(), kwargs, False)
      File "/usr/lib/python3.14/site-packages/pluggy/_manager.py", line 120, in _hookexec
          return self._inner_hookexec(hook_name, methods, kwargs, firstresult)
               ~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/usr/lib/python3.14/site-packages/pluggy/_callers.py", line 167, in _multicall
        raise exception
      File "/usr/lib/python3.14/site-packages/pluggy/_callers.py", line 121, in _multicall
        res = hook_impl.function(*args)
      File "/home/bdenes/ScyllaDB/scylladb/scylladb/test/pylib/runner.py", line 206, in pytest_configure
        config.run_ids = tuple(range(1, config.getoption("--repeat") + 1))
                                        ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~
    TypeError: can only concatenate str (not "int") to str

Closes scylladb/scylladb#27649
2025-12-24 15:13:02 +02:00
Nadav Har'El
8df5189f9c Merge 'docs: scylla-sstable.rst: extract script API to separate document' from Botond Dénes
The script API is 500+ lines long in an already too long and hard to navigate document. Extract it to a separate document, making both documents shorter and easier to navigate.

Documentation refactoring, no backport needed.

Closes scylladb/scylladb#27609

* github.com:scylladb/scylladb:
  docs: scylla-sstable-script-api.rst: add introduction and title
  docs: scylla-sstable.rst: extract script API to separate document
  docs: scylla-sstable: prepare for script API extract
2025-12-24 15:02:57 +02:00
Botond Dénes
b036a461b7 tools/scylla-sstable: dump-schema: incude UDT description in dump
If the table uses UDTs, include the description of these (CREATE TYPE
statement) in the schema dump. Without these the schema is not useful.

Closes scylladb/scylladb#27559
2025-12-24 14:46:52 +02:00
Botond Dénes
3071ccd54a Merge 'Storage-agnostic table::snapshot_on_all_shards()' from Pavel Emelyanov
The method in question knows that it writes snapshot to local filesystem and uses this actively. This PR relaxes this knowledge and splits the logic into two parts -- one that orchestrates sstables snapshot and collects the necessary metadata, and the code that writes the metadata itself.

Closes scylladb/scylladb#27762

* github.com:scylladb/scylladb:
  table: Move snapshot_file_set to table.cc
  table: Rename and move snapshot_on_all_shards() method
  table: Ditch jsondir variable
  table, sstables: Pass snapshot name to sstable::snapshot()
  table: Use snapshot_writer in write_manifest()
  table: Use snapshot_writer in write_schema_as_cql()
  table: Add snapshot_writer::sync()
  table: Add snapshot_writer::init()
  table: Introduce snapshot_writer
  table: Move final sync and rename seal_snapshot()
  table: Hide write_schema_as_cql()
  table: Hide table::seal_snapshot()
  table: Open-code finalize_snapshot()
  table: Fix indentation after previuous patch
  table: Use smp::invoke_on_all() to populate the vector with filenames
  table: Don't touch dir once more on seal_snapshot()
  table: Open-code table::take_snapshot() into caller lambda
  table: Move parts of table::take_snapshot to sstables_manager
  table: Introduce table::take_snapshot()
  table: Store the result of smp::submit_to in local variable
2025-12-24 13:46:47 +02:00
Nadav Har'El
4ae45eb367 test/alternator: remove unused imports
Remove many unused "import" statements or parts of import statement.
All of them were detected by Copilot, but I verified each one manually
and prepared this patch.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#27676
2025-12-24 13:44:28 +02:00
Nadav Har'El
da00401b7d test/alternator: rename test with duplicate name
The file test/alternator/test_transact.py accidentally had two tests
with the same name, test_transact_get_items_projection_expression.
This means the first of the two tests was ignored and never run.

This patch renames the second of the two to a more appropriate
(and unique...) name.

I verified that after this change the number of tests in this file
grows by one, and that still all tests pass on DynamoDB and fail
(as expected by xfail) on Alternator.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#27702
2025-12-24 13:43:43 +02:00
Botond Dénes
95d4c73eb1 Merge 'Make object storage config truly updateable' from Pavel Emelyanov
The db::config::object_storage_endpoints parameter is live-updateable, but when the update really happens, the new endpoints may fail to propagate to non-zero shards because of the way db::config sharding is implemented.

Refs: #7316
Fixes: #26509

Backport to 2025.3 and 2025.4, AFAIK there are set ups with object storage configs for native backup

Closes scylladb/scylladb#27689

* github.com:scylladb/scylladb:
  sstables/storage_manager: Fix configured endpoints observer
  test/object_store: Add test to validate how endpoint config update works
2025-12-24 13:42:44 +02:00
Botond Dénes
12dcf79c60 Merge 'build: support (and prefer) sccache as the compiler cache' from Avi Kivity
Currently, we support ccache as the compiler cache. Since it is transparent, nothing
much is needed to support it.

This series adds support to sccache[1] and prefers it over ccache when it is installed.

sccache brings the following benefits over ccache:
1. Integrated distributed build support similar to distcc, but with automatic toolchain packaging and a scheduler
2. Rust support
3. C++20 modules (upcoming[2])

It is the C++20 modules support that motivates the series. C++20 modules have the potential to reduce
build times, but without a compiler cache and distributed build support, they come with too large
a penalty. This removes the penalty.

The series detects that sccache is installed, selects it if so (and if not overridden
by a new option), enables it for C++ and Rust, and disables ccache transparent
caching if sccache is selected.

Note: this series doesn't add sccache to the frozen toolchain or add dbuild support. That
is left for later.

[1] https://github.com/mozilla/sccache
[2] https://github.com/mozilla/sccache/pull/2516

Toolchain improvement, won't be backported.

Closes scylladb/scylladb#27834

* github.com:scylladb/scylladb:
  build: apply sccache to rust builds too
  build: prevent double caching by compiler cache
  build: allow selecting compiler cache, including sccache
2025-12-24 13:40:02 +02:00
Nadav Har'El
74a57d2872 test/cqlpy: remove unused imports
Remove many unused "import" statements or parts of import statement.
All of them were detected by Copilot, but I verified each one manually
and prepared this patch.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#27675
2025-12-24 13:31:41 +02:00
Andrzej Jackowski
632ff66897 doc: audit: mention double audit sink in Enabling Audit section
Configuration of both table and syslog audit is possible since
scylladb/scylladb#26613 was implemented. However, the "Enabling Audit"
section of the documentation wasn't updated, which can be misleading.

Ref: scylladb/scylladb#26613

Closes scylladb/scylladb#27790
2025-12-24 13:20:03 +02:00
Gleb Natapov
04976875cc topology coordinator: set session id for streaming at the correct time
Commit d3efb3ab6f added streaming session for rebuild, but it set
the session and request submission time. The session should be set when
request starts the execution, so this patch moved it to the correct
place.

Closes scylladb/scylladb#27757
2025-12-24 13:17:53 +02:00
Yaniv Michael Kaul
377c3ac072 scripts: benign fixes flagged by CodeQL/PyLens
Unused imports, unused variables and such.
No functional changes, just to get rid of some standard CodeQL warnings.

Benign - no need to backport.
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>

Closes scylladb/scylladb#27801
2025-12-24 13:08:24 +02:00
Avi Kivity
d6edad4117 test: pylib: resource_gather: don't take ownership of /sys/fs/cgroup under podman
Under podman, we already own /sys/fs/cgroup. Run the chown command only
under docker where the container does not map the host user to the
container root user.

The chown process is sometimes observed to fail with EPERM (see issue).
But it's not needed, so avoid it.

Fixes #27837.

Closes scylladb/scylladb#27842
2025-12-24 10:56:24 +02:00
Marcin Maliszkiewicz
3c1e1f867d raft: auth: add semaphore to auth_cache::load_all
Auth cache loading at startup is racing between
auth service and raft code and it doesn't support
concurrency causing it to crash.

We can't easily remove any of the places as during
raft recovery snapshot is not loaded and it relies
on loading cache via auth service. Therefore we add
semaphore.

Fixes https://github.com/scylladb/scylladb/issues/27540

Closes scylladb/scylladb#27573
2025-12-24 10:56:24 +02:00
Nadav Har'El
f3a4af199f test/cqlpy/test_materialized_view.py: Fix for Commented-out code
This patch was suggested and prepared by copilot, I am writing the commit
message because the original one was worthless.

In commit cf138da, for an an unexplained reason, a loop waiting until the
expected value appears in a materialized view was replaced by a call for
wait_for_view_built(). The old loop code was left behind in a comment,
and this commented-out code is now bothering our AI. So let's delete the
commented-out code.

Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>

Closes scylladb/scylladb#27646
2025-12-24 10:56:23 +02:00
Botond Dénes
1bb897c7ca Merge 'Unify configuration of object storage endpoints' from Pavel Emelyanov
To configure S3 storage, one needs to do

```
object_storage_endpoints:
  - name: s3.us-east-1.amazonaws.com
    port: 443
    https: true
    aws_region: us-east-1
```

and for GCS it's

```
object_storage_endpoints:
  - name: https://storage.googleapis.com:433
    type: gs
    credentials_file: <gcp account credentials json file>
```

This PR updates the S3 part to look like

```
object_storage_endpoints:
  - name: https://s3.us-east-1.amazonaws.com:443
    aws_region: us-east-1
```

fixes: #26570

Not-yet released feature, no need to backport. Old configs are not accepted any longer. If it's needed, then this decision needs to be revised.

Closes scylladb/scylladb#27360

* github.com:scylladb/scylladb:
  object_storage: Temporarily handle pure endpoint addresses as endpoints
  code: Remove dangling mentions of s3::endpoint_config
  docs: Update docs according to new endpoints config option format
  object_storage: Create s3 client with "extended" endpoint name
  test: Add named constants for test_get_object_store_endpoints endpoint names
  s3/storage: Tune config updating
  sstable: Shuffle args for s3_client_wrapper
2025-12-24 06:59:02 +02:00
Botond Dénes
954f2cbd2f Merge 'config, transport: add listeners for native protocol fronted by proxy protocol v2' from Avi Kivity
For deployments fronted by a reverse proxy (haproxy or privatelink), we want to
use proxy protocol v2 so that client information in system.clients is correct and so
that the shard-aware selection protocol, which depends on the source port, works
correctly. Add proxy-protocol enabled variants of each of the existing native transport
listeners.

Tests are added to verify this works. I also manually tested with haproxy.

New feature, no backport.

Closes scylladb/scylladb#27522

* github.com:scylladb/scylladb:
  test: add proxy protocol tests
  config, transport: support proxy protocol v2 enhanced connections
2025-12-24 06:58:00 +02:00
Nadav Har'El
e75c75f8cd test/cqlpy: fix two tests that couldn't fail because of typo
As noticed by copilot, two tests in test_guardrail_compact_storage.py
could never fail, because they used `pytest.fail` instead of the
correct `pytest.fail()` to fail. Unfortunately, Python has a footgun
where if it sees a bare function name without parenthesis, instead of
complaining it evaluates the function object and then ignores it,
and absolutely nothing happens.

So let's add the missing `()`. The test still passes, but now it at
least has a chance of failing if we have a regression.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#27658
2025-12-24 06:49:54 +02:00
Yaron Kaikov
d671ca9f53 fix: remove return from finally block in s3_proxy.py
during any jenkins job that trigger `test.py` we get:
```
/jenkins/workspace/releng-testing/byo/byo_build_tests_dtest/scylla/test/pylib/s3_proxy.py:152: SyntaxWarning: 'return' in a 'finally' block
```

The 'return' statement in the finally block was causing a SyntaxWarning.
Moving the return outside the finally block ensures proper exception
handling while maintaining the intended behavior.

Closes scylladb/scylladb#27823
2025-12-24 06:48:03 +02:00
Avi Kivity
fc81983d42 test: sstable_validation_test: actually test ms version
sstable_validation_test tests the `scylla sstable validate` command
by passing it intentionally corrupted sstables. It uses an sstable
cache to avoid re-creating the same sstables. However, the cache
does not consider the sstable version, so if called twice with the
same inputs for different versions, it will return an sstable with
the original version for both calls. As a results, `ms` sstables
were not tested. Fix this bug by adding the sstable version (and
the schema for good measure) to the cache key.

An additional bug, hidden by the first, was that we corrupted the
sstable by overwriting its Index.db component. But `ms` sstables
don't have an Index.db component, they have a Partitions.db component.
Adjust the corrupting code to take that into account.

With these two fixes, test_scylla_sstable_validate_mismatching_partition_large
fails on `ms` sstables. Disable it for that version. Since it was
previously practically untested, we're not losing any coverage.

Fixing this test unblocks further work on making pytest take charge
of running the tests. pytest exposed this problem, likely by running
it on different runners (and thus reducing the effectiveness of the
cache).

Fixes #27822.

Closes scylladb/scylladb#27825
2025-12-24 06:47:31 +02:00
Botond Dénes
cf70250a5c Update seastar submodule
* seastar 7ec14e83...f0298e40 (8):
  > Merge 'coroutine/try_future: call set_current_task() when resuming the coroutine' from Botond Dénes
    coroutine/try_future: call set_current_task() when resuming the coroutine
    core: move set_current_task() out-of-line
  > stop_signal: stop including reactor.hh
  > cmake: Mark hwloc headers as system includes to suppress warnings
  > build: explicitly enable vptr sanitizer
  > httpd: Add API to set tcp keepalive params
  > Merge 'Make datagram_channel::send() use temporary_buffer-s' from Pavel Emelyanov
    net: Remove no longer used to_iovec() helpers
    net,code: Update callers to use new datagram_channel::send()
    net: Introduce datagram_channel::send(span<temporary_buffer>) method
    posix-stack: Make UDP socket implementation use wrapped_iovec
    posix-stack: Introduce wrapped_iovec
  > code: Move pollable_fd_state::write_all(const char*) from API level 9
  > thread: Remove unused sched_group() helper

configure.py: added -lubsan to DEBUG sanitizer flags

Closes scylladb/scylladb#27511
2025-12-24 06:46:36 +02:00
Nadav Har'El
54f3e69fdc Fix for Statement has no effect
This problem and its fix was suggested by copilot, I'm just writing the
cover letter.
test/nodetool/test_status.py has the silly statement tokens == "?" which
has no effect. Looking around the code suggested to me (and also to
Copilot, nice) that the correct intent was assert tokens == "?" and not,
say, tokens = "?".

Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>

Closes scylladb/scylladb#27659
2025-12-24 06:43:26 +02:00
Piotr Dulikowski
9ed820cbf5 test: cluster: test for recovery after partial group0 command
Add a reproducer for scylladb/scylladb#26945. By using error injections,
the test triggers a situation where a command that removes an obsolete
CDC generation is partially applied, then the node is killed an brought
back. Thanks to the fix, restarting the node succeeds and does not
trigger any consistency checks in the group0 reload logic.
2025-12-23 20:50:43 +01:00
Piotr Dulikowski
71bc1886ee group0_state_machine: remove obsolete comment about group0 consistency
The comment is outdated. It is concerned about group0 consistency after
crash, and that re-applying committed commands may require a raft
quorum. First, 579dcf1 was introduced (long ago) which gets rid of the
need for quorum as the node persists the commit index before applying
the commands - so it knows up to which command it should re-apply on
restart. Second, the preceding commits in this PR makes use of this
mechanism for group0.

Remove the comment as the concern was fully addressed. Additionally,
remove a mention of the comment in raft_group0_client.cc - although it
claims that the comment is placed in `group0_state_machine::apply`, it
has been moved to `merge_and_apply` in 96c6e0d (both comments were
originally introduced in 6a00e79).
2025-12-23 20:44:17 +01:00
Piotr Dulikowski
b24001b5e7 group0_state_machine: don't update in-memory state machine until start
Group0 commands consist of one or more mutations and are supposed to be
atomic - i.e. the data structures that reflect the group0 tables state
are not supposed to be updated while only some mutations of a command
are applied, the logic responsible for that is not supposed to observe
an inconsistent state of group0 tables.

It turns out that this assumption can be broken if a node crashes in the
middle of applying a multi-mutation group0 command. Because these
mutations are, in general, applied separately, only some mutations might
survive a crash and a restart, so the group0 tables might be in an
inconsistent state. The current logic of group0_state_machine will
attempt to read the group0 tables' state as it was left after restart,
so it may observe inconsistent state.

This can confuse the node as it may observe a state that it was not
supposed to observe, or the state will just outright break some
invariants and trigger some sanity checks. One of those was observed in
scylladb/scylladb#26945, where a command from the CDC generation
publisher fiber was partially applied. The fiber, in addition to
publishing generations, it removes old, expired generations as well.
Removal is done by removing data that describes the generation from
cdc_generations_v3 and by removing the generation's ID from the
committed generation list in the topology table. If only the first
mutation gets through but not the other one, on reload the node will see
a committed CDC generation without data, which will trigger an
on_internal_error check.

Fix this by delaying the moment when the in memory data structures are
first loaded. In 579dcf1, a mechanism was introduced which persists the
commit index before applying commands that are considered committed.
Starting a raft server waits until commands are replayed up to that
point. The fix is to start the group0_state_machine in a mode which only
applies mutations - the aforementioned mechanism will re-apply the
commands which will, thanks to the mutation idempotency, bring the
group0 to a consistent state. After the group0 is known to be in
consistent state (so, after raft::server_impl::start) the in-memory data
structures of group0 are loaded for the first time.

There is an exception, however: schema tables. Information about schema
is actually loaded into memory earlier than the moment when group0 is
started. Applying changes to schema is done through the migration
manager module which compares the persisted state before and after the
schema mutations are applied and acts on that. Refactoring migration
manager is out of scope of this PR. However, this is not a problem
because the migration manager takes care to apply all of the mutations
given in a command in a single commitlog segment, so the initial schema
loading code should not see an inconsistent state due to the state being
partially applied.

Fixes: scylladb/scylladb#26945
2025-12-23 20:44:16 +01:00
Piotr Dulikowski
f4efdf18a5 group0_state_machine: move reloading out of std::visit
In the next commit, we will adjust the logic so that it only reloads in
memory state only when a flag is set. By moving the reload logic to one
place in `merge_and_apply`, the next commit will be able to reach its
goal by only adding a single `if`.
2025-12-23 20:44:16 +01:00
Piotr Dulikowski
6bdbd91cf7 service: raft: add state machine ref to raft_server_for_group
This reference will be used by the code that starts group0. It will
manually enable the in-memory state machine only after the group0 server
is fully started, which entails replaying the group0 commands that are,
locally, seen as committed - in order to repair any inconsistencies that
might have arisen due to some commands being applied only partially
(e.g. due to a crash).
2025-12-23 20:44:16 +01:00
Michał Hudobski
ce3320a3ff auth: add system table permissions to VECTOR_SEARCH_INDEXING
Due to the recent changes in the vector store service,
the service needs to read two of the system tables
to function correctly. This was not accounted for
when the new permission was added. This patch fixes that
by allowing these tables (group0_history and versions)
to be read with the VECTOR_SEARCH_INDEXING permission.

We also add a test that validates this behavior.

Fixes: SCYLLADB-73

Closes scylladb/scylladb#27546
2025-12-23 15:53:07 +02:00
Pawel Pery
caa0cbe328 vector_search_validator: move high availability tests from vector-store.git
Initially, tests for high availability were implemented in vector-store.git
repository. High availability is currently implemented in scylladb.git
repository so this repository should be the better place to store them. This
commit copies these tests into the scylladb.git.

The commit copies validator-vector-store/src/high_availability.rs (tests logic)
and validator-tests/src/common.rs (utils for tests) into the local crate
validator-scylla. The common.rs should be copied to be able for reviewer to see
common test code and this code most likely be frequent to change - it will be
hard to maintain one common version between two repositories.

The commit updates also README for vector_search_validator; it shortly describe
the validator modules.

The commit updates reference to the latest vector-store.git master.

As a next step on the vector-store.git high_availability.rs would be removed
and common.rs moved from validator-tests into validator-vector-store.

References: VECTOR-394

Closes scylladb/scylladb#27499
2025-12-23 15:53:07 +02:00
Yaron Kaikov
bad2fe72b6 .github/workflows: Add email validator workflow
This workflow validates that all commits in a pull request use email
addresses ending in @scylladb.com. For each commit with an author or
committer email that doesn't match this pattern, the workflow automatically
adds a comment to the pull request with a warning.

This serves two purposes:

1. Alert maintainers when external contributors submit code (which is
   acceptable, but good to be aware of)

2. Help ScyllaDB developers catch cases where they haven't configured
   their git email correctly

When a non-@scylladb.com email is detected, the workflow posts this
comment on the pull request:
```
    ⚠️ Non-@scylladb.com Email Addresses Detected

    Found commit(s) with author or committer emails that don't end with
    @scylladb.com.

    This indicates either:

    - An external contributor (acceptable, but maintainer should be aware)
    - A developer who hasn't configured their git email correctly

    For ScyllaDB developers:

    If you're a ScyllaDB employee, please configure your git email globally:

        git config --global user.email "your.name@scylladb.com"

    If only your most recent commit is invalid, you can amend it:

        git commit --amend --reset-author --no-edit
        git push --force

    If you have multiple invalid commits, you need to rewrite them all:

        git rebase -i <base-commit>
        # Mark each invalid commit as 'edit', then for each:
        git commit --amend --reset-author --no-edit
        git rebase --continue
        # Repeat for each invalid commit
        git push --force
```
Fixes: https://scylladb.atlassian.net/browse/RELENG-35

Closes scylladb/scylladb#27796
2025-12-23 15:53:06 +02:00
Pavel Emelyanov
ec15a1b602 table: Do not move foreign string when writing snapshot
The table::seal_snapshot() accepts a vector of sstables filenames and
writes them into manifest file. For that, it iterates over the vector
and moves all filenames from it into the streamer object.

The problem is that the vector contains foreign pointers on sets with
sstrings. Not only sets are foreign, sstrings in it are foreign too.
It's not correct to std::move() them to local CPU.

The fix is to make streamer object work on string_view-s and populate it
with non-owning references to the sstrings from aforementioned sets.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#27755
2025-12-23 15:53:06 +02:00
Pavel Emelyanov
ecef158345 api: Use ranges library to process views in get_built_indexes()
No functional changes, just make the loop shorter and more
self-contained.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#27742
2025-12-23 15:53:06 +02:00
Israel Fruchter
53abf93bd8 Update tools/cqlsh submodule
* tools/cqlsh 22401228...9e5a91d7 (7):
  > Add pip retry configuration to handle network timeouts
  > Clean up unwanted build artifacts and update .gitignore
  > test_legacy_auth: update to pytest format
  > Add support for disabling compression via CLI and cqlshrc
  > Update scylla-driver to 3.29.7 (#144)
  > Update scylla-driver version to 3.29.6
  > Revert "Migrate workflows to Blacksmith"

Closes scylladb/scylladb#27567

[avi: build optimized clang 21.1.7
      regenerate frozen toolchain with optimized clang from

      https://devpkg.scylladb.com/clang/clang-21.1.7-Fedora-43-aarch64.tar.gz
      https://devpkg.scylladb.com/clang/clang-21.1.7-Fedora-43-x86_64.tar.gz
]

Closes scylladb/scylladb#27734
2025-12-23 15:53:06 +02:00
Aleksandra Martyniuk
bbe64e0e2a test: rename duplicate tests
There are two test with name test_repair_options_hosts_tablets in
test/nodetool/test_cluster_repair.py and and two test_repair_keyspace
in test/nodetool/test_repair.py. Due to that one of each pair is ignored.

Rename the tests so that they are unique.

Fixes: https://github.com/scylladb/scylladb/issues/27701.

Closes scylladb/scylladb#27720
2025-12-23 15:53:06 +02:00
Anna Stuchlik
7198191aa9 doc: fix the license information on DockerHub
This commit removes the OSS-related information from DockerHub.
It adds the link to the Source Available license.

Fixes https://github.com/scylladb/scylladb/issues/22440

Closes scylladb/scylladb#27706
2025-12-23 15:53:06 +02:00
Calle Wilund
d5f72cd5fc test::pylib::encryption_provider: Push up setting system_key_directory to all providers
Fixes #27694

Unless set by config, the location will default to /etc/scylla, which is not a good
place to write things for tests. Push the config properly and the directory (but
_not_ creation) to all provider basetype.

Closes scylladb/scylladb#27696
2025-12-23 15:53:06 +02:00
Dawid Mędrek
afde5f668a test: Implement describing Boost tests in JSON format
The Boost.Test framework offers a way to describe tests written in it
by running them with the option `--list_content`. It can be
parametrized by either HRF (Human Readable Format) or DOT (the Graphviz
graph format) [1]. Thanks to that, we can learn the test tree structure
and collect additional information about the tests (e.g. labels [2]).

We currently emply that feature of the framework to collect and run
Boost tests in Scylla. Unfortunately, both formats have their
shortcomings:

* HRF: the format is simple to parse, but it doesn't contain all
       relevant information, e.g. labels.
* DOT: the format is designed for creating graphical visualizations,
       and it's relatively difficult to parse.

To amend those problems, we implement a custom extension of the feature.
It produces output in the JSON format and contains more than the most
basic information about the tests; at the same time, it's easy to browse
and parse.

To obtain that output, the user needs to call a Boost.Test executable
with the option `--list_json_content`. For example:

```
$ ./path/to/test/exec -- --list_json_content
```

Note that the argument should be prepended with a `--` to indicate that
it targets user code, not Boost.Test itself.

---

The structure of the new format looks like this (top-level downwards):

- File name
- Test suite(s) & free test cases
- Test cases wrapped in test suites

Note that it's different from the output the default Boost.Test formats
produce: they organize information within test suites, which can
potentially span multiple files [3]. The JSON format makes test files
the primary object of interest and test suites from different files
are always considered distinct.

Example of the output (after applying some formatting):

```
$ ./build/dev/test/boost/canonical_mutation_test -- --list_json_content
[{"file":"test/boost/canonical_mutation_test.cc", "content": {
  "suites": [],
  "tests": [
    {"name": "test_conversion_back_and_forth", "labels": ""},
    {"name": "test_reading_with_different_schemas", "labels": ""}
  ]
}}]
```

---

The implementation may be seen as a bit ugly, and it's effectively
a hack. It's based on registering a global fixture [4] and linking
that code to every Boost.Test executable.

Unfortunately, there doesn't seem to be any better way. That would
require more extensive changes in the test files (e.g. enforcing
going through the same entry point in all of them).

This implementation is a compromise between simplicity and
effectiveness. The changes are kept minimal, while the developers
writing new tests shouldn't need to remember to do anything special.
Everything should work out of the box (at least as long as there's
no non-trivial linking involved).

Fixes scylladb/scylladb#25415

---

References:
[1] https://www.boost.org/doc/libs/1_89_0/libs/test/doc/html/boost_test/utf_reference/rt_param_reference/list_content.html
[2] https://www.boost.org/doc/libs/1_89_0/libs/test/doc/html/boost_test/tests_organization/tests_grouping.html
[3] https://www.boost.org/doc/libs/1_89_0/libs/test/doc/html/boost_test/tests_organization/test_tree/test_suite.html
[4] https://www.boost.org/doc/libs/1_89_0/libs/test/doc/html/boost_test/tests_organization/fixtures/global.html

Closes scylladb/scylladb#27527
2025-12-23 15:53:06 +02:00
Asias He
140858fc22 tablet-mon.py: Add repair support
Add repair support in the tablet monitor.

Fixes #24824

Closes scylladb/scylladb#27400
2025-12-23 15:53:06 +02:00
Pavel Emelyanov
132aa753da sstables/storage_manager: Fix configured endpoints observer
On start the manager creates observer for object_storage_endpoints
config parameter. The goal is to refresh the maintained set of endpoint
parameters and client upon config change. The observer is created on
shard 0 only, and when kicked it calls manager.invoke-on-all to update
manager on all shards.

However, there's a race here. The thing is that db::config values are
implicitly "sharded" under the hood with the help of plain array. When
any code tries to read a value from db::config::something, the reading
code secretly gets the value from this inner array indexed by the
current shard id.

Next, when the config is updated, it first assigns new values to [0]
element of the hidden array, then calls broadcast_to_all_shards() helper
that copies the valaues from zeroth slot to all the others. But the
manager's observer is triggered when the new value is assigned on zero
index, and if the invoke-on-all lambda (mentioned above) happens to be
faster than broadcast_to_all_shards, the non-zero shards will read old
values from db::config's inner array.

The fix is to instantiate observer on all shards and update only local
shard, whenever this update is triggered.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-12-23 12:43:11 +03:00
Pavel Emelyanov
f902eb1632 test/object_store: Add test to validate how endpoint config update works
There's a test for backup with non-existing endpoint/bucket/snapshot. It
checks that API call to backup sstables properly fails in that case.
This patch adds similar test for "unconfigured endpoint", but it adds
the endpoint configuration on-the-fly and expects that backup will
proceed after config update.

Currently the test fails, as config update only affect the config
itself, the storage_manager, that's in charge of maintaining endpoint
clients, is not really updated. Next patch will fix it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-12-23 12:41:38 +03:00
Pavel Emelyanov
e0cddc8c99 table: Move snapshot_file_set to table.cc
It's not used anywhere else now.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-12-23 12:14:36 +03:00
Pavel Emelyanov
e31b72c61f table: Rename and move snapshot_on_all_shards() method
Now it's database::snapshot_table_on_all_shards(). This is symmetric to
database::truncate_table_on_all_shards().

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-12-23 12:14:36 +03:00
Pavel Emelyanov
48b1ceefaf table: Ditch jsondir variable
Now the table::snapshot_on_all_shards() is storage-independent and can
stop maintaining the local path variable.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-12-23 12:14:36 +03:00
Pavel Emelyanov
a21aa5bdf6 table, sstables: Pass snapshot name to sstable::snapshot()
Currently sstable::snapshot() is called with directory name where to put
snapshots into. This patch changes it to accept snapshot name instead.
This makes the table-sstable API be unware of snapshot destination
storage type.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-12-23 12:14:36 +03:00
Pavel Emelyanov
d0812c951e table: Use snapshot_writer in write_manifest()
The manifest writing code is self-contained in a sense that it needs
list of sstable files and output_stream to write it too. The
snapshot_writer can provide output_stream for specific component, it can
be re-used for manifest writing code, thus making it independent from
local filesystem.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-12-23 12:13:47 +03:00
Pavel Emelyanov
fe8923bdc7 table: Use snapshot_writer in write_schema_as_cql()
The schema writing code is self-contained in a sense that it needs
schema description and output_stream to write it too. Teach the
snapshot_writer to provide output_stream and make write_schema_as_cql()
be independent from local filesystem.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-12-23 12:06:17 +03:00
Pavel Emelyanov
9fee06d3bc table: Add snapshot_writer::sync()
And move calls to sync_directory() into it too.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-12-23 12:06:16 +03:00
Pavel Emelyanov
7a298788c0 table: Add snapshot_writer::init()
And move the call to recursive_touch_directory into it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-12-23 12:06:05 +03:00
Pavel Emelyanov
db6a5aa20b table: Introduce snapshot_writer
It's an abstract class that defines how to write data and metadata with
table snapshot. Currently it just replaces the storage_options checks
done by table::snapshot_on_all_shards(), but it will soon evolve.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-12-23 12:00:26 +03:00
Pavel Emelyanov
8e247b06a2 table: Move final sync and rename seal_snapshot()
The seal_snapshot() syncs directory at the end. Now when the method is
table.cc-local, it doesn't need to be that careful. It looks nicer if
being renamed to write_manifest() and it's caller that syncs directory
after calling it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-12-23 11:57:26 +03:00
Pavel Emelyanov
37746ba814 table: Hide write_schema_as_cql()
The method only needs schema description from table. The caller can
pre-get it and pass it as argument. This makes it symmetric with
seal_snapshot() (that will be renamed soon) and reduces the class table
API size.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-12-23 11:57:26 +03:00
Pavel Emelyanov
976bcef5d0 table: Hide table::seal_snapshot()
The method is static and has nothing to do with table. The
snapshot_file_set needs to become public, but it will be moved to
table.cc soon.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-12-23 11:57:26 +03:00
Pavel Emelyanov
8a4daf3ef1 table: Open-code finalize_snapshot()
It makes is easier to modify this code further.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-12-23 11:57:26 +03:00
Pavel Emelyanov
6975234d1c table: Fix indentation after previuous patch
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-12-23 11:57:26 +03:00
Pavel Emelyanov
4a88f15465 table: Use smp::invoke_on_all() to populate the vector with filenames
There's a vector of foreign pointers to sets with sstable filenames
that's populated on all shards. The code does the invoke-on-all by hand
to grow the vector with push-back-s. However, if resizing the vector in
advance, shards will be able to just populate their slots.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-12-23 11:57:25 +03:00
Pavel Emelyanov
41ed11cdbe table: Don't touch dir once more on seal_snapshot()
The directory had been created way before that, in the caller.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-12-23 11:57:25 +03:00
Pavel Emelyanov
07708acebd table: Open-code table::take_snapshot() into caller lambda
Now when the logic of take_snapshot() is split between two components
(table and sstables_manager) it's no longer useful

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-12-23 11:57:25 +03:00
Pavel Emelyanov
ef3b651e1b table: Move parts of table::take_snapshot to sstables_manager
Move the loop over vector of sstables that calls sstable->snapshot()
into sstables manager.

This makes it symmetric with sstables_manager::delete_atomically() and
allows for future changes.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-12-23 11:57:25 +03:00
Pavel Emelyanov
0c45a7df00 table: Introduce table::take_snapshot()
The method returns all sstables vector with a guard that prevents this
list from being modified. Currently this is the part of another existing
table::take_snapshot() method, but the newer, smaller one, is more
atomic and self-contained, next patches will benefit from it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-12-23 11:57:25 +03:00
Pavel Emelyanov
2e2cd2aa39 table: Store the result of smp::submit_to in local variable
Tossing bits around not to make it in the next, larger, patch

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-12-23 11:57:25 +03:00
Botond Dénes
58b5d43538 Merge 'test: multi LWT and counters test during tablets resize and migration' from Yauheni Khatsianevich
This PR extends BaseLWTTester with optional counter-table configuration and
verification, enabling randomized LWT tests over tablets with counters.

And introduces new LWT with counters test durng tablets resize and migration
- Workload: N workers perform CAS updates
- Update counter table each time CAS was successful
- Enable balancing and increase min_tablet_count to force split,
 and lower min_tablet_count to merge.
- Run tablets migrations loop
- Stop workload and verify data consistency

Refs: https://github.com/scylladb/qa-tasks/issues/1918
Refs: https://github.com/scylladb/qa-tasks/issues/1988
Refs https://github.com/scylladb/scylladb/issues/18068

Closes scylladb/scylladb#27170

* github.com:scylladb/scylladb:
  test: new LWT with counters test during tablets migration/resize - Workload: N workers perform CAS updates - Update counter table each time CAS was successful - Enable balancing and increase min_tablet_count to force split,  and lower min_tablet_count to merge. - Run tablets migrations loop - Stop workload and verify data consistency
  test/lwt: add counter-table support to BaseLWTTester
2025-12-23 07:29:35 +02:00
Botond Dénes
bfdd4f7776 Merge 'Synchronize incremental repair and tablet split' from Raphael Raph Carvalho
Split prepare can run concurrently with repair.

Consider this:

1) split prepare starts
2) incremental repair starts
3) split prepare finishes
4) incremental repair produces unsplit sstable
5) split is not happening on sstable produced by repair
        5.1) that sstable is not marked as repaired yet
        5.2) might belong to repairing set (has compaction disabled)
6) split executes
7) repairing or repaired set has unsplit sstable

If split was acked to coordinator (meaning prepare phase finished),
repair must make sure that all sstables produced by it are split.
It's not happening today with incremental repair because it disables
split on sstables belonging to repairing group. And there's a window
where sstables produced by repair belong to that group.

To solve the problem, we want the invariant where all sealed sstables
will be split.
To achieve this, streaming consumers are patched to produce unsealed
sstable, and the new variant add_new_sstable_and_update_cache() will
take care of splitting the sstable while it's unsealed.
If no split is needed, the new sstable will be sealed and attached.

This solution was also needed to interact nicely with out of space
prevention too. If disk usage is critical, split must not happen on
restart, and the invariant aforementioned allows for it, since any
unsplit sstable left unsealed will be discarded on restart.
The streaming consumer will fail if disk usage is critical too.

The reason interposer consumer doesn't fully solve the problem is
because incremental repair can start before split, and the sstable
being produced when split decision was emitted must be split before
attached. So we need a solution which covers both scenarios.

Fixes #26041.
Fixes #27414.

Should be backported to 2025.4 that contains incremental repair

Closes scylladb/scylladb#26528

* github.com:scylladb/scylladb:
  test: Add reproducer for split vs intra-node migration race
  test: Verify split failure on behalf of repair during critical disk utilization
  test: boost: Add failure_when_adding_new_sstable_test
  test: Add reproducer for split vs incremental repair race condition
  compaction: Fail split of new sstable if manager is disabled
  replica: Don't split in do_add_sstable_and_update_cache()
  streaming: Leave sstables unsealed until attached to the table
  replica: Wire add_new_sstables_and_update_cache() into intra-node streaming
  replica: Wire add_new_sstable_and_update_cache() into file streaming consumer
  replica: Wire add_new_sstable_and_update_cache() into streaming consumer
  replica: Document old add_sstable_and_update_cache() variants
  replica: Introduce add_new_sstables_and_update_cache()
  replica: Introduce add_new_sstable_and_update_cache()
  replica: Account for sstables being added before ACKing split
  replica: Remove repair read lock from maybe_split_new_sstable()
  compaction: Preserve state of input sstable in maybe_split_new_sstable()
  Rename maybe_split_sstable() to maybe_split_new_sstable()
  sstables: Allow storage::snapshot() to leave destination sstable unsealed
  sstables: Add option to leave sstable unsealed in the stream sink
  test: Verify unsealed sstable can be compacted
  sstables: Allow unsealed sstable to be loaded
  sstables: Restore sstable_writer_config::leave_unsealed
2025-12-23 07:28:56 +02:00
Botond Dénes
bf9640457e Merge 'test: add crash detection during tests' from Cezar Moise
After tests end, an extra check is performed, looking into node logs for crashes, aborts and similar issues.
The test directory is also scanned for coredumps.
If any of the above are found, the test will fail with an error.

The following checks are made:
- Any log line matching `Assertion.*failed` or containing `AddressSanitizer` is marked as a critical error
- Lines matching `Aborting on shard` will only be marked as a critical error if the paterns in `manager.ignore_cores_log_patterns` are not found in that log
- If any critical error is found, the log is also scanned for backtraces
- Any backtraces found are decoded and saved
- If the test is marked with `@pytest.mark.check_nodes_for_errors`, the logs are checked for any `ERROR` lines
- Any pattern in `manager.ignore_log_patterns` and `manager.ignore_cores_log_patterns` will cause above check to ignore that line
- The `expected_error` value that many methods, like `manager.decommission_node`, have will be automatically appended to `manager.ignore_log_patterns`

refs: https://github.com/scylladb/qa-tasks/issues/1804

---

[Examples](https://jenkins.scylladb.com/job/scylla-staging/job/cezar/job/byo_build_tests_dtest/46/testReport/):
Following examples are run on a separate branch where changes have been made to enable these failures.

`test_unfinished_writes_during_shutdown`
- Errors are found in logs and are not ignored
```
failed on teardown with "Failed:
Server 2096: found 1 error(s) (log: scylla-2096.log)
  ERROR 2025-12-15 14:20:06,563 [shard 0: gms] raft_topology - raft_topology_cmd barrier_and_drain failed with: std::runtime_error (raft topology: command::barrier_and_drain, the version has changed, version 11, current_version 12, the topology change coordinator  had probably migrated to another node)
Server 2101: found 4 error(s) (log: scylla-2101.log)
  ERROR 2025-12-15 14:20:04,674 [shard 0:strm] repair - repair[c434c0c0-68da-472c-ba3e-ed80960ce0d5]: Repair 1 out of 4 ranges, keyspace=system_distributed, table=view_build_status, range=(minimum token,maximum token), peers=[27c027a6-603d-49d0-8766-1b085d8c7d29, b549cb36-fae8-490b-a19e-86d42e7aa07a, f7049967-81ff-4296-9be7-9d6a4d33a29e], live_peers=[b549cb36-fae8-490b-a19e-86d42e7aa07a, f7049967-81ff-4296-9be7-9d6a4d33a29e], status=failed: mandatory neighbor=27c027a6-603d-49d0-8766-1b085d8c7d29 is not alive
  ERROR 2025-12-15 14:20:04,674 [shard 1:strm] repair - repair[c434c0c0-68da-472c-ba3e-ed80960ce0d5]: Repair 1 out of 4 ranges, keyspace=system_distributed, table=view_build_status, range=(minimum token,maximum token), peers=[27c027a6-603d-49d0-8766-1b085d8c7d29, b549cb36-fae8-490b-a19e-86d42e7aa07a, f7049967-81ff-4296-9be7-9d6a4d33a29e], live_peers=[b549cb36-fae8-490b-a19e-86d42e7aa07a, f7049967-81ff-4296-9be7-9d6a4d33a29e], status=failed: mandatory neighbor=27c027a6-603d-49d0-8766-1b085d8c7d29 is not alive
  ERROR 2025-12-15 14:20:04,675 [shard 0: gms] raft_topology - raft_topology_cmd stream_ranges failed with: std::runtime_error (["shard 0: std::runtime_error (repair[c434c0c0-68da-472c-ba3e-ed80960ce0d5]: 1 out of 4 ranges failed, keyspace=system_distributed, tables=[\"view_build_status\", \"cdc_generation_timestamps\", \"service_levels\", \"cdc_streams_descriptions_v2\"], repair_reason=bootstrap, nodes_down_during_repair={27c027a6-603d-49d0-8766-1b085d8c7d29}, aborted_by_user=false, failed_because=std::runtime_error (Repair mandatory neighbor=27c027a6-603d-49d0-8766-1b085d8c7d29 is not alive, keyspace=system_distributed, mandatory_neighbors=[27c027a6-603d-49d0-8766-1b085d8c7d29, b549cb36-fae8-490b-a19e-86d42e7aa07a, f7049967-81ff-4296-9be7-9d6a4d33a29e]))", "shard 1: std::runtime_error (repair[c434c0c0-68da-472c-ba3e-ed80960ce0d5]: 1 out of 4 ranges failed, keyspace=system_distributed, tables=[\"view_build_status\", \"cdc_generation_timestamps\", \"service_levels\", \"cdc_streams_descriptions_v2\"], repair_reason=bootstrap, nodes_down_during_repair={27c027a6-603d-49d0-8766-1b085d8c7d29}, aborted_by_user=false, failed_because=std::runtime_error (Repair mandatory neighbor=27c027a6-603d-49d0-8766-1b085d8c7d29 is not alive, keyspace=system_distributed, mandatory_neighbors=[27c027a6-603d-49d0-8766-1b085d8c7d29, b549cb36-fae8-490b-a19e-86d42e7aa07a, f7049967-81ff-4296-9be7-9d6a4d33a29e]))"])
  ERROR 2025-12-15 14:20:06,812 [shard 0:main] init - Startup failed: std::runtime_error (Bootstrap failed. See earlier errors (Rolled back: Failed stream ranges: std::runtime_error (failed status returned from 9dd942aa-acec-4105-9719-9bda403e8e94)))
Server 2094: found 1 error(s) (log: scylla-2094.log)
  ERROR 2025-12-15 14:20:04,675 [shard 0: gms] raft_topology - send_raft_topology_cmd(stream_ranges) failed with exception (node state is bootstrapping): std::runtime_error (failed status returned from 9dd942aa-acec-4105-9719-9bda403e8e94)"
```

`test_kill_coordinator_during_op`
- aborts caused by injection
- `ignore_cores_log_patterns` is not set
- while there are errors in logs and `ignore_log_patterns` is not set, they are ignored automatically due to the `expected_error` parameter, such as in `await manager.decommission_node(server_id=other_nodes[-1].server_id, expected_error="Decommission failed. See earlier errors")`
```
failed on teardown with "Failed:
Server 1105: found 1 critical error(s), 1 backtrace(s) (log: scylla-1105.log)
  Aborting on shard 0, in scheduling group gossip.
  1 backtrace(s) saved in scylla-1105-backtraces.txt
Server 1106: found 1 critical error(s), 1 backtrace(s) (log: scylla-1106.log)
  Aborting on shard 0, in scheduling group gossip.
  1 backtrace(s) saved in scylla-1106-backtraces.txt
Server 1113: found 1 critical error(s), 1 backtrace(s) (log: scylla-1113.log)
  Aborting on shard 0, in scheduling group gossip.
  1 backtrace(s) saved in scylla-1113-backtraces.txt
Server 1148: found 1 critical error(s), 1 backtrace(s) (log: scylla-1148.log)
  Aborting on shard 0, in scheduling group gossip.
  1 backtrace(s) saved in scylla-1148-backtraces.txt"
```
Decoded backtrace can be found in [failed_test_logs](https://jenkins.scylladb.com/job/scylla-staging/job/cezar/job/byo_build_tests_dtest/46/artifact/testlog/x86_64/dev/failed_test/test_kill_coordinator_during_op.dev.1)

Closes scylladb/scylladb#26177

* github.com:scylladb/scylladb:
  test: add logging to crash_coordinator_before_stream injection
  test: add crash detection during tests
  test.py: add pid to ServerInfo
2025-12-23 07:27:58 +02:00
Pavel Emelyanov
cd2568ad00 test: Merge and parametrize test_backup_to_non_existent_something tests
There are three tests in cluster/object_store suite that check how
backup fails in case either of its parameters doesn't really exists. All
three greatly duplicate each other, it makes sense to merge them into
one larger parametrized test.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#27695
2025-12-23 07:02:18 +02:00
Avi Kivity
7586c5ccbd Merge 'system.clients: add client_options map column' from Vladislav Zolotarov
This pull request introduces a new caching mechanism for client options in the Alternator and transport layers, refactors how client metadata is stored and accessed, and extends the `system.clients` virtual table to surface richer client information. The changes improve efficiency by deduplicating commonly used strings (like driver names/versions and client options), and ensure that client data is handled in a way that's safe for cross-shard access. Additionally, the test suite and virtual table schema are updated to reflect the new client options data.

**Caching and client metadata refactoring:**

* The largest and most repeatable items in the connection state before this PR were a `driver_name` and a `driver_version` which were stored as an `sstring` object which means that the corresponding memory consumption was 16 bytes per each such value at least (the smallest size of the `seastar`'s `sstring` object) **per-connection**. In reality the driver name is usually longer than 15 characters, e.g. "ScyllaDB Python Driver" is 23 characters and this is not the longest driver name there is. In such cases the actual memory usage of a corresponding `sstring` object jumps to 8 + 4 + 1 + (string length, 23 in our example) + 1.
So, for "ScyllaDB Python Driver" it would be 37 bytes (in reality it would be a bit more due to natural alignment of other allocations since the `contents` size is not well aligned (13 bytes), but let's ignore this for now).
* These bytes add up quickly as there are more connections and, sometimes we are talking about millions of connections per-shard.
* Using a smart pointer (`lw_shared_ptr`) referencing a corresponding cached value will effectively reduce the per-connection memory usage to be 8 bytes (a size of a pointer on 64-bit CPU platform) for each such value. While storing a corresponding `sstring` value only once.
* This will would reduce the "variable" (per-connection) memory usage by **at least 50%**. And in case of "ScyllaDB Python Driver" driver version - by 78%!
* And all this for a price of a single `loading_shared_values` object **per-shard** (implements a hash table) and a minor overhead for each value **stored** in it.

* Introduced a new cache type (`client_options_cache_type`) for deduplicating and sharing client option strings, and refactored `client_data`, `client_state`, and related classes to use `foreign_ptr<std::unique_ptr<client_data>>` and cached entry types for fields like driver name, driver version, and client options. (`client_data.hh`, `service/client_state.hh`, `alternator/server.hh`, `alternator/controller.hh`, `transport/controller.hh`, `transport/protocol_server.hh`) [[1]](diffhunk://#diff-664a3b19e905481bdf8eb3843fc4d34691067bb97ab11cfd6e652e74aac51d9fR33-R36) [[2]](diffhunk://#diff-664a3b19e905481bdf8eb3843fc4d34691067bb97ab11cfd6e652e74aac51d9fL40-R56) [[3]](diffhunk://#diff-daadce1a2de3667511e59558f3a8f077b5ee30a14bcc6a99d588db90d0fcd2bdL105-R107) [[4]](diffhunk://#diff-daadce1a2de3667511e59558f3a8f077b5ee30a14bcc6a99d588db90d0fcd2bdL154-R182) [[5]](diffhunk://#diff-5fce246edf5abffb2351bd02e2eb1e9850880f7a00607ccaa90c3eee7ef57c6bL91-R92) [[6]](diffhunk://#diff-5fce246edf5abffb2351bd02e2eb1e9850880f7a00607ccaa90c3eee7ef57c6bL110-R111) [[7]](diffhunk://#diff-31730ba8e7374f784a88dc27c1512291cf73b7f24e08768f7466a3c8cfcc7a1aL96-R96) [[8]](diffhunk://#diff-19a97c0247cc08155ee49b277e43859ca32d6ef8cbff0ed7368ec5fa19e0a11eL172-R172) [[9]](diffhunk://#diff-eea7e2db5d799a25e717a72ac8ce5842bd4adb72b694d38d8f47166d9cd926faL356-R356) [[10]](diffhunk://#diff-d0b4ec3a144bbc5dc993866cf0b940850a457ff6156064f7e2b4b10ad0a95fefL80-R80) [[11]](diffhunk://#diff-4293b94c444d9bd5ecd17ce7eda8c00685d35ecf6e07f844efc91a91bbe85be1L46-R48)

* Updated the methods for setting and getting driver name, driver version, and client options in `client_state` to be asynchronous and use the new cache. (`service/client_state.hh`, `service/client_state.cc`) [[1]](diffhunk://#diff-daadce1a2de3667511e59558f3a8f077b5ee30a14bcc6a99d588db90d0fcd2bdL154-R182) [[2]](diffhunk://#diff-99634aae22e2573f38b4e2f050ed2ac4f8173ff27f0ae8b3609d1f0cc1aeb775R347-R362)

**Virtual table and API enhancements:**

* Extended the `system.clients` virtual table schema and implementation to include a new `client_options` column (a map of option key/value pairs), and updated the table population logic to use the new cached types and foreign pointers. (`db/virtual_tables.cc`) [[1]](diffhunk://#diff-05f7bff3edb39fb8759c90b445e860189f2f30e04717ed58bae42716082af3d1R752) [[2]](diffhunk://#diff-05f7bff3edb39fb8759c90b445e860189f2f30e04717ed58bae42716082af3d1L769-R770) [[3]](diffhunk://#diff-05f7bff3edb39fb8759c90b445e860189f2f30e04717ed58bae42716082af3d1L809-R816) [[4]](diffhunk://#diff-05f7bff3edb39fb8759c90b445e860189f2f30e04717ed58bae42716082af3d1L828-R879)

**API and interface changes:**

* Changed the signatures of `get_client_data` methods throughout the codebase to return vectors of `foreign_ptr<std::unique_ptr<client_data>>` instead of plain `client_data` objects, to ensure safe cross-shard access. (`alternator/controller.hh`, `alternator/controller.cc`, `alternator/server.hh`, `alternator/server.cc`, `transport/controller.hh`, `transport/protocol_server.hh`) [[1]](diffhunk://#diff-31730ba8e7374f784a88dc27c1512291cf73b7f24e08768f7466a3c8cfcc7a1aL96-R96) [[2]](diffhunk://#diff-19a97c0247cc08155ee49b277e43859ca32d6ef8cbff0ed7368ec5fa19e0a11eL172-R172) [[3]](diffhunk://#diff-5fce246edf5abffb2351bd02e2eb1e9850880f7a00607ccaa90c3eee7ef57c6bL110-R111) [[4]](diffhunk://#diff-a7e2cda866c03a75afcf3b087de1c1dcd2e7aa996214db67f9a11ed6451e596dL988-R995) [[5]](diffhunk://#diff-eea7e2db5d799a25e717a72ac8ce5842bd4adb72b694d38d8f47166d9cd926faL356-R356) [[6]](diffhunk://#diff-d0b4ec3a144bbc5dc993866cf0b940850a457ff6156064f7e2b4b10ad0a95fefL80-R80) [[7]](diffhunk://#diff-4293b94c444d9bd5ecd17ce7eda8c00685d35ecf6e07f844efc91a91bbe85be1L46-R48)

**Testing and validation:**

* Updated the Python test for the `system.clients` table to verify the new `client_options` column and its contents, ensuring that driver name and version are present in the options map. (`test/cqlpy/test_virtual_tables.py`) [[1]](diffhunk://#diff-6dd8bd4a6a82cd642252a29dc70726f89a46ceefb991c3e63fc67e283f323f03R79) [[2]](diffhunk://#diff-6dd8bd4a6a82cd642252a29dc70726f89a46ceefb991c3e63fc67e283f323f03R88-R90)

Closes scylladb/scylladb#25746

* github.com:scylladb/scylladb:
  transport/server: declare a new "CLIENT_OPTIONS" option as supported
  service/client_state and alternator/server: use cached values for driver_name and driver_version fields
  system.clients: add a client_options column
  controller: update get_client_data to use foreign_ptr for client_data
2025-12-22 20:02:40 +02:00
Emil Maskovsky
d60b908a8e test/raft: improve reporting in the randomized_nemesis_test digest functions
The Boost ASSERTs in the digest functions of the randomized_nemesis_test
were not working well inside the state machine digest functions, leading
to unhelpful boost::execution_exception errors that terminated the apply
fiber, and didn't provide any helpful information.

Replaced by explicit checks with on_fatal_internal_error calls that
provide more context about the failure. Also added validation of the
digest value after appending or removing an element, which allows to
determine which operation resulted in causing the wrong value.

This effectively reverts the changes done in https://github.com/scylladb/scylladb/pull/19282,
but adds improved error reporting.

Refs: scylladb/scylladb#27307
Refs: scylladb/scylladb#17030

Closes scylladb/scylladb#27791
2025-12-22 20:02:40 +02:00
Nikos Dragazis
20ff2fcc18 docs: Amend limitations for keyspace RF changes
The doc about DDL statements claims that an `ALTER KEYSPACE` will fail
in the presence of an ongoing global topology operation.

This limitation was specifically referring to RF changes, which Scylla
implements as global topology requests (`keyspace_rf_change`), and it
was true when it was first introduced (1b913dd880) because there was
no global topology request queue at that time, so only one ongoing
global request was allowed in the cluster.

This limitation was lifted with the introduction of the global topology
request queue (6489308ebc), and it was re-introduced again very
recently (2e7ba1f8ce) in a slightly different form; it now applies only
to RF changes (not to any request type) and only those that affect the
same keyspace. None of these two changes were ever reflected in the doc.

Synchronize the doc with the current state.

Fixes #27776.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>

Closes scylladb/scylladb#27786
2025-12-22 20:02:40 +02:00
Andrei Chekun
6ffdada0ea test.py: modify JUnit report for easier rerun on CI
This will allow to add custom XML attribute to the JUnit report. In this
case there will be path to the function that can be used to run with
pytest command. Parametrized tests will have path to the function
excluding parameter.

Closes scylladb/scylladb#27707
2025-12-22 20:02:40 +02:00
Anna Stuchlik
4c247a5d08 doc: document support for i8g and i8ge instances
Fixes https://github.com/scylladb/scylladb/issues/27703

Closes scylladb/scylladb#27754
2025-12-22 20:02:40 +02:00
copilot-swe-agent[bot]
288d4b49e9 Skip backtrace in lsa-timing logs for preemptible reclaim
Preemptible reclaim is only done from the background reclaimer,
so backtrace is not useful. It's also normal that it takes a long time.
Skip the backtrace when reclaim is preemptible to reduce log noise.

Fixes the issue where background reclaim was printing unnecessary
backtraces in lsa-timing logs when operations took longer than the
stall detection threshold.

Closes: #27692

Co-authored-by: tgrabiec <283695+tgrabiec@users.noreply.github.com>
2025-12-22 20:02:40 +02:00
Pavel Emelyanov
e304d912b4 Merge 'db/view/view_building_worker: follow-ups' from Michał Jadwiszczak
This patch consists of a few smaller follow-ups to the view building worker:
- catch general execption in staging task registrator
- remove unnecessary CV broadcast
- don't pollute function context with conditionally compiled variable
- avoid creating a copy of tasks map
- fix some typos

Refs https://github.com/scylladb/scylladb/issues/25929
Refs https://github.com/scylladb/scylladb/pull/26897

This PR doesn't fix any bugs but recently we're backporting some PRs to 2025.4, so let's also backport this one to avoid painful conflicts.

Closes scylladb/scylladb#26558

* github.com:scylladb/scylladb:
  docs/dev/view-building-coordinator: fix typos
  db/view/view_building_worker: remove unnnecessary empty lines
  db/view/view_building_worker: fix typo
  db/view/view_building_worker: avoid creating a copy of tasks map
  db/view/view_building_worker: wrap conditionally compiled code in a scope
  db/view/view_building_worker: remove unnecessary CV broadcast
  db/view/view_building_worker: catch general execption in staging task registrator
2025-12-22 20:02:40 +02:00
Botond Dénes
846a6e700b Merge 'get_snapshot_details: process also staging directory' from Benny Halevy
Currently, we determine the live vs. total snapshot size by listing all files in the snapshot directory,
and for each name, look it up in the base table directory and see if it exists there, and if so, if it's the same file
as in the snapshot by looking to the fstat data for the dev id and inode number.

However, we do not look the names in the staging directory so staging sstable
would skew the results as the will falsely contribute to the live size, since they
wouldn't be found in the base directory.

This change processes both the staging directory and base table directory
and keeps the file capacity in a map, indexed by the files inode number, allowing us to easily
detect hard links and be resilient against concurrent move of files from the staging sub-directory
back into the base table directory.

Fixes #27635

* Minor issue, no backport required

Closes scylladb/scylladb#27636

* github.com:scylladb/scylladb:
  table: get_snapshot_details: add FIXME comments
  table: get_snapshot_details: lookup entries also in the staging directory
  table: get_snapshot_details: optimize using the entry number_of_links
  table: get_snapshot_details: continue loop for manifest and schema entries
  table: get_snapshot_details: use directory_lister
2025-12-22 20:02:40 +02:00
Botond Dénes
af5e73def9 Merge 'test/cqlpy: remove unused variables' from Nadav Har'El
These patches fix a bunch of variables defined in test/cqlpy tests, but not used. Besides wasting a few bytes on disk, these unused variables can add confusion for readers who see them and might think they have some use which they are missing.

All these unused variables were found by Copilot's "code quality" scanner, but I considered each of them, and fixed them manually.

Closes scylladb/scylladb#27667

* github.com:scylladb/scylladb:
  test/cqlpy: remove unused variables
  test/cqlpy: use unique partition in test
2025-12-22 20:02:39 +02:00
Avi Kivity
8e462d06be build: apply sccache to rust builds too
sccache works for rust as well as for C++; use it for rust builds
as well.
2025-12-22 15:36:15 +02:00
Avi Kivity
9ac82d63e9 build: prevent double caching by compiler cache
To avoid both sccache and ccache being used simultaneously,
detect if ccache is in $PATH and use whatever compiler would
have been called instead.
2025-12-22 15:36:14 +02:00
Anna Stuchlik
9793a45288 doc: add a Vector Search page under Features
This commit adds a page with an overview of Vector Search under the Features section.
It includes a link to the VS documentation in ScyllaDB Cloud,
as the feature is only available in ScyllaDB Cloud.

The purpose of the page is to raise awareness of the feature.

Fixes https://scylladb.atlassian.net/browse/VECTOR-215

Closes scylladb/scylladb#27787
2025-12-22 15:29:45 +02:00
Avi Kivity
afb96b6387 build: allow selecting compiler cache, including sccache
Add a new configuration option for selecting the compiler
cache. Prefer sccache if found, since it supports rust as
well as C++, has better support for distributed compilation,
and is slated to receive module support soon.

cmake is also supported.
2025-12-22 15:27:15 +02:00
Alex
033579ad6f db: api: service: Fix ClientConnectorError in test_client_routes The bug was caused by capturing local variables by reference in lambdas passed to with_retry(), which is a coroutine. When the coroutine suspends, the lambda frame exits and the referenced locals are destroyed, leading to use-after-lifetime issues. This change fixes the problem by ensuring safe ownership across suspension points and also refactors how route_keys and route_entries are passed from the caller. Previously they were passed as const lvalue references, which cannot be moved and therefore ended up being repeatedly copied across function calls and lambda invocations. The new approach avoids unnecessary copies and makes the lifetime semantics explicit and safe.
Fixes: 27792

no backport needed private link is only in master branch

Closes scylladb/scylladb#27795
2025-12-22 14:52:47 +02:00
Yaniv Michael Kaul
c1da552fa4 test/pylib/scylla_cluster.py:get_scylla_2025_1_executable() - retry curl download of 2025.1
For some reason, we might fail. Retry 10 times, and fail with an error code instead of 404 or whatnot.

Benign, I hope - no need to backport.
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>

Closes scylladb/scylladb#27746
2025-12-22 14:45:06 +02:00
Piotr Smaron
cb3b96b8f4 raft: correct lease->least typo in a comment
Funny, when researching if our raft implementation relies on the 'lease'
mechanism, I noticed this typo.

Closes scylladb/scylladb#27803
2025-12-22 14:39:55 +02:00
Avi Kivity
b105ad8379 build: drop -fexperimental-assignment-tracking clang option
`-fexperimental-assignment-tracking` was added in fdd8b03d4b to
make coroutine debugging work.

However, since then, it became unnecessary, perhaps due to 87c0adb2fe,
or perhaps to a toolchain fix.

Drop it, so we can benefit from assignment tracking (whatever it is),
and to improve compatibility with sccache, which rejects this option.

I verified that the test added in fdd8b03d4b fails without the option
and passes with this patch; in other words we're not introducing a
regression here.

Closes scylladb/scylladb#27763
2025-12-22 14:33:48 +02:00
Michael Litvak
9f8aea21e3 docs: update RF-rack restrictions
Update the documentation about restrictions to tablets keyspaces related
to RF-rack.

* MV/SI require the keyspace to be RF-rack-valid
* topology operations are restricted if a keyspace has views to preserve
  RF-rack-validity
2025-12-22 09:21:07 +01:00
Michael Litvak
75b5285cdf cql3: don't apply RF-rack restrictions on vector indexes
When creating an index we validate that the keyspace is RF-rack-valid
and print a warning that the keyspace must remain RF-rack-valid.

This should apply only to indexes that are based on materialized views
for which there are consistency concerns when the keyspace is not
RF-rack-valid.

vector indexes are not based on materialized views, hence these
restrictions should not apply to them.
2025-12-22 09:21:07 +01:00
Michael Litvak
06343b58a2 cql3: add warning when creating mv/index with tablets about rf-rack
Creating a MV or index in a tablets-based keyspace now forces additional
restrictions on the keyspace. The keyspace must be RF-rack-valid and it
must remain RF-rack-valid while the view exists.

Add a CQL warning about these restrictions.
2025-12-22 09:21:06 +01:00
Michael Litvak
df801d16da service/tablet_allocator: always allow tablet merge of tables with views
allow tablet merge of tables with views even if the
rf_rack_valid_keyspaces option is not set, because now keyspaces that
have views are enforced to always be rf-rack-valid, regardless of the
option value.
2025-12-22 09:14:30 +01:00
Michael Litvak
07d85af433 locator: extend rf-rack validation for rack lists
Extend the RF-rack validation in `assert_rf_rack_valid_keyspace` to
validate rack-list-based replication as well. Previously, validation was
done only for numeric replication.

If the replication is based on a rack list, we validate that all racks
that are required for replication are present in the topology rack map.
If some rack is needed for replication but is missing, or it doesn't
have normal token owner nodes, the validation fails with an error.
2025-12-22 09:14:30 +01:00
Michael Litvak
d40d06c7ad test: test rf-rack validity when creating keyspace during node ops
add tests that attempt to create a keyspace during different stages of
node join or remove, and verify that the rf-rack condition can't be
broken - either creating the keyspace should fail or the node operation
should fail, depending on the stage.
2025-12-22 09:14:30 +01:00
Michael Litvak
a738905a4b locator: fix rf-rack validation during node join/remove
If a keyspace is created while a node is joining or being removed, it could
break the rf-rack invariant. For example:
1. We have 3 nodes in 3 racks, no keyspaces
2. A new node starts to join in a new rack - passes validation because
   there are no keyspaces
3. Create a keyspace with rf=3 - passes validation because the joining
   node is not a normal token owner yet
4. The new node becomes a normal token owner
5. The rf-rack invariant is broken. We have rf=3 and 4 racks

To fix this, we change the rf-rack check to consider a node as a token
owner if it's either a normal token owner or it has bootstrap tokens and
is about to become a normal token owner.

Now the condition can't be broken. Consider keyspace creation at
different stages of adding a node in our example:
* Before the node is assigned bootstrap tokens: the node is not
  considered. We can create a keyspace with rf=3 as if the node doesn't
  exist, and then node join will fail in the group0 operation that
  assigns bootstrap tokens, because during this operation we check
  rf-rack validity.
* Assigning bootstrap tokens is a single group0 operation that is
  serialized with keyspace creation. During this operation we check that
  adding the node as a token owner will maintain rf-rack validity for all
  keyspaces.
* After the node is assigned bootstrap tokens and until it becomes a
  normal token owner: it is considered as a transitioning token owner by
  the rf-rack check and the rack is considered a transitioning rack. We
  can't count the rack as a normal rack because the node join may still
  fail and rollback. Trying to create a keyspace with either rf=3 or
  rf=4 will fail because we can end up with either 3 or 4 racks.

Similarly, when removing a node, we validate that removing the node will
maintain rf-rack validity in the same group0 operation that changes the
node state to removing/decommissioning, after which the node becomes a
leaving endpoint, and it's not considered a normal token owner anymore
for the rf-rack check.
2025-12-22 09:14:30 +01:00
Michael Litvak
9940dcefa7 test: test topology restrictions for views with tablets
Add tests that verify the restrictions on topology operations when there
are keyspaces with tablets and materialized views.

For such keyspaces, RF=Racks must be enforced while they have
materialized views, therefore adding a node in a new rack or removing a
node that would eliminate a rack should be rejected.
2025-12-22 09:14:30 +01:00
Michael Litvak
870aad7f71 test: add test_topology_ops_with_rf_rack_valid
add new tests for testing that RF-rack validity is maintained when doing
topology operations that may break them, such as adding nodes in new
racks or removing nodes.
2025-12-22 09:14:30 +01:00
Michael Litvak
8f1a566be8 topology coordinator: restrict node join/remove to preserve RF-rack validity
when a new node joins or an existing node is removed / decommissioned,
check if the operation would violate the RF-rack-validity of some
keyspace. if so - reject the operation in order to preserve
RF-rack-validity.

Fixes scylladb/scylladb#23345
Fixes scylladb/scylladb#26820
2025-12-22 09:14:30 +01:00
Michael Litvak
aa8db3b8da topology coordinator: add validation to node remove
add validation to node remove / decommission, similar to node validation
when a node joins.

when starting node remove or decommission, the validation function
checks if the operation is valid and can proceed. if not, it's aborted
with an error message.

we change the return type of validate_joining_node so that it will be
similar and consistent with the new validate_removing_node.
2025-12-22 09:14:29 +01:00
Michael Litvak
9e1f78d162 locator: extend rf-rack validation functions
Extend the locator function assert_rf_rack_valid_keyspace to accept
arbitrary topology dc-rack maps and nodes instead of using the current
token metadata.

This allows us to add a new variant of the function that checks rf-rack
validity given a topology change that we want to apply. we will use it
to check that rf-rack validity will be maintained before applying the
topology change.

The possible topology changes for the check are node add and node remove
/ decommission. These operations can change the number of normal racks -
if a new node is added to a new rack, or the last node is removed from a
rack.
2025-12-22 09:14:29 +01:00
Michael Litvak
8df61f6d99 view: change validate_view_keyspace to allow MVs if RF=Racks
The function validate_view_keyspace checks if a keyspace is eligible for
having materialized views, and it is used for validation when creating a
MV or a MV-based index.

Previously, it was required that the rf_rack_valid_keyspaces option is
set in order for tablets-based keyspaces to be considered eligible, and
the RF-rack condition was enforced when the option is set.

Instead of this, we change the validation to allow MVs in a keyspace if
the RF-rack condition is satisfied for the keyspace - regardless of the
config option.

We remove the config validation for views on startup that validates the
option `rf_rack_valid_keyspaces` is set if there are any views with
tablets, since this is not required anymore.

We can do this without worrying about upgrades because this change will
be effective from 2025.4 where MVs with tablets are first out of
experimental phase.

We update the test for MV and index restrictions in tablets keyspaces
according to the new requirements.

* Create MV/index: previously the test checked that it's allowed only if
  the config option `rf_rack_valid_keyspaces` is set. This is changed
  now so it's always allowed to create MV/index if the keyspace is
  RF-rack-valid. Update the test to verify that we can create MV/index
  when the keyspace is RF-rack-valid, even if the rf_rack option is not
  set, and verify that it fails when the keyspace is RF-rack-invalid.
* Alter: Add a new test to verify that while a keyspace has views, it
  can't be altered to become RF-rack-invalid.
2025-12-22 09:14:29 +01:00
Michael Litvak
de1bb84fca db: enforce rf-rack-validity for keyspaces with views
Extend the RF-rack-validity enforcement to keyspaces that have views,
regardless of the option `rf_rack_valid_keyspaces`.

Previously, RF-rack-validity was enforced when `rf_rack_valid_keyspaces`
was set for all keyspaces. Now we want to allow creating MVs in tablet
keyspaces that are RF-rack-valid and enforce the RF-rack-validity even
if the config option is not set.
2025-12-22 09:13:49 +01:00
Michael Litvak
75b269229d replica/db: add enforce_rf_rack_validity_for_keyspace helper
Add the helper function enforce_rf_rack_validity_for_keyspace that
returns true if RF-rack-validity should be enforced for a keyspace, and
use it wherever we need to check this instead of checking the config
option directly.

This is useful because this condition is used in multiple places, and
having it defined in a single helper function will make it easier to
see and change the enforcement conditions.
2025-12-22 09:13:49 +01:00
Michael Litvak
8b0b0c4d80 db: remove enforce parameter from check_rf_rack_validity
simple refactoring: the enforce parameter is always given the value of
the `rf_rack_valid_keyspaces` option. remove the parameter and use the
option value directly from the db config.

this will be useful for a later change to the enforcement conditions.
2025-12-22 09:13:49 +01:00
Michael Litvak
61ae653693 test: adjust test to not break rf-rack validity
the test test_unfinished_writes_during_shutdown starts 3 nodes in 3
racks and creates a keyspace with RF=3, then adds a new node in a 4th
rack. this breaks rf-rack validity for the keyspace.

we change it instead to add the new node in an existing rack. it doesn't
matter for the test - the test only wants to add a new node to trigger
some topology change.
2025-12-22 09:13:48 +01:00
Karol Nowacki
addac8b3f7 vector_search: test: Fix flaky DNS resolution test
The `vector_store_client_test_dns_resolving_repeated` test had race
conditions causing it to be flaky. Two main issues were identified:

1. Race between initial refresh and manual trigger: The test assumes
    a specific resolution sequence, but timing variations between the
    initial DNS refresh (on client creation) and the first manual
    trigger (in the test loop) can cause unexpected delayed scheduling.

2. Extra triggers from resolve_hostname fiber: During the client
    refresh phase, the background DNS fiber clears the client list.
    If resolve_hostname executes in the window after clearing but
    before the update completes, pending triggers are processed,
    incrementing the resolution count unexpectedly. At count 6, the
    mock resolver returns a valid address (count % 3 == 0), causing
    the test to fail.

The fix relaxes test assertions to verify retry behavior and client
clearing on DNS address loss, rather than enforcing exact resolution
counts.

Fixes: #27074

Closes scylladb/scylladb#27685
2025-12-21 20:02:16 +02:00
Vlad Zolotarov
ea95cdaaec transport/server: declare a new "CLIENT_OPTIONS" option as supported
Declare support for a 'CLIENT_OPTIONS' startup key.
This key is meant to be used by drivers for sending client-specific
configurations like request timeouts values, retry policy configuration, etc.
The value of this key can be any string in general (according to the CQL binary protocol),
however, it's expected to be some structured format, e.g. JSON.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2025-12-20 12:26:22 -05:00
Vlad Zolotarov
28cbaef110 service/client_state and alternator/server: use cached values for driver_name and driver_version fields
Optimize memory usage changing types of driver_name and driver_version be
a reference to a cached value instead of an sstring.

These fields very often have the same value among different connections hence
it makes sense to cache these values and use references to them instead of duplicating
such strings in each connection state.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2025-12-20 12:26:22 -05:00
Vlad Zolotarov
85adf6bdb1 system.clients: add a client_options column
This new column is going to contain all OPTIONS sent in the
STARTUP frame of the corresponding CQL session.

The new column has a `frozen<map<text, text>>` type, and
we are also optimizing the amount of required memory for storing
corresponding keys and values by caching them on each shard level.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2025-12-20 12:26:15 -05:00
Vlad Zolotarov
3a54bab193 controller: update get_client_data to use foreign_ptr for client_data
get_client_data() is used to assemble `client_data` objects from each connection
on each CPU in the context of generation of the `system.clients` virtual table data.

After collected, `client_data` objects were std::moved and arranged into a
different structure to match the table's sorting requirements.

This didn't allow having not-cross-shard-movable objects as fields in the `client_data`,
e.g. lw_shared_ptr objects.

Since we are planning to add such fields to `client_data` in following patches this patch
is solving the limitation above by making get_client_data() return `foreign_ptr<std::unique_ptr<client_data>>`
objects instead of naked `client_data` ones.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2025-12-19 11:01:41 -05:00
Anna Stuchlik
f65db4e8eb doc: remove the links to the Download Center
This commit removes the remaining links to the Download Center on the website.
We no longer use it for installation, and we don't want users to infer that
something like that still exists.

Fixes https://github.com/scylladb/scylladb/issues/27753

Closes scylladb/scylladb#27756
2025-12-19 12:53:40 +01:00
Botond Dénes
df2ac0f257 Merge 'test: dtest: schema_management_test.py: migrate from dtest' from Dario Mirovic
This PR migrates schema management tests from dtest to this repository.

One reason is that there is an ongoing effort to migrate tests from dtest to here.

Test `TestLargePartitionAlterSchema.test_large_partition_with_drop_column` failed with timeout error once. The main suspect so far are infra related problems, like infra congestion. The [logs from the test execution](https://jenkins.scylladb.com/job/scylla-master/job/dtest-release/1062/testReport/junit/schema_management_test/TestLargePartitionAlterSchema/Run_Dtest_Parallel_Cloud_Machines___Dtest___full_split001___test_large_partition_with_drop_column/), linked in the issue [test_large_partition_with_drop_column failed on TimeoutError #26932](https://github.com/scylladb/scylladb/issues/26932) show the following:
- `populate` works as intended - it starts, then during populate/insert drop column happened, then an exception is raised and intentionally ignored in the test, so no `Finish populate DB` for 50 x 1490 records - expected
- drop column works as intended - interrupts `populate` and proceeds to flush
- flush **probably** works as intended - logs are consistent with what we expect and what I got in local test runs
- `read` is the only thing that visibly got stuck, all the way until timeout happened, 5 minutes after the start

Migrating the test to this repo will also give us test start and end times on CI machines, in the sql report database. It has start and end timestamp for each test executed. We will be able to see how long does it usually take when the test is successful. It can not be seen from the logs, because logs are not kept for successful tests.

Another thing this PR does is adding a log message at the end of `database::flush_all_tables`. This will let us know if a thread got stuck inside or finished successfully. This addresses the **probably** part of the flush analysis step described above. If the issue reoccurs, we will have more information.

The test `test_large_partition_with_add_column` has not been executing for ~5 years. It was never migrated to pytest. The name was left as `large_partition_with_add_column_test`, and was skipped. Now it is enabled and updated.

Both `test_large_partition_with_add_column` and `test_large_partition_with_drop_column` are improved.
Small performance improvements:
- Regex compilation extracted from the stress function to the module level, to avoid recompilation.
- Do not materialize list in `stress_object` for loop. Use a generator expression.

The tests in `TestLargePartitionAlterSchema` are `test_large_partition_with_add_column`
and `test_large_partition_with_drop_column`.

These tests need to replicate the following conditions that led to a bug before a fix from around 5 years ago.

The scenario in which the problem could have happened has to involve:
- a large partition with many rows, large enough for preemption (every 0.5ms) to happen during the scan of the partition.
- appending writes to the partition (not overwrites)
- scans of the partition
- schema alter of that table. The issue is exposed only by adding or dropping a column, such that the added/dropped
  column lands in the middle (in alphabetical order) of the old column set.

The way the test is set up is:
- fixed number of writes per populate call
- fixed number of reads

This has the following implications:
- if the machine executing the test is fast, all the writes are done before the 10 seconds sleep
- there are too many reads - most of them get executed after the test logic is done

This patch solves these issues in the following way:
- populate lazily generates write data, and stops when instructed by `stop_populating` event
- read, which is done sequentially, stops when instructed by `stop_reading` event
- number of max operations is increased significantly, but the operations are stopped 1 second
  after node flush; this makes sure there are enough operations during the test, but also that
  the test does not take unnecessary time

Test execution time has been reduced severalfold. On dev machine the time the tests take is
reduced from 110 seconds to 34 seconds.

scylla-dtest PR that removes migrated tests:
[schema_management_test.py: remove tests already ported to scylladb repo #6427](https://github.com/scylladb/scylla-dtest/pull/6427)

Fixes #26932

This is a migration of existing tests to this repository. No need for backport.

Closes scylladb/scylladb#27106

* github.com:scylladb/scylladb:
  test: dtest: schema_management_test.py: speed up `TestLargePartitionAlterSchema` tests
  test: dtest: schema_management_test.py: fix large partition add column test
  test: dtest: schema_management_test.py: add `TestSchemaManagement.prepare`
  test: dtest: schema_management_test.py: test enhancements
  test: dtest: schema_management_test.py: make the tests work
  test: dtest: migrate setup and tools from dtest
  test: dtest: copy unmodified schema_management_test.py
  replica: database: flush_all_tables log on completion
2025-12-19 12:30:00 +02:00
Botond Dénes
093e97a539 Merge 'test: increase num of requests in driver_service_level tests' from Andrzej Jackowski
`_verify_tasks_processed_metrics()` is used to check that the correct
service level is used to process requests. It takes two service levels
as arguments and executes numerous requests. After that, the number
of tasks processed by one of the service levels is expected to rise
by at least the number of executed requests. In contrast,
the second service level is expected to process fewer tasks than
the number of requests.

Unfortunately, background noise may cause some tasks to be executed
on the service level that is not supposed to process requests.
This patch increases the number of executed requests to eliminate
the chance of noise causing test failures.

Additionally, this commit extends logging to make future investigation
easier.

Fixes: https://github.com/scylladb/scylladb/issues/27715

No backport, fix for test on master.

Closes scylladb/scylladb#27735

* github.com:scylladb/scylladb:
  test: remove unused `get_processed_tasks_for_group`
  test: increase num of requests in driver_service_level tests
2025-12-19 10:54:14 +02:00
Emil Maskovsky
fa6e5d0754 test/random_failures: fix handling of banned notification
After 39cec4a node join may fail with either "init - Startup failed"
notification or occasionally because it was banned, depending on timing.

The change updates the test to handle both cases.

Fixes: scylladb/scylladb#27697

No backport: This failure is only present in master.

Closes scylladb/scylladb#27768
2025-12-19 09:55:31 +02:00
Emil Maskovsky
08518b2c12 test/raft: fix test_joining_old_node_fails flakiness
When a node without the required feature attempts to join a Raft-based
cluster with the feature enabled, there is a race between the join
rejection response ("Feature check failed") and the ban notification
("received notification of being banned"). Depending on timing, either
message may appear in the joining node's log.

This starts to happen after 39cec4a (which introduced informing the
nodes about being banned).

Updated the test to accept both error messages as valid, making the test
robust against this race condition, which is more likely in debug mode
or under slow execution.

Fixes: scylladb/scylladb#27603

No backport: This failure is only present in master.

Closes scylladb/scylladb#27760
2025-12-19 09:44:09 +02:00
Emil Maskovsky
2a75b1374e test/raft: fix race condition in failure_detector_test
The test had a sporadic failure due to a broken promise exception.
The issue was in `test_pinger::ping()` which captured the promise by
move into the subscription lambda, causing the promise to be destroyed
when the lambda was destroyed during coroutine unwinding.

Simplify `test_pinger::ping()` by replacing manual abort_source/promise
logic with `seastar::sleep_abortable()`.
This removes the risk of promise lifetime/race issues and makes the code
simpler and more robust.

Fixes: scylladb/scylladb#27136

Backport to active branches: This fixes a CI test issue, so it is
beneficial to backport the fix. As this is a test-only fix, it is a low
risk change.

Closes scylladb/scylladb#27737
2025-12-19 09:42:19 +02:00
Łukasz Paszkowski
2cb9bb8f3a test_user_writes_rejection: Disable speculative retries
This test starts a 3-node cluster and creates a large blob file so that one
node reaches critical disk utilization, triggering write rejections on that
node. The test then writes data with CL=QUORUM and validates that the data:
- did not reach the critically utilized node
- did reach the remaining two nodes

By default, tables use speculative retries to determine when coordinators may
query additional replicas.

Since the validation uses CL=ONE, it is possible that an additional request
is sent to satisfy the consistency level. As a result:
- the first check may fail if the additional request is sent to a node that
  already contains data, making it appear as if data reached the critically
  utilized node
- the second check may fail if the additional request is sent to the critically
  utilized node, making it appear as if data did not reach the healthy node

The patch fixes the flakiness by disabling the speculative retries.

Fixes https://github.com/scylladb/scylladb/issues/27212

Closes scylladb/scylladb#27488
2025-12-19 09:39:09 +02:00
Botond Dénes
c899c117c7 scylla-gdb.py: scylla read-stats: include all permit lists
The current code which collects permit stats is out-of-date (by a few
years), as it only iterates through _permit_list. There are 4 additional
lists that permits can be part of now (all intrusive). Include all of
these in the stat collection.
As a bonus, also print the semaphore pointer in the printout, so the
user can hand-examine it, should they wish to.
2025-12-18 18:26:07 +02:00
Botond Dénes
db9f9e1b7e scylla-gdb.py: scylla fiber: add --direction command-line param
Can be "forward", "backward" or "both" (default).
Allows traversing the fiber in just one direction. Useful when scylla
fiber fails to traverse through a task and the user has to locate the
next one in the chain manually. When resuming from this next item, the
user might want to skip the already seen part of the fiber, to save time
on the invokation.
2025-12-18 18:26:07 +02:00
Botond Dénes
4c6e508811 scylla-gdb.py: scylla fiber: add support for traversing through coroutines backward
Traversing through coroutines forward (finding task waiting on this
coroutine) is already supported. This patch adds support for traversing
through coroutines backwards (finding task waited on by coroutine).
Coroutines need special handling: the future<> object is most likely
allocated on the coroutine frame, so we have to search throgh that to
find it. When doing so the first two pointers on the frame have to be
skipped: these are pointers to .resume and .destroy respectively and
will halt the search algorithm if seen.
2025-12-18 18:26:06 +02:00
Dario Mirovic
f1d63d014c test: dtest: schema_management_test.py: speed up TestLargePartitionAlterSchema tests
The tests in `TestLargePartitionAlterSchema` are `test_large_partition_with_add_column`
and `test_large_partition_with_drop_column`.

These tests need to replicate the following conditions that led to a bug before a fix from around 5 years ago.

The scenario in which the problem could have happened has to involve:
- a large partition with many rows, large enough for preemption (every 0.5ms) to happen during the scan of the partition.
- appending writes to the partition (not overwrites)
- scans of the partition
- schema alter of that table. The issue is exposed only by adding or dropping a column, such that the added/dropped
  column lands in the middle (in alphabetical order) of the old column set.

The way the test is set up is:
- fixed number of writes per populate call
- fixed number of reads

This has the following implications:
- if the machine executing the test is fast, all the writes are done before the 10 seconds sleep
- there are too many reads - most of them get executed after the test logic is done

This patch solves these issues in the following way:
- populate lazily generates write data, and stops when instructed by `stop_populating` event
- read, which is done sequentially, stops when instructed by `stop_reading` event
- number of max operations is increased significantly, but the operations are stopped 1 second
  after node flush; this makes sure there are enough operations during the test, but also that
  the test does not take unnecessary time

Test execution time has been reduced severalfold. On dev machine the time the tests take is
reduced from 110 seconds to 34 seconds.

The patch also introduces a few small improvements:
- `cs_run` renamed to `run_stress` for clarity
- Stopped checking if cluster is `ScyllaCluster`, since it is the only one we use
- `case_map` removed from `test_alter_table_in_parallel_to_read_and_write`, used `mixed` param directly
- Added explanation comment on why we do `data[i].append(None)`
- Replaced `alter_table` inner function with its body, for simplicity
- Removed unnecessary `ck_rows` variable in `populate`
- Removed unnecessary `isinstance(self.cluster. ScyllaCluster)`
- Adjusted `ThreadPoolExecutor` size in several places where 5 workers are not needed
- Replaced functional programming style expressions for `new_versions` and `columns_list` with
  comprehension/generator statement python style code, improving readability

Refs #26932

fix
2025-12-18 17:07:27 +01:00
Michael Litvak
33f7bc28da docs: document restrictions of colocated tables
Currently some things are not supported for colocated tables: it's not
possible to repair a colocated table, and due to this it's also not
possible to use the tombstone_gc=repair mode on a colocated table.

Extend the documentation to explain what colocated tables are and
document these restrictions.

Fixes scylladb/scylladb#27261

Closes scylladb/scylladb#27516
2025-12-18 15:38:29 +01:00
Cezar Moise
0ef8ca4c57 test: add logging to crash_coordinator_before_stream injection
In order to have the test ignore crashes caused by the injection,
it needs to log its occurence.
2025-12-18 16:28:13 +02:00
Cezar Moise
95d0782f89 test: add crash detection during tests
After tests end, an extra check if performed, looking into node logs.
By default, it only searches for critical errors and scans for coredumps.
If the test has the fixture `check_nodes_for_errors`, it will search for all errors.
Both checks can be ignored by setting `ignore_cores_log_patterns` and `ignore_log_patterns`.
If any of the above are found, the test will fail with an error.
2025-12-18 16:28:13 +02:00
Dario Mirovic
f831ca5ab5 test: dtest: schema_management_test.py: fix large partition add column test
`large_partition_with_add_column_test` and `large_partition_with_drop_column_test`
were added on August 17th, 2020 in scylladb/scylla-dtest#1569.

Only `large_partition_with_drop_column_test` was migrated to pytest, and renamed
to `test_large_partition_with_drop_column` on March 31st, 2021 in scylladb/scylla-dtest#2051.
Since then this test has not been running.

This patch fixes it - the test is updated and renamed and the testing environment
now properly picks it up.

Refs #26932
2025-12-18 12:54:43 +01:00
Dario Mirovic
1fe0509a9b test: dtest: schema_management_test.py: add TestSchemaManagement.prepare
Extract repeated cluster initialization code in `TestSchemaManagement`
into a separate `prepare` method. It holds all the common code for
cluster preparation, with just the necessary parameters.

Refs #26932
2025-12-18 12:54:43 +01:00
Dario Mirovic
e7d76fd8f3 test: dtest: schema_management_test.py: test enhancements
Extract regex compilation from the stress functions to the module level,
to avoid unnecessary regex compilation repetition.

Add descriptions to the stress functions.

Do not materialize list in `stress_object` for loop. Use a generator expression.

Make `_set_stress_val` an object method.

Refs #26932
2025-12-18 12:54:43 +01:00
Dario Mirovic
700853740d test: dtest: schema_management_test.py: make the tests work
Remove unused function markers.
Add wait_other_notice=True to cluster start method in
TestSchemaHistory.prepare function to make the test stable.

Enable the test in suite.yaml for dev and debug modes.

Fixes #26932
2025-12-18 12:54:43 +01:00
Dario Mirovic
3c5dd5e5ae test: dtest: migrate setup and tools from dtest
Migrate several functionalities from dtest. These will be used by
the schema_management_test.py tests when they are enabled.

Refs #26932
2025-12-18 12:54:43 +01:00
Dario Mirovic
5971b2ad97 test: dtest: copy unmodified schema_management_test.py
Copy schema_management_test.py from scylla-dtest to
test/cluster/dtest/schema_management_test.py.

Add license header.

Disable it for debug, dev, and release mode.

Refs #26932
2025-12-18 12:54:42 +01:00
Dario Mirovic
f89315d02f replica: database: flush_all_tables log on completion
In database::flush_all_tables add log on completion.
This slightly improves the readability of logs when debugging an issue.

Refs #26932
2025-12-18 12:54:42 +01:00
Patryk Jędrzejczak
d5c205194b Merge 'topology: Make removenode use left_token_ring state for global barrier' from Emil Maskovsky
Make the removenode operation go through the `left_token_ring` state, similar to decommission. This ensures that when removenode completes, all nodes in the cluster are aware of the topology change through a global token metadata barrier.

Previously, removenode would skip the `left_token_ring` state and go directly from `write_both_read_new` to `left` state. This meant that when the operation completed, some nodes might not yet know about the topology change, potentially causing issues with subsequent data plane requests.

Key changes:
- Both decommission and removenode now transition to `left_token_ring` state in the `write_both_read_new` handler
- In `left_token_ring` state, only decommissioning nodes receive the shutdown RPC (removed nodes are already dead)
- Updated documentation to reflect that both operations use this state

This change improves consistency guarantees for removenode operations by ensuring cluster-wide awareness before completion.

The change is protected by "REMOVENODE_WITH_LEFT_TOKEN_RING" feature flag to also support mixed clusters during e.g. upgrade.

Fixes:  scylladb/scylladb#25530

No backport: This fixes and issue found in tests. It can theoretically happen in production too, but wasn't reported in any customer issue, so a backport is not needed.

Closes scylladb/scylladb#26931

* https://github.com/scylladb/scylladb:
  topology: make removenode use left_token_ring state for global barrier
  topology: allow removing nodes not having tokens
  features: add feature flag for removenode via left token ring
2025-12-18 09:34:38 +01:00
Andrzej Jackowski
6ad10b141a test: remove unused get_processed_tasks_for_group
The function `get_processed_tasks_for_group` was defined twice in
`test_raft_service_levels.py`. This change removes the unused
definition to avoid confusion and clean up the code.
2025-12-17 20:45:53 +01:00
Andrzej Jackowski
8cf8e6c87d test: increase num of requests in driver_service_level tests
`_verify_tasks_processed_metrics()` is used to check that the correct
service level is used to process requests. It takes two service levels
as arguments and executes numerous requests. After that, the number
of tasks processed by one of the service levels is expected to rise
by at least the number of executed requests. In contrast,
the second service level is expected to process fewer tasks than
the number of requests.

Unfortunately, background noise may cause some tasks to be executed
on the service level that is not supposed to process requests.
This patch increases the number of executed requests to eliminate
the chance of noise causing test failures.

Additionally, this commit extends logging to make future investigation
easier.

Fixes: scylladb/scylladb#27715
2025-12-17 20:45:48 +01:00
Michael Litvak
3a06c32749 schema_registry: fix learning a schema with cdc schema
When learning a schema that has a linked cdc schema, we need to learn
also the cdc schema, and at the end the schema should point to the
learned cdc schema.

This is needed because the linked cdc schema is used for generating cdc
mutations, and when we process the mutations later it is assumed in some
places that the mutation's schema has a schema registry entry.

We fix a scenario where we could end up with a schema that points to a
cdc schema that doesn't have a schema registry entry. This could happen
for example if the schema is loaded before it is learned, so when we
learn it we see that it already has an entry. In that case, we need to
set the cdc schema to the learned cdc schema as well, because it could
have been loaded previously with a cdc schema that was not learned.

Fixes scylladb/scylladb#27610

Closes scylladb/scylladb#27704
2025-12-17 20:01:00 +02:00
Michał Jadwiszczak
74ab5addd3 test/cluster/test_view_building_coordinator: fix flakiness in test_file_streaming
The test generates a staging sstable on a node and verifies whether
the view is correctly populated.

However view updates generated by a staging sstable
(`view_update_generator::generate_and_propagate_view_updates()`) aren't
awaited by sstable consumer.
It's possible that the view building coordinator may see the task as finished
(so the staging sstable was processed) but not all view updates were
writted yet.

This patch fixes the flakiness by waiting until
`scylla_database_view_update_backlog` drops down to 0 on all shards.

Fixes scylladb/scylladb#26683

Closes scylladb/scylladb#27389
2025-12-17 17:29:15 +01:00
Michael Litvak
55f4a2b754 migration_listener: fix deadlock in nested notifications
When calling a migration notification from the context of a notification
callback, this could lead to a deadlock with unregistering a listener:
A: the parent notification is called. it calls thread_for_each, where it
   acquires a read lock on the vector of listeners, and calls the
   callback function for each listener while holding the lock.
B: a listener is unregistered. it calls `remove` and tries to acquire a
   write lock on the vector of listeners. it waits because the lock is
   held.
A: the callback function calls another notification and calls
   thread_for_each which tries to acquire the read lock again. but it
   waits since there is a waiter.

Currently we have such concrete scenario when creating a table, where
the callback of `before_create_column_family` in the tablet allocator
calls `before_allocate_tablet_map`, and this could deadlock with node
shutdown where we unregister listeners.

Fix this by not acquiring the read lock again in the nested
notification. There is no need because the read lock is already held by
the parent notification while the child notification is running. We add
a function `thread_for_each_nested` that is similar to `thread_for_each`
except it assumes the read lock is already held and doesn't acquire it,
and it should be used for nested notifications instead of
`thread_for_each`.

Fixes scylladb/scylladb#27364

Closes scylladb/scylladb#27637
2025-12-17 14:00:28 +01:00
Emil Maskovsky
1642c686c2 topology: make removenode use left_token_ring state for global barrier
Make the removenode operation go through the `left_token_ring` state,
similar to decommission. This ensures that when removenode completes,
all nodes in the cluster are aware of the topology change through a
global token metadata barrier.

Previously, removenode would skip the `left_token_ring` state and go
directly from `write_both_read_new` to `left` state. This meant that
when the operation completed, some nodes might not yet know about the
topology change, potentially causing issues with subsequent data plane
requests.

Key changes:
- Both decommission and removenode now transition to `left_token_ring`
  state in the `write_both_read_new` handler
- In `left_token_ring` state, only decommissioning nodes receive the
  shutdown RPC (removed nodes may already be dead)
- Updated documentation to reflect that both operations use this state

This change improves consistency guarantees for removenode operations
by ensuring cluster-wide awareness before completion.

Fixes:  scylladb/scylladb#25530
2025-12-17 13:31:11 +01:00
Emil Maskovsky
9431826c52 topology: allow removing nodes not having tokens
For the changes to go through the left_token_ring state when
REMOVENODE_WITH_LEFT_TOKEN_RING feature is enabled, we need to allow
removing nodes to not have any tokens (similarly to decommissioning
nodes, which use the same sequence of states).

This means the tests also need to change to allow for this new behavior
- it can temporarily happen that a removing node has no tokens but is
still part of Raft group 0 (so there may be a temporary mismatch between
the token ring and group 0 membership).

Therefore, the `check_token_ring_and_group0_consistency` function is
replaced by `wait_for_token_ring_and_group0_consistency`, which waits
up to 30 seconds for consistency to be reached.
2025-12-17 13:31:11 +01:00
Emil Maskovsky
ba6fabfc88 features: add feature flag for removenode via left token ring
To improve the behavior of the removenode operation, we want to issue
a global topology barrier after the removenode has been applied.
However, this requires changing the topology state machine to add a new
state (left_token_ring) to the removenode flow, which is not supported
by older nodes.

To allow rolling upgrades, we add a feature flag
REMOVENODE_WITH_LEFT_TOKEN_RING that controls whether the new removenode
flow is used.
2025-12-17 13:31:11 +01:00
Avi Kivity
cfd91545f9 test: add proxy protocol tests
Test that the new configuration options work and that we can
connect to them. Use direct connections with an inline implementation
of the proxy protocol and the CQL native protocol, since we want
to maintain direct control over the source port number (for shard-aware
ports). Also test we land on the expected shard.
2025-12-17 14:18:04 +02:00
Avi Kivity
1382b47d45 config, transport: support proxy protocol v2 enhanced connections
We have four native transport ports: two for plain/TLS, and two
more for shard-aware (plain/TLS as well). Add four more that expect
the proxy protocol v2 header. This allows nodes behind a reverse
proxy to record the correct source address and port in system.clients,
and the shard-aware port to see the correct source port selection
made my the client.
2025-12-17 14:18:04 +02:00
Pavel Emelyanov
a6618f225c object_storage_endpoint_param: Make it formattable for real
Currently the formatter converts it to json and then tries to emit into
the output context with the "...{{}}" format string. The intent was to
have the "...{<json text>}" output. However, the double curly brace in
format string means "print a curly brace", so the output of the above
formatting is "...{}", literally.

Fix by keeping a single curly brace. The "<json text>" thing will have
its own surrounding curly braces.

Fixes #27718

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#27687
2025-12-17 11:48:39 +01:00
Dani Tweig
0bfd07a268 github: synching scylladb.git new milestones with Jira releases
This action is triggered when a new milestone is created in scylladb.git
It will call the main logic which will create the same milestone as Jira
releases in the SCYLLADB and CUSTOMER Jira projects.

Fixes: PM-100

Closes scylladb/scylladb#27717
2025-12-17 10:05:06 +01:00
Tomasz Grabiec
c077283352 Merge 'service: support conversion of tablet keyspaces to rack-list using ALTER KEYSPACE' from Aleksandra Martyniuk
If a keyspace has a numeric replication factor in a DC and rf < #racks,
then the replicas of tablets in this keyspace can be distributed among
all racks in the DC (different for each tablet). With rack list, we need all
tablet replicas to be placed on the same racks. Hence, the conversion
requires tablet co-location.

After this series, the conversion can be done using ALTER KEYSPACE
statement. The statement that does this conversion in any DC is not
allowed to change a rf in any DC. So, if we have dc1 and dc2 with 3 racks
each and a keyspace ks then with a single ALTER KEYSPACE we can do:
- {dc1 : 2} -> {dc1 : [r1, r2]};
- {dc1 : 2, dc2: 2} ->  {dc1 : [r1, r2], dc2: [r2,r3]};
- {dc1 : 2, dc2: 2} -> {dc1 : [r1, r2], dc2: 2}
- {dc1 : 2} -> {dc1 : 2, dc2 : [r1]}
But we cannot do:
- {dc1 : 2} -> {dc1 : [r1, r2, r3]};
- {dc1 : 1, dc2 : [r1, r2] → dc1: [r1], dc2: [r1].

In order to do the co-locations rf change request is paused. Tablet
load balancer examines the paused rf change requests and schedules
necessary tablet migrations. During the process of co-location, no other
cross-rack migration is allowed.

Load balancer checks whether any paused rf change request is
ready to be resumed. If so, it puts the request back to global topology
request queue.

While an rf change request for a keyspace is running, any other rf change
of this keyspace will fail.

Fixes: #26398.

New feature, no backport

Closes scylladb/scylladb#27279

* github.com:scylladb/scylladb:
  test: add est_rack_list_conversion_with_two_replicas_in_rack
  test: test creating tablet_rack_list_colocation_plan
  test: add test_numeric_rf_to_rack_list_conversion test
  tasks: service: add global_topology_request_virtual_task
  cql3: statements: allow altering from numeric rf to rack list
  service: topology_coordinator: pause keyspace_rf_change request
  service: implement make_rack_list_colocation_plan
  service: add tablet_rack_list_colocation_plan
  cql3: reject concurrent alter of the same keyspace
  test: check paused rf change requests persistence
  db: service: add paused_rf_change_requests to system.topology
  service: pass topology and system_keyspace to load_balancer ctor
  service: tablet_allocator: extract load updates
  service: tablet_allocator: extract ensure_node
  tasks, system_keyspace: Introduce get_topology_request_entry_opt()
  node_ops: Drop get_pending_ids()
  node_ops: Drop redundant get_status_helper()
2025-12-17 10:05:06 +01:00
Anna Stuchlik
7061384a27 doc: replace "Scylla" with "ScyllaDB" in the Alternator docs
Fixes https://github.com/scylladb/scylladb/issues/27708

Closes scylladb/scylladb#27709
2025-12-17 10:05:06 +01:00
Tomasz Grabiec
7bc59e93b2 Fix lambda-coroutine fiasco in hint_endpoint_manager.cc
Found by copilot.

No issue was observed yet.

Fixes #27520

Closes scylladb/scylladb#27477
2025-12-16 20:16:41 +03:00
Łukasz Paszkowski
a61c221902 test/pylib/suite/python.py: Handle extra_cmdline_options correctly
runner.py defines a command-line option `--extra-scylla-cmdline-options`
with the default type=str. However, the function `merge_cmdline_options`,
which consumes this value to merge command-line options from multiple
sources, expects a list of strings.

This mismatch results in the following exception:
```
raise ValueError(f'invalid argument name {name}, all args {args}')
ValueError: invalid argument name o, all args --logger-log-level repair=debug --default-log-level=error
```
when a test is run with pytest using:
`--extra-scylla-cmdline-options='--logger-log-level repair=debug --default-log-level=error'`

Fix this by handling the option consistently and calling `.split()`.
Also change the default value from an empty list to an empty string
to avoid confusion both in runner.py and test.py.

Closes scylladb/scylladb#27523
2025-12-16 20:14:43 +03:00
Benny Halevy
798714183e table: get_snapshot_details: add FIXME comments
Ref https://github.com/scylladb/seastar/pull/3163

We can optimize the stat calls we use here by
using open_directory to open the snapshot,
base, and staging directory once, and using statat
calls for the relative name instead of the full
blown file_stat that needs to traverse the whole
path prefix for every call (the dirents are likely
to be cached, but still why waste cpu cycles on that
over and over again).

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-12-16 18:45:56 +02:00
Benny Halevy
f5ca3657e2 table: get_snapshot_details: lookup entries also in the staging directory
Since the sstables in the snapshot may still be in the staging dir.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-12-16 18:42:05 +02:00
Benny Halevy
dc00461adf table: get_snapshot_details: optimize using the entry number_of_links
If the number_of_linkes equals 1, we can be sure that
the file exists only in the snapshot directory so there is no need
to look it up in the data directory.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-12-16 18:42:05 +02:00
Benny Halevy
be6d87648c table: get_snapshot_details: continue loop for manifest and schema entries
Now that we're using a simple loop in the coroutine just continue
the loop for files we want to ignore.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-12-16 18:42:05 +02:00
Benny Halevy
004c08f525 table: get_snapshot_details: use directory_lister
It is more efficient to use the coroutine generator
to list the directory.

Brewing changes in seastar would make the generator buffered
as well as adding an extended generation that would
return the file stat data for each entry, that would become
useful in the next patch that optimizes the algorithm by
considering the entry's link count.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-12-16 18:42:05 +02:00
David Garcia
386ec0af4e docs: fix multiversion build
Installs sphinx-scylladb-markdown = "^0.1.4" to fix the multiversion build.

Details in https://github.com/scylladb/sphinx-scylladb-theme/pull/1552

fix: update poetry

Closes scylladb/scylladb#27619
2025-12-16 19:41:03 +03:00
Pavel Emelyanov
c4496dd63c Merge 'test/cqlpy: rename tests with duplicate name' from Nadav Har'El
When translating Cassandra's unit tests, in a couple of places I accidentally used the same name for two tests, resulting in the first of each pair to never running.

Let's fix the name of the second of the each pair to be the real name it had in the original Cassandra test.

Closes scylladb/scylladb#27644

* github.com:scylladb/scylladb:
  test/cqlpy: rename test with duplicate name
  test/cqlpy: rename test with duplicate name
2025-12-16 19:32:20 +03:00
Nadav Har'El
84df5cfaf8 test/alternator: delete unnecessary "pass"
Fixing something that never bothered anyone but our automated "code quality" tool: there's an unnecessary call to "pass" in one of our tests. Just remove it.

Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>

Closes scylladb/scylladb#27645
2025-12-16 19:29:23 +03:00
Botond Dénes
f06db096bd test/boost/reader_concurrency_semaphore_test: un-flake memory limit engages test
This test was observed to fail multiple times recently in promotion,
because there were successful reads. The failure only reproduces on
arm64, it doesn't reproduce on x86.
The suspected reason is that the data set is too close to the edge,
where all reads fail due to too high memory consumption. Reduce the
number of sstables used by this test to 54 (from 64).

Fixes: #27248

Closes scylladb/scylladb#27650
2025-12-16 19:24:49 +03:00
Pavel Emelyanov
31f90c089c Merge 'test/alternator: remove unused variable assignments and statements' from Nadav Har'El
Copilot found in test/alternator a bunch of places where we unnecessarily assign a variable that we don't use, or had a duplicated statement which doesn't do anything. This patch fixes all of them. AI still doesn't know how to prepare a patch that looks anything close to reasonable, so I did this part manually, and also carefully investigated each and every change (this took **a lot** of human time).

These patches don't change anything in the functionality of any of the tests. It's all cosmetic.

Closes scylladb/scylladb#27655

* github.com:scylladb/scylladb:
  test/alternator: remove unnecessary duplicate statement
  test/alternator: remove unused variable assignments
2025-12-16 19:23:34 +03:00
Nadav Har'El
c58739de6a Fix for Unused local variable
Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>

Closes scylladb/scylladb#27665
2025-12-16 19:20:53 +03:00
Benny Halevy
9e18cfbe17 sstable: add _mutate_sem to serialize link/move with components rewrite
We currently have races, like between moving an sstable from staging
using change_state, or when taking a snapshot, to e.g.
rewrite_statistics that replaces one of the sstable component files
when called, for example, from update_repaired_at by incremental repair.

Use a semaphore as a mutex to serialize those functions.
Note that there is no need for rwlock since the operations
are rare and read-only operations like snapshot don't
need to run in parallel.

Fixes #25919

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-12-16 17:06:45 +02:00
Piotr Dulikowski
7900aa5319 Merge 'server: fix scheduling group update timing in system.clients' from Alex Dathskovsky
Previously, the scheduling_group column was updated during the switch_tenant function, which meant the update occurred only after the tenant change operation completed—updating rows one by one. With this change, the scheduling_group column is now updated before the switch_tenant logic runs, ensuring that the table reflects the correct scheduling groups for all rows as early as possible.

fixes: #26060
fixes: #27295

backport: not required
this is a minor bug fix. Internal logic worked but the user couldnt see the change in the table if they would read the system.clients table

Closes scylladb/scylladb#26404

* github.com:scylladb/scylladb:
  test: cqlpy: Remove test_switch_tenants and add test in cluster testing. The test needs to run twice, in two separate Scylla runs, using two different modes: gossip and raft. The cluster framework supports this setup, while cqlpy only runs against Scylla instances in raft mode. Therefore, the test was moved from cqlpy to the cluster-based framework. This commit both adds the test in cluster/ and removes the old version in cqlpy/.
  server: Refactor update_control_connection_scheduling_group functionality This refactoring moves the logic that retrieves the scheduling group for driver_service_level_name out of switch_tenant. This change is possible because the scheduling group for the driver is retrieved from a map (LOOKUP). The lookup function is fully synchronized, non-coroutine, and returns immediately. For that reason, it’s better to perform this lookup outside of the switch_tenant function.
  server: Refactor scheduling group update functionality. This change generalizes the scheduling-group update functionality and removes some copy-paste code, improving overall readability and maintainability. To achieve this, capturing lambdas were introduced. As a result, self-deducing this was added to those lambdas to avoid coroutine-related issues (“coroutine fiasco”).
  server: Fix switch_tenant problem, When running on a V2 server, service-level data comes from service level cache. Because of this, we can use synchronized function to get the schedualing group. Since we are transitioning to a Raft-based architecture where all servers will be V2, we can safely implement this fix specifically for that case. This change adds get_cached_user_scheduling_group functionality and moves its usage out of switch_tenant function in update_scheduling_group_v2 usage.
  server: Add update_service_level_scheduling_group_v1 functions to create placehholder for functionality that will introduce v2 implementation. The new functionality will allow usage of service level cache
2025-12-16 15:39:49 +01:00
Aleksandra Martyniuk
9d20f0a3d2 test: add est_rack_list_conversion_with_two_replicas_in_rack 2025-12-16 13:31:24 +01:00
Aleksandra Martyniuk
0476e8d272 test: test creating tablet_rack_list_colocation_plan 2025-12-16 13:31:24 +01:00
Aleksandra Martyniuk
e48789cf6c test: add test_numeric_rf_to_rack_list_conversion test 2025-12-16 13:31:24 +01:00
Aleksandra Martyniuk
9039dfa4a5 tasks: service: add global_topology_request_virtual_task
Add a service::topo::global_topology_request_virtual_task, which
covers the replication factor changes.

Currently, the global_topology_request_virtual_task can be aborted
only if it is paused.

The progress of the rf change isn't counted.
2025-12-16 13:31:22 +01:00
Aleksandra Martyniuk
1884e655d6 cql3: statements: allow altering from numeric rf to rack list
Allow altering from numeric replication factor to rack list. Ensure
that a single ALTER KEYSPACE statement doesn't try to both convert
to rack list and change rf.
2025-12-16 13:29:08 +01:00
Aleksandra Martyniuk
640c491388 service: topology_coordinator: pause keyspace_rf_change request
To do the conversion from numeric rf to rack list, we need to co-locate
tablets of the keyspace, so that all of them have replicas on the same racks.
Pause the keyspace_rf_change global topology request, so that the co-location
could be done before the ALTER KEYSPACE changes are applied.

The pause is needed if in any dc rf changes from numeric to rack list
and the co-location is necessary. In this case we don't finish the request.
Instead, we add the request to the paused request vector. No migrations are
started.
2025-12-16 13:29:08 +01:00
Aleksandra Martyniuk
cd83d1d4dc service: implement make_rack_list_colocation_plan
The make_rack_list_colocation_plan consists of two phases.

In the first phase (realized with find_required_rack_list_colocations),
we find the pairs of (replica to be co-located, destination dc and rack).
We skip the pairs related to the tablets that are in transition or for
which the load balancer migration is already planned. We group the pairs
by destination dc and rack. Thanks to that in the second phase we can
calculate the least loaded nodes and shards only once for each rack.

In the second phase, we calculate the load of the nodes in a cluster
based on current transition and previously scheduled migrations.
We utilize the map created in the first phase and choose the least
loaded targets in each rack. We skip the tablets for which the
co-location was already scheduled.

find_required_rack_list_colocations isn't a method of load_balancer,
because in the following changes it is going to be reused by topology
coordinator to determine whether the rf change should be paused.
2025-12-16 13:29:05 +01:00
Aleksandra Martyniuk
bbe0b01b14 service: add tablet_rack_list_colocation_plan
Add tablet_rack_list_colocation_plan. Keep it in migration_plan.
The plan includes a request that is ready to resume. There can be
more than one such request at the time, but we consider them one
by one for clarity of code. Rack list co-locations will be kept together
with normal load balancer migrations.

Consider normal load balancer migrations before rack list co-locations.
During rack list co-location, allow load balancer migrations to happen
only within a single rack. Do not create the merge co-location plan if
there is ongoing rack list co-location (if there are any rf changes paused).

Generate rf change resume based on the plan. Add _request_to_resume back
to global requests queue.

make_rack_list_colocation_plan will be implemented in the following change.
2025-12-16 13:27:50 +01:00
Aleksandra Martyniuk
2e7ba1f8ce cql3: reject concurrent alter of the same keyspace
Reject ALTER KEYSPACE request if there is unfinished (queued, pending,
or paused) alter request of the same keyspace.

This is required as in the following changes, global request queue
will contain rf change requests meant to be resumed.
2025-12-16 13:27:48 +01:00
Aleksandra Martyniuk
b3a0e4c2dc test: check paused rf change requests persistence 2025-12-16 13:25:38 +01:00
Aleksandra Martyniuk
08e5f35527 db: service: add paused_rf_change_requests to system.topology
In the following changes, we allow to alter from numeric rf to rack list.
Before the alter, two tablets of the same keyspace can have replicas
on different racks. To switch to rack list, we need to co-locate
the replicas. It will be achieved by pausing the keyspace_rf_change
and scheduling migrations.

We need to persist the ids of requests that are paused. A new column -
paused_rf_change_requests is added to system.topology table.

In this commit no data is kept in the new column.
2025-12-16 13:25:38 +01:00
Aleksandra Martyniuk
d66a36058b service: pass topology and system_keyspace to load_balancer ctor
Pass a pointer to service::topology and db::system_keyspace to load
balancer. It will be used in the following patches to create
rack_list_colocation plan.
2025-12-16 13:25:38 +01:00
Aleksandra Martyniuk
6681c0f33f service: tablet_allocator: extract load updates
Extract consider_scheduled_load and consider_planned_load so that
they can be reused.
2025-12-16 13:25:38 +01:00
Aleksandra Martyniuk
13e9ee3f6f service: tablet_allocator: extract ensure_node
Extract ensure_node method so that it can be reused.
2025-12-16 13:25:38 +01:00
Tomasz Grabiec
71e6ef90f4 tasks, system_keyspace: Introduce get_topology_request_entry_opt()
It's a cleanup. Better to return std::nullopt than faking an entry
with an id when require_entry == false.
2025-12-16 13:25:34 +01:00
Tomasz Grabiec
902803babd node_ops: Drop get_pending_ids()
Every pending request should also have an entry in
system.topology_requests so it's redundant.

And problematic, because we cannot build a full request entry from
just an id alone, so if we would return those requests, they would
have blank information, and logic which needs more information would
not work.
2025-12-16 13:25:11 +01:00
Dawid Mędrek
df0830044d cql3: Extend DESC INDEX by view properties
We're extending the logic of DESCRIBE INDEX to include properties of the
underlying materialized view. Tests are provided to ensure the
implementation works as intended.
2025-12-16 11:43:38 +01:00
Dawid Mędrek
d0a42852c0 cql3: Forbid using CLUSTERING ORDER BY when creating index
This is a temporary solution as handling this property may require
a bit more attention or at least a bit more focus. For now, let's
forbid using it so it's clear it won't get applied. A simple test
is provided to cover it.

We document the restriction.
2025-12-16 11:43:38 +01:00
Dawid Mędrek
6541362a0b cql3: Extend CREATE INDEX by MV properties
After the previous patch that extended the grammar and provided
basic functionalities to accommodate properties of materialized views
in indexes, this commit takes another step and actually applies them
to the underlying view when it's being created.

We're providing validation tests for each property, with the single
exception of CLUSTERING ORDER BY. That one will be handled separately
in an upcoming commit.

We also update the user documentation.
2025-12-16 11:43:38 +01:00
Dawid Mędrek
e1f1071bc5 cql3/statements/create_index_statement: Allow for view options
We're allowing CREATE INDEX to accept the same set of properties as
materialized views do. Our goal is to give the user an ability to
configure the underlying materialized view of an index directly,
when creating it.

This commit doesn't do anything except for extending the grammar
and passing the right pieces of information to the right destinations.
There's no validation and the options have no effect yet. That will
be done in the following patch.
2025-12-16 11:43:37 +01:00
Dawid Mędrek
2cc01eeffb cql3/statements/create_index_statement: Rename member 2025-12-16 11:43:37 +01:00
Dawid Mędrek
b3099e2f2c cql3/statements/index_prop_defs: Re-introduce index_prop_defs
The type represents a mix of both index-specific and view properties.
Since we cannot easily distinguish which properties belong to which
entity, let's use this abstraction and filter them from the C++ level.

This is a prerequisite for extending the capabilities of CREATE INDEX
by allowing it to configuring the underlying materialized view.
2025-12-16 11:43:37 +01:00
Dawid Mędrek
6dfebc0b6e cql3/statements/property_definitions: Add extract_property()
The method will be useful in an upcoming commit where we'll want to
split a type inherited from `property_definitions` into two separate
entities.
2025-12-16 11:43:37 +01:00
Dawid Mędrek
62a8dd1b7f cql3/statements/index_prop_defs.cc: Add namespace 2025-12-16 11:43:37 +01:00
Dawid Mędrek
dcf2c71204 cql3/statements/index_prop_defs.hh: Rename type
We rename the type `index_prop_defs` to `index_specific_prop_defs`.
The rationale for the change is to distinguish between properties
related directly to a index and properties related to the underlying
view (if applicable).

The type `index_prop_defs` will be re-introduced in an upcomming commit
where it'll encompass both index-related and view-related properties.
This is a prerequisite for it.
2025-12-16 11:43:37 +01:00
Dawid Mędrek
e9108677f7 cql3/statements/view_prop_defs.cc: Move validation logic into file
We're moving the rest of the validation logic that can be moved from
`cql3/statements/{create,alter}_view_statement.cc` to the new file.
2025-12-16 11:43:35 +01:00
Dawid Mędrek
f5f7aeaa0a cql3/statements: Introduce view_prop_defs.{hh,cc}
We're introducing a new type wrapping properties that can be used with
materialized views. Doing that, we achieve the following things:

(1) We can keep validation logic in one place.
(2) We differentiate between properties of a regular table and
    properties of a materialized view.
(3) It provides better modularization and allows for reusing the code.
(4) It gets rid of inconsistencies in the existing code, e.g.
    CREATE MV using one type for properties, while ALTER MV another.

The actual end goal of this commit is to be able to reuse at least part
of the validation logic of MVs in CREATE INDEX and, when it gets added,
ALTER INDEX: we want to endow those statements with an ability to modify
the underlying materialized view without having to modify it directly.

This patch does NOT implement the whole validation logic yet. It will be
done in a following commit.

Refs scylladb/scylladb#16454
2025-12-16 11:43:03 +01:00
Tomasz Grabiec
4ed17c9e88 node_ops: Drop redundant get_status_helper() 2025-12-16 11:09:11 +01:00
Patryk Jędrzejczak
73db5c94de Merge 'db: api: service: introduce system.client_routes table and related API endpoints' from Andrzej Jackowski
`system.client_routes` is a system table that sets the target address and ports for each `host_id`, for one or more connection (e.g., Private Link) represented by `connection_id`. Cloud will write the table via REST, and drivers will read it via CQL to override values obtained from `system.local` and `system.peers`.

This patch series contains:
 - Introduction of `CLIENT_ROUTES` feature flag.
 - Implementation of raft-based `system.client_routes` table
 - Implementation of `v2/client-routes` POST/DELETE/GET endpoints
 - Implementation of new `CLIENT_ROUTES_CHANGE` event that is sent to drivers when `system.client_routes` is changed
 - New tests that verifies the aforementioned features

Ref: scylladb/scylla-enterprise#5699

For now, no automatic backport. However, the changes are planned to be release on `2025.4` either as a backport or a private build.

Closes scylladb/scylladb#27323

* https://github.com/scylladb/scylladb:
  docs: describe CLIENT_ROUTES_CHANGE extension
  test: add test for CLIENT_ROUTES event
  service: transport: add CLIENT_ROUTES_CHANGE event
  test: add cluster tests for client routes
  test: add API tests for client_routes endpoints
  test: add `timeout` parameter to `delete` in RESTClient
  test: allow json_body in send
  api: implement client_routes endpoints
  api: add client_routes.json
  service: main: add client_routes_service
  db: add system.client_routes table
  gms: add CLIENT_ROUTES feature
2025-12-16 10:38:27 +01:00
Botond Dénes
85f05fbe1b Revert "Merge 'Add digests for all sstable components in scylla metadata' from Taras Veretilnyk"
This reverts commit 866c96f536, reversing
changes made to 367633270a.

This change caused all longevities to fail, with a crash in parsing
scylla-metadata. The investigation is still ongoing, with no quick fix
in sight yet.

Fixes: #27496

Closes scylladb/scylladb#27518
2025-12-16 11:34:40 +02:00
Botond Dénes
83f46fa7f5 doc: add video link for TTL
Closes: #26210
2025-12-16 10:43:03 +02:00
Anna Stuchlik
ea6f2a21c6 doc: remove references to ScyllaDB versions 4.3 and 4.4
We should never refer to the no longer supported OSS versions.
This is a leftover - other mentions were removed long time ago.

Fixes https://github.com/scylladb/scylladb/issues/19569

Closes scylladb/scylladb#27656
2025-12-16 06:58:53 +02:00
Yaniv Kaul
30c4bc3f96 Fix for __iter__ method returns a non-iterator
To fix this issue, the std_list_iterator class defined within
std_list.__iter__ should implement the full iterator protocol by defining
an __iter__() method that returns self. This change ensures any instance
of std_list_iterator can be used as an iterator in Python for loops and
other iteration contexts, as required. The fix is to add a small method
definition inside the std_list_iterator class, ideally after the __init__
or in a logical place with the other dunder methods.

Only the code inside the std_list class's __iter__ function (lines around
the definition of the inner class and its methods) needs to be edited.

Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>

Closes scylladb/scylladb#27642
2025-12-16 06:57:19 +02:00
Piotr Smaron
77fa936edc doc: audit: update to present how to enable both syslog and table
Supporting both sinks have been introduced in
https://github.com/scylladb/scylladb/pull/26613, but it missed the docs
changes, so here they are.

Closes scylladb/scylladb#27607
2025-12-16 06:56:39 +02:00
Piotr Smaron
0ec485845b Clarify documentation build instructions
Closes scylladb/scylladb#27606
2025-12-16 06:56:00 +02:00
Botond Dénes
dace39fd6c Merge 'Make commitlog replay handle files with corrupt file header (non-zero) as data loss, not startup failure' from Calle Wilund
Fixes #26744

If a segment to replay is broken such that the main header is not zero, but still broken, we throw header_checksum_error. This was not handled in replayer, which grouped this into the "user error/fundamental problem" category.

However, assuming we allow for "real" disk corruption, this should really be treated same as data corruption, i.e. reported data loss, not failure to start up.

The `test_one_big_mutation_corrupted_on_startup` test accidentally sometimes provoked this issue, by doing random file wrecking, which on rare occasions provoked this, and thus failed test due to scylla not starting up, instead of losing data as expected.

Closes scylladb/scylladb#27556

* github.com:scylladb/scylladb:
  test::cluster::dtest::tools::files: Remove file
  commitlog_replay: Handle fully corrupt files same as partial corruption.
  test::pylib::suite::base: Split options.name test specifier only once
2025-12-16 06:55:42 +02:00
Calle Wilund
5f8f724d78 repair: Don't use off-strategy as repair destination with tablet tables
Fixes #17384

Bypasses enabling off-strategy storage/placement for repair streams
when table repaired is using tablets. Instead, the resulting sstable(s)
will be placed in the "normal" set of sstables, and bypass a post-repair
off-strategy compaction.

v2:
Bypass off-strat for whatever reason iff dest is tablets.

Closes scylladb/scylladb#27500
2025-12-16 06:54:07 +02:00
Michał Chojnowski
df93ea626b test/scylla_gdb: use gcore instead of signal SIGSEGV to generate a coredump on failure
The test fails in CI sometimes, and we want a coredump from a failure
to debug that. We made the test send a `signal SIGSEGV` to Scylla
on failure, but apparently that doesn't work as intended on our CI
hosts. (The CI runner seemingly can't find any coredump afterwards).

We can use gdb's `gcore` command to produce a coredump in a more
predictable way.

Refs scylladb/scylladb#22501

Closes scylladb/scylladb#27498
2025-12-16 06:53:43 +02:00
Botond Dénes
74347625f9 Merge 'test/alternator: add reproducers for more issues' from Nadav Har'El
This series adds an xfailing reproducers for two issue: #8070 and #27037:

27037 is about where even with alternator_streams_increased_compatibility set to true, if an attribute
is set to the same value it had but using a different JSON representation - a Alternator Streams
event is unduly produced.

8070 is about the ability to write malformed values into the database and then fail during read - instead of failing, as expected, during the write. This issue was known for years, but we never really had a reproducer for it - it's not possible to reproduce it using clean boto3 code and we need to build a request manually.

The first two patches are two small cleanups (including fixes #27372)  that I did while preparing the real tests - which are in the final two patches.

Closes scylladb/scylladb#27376

* github.com:scylladb/scylladb:
  test/alternator: add reproducer for bug with storing invalid values
  test/alternator: reproducer for issue 27375
  utils/rjson: fix error messages from rjson::parse()
  test/alternator: extract get_signed_request() to util.py
2025-12-16 06:53:14 +02:00
Andrzej Jackowski
f1fc5cc808 docs: describe CLIENT_ROUTES_CHANGE extension
Ref: scylladb/scylla-enterprise#5699
2025-12-15 18:19:37 +01:00
Andrzej Jackowski
61bbea51ad test: add test for CLIENT_ROUTES event
Ref: scylladb/scylla-enterprise#5699
2025-12-15 18:19:37 +01:00
Andrzej Jackowski
c2b1b10ca0 service: transport: add CLIENT_ROUTES_CHANGE event
Introduce the CLIENT_ROUTES_CHANGE event to let drivers refresh
connections when `system.client_routes` is modified. Some deployments
(e.g., Private Link) require specific address/port mappings that can
change without topology changes and drivers need to adapt promptly
to avoid connectivity issues.

This new EVENT type carries a change indicator plus the affected
`connection_ids` and `host_ids`. The only change value is
`UPDATE_NODES`, meaning one or more client routes were inserted,
updated, or deleted.

Drivers subscribe using the existing events mechanism, so no additional
`cql_protocol_extension` key is required.

Ref: scylladb/scylla-enterprise#5699
2025-12-15 18:19:37 +01:00
Andrzej Jackowski
ec87b92ba1 test: add cluster tests for client routes
Ref: scylladb/scylla-enterprise#5699
2025-12-15 18:17:15 +01:00
Andrzej Jackowski
9c9371511f test: add API tests for client_routes endpoints
Ref: scylladb/scylla-enterprise#5699
2025-12-15 17:46:14 +01:00
Andrzej Jackowski
2e80997630 test: add timeout parameter to delete in RESTClient
The parameter was missing and is needed to implement a test case
later in this patch series.
2025-12-15 17:44:48 +01:00
Andrzej Jackowski
1143acaf5b test: allow json_body in send
Needed to test `/v2/connection_metadata` endpoints that receive
JSON input.

Ref: scylladb/scylla-enterprise#5699
2025-12-15 17:44:48 +01:00
Andrzej Jackowski
e153cc434f api: implement client_routes endpoints
Ref: scylladb/scylla-enterprise#5699
2025-12-15 17:36:47 +01:00
Nadav Har'El
4e106b9820 test/cqlpy: remove unused variables
Copilot detected a few cases of cqlpy tests setting a variable which
they don't use. In all the cases in this patch, we can just remove
the variable. Although the AI found all these unused variables, I
verified each case carefully before changing it in this patch.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2025-12-15 18:11:04 +02:00
Nadav Har'El
64d9c370ee test/alternator: remove unnecessary duplicate statement
copilot noticed that test/alternator/test_scan.py had a duplicate
statement (call to full_scan()). It doesn't break the test, but also
adds nothing but confusion - so let's just remove it.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2025-12-15 18:07:45 +02:00
Nadav Har'El
a3959fe3db test/alternator: remove unused variable assignments
copilot noticed in that in in many of Alternator tests, we have some
unnecessary assignments. For example, in a few places, we use the idiom:

     with pytest.raises(...):
         ret = ...

The "ret=" part is unnecessary, as this test expects the statement to
fail (hence the raises()), and ret is never assigned. The assignment
was only there because we copied this statement from another place in
the test, which does expect the statement to pass and wants to validate
the returned value.

So we should just drop the "ret=" from these tests.

Another common occurance is that we used the idiom

     response = table.do_something()

Without checking the response and no intention to check it (either we
know it will work, or we just want to check it doesn't throw). So we
can drop the "response=" here too.

All of the unused variables in this patch were discovered by Copilot,
but I reviewed each of them carefully myself and prepared this patch.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2025-12-15 18:07:05 +02:00
Nadav Har'El
4fa4f40712 test/cqlpy: use unique partition in test
It is traditional to use a unique (or random) partition key in cqlpy
tests, to allow multiple tests to share the same table and make the test
suite a bit faster. One of the tests, test_multi_column_relation_desc,
set up a unique key "k", but then forgot to use it and used partition
key 0 instead. Fix the test to use this k.

This problem was spotted by Copilot, who saw the unused variable k.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2025-12-15 17:08:51 +02:00
Yauheni Khatsianevich
07867a9a0d test: new LWT with counters test during tablets migration/resize
- Workload: N workers perform CAS updates
- Update counter table each time CAS was successful
- Enable balancing and increase min_tablet_count to force split,
 and lower min_tablet_count to merge.
- Run tablets migrations loop
- Stop workload and verify data consistency
2025-12-15 14:32:30 +01:00
Patryk Jędrzejczak
844545bb74 Merge 'treewide: fix cases of improper re-throwing of std::exception_ptr' from Emil Maskovsky
Fix multiple cases where the captured `std::exception_ptr` has been re-thrown via simple `throw eptr;`, which results in losing the original exception type and details.

Resolved at various places found by clang-tidy:

1. db::schema_applier

When applying schema changes, the previous implementation attempted to handle exceptions by catching and rethrowing them, but did so incorrectly: using `throw ex` with a `std::exception_ptr` loses the original exception type and details.

However, in this case, explicit exception handling is unnecessary. The only reason for catching was to ensure `ap.destroy()` is called before propagating the exception. This can be more cleanly and safely achieved using Seastar's `.finally()` continuation, which guarantees cleanup regardless of success or failure.

2. directories

The `std::exception_ptr()` has been captured for logging and then again re-thrown incorrectly via `throw ex;`. We could use `std::rethrow_exception()` here instead, but it seems to be simpler to just use regular `throw;` to rethrow the original exception, and only use the `std::current_exception()` for logging (which is a pattern used in other places as well).

3. storage_service

Here the exception has been re-thrown incorrectly in a coroutine. There it is best to use the `co_await coroutine::return_exception_ptr` to propagate exception more efficiently in a coroutine-friendly manner.

Fixes: SCYLLADB-94
Refs: scylladb/scylladb#27501

No backport: This fixes an error logging issue, that isn't a production problem by itself (only found in test), therefore not backporting to older branches.

Closes scylladb/scylladb#27613

* https://github.com/scylladb/scylladb:
  db: schema_applier: improve exception-safe cleanup
  directories: fix exception rethrowing
  storage_service: use coroutine-friendly exception propagation in join_node_response_handler
2025-12-15 13:56:45 +01:00
Nadav Har'El
ccacea621f test/cqlpy: fix flaky test test_view_in_system_tables
The cqlpy test test_materialized_view.py::test_view_in_system_tables
checks that the system table "system.built_views" can inform us that
a view has been built. This test was flaky, starting to fail quite
often recently, and this patch fixes the problem in the test.

For historic reasons  this test began by calling a utility function
wait_for_view_built() - which uses a different system table,
system_distributed.view_build_status, to wait until the view was built.
The test then immediately tries to verify that also system.built_views
lists this view.

But there is no real reason why we could assume - or want to assume -
that these two tables are updated in this order, or how much time
passed between the two tables being changed. The authors of this
test already acknowledged there is a problem - they included a hack
purporting to be a "read barrier" that claimed to solve this exact
problem - but it seems it doesn't, or at least no longer does after
recent changes to the view builder's implementation.

The solution is simple - just remove the call to wait_for_view_built()
and the "hack" after it. We should just wait in a loop (until a timeout)
for the system table that we really wanted to check - system.built_views.
It's as simple as that. No need for any other assumptions or hacks.

Fixes #27296

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#27626
2025-12-15 15:29:08 +03:00
Nadav Har'El
f287484f4d test/cqlpy: rename test with duplicate name
When translating Cassandra's test validation/operations/CreateTest.java
I accidentally used the same name for two tests, resulting in the first
of them never being run.

Let's fix the name of the second of the two to be the real name it had
in the original Cassandra test.

After this patch pytest reports 16 tests in this file, instead of 15
before this patch. The previously-ignored test was correct, and it
now passes in both Scylla and Cassandra.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2025-12-15 14:19:24 +02:00
Benny Halevy
f4a4671ad6 table: seal_snapshot: avoid oversized allocation when dumping manifest.json
Currently, we first print the json contents into a stringstream buffer
and then we write it as a whole to the manifest.json file output stream.

This is not scalable and may cause large allocation for large enough number
of files.

Fixes #24216

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#27542
2025-12-15 15:19:24 +03:00
Dawid Mędrek
8691d287be cql3/statements/create_view_statement.cc: Move validation of ID
The end goal we have in mind in this commit is to extract the validation
logic of the options used for creating and altering an MV to a separate
place and be able to call from different places in the code.

It will be useful when extending the capabilities of the CREATE INDEX
statement.

In this patch, we move the part of validation responsible for checking
the ID option to keep it close to the other parts of validation of the
options in their "raw" form.
2025-12-15 13:18:48 +01:00
Dawid Mędrek
11c109c623 schema/schema.hh: Do not include index_prop_defs.hh
One of the upcoming commits will lead to a cyclic dependency
of headers because `schema.hh` includes `index_prop_defs.hh`.
To prevent that, we remove the include and replace it with
a manually added alias.

This is not a perfect solution, but doing it properly would
require comprehensive changes. We can do that in a separate
task.
2025-12-15 13:18:48 +01:00
Andrzej Jackowski
70a0418102 api: add client_routes.json
Add the JSON definitions for the POST, GET, and DELETE endpoints used
to modify client routes. These endpoints are intended for Cloud to
update the `system.client_routes` table.

The API is implemented in `/v2/` because the endpoints process arrays
of objects. Handling of such structures was improved between
Swagger 1.2 and 2.0 versions. There are already similar
`get_metrics_config` and `set_metrics_config` endpoints that operate
on similar structures and they are also in /v2/.

The introduced JSON files start with `, ` but it's intended because
the files are concatenated to the existing (metrics) JSON files,
and they need to represent valid JSON after the concatenation.

Ref: scylladb/scylla-enterprise#5699
2025-12-15 13:13:46 +01:00
Andrzej Jackowski
6fcc1ecf94 service: main: add client_routes_service
Introduce `client_routes_service` for managing
`system.client_routes` table.

Ref: scylladb/scylla-enterprise#5699
2025-12-15 13:13:40 +01:00
Andrzej Jackowski
8dde70d04c db: add system.client_routes table
Introduce `system.client_routes`, a system table that sets the target
address and ports for each `host_id`, for one or more connections
(e.g., Private Link) represented by `connection_id`. Cloud will write
the table via REST, and drivers will read it via CQL to override
values obtained from `system.local` and `system.peers`.

The table is Raft-managed to provide consistent replication across
nodes.

Schema overview: each row is identified by `(connection_id, host_id)`
and describes where clients should connect: `address` and one or more of
`port`, `tls_port`, `alternator_port`, `alternator_https_port`.
`host_id` is a UUID (just as in ScyllaDB) but `connection_id` can be
any string to accept formats of all cloud providers. `address` is
also a regular string because it can represent either an IP address or
a domain. Ports are optional in the sense that at least one of
the four must be provided.

Ref: scylladb/scylla-enterprise#5699
2025-12-15 13:08:05 +01:00
Andrzej Jackowski
2e7070d3b7 gms: add CLIENT_ROUTES feature
The feature will be used later in this patch series:
 - To avoid unnecessary operations when the feature is not enabled
 - To guard new API endpoints from being used before the cluster is
   ready to use them.
 - To implement update tests (by disabling/enabling the feature)

Ref: scylladb/scylla-enterprise#5699
2025-12-15 13:08:04 +01:00
Pavel Emelyanov
3f7ee3ce5d Merge 'batchlog: make replay (flush) faster' from Botond Dénes
The batchlog table contains an entry for each logged batch that is processed by the local node as coordinator. These entries are typically very short lived, they are inserted when the batch is processed and deleted immediately after the batch is successfully applied.
When a table has `tombstone_gc = {'mode': 'repair'}` enabled, every repair has to flush all hints and batchlogs, so that we can be certain that there is no live data in any of these, older than the last repair. Since batches can contain member queries from any number of tables, the whole batchlog has to be flushed, even if repair-mode tombstone-gc is enabled for a single table.

Flushing the batchlog table happens by doing a batchlog replay. This involves reading the entire content of this table, and attempting to replay+delete any live entries (that are old enough to be replayed).  Under normal operating circumstances, 99%+ of the content of the batchlog table is partition tombstones.  Because of this, scanning the content of this table has to process thousands to millions of tombstones. This was observed to require up to 20 minutes to finish, causing repairs to slow down to a crawl, as the batchlog-flush has to be repeated at the end of the repair of each token-range.

When trying to address this problem, the first idea was that we should expedite the garbage-collection of these accumulated tombstones. This experiment failed, see https://github.com/scylladb/scylladb/pull/23752. The commitlog proved to be an impossible to bypass barrier, preventing quick garbage-collection of tombstones. So long as a single commit-log segment is alive, holding content from the batchlog table, all tombstones written after are blocked from GC.
The second approach, represented by this PR, is to not rely in tombstone GC to reduce the tombstone amount. Instead restructure the table such that a single higher-order tombstone can be used to shadow and allow for the eviction of the myriads of individual batchlog entry tombstones. This is realized by reorganizing the batchlog table such that individual batches are rows, not partitions.
This new schema is introduced by the new `system.batchlog_v2` table, introduced by this PR:

    CREATE TABLE system.batchlog_v2 (
        version int,
        stage int,
        shard int,
        written_at timestamp,
        id uuid,
        data blob,
        PRIMARY KEY ((version, stage, shard), written_at, id));

The new schema organization has the following goals:
1) Make post-replay batchlog cleanup possible with a simple range-tombstone. This allows dropping the individual dead batchlog entries, as they are shadowed by a higher level tombstone. This enables dropping tombstones without tombstone GC.
2) To make the above possible, introduce the stage key component: batchlog entries that fail the first replay attempt, are moved to the failed_replay stage, so the initial stage can be cleaned up safely.
3) Spread out the data among Scylla shards, via the batchlog shard column.
4) Make batchlog entries ordered by the batchlog create time (id). This allows for selecting batchlogs to replay, without post-filtering of batchlogs that are too young to be replayed.

Fixes: https://github.com/scylladb/scylladb/issues/23358

This is an improvement, normally not a backport-candidate. We might override this and backport to allow wider use of `tombstone_gc: {'mode': 'repair'}`.

Closes scylladb/scylladb#26671

* github.com:scylladb/scylladb:
  db/config: change batchlog_replay_cleanup_after_replays default to 1
  test/boost/batchlog_manager_test: add test for batchlog cleanup
  replica/mutation_dump: always set position weight for clustering positions
  service/storage_proxy: s/batch_replay_throw/storage_proxy_fail_replay_batch/
  test/lib: introduce error_injection.hh
  utils/error_injection: add debug log to disable() and disable_all()
  test/lib/cql_test_env: forward config to batchlog
  test/lib/cql_test_env: add batch type to execute_batch()
  test/lib/cql_assertions: add with_size(predicate) overload
  test/lib/cql_assertions: add source location to fail messages
  test/lib/cql_assertions: columns_assertions: add assert_for_columns_of_each_row()
  test/lib/cql_assertions: rows_assertions::assert_for_columns_of_row(): add index bound check
  test/lib/cql_assertions: columns_assertions: add T* with_typed_column() overload
  db/batchlog_manager: config: s/write_timeout/reply_timeot/
  db,service: switch to system.batchlog_v2
  db/system_keyspace: introduce system.batchlog_v2
  service,db: extract generation of batchlog delete mutation
  service,db: extract get_batchlog_mutation_for() from storage-proxy
  db/batchlog_manager: only consider propagation delay with tombstone-gc=repair
  db/batchlog_manager: don't drop entire batch if one mutations' table was dropped
  data_dictionary: table: add get_truncation_time()
  db/batchlog_manager: batch(): replace map_reduce() with simple loop
  db/batchlog_manager: finish coroutinizing replay_all_failed_batches
  db/batchlog_manager: improve replayAllFailedBatches logs
2025-12-15 15:05:19 +03:00
Nadav Har'El
a9442e6d56 test/cqlpy: rename test with duplicate name
When translating Cassandra's test validation/operations/DeleteTest.java
I accidentally used the same name for two tests, resulting in the first
of them never being run.

Let's fix the name of the second of the two to be the real name it had
in the original Cassandra test.

After this patch pytest reports 52 tests in this file, instead of 51
before this patch. The previously-ignored test was correct, and it
now passes in both Scylla and Cassandra.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2025-12-15 12:02:59 +02:00
Dawid Mędrek
1e14c08eee locator/token_metadata: Remove get_host_id()
The function is declared, but it's not defined or used anywhere.

Closes scylladb/scylladb#27374
2025-12-15 10:36:52 +01:00
Michael Litvak
b9ec1180f5 alternator: require rf_rack_valid_keyspaces when creating index
When creating an alternator table with tablets, if it has an index, LSI
or GSI, require the config option rf_rack_valid_keyspaces to be enabled.

The option is required for materialized views in tablets keyspaces to
function properly and avoid consistency issues that could happen due to
cross-rack migrations and pairing switches when RF-rack validity is not
enforced.

Currently the option is validated when creating a materialized view via
the CQL interface, but it's missing from the alternator interface. Since
alternator indexes are based on materialized views, the same check
should be added there as well.

Fixes scylladb/scylladb#27612

Closes scylladb/scylladb#27622
2025-12-15 10:36:57 +02:00
Michał Hudobski
12483d8c3c vector_search: throw an error when we restrict primary in vector search
We currently allow restrictions on single column primary key,
but we ignore the restriction and return all results.
This can confuse the users. We change it so such a restriction
will throw an error and add a test to validate it.

Fixes: VECTOR-331

Closes scylladb/scylladb#27143
2025-12-15 09:45:56 +02:00
Jenkins Promoter
d5641398f5 Update pgo profiles - aarch64 2025-12-15 05:16:31 +02:00
Alex
d21faab9dc test: cqlpy: Remove test_switch_tenants and add test in cluster testing. The test needs to run twice, in two separate Scylla runs, using two different modes: gossip and raft.
The cluster framework supports this setup, while cqlpy only runs against Scylla instances in raft mode.
Therefore, the test was moved from cqlpy to the cluster-based framework.
This commit both adds the test in cluster/ and removes the old version in cqlpy/.
2025-12-14 18:46:06 +02:00
Alex
30f6a40ae6 server: Refactor update_control_connection_scheduling_group functionality This refactoring moves the logic that retrieves the scheduling group for driver_service_level_name out of switch_tenant.
This change is possible because the scheduling group for the driver is retrieved from a map (LOOKUP).
The lookup function is fully synchronized, non-coroutine, and returns immediately.
For that reason, it’s better to perform this lookup outside of the switch_tenant function.
2025-12-14 18:46:06 +02:00
Alex
5579489c4c server: Refactor scheduling group update functionality. This change generalizes the scheduling-group update functionality and removes some copy-paste code, improving overall readability and maintainability.
To achieve this, capturing lambdas were introduced. As a result, self-deducing this was added to those lambdas to avoid coroutine-related issues (“coroutine fiasco”).
2025-12-14 18:46:05 +02:00
Alex
17c9d640fe server: Fix switch_tenant problem, When running on a V2 server, service-level data comes from service level cache. Because of this, we can use synchronized function to get the schedualing group.
Since we are transitioning to a Raft-based architecture where all servers will be V2, we can safely implement this fix specifically for that case.
This change adds get_cached_user_scheduling_group functionality and moves its usage out of switch_tenant function in update_scheduling_group_v2 usage.
2025-12-14 16:27:40 +02:00
Alex
f98af582a7 server: Add update_service_level_scheduling_group_v1 functions to create placehholder for functionality that will introduce v2 implementation.
The new functionality will allow usage of service level cache
2025-12-14 16:09:18 +02:00
Nadav Har'El
c06e63daed Merge 'auth: start using SHA 512 hashing originated from musl with added yielding' from Andrzej Jackowski
This patch series contains the following changes:
 - Incorporation of `crypt_sha512.c` from musl to out codebase
 - Conversion of `crypt_sha512.c` to C++ and coroutinization
 - Coroutinization of `auth::passwords::check`
 - Enabling use of `__crypt_sha512` orignated from `crypt_sha512.c` for
   computing SHA 512 passwords of length <=255
 - Addition of yielding in the aforementioned hashing implementation.

The alien thread was a solution for reactor stalls caused by indivisible
password‑hashing tasks (https://github.com/scylladb/scylladb/issues/24524).
However, because there is only one alien thread, overall hashing throughput was reduced
(see, e.g., https://github.com/scylladb/scylla-enterprise/issues/5711). To address this,
the alien‑thread solution is reverted, and a hashing implementation
with yielding is introduced in this patch series.

Before this patch series, ScyllaDB used SHA-512 hashing provided
by the `crypt_r` function, which in our case meant using the implementation
from the `libxcrypt` library. Adding yielding to this `libxcrypt`
implementation is problematic, both due to licensing (LGPL) and because the
implementation is split into many functions across multiple files. In
contrast, the SHA-512 implementation from `musl libc` has a more
permissive license and is concise, which makes it easier to incorporate
into the ScyllaDB codebase.

The performance of this solution was compared with the previous
implementation that used one alien thread and the implementation
after the alien thread was reverted. The results (median) of
`perf-cql-raw` with `--connection-per-request 1 --smp 10` parameters
are as follows:
 - Alien thread: 41.5 new connections/s per shard
 - Reverted alien thread: 244.1 new connections/s per shard
 - This commit (yielding in hashing): 198.4 new connections/s per shard

The roughly 20% performance deterioration compared to
the old implementation without the alien thread comes from the fact
that the new hashing algorithm implemented in `utils/crypt_sha512.cc`
performs an expensive self-verification and stack cleanup.

On the other hand, with smp=10 the current implementation achieves
roughly 5x higher throughput than the alien thread. In addition,
due to yielding added in this commit, the algorithm is expected
to provide similar protection from stalls as the alien thread did.
In a test that in parallel started a cassandra-stress workload and
created thousands of new connections using python-driver, the values of
`scylla_reactor_stalls_count` metric were as follows:
 - Alien thread: 109 stalls/shard total
 - Reverted alien thread: 13186 stalls/shard total
 - This commit (yielding in hashing): 149 stalls/shard total

Similarly, the `scylla_scheduler_time_spent_on_task_quota_violations_ms`
values were:
 - Alien thread: 1087 ms/shard total
 - Reverted alien thread: 72839 ms/shard total
 - This commit (yielding in hashing): 1623 ms/shard total

To summarize, yielding during hashing computations achieves similar
throughput to the old solution without the alien thread but also
prevents stalls similarly to the alien thread.

Fixes: scylladb/scylladb#26859
Refs: scylladb/scylla-enterprise#5711

No automatic backport. After this PR is completed, the alien thread should be rather reverted from older branches (2025.2-2025.4 because on 2025.1 it's already removed). Backporting of the other commits needs further discussion.

Closes scylladb/scylladb#26860

* github.com:scylladb/scylladb:
  test/boost: add too_long_password to auth_passwords_test
  test/boost: add same_hashes_as_crypt_r to auth_passwords_test
  auth: utils: add yielding to crypt_sha512
  auth: change return type of passwords::check to future
  auth: remove code duplication in verify_scheme
  test/boost: coroutinize auth_passwords_test
  utils: coroutinize crypt_sha512
  utils: make crypt_sha512.cc to compile
  utils: license: import crypt_sha512.c from musl to the project
  Revert "auth: move passwords::check call to alien thread"
2025-12-14 14:01:01 +02:00
David Garcia
c1c3b2c5bb docs: fix local build
prevents early exits in metrics docs generation to break the local build.

Fixes #27497

Closes scylladb/scylladb#27615
2025-12-14 11:48:48 +02:00
Raphael S. Carvalho
a0a7941eb1 test: Add reproducer for split vs intra-node migration race
This is a problem caught after removing split from
add_sstable_and_update_cache(), which was used by
intra node migration when loading new sstables
into the destination shard.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-12-12 17:01:18 -03:00
Raphael S. Carvalho
e3b9abdb30 test: Verify split failure on behalf of repair during critical disk utilization
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-12-12 17:01:18 -03:00
Raphael S. Carvalho
bc772b791d test: boost: Add failure_when_adding_new_sstable_test
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-12-12 17:01:18 -03:00
Raphael S. Carvalho
77a4f95eb8 test: Add reproducer for split vs incremental repair race condition
Refs #26041.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-12-12 17:01:16 -03:00
Raphael S. Carvalho
992bfb9f63 compaction: Fail split of new sstable if manager is disabled
If manager has been disabled due to out of space prevention, it's
important to throw an exception rather than silently not
splitting the new sstable.

Not splitting a sstable when needed can cause correctness issue
when finalizing split later.
It's better to fail the writer (e.g. repair one) which will be
retried than making caller think everything succeeded.

The new replica::table::add_new_sstable_and_update_cache() will
now unlink the new sstable on failure, so the table dir will
not be left with sstables not loaded.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-12-12 16:59:51 -03:00
Raphael S. Carvalho
ee3a743dc4 replica: Don't split in do_add_sstable_and_update_cache()
Now, only sstable loader on boot and refresh from upload uses this
procedure. The idea is that maybe_split_new_sstable() will throw
when compaction cannot run due to e.g. out of space prevention.
It could fail repair writer, but we don't want it to fail boot.

As for refresh from upload, it's not supposed to work when tablet
map at the time of backup is not the same when restoring.
Even before this, refresh would fail if split already executed,
split would only happen if split was still ongoing. We need
token range stability for local restore. The safe variant will
always be load and stream.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-12-12 16:59:51 -03:00
Raphael S. Carvalho
48d243f32f streaming: Leave sstables unsealed until attached to the table
We want the invariant that after ACK, all sealed sstables will be split.

This guarantee that on restart, no unsplit sstables will be found
sealed.

The paths that generate unsplit sstables are streaming and file
streaming consumers. It includes intra-node streaming, which
is local but can clone an unsplit sstable into destination.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-12-12 16:59:51 -03:00
Raphael S. Carvalho
d9d58780e2 replica: Wire add_new_sstables_and_update_cache() into intra-node streaming
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-12-12 16:59:51 -03:00
Raphael S. Carvalho
ddb27488fa replica: Wire add_new_sstable_and_update_cache() into file streaming consumer
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-12-12 16:59:51 -03:00
Raphael S. Carvalho
10225ee434 replica: Wire add_new_sstable_and_update_cache() into streaming consumer
After the wiring, failure to attach the new sstable in the streaming
consumer will unlink the sstable automatically.

Fixes #27414.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-12-12 16:59:51 -03:00
Raphael S. Carvalho
a72025bbf6 replica: Document old add_sstable_and_update_cache() variants
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-12-12 16:59:51 -03:00
Raphael S. Carvalho
3f8363300a replica: Introduce add_new_sstables_and_update_cache()
Piggyback on new add_new_sstable_and_update_cache(), replacing
the previous add_sstables_and_update_cache().

Will be used by intra-node migration since we want it to be
safe when loading the cloned sstables. An unsplit sstable
can be cloned into destination which already ACKed split,
so we need this variant which splits sstable if needed,
while it's unsealed.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-12-12 16:59:50 -03:00
Raphael S. Carvalho
63d1d6c39b replica: Introduce add_new_sstable_and_update_cache()
Failure to load sstable in streaming can leave sealed sstables
on disk since they're not unlinked on failure.

This can result in several problems:
1) Data resurrection: since the sstable may contain deleted data
2) Split issue: since the finalization requires all sstables to be split
3) Disk usage issue: since the sstables hold space and streaming retries
can keep accumulating these files.

This new procedure will be later wired into streaming consumers, in
order to fix those problems.

Another benefit of the interface is that if there's split when adding
the new sstable, the output sstables will be returned to the caller,
allowing them to register the actual loaded sstables into e.g.
the view builder.

Refs #27414.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-12-12 16:59:50 -03:00
Raphael S. Carvalho
27d460758f replica: Account for sstables being added before ACKing split
We want the invariant that after ACK, all sealed sstables will be split.
If check-and-attach is not atomic, this sequence is possible:

1) no split decision set.
2) Unsplit sstable is checked, no need to split, sealed.
3) split decision is set and ACKed
4) unsplit sstable is attached

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-12-12 16:59:50 -03:00
Raphael S. Carvalho
794e03856a replica: Remove repair read lock from maybe_split_new_sstable()
The lock is intended to serialize some maintenance compactions,
such as major, with repair. But maybe_split_new_sstable() is
restricted solely to new sstables that aren't part of the
sstable set.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-12-12 16:59:50 -03:00
Raphael S. Carvalho
2dae0a7380 compaction: Preserve state of input sstable in maybe_split_new_sstable()
This is crucial with MVs, since the splitting must preserve the state of
the original sstable. We want the sstable to be in staging dir, so it's
excluded when calculating the diff for performing pushes to view
replicas.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-12-12 16:59:50 -03:00
Raphael S. Carvalho
1fdc410e24 Rename maybe_split_sstable() to maybe_split_new_sstable()
Since the function must only be used on new sstables, it should
be renamed to something describing its usage should be restricted.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-12-12 16:59:50 -03:00
Raphael S. Carvalho
1a077a80f1 sstables: Allow storage::snapshot() to leave destination sstable unsealed
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-12-12 16:59:50 -03:00
Raphael S. Carvalho
c5e840e460 sstables: Add option to leave sstable unsealed in the stream sink
That will be needed for file streaming to leave output sstable unsealed.

we want the invariant where all sealed sstables are split after split
was ACKed.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-12-12 16:59:50 -03:00
Raphael S. Carvalho
c10486a5e9 test: Verify unsealed sstable can be compacted
This is crucial for splitting before sealing the sstable produced by
repair. This way, unsplit sstables won't be left on disk sealed.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-12-12 16:59:50 -03:00
Raphael S. Carvalho
ab82428228 sstables: Allow unsealed sstable to be loaded
File streaming will have to load an unsealed sstable, so we need
to be able to parse components from temporary TOC instead.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-12-12 16:59:50 -03:00
Raphael S. Carvalho
b1be4ba2fc sstables: Restore sstable_writer_config::leave_unsealed
This option was retired in commit 0959739216, but
it will be again needed in order to implement split before sealing.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-12-12 16:59:50 -03:00
Emil Maskovsky
5e7456936e db: schema_applier: improve exception-safe cleanup
When applying schema changes, the previous implementation attempted to
handle exceptions by catching and rethrowing them, but did so
incorrectly: using `throw ex` with a `std::exception_ptr` loses the
original exception type and details. The correct approach is to use
`std::rethrow_exception()`.

However, in this case, explicit exception handling is unnecessary. The
only reason for catching was to ensure `ap.destroy()` is called before
propagating the exception. This can be more cleanly and safely achieved
using Seastar's `.finally()` continuation, which guarantees cleanup
regardless of success or failure.

This change removes the manual try/catch/rethrow and uses `.finally()`
to ensure proper cleanup, letting exceptions propagate naturally and
preserving their type and information.

Fixes: SCYLLADB-94
Refs: scylladb/scylladb#27501
2025-12-12 18:18:31 +01:00
Emil Maskovsky
e6f5f2537e directories: fix exception rethrowing
Fix location identified by clang-tidy where `std::exception_ptr` was
incorrectly rethrown using `throw ep;`. The correct approach is to use
`std::rethrow_exception(ep)`, which preserves the original exception
type and stack trace.

But this can be even further simplified by logging the current exception
with `std::current_exception()` and rethrowing using `throw;` instead of
capturing and rethrowing a `std::exception_ptr`. This matches the
idiomatic pattern used elsewhere in the codebase and improves clarity.

This change ensures proper exception propagation and avoids type slicing
or loss of diagnostic information.
2025-12-12 18:10:20 +01:00
Emil Maskovsky
76aacc00f2 storage_service: use coroutine-friendly exception propagation in join_node_response_handler
Improve exception handling in join_node_response_handler by using
`co_await coroutine::return_exception_ptr` to propagate exceptions.
This replaces the incorrect direct throw of `std::exception_ptr` and
ensures proper coroutine-friendly exception propagation.
2025-12-12 17:59:54 +01:00
Cezar Moise
7c8ab3d3d3 test.py: add pid to ServerInfo
Adding pid info to servers allows matching coredumps with servers

Other improvements:
- When replacing just some fields of ServerInfo, use `_replace` instead of
building a new object. This way it is agnostic to changes to the Object
- When building ServerInfo from a list, the types defined for its fields are
not enforced, so ServerInfo(*list) works fine and does not need to be changed if
fields are added or removed.
2025-12-12 15:11:03 +02:00
Botond Dénes
cb7f2e4953 docs: scylla-sstable-script-api.rst: add introduction and title 2025-12-12 13:50:12 +02:00
Botond Dénes
dd5b6770c8 docs: scylla-sstable.rst: extract script API to separate document
The script API is 500+ lines long in an already too long and hard to
navigate document. Extract it to a separate document, making both
documents shorter and easier to navigate.
2025-12-12 13:44:32 +02:00
Botond Dénes
3d73a9781e docs: scylla-sstable: prepare for script API extract
We are about to extract the script API to a separate document. In
preparation convert soon-to-be cross-document references, so they keep
working after the extraction.
2025-12-12 13:15:48 +02:00
Piotr Smaron
982339e73f doc: audit: set audit as enabled by default 2025-12-12 09:18:54 +01:00
Piotr Smaron
d1a04b3913 Reapply "audit: enable some subset of auditing by default"
This reverts commit a5edbc7d612df237a1dd9d46fd5cecf251ccfd13.

Fixes: https://github.com/scylladb/scylladb/issues/26020
2025-12-12 09:18:54 +01:00
Nadav Har'El
b3b0860e7c test/alternator: add reproducer for bug with storing invalid values
This patch adds a reproducer for a long-known bug, #8070, where
Alternator can store invalid values which are just blindly stored as
JSON, and we will only see the failure when reading the item back -
and either the client will fail to parse it, or sometimes even Alternator's
own code (e.g., FilterExpression) will fail to parse it. The right
behavior is to fail the write - not the read.

The included test checks writing different kinds of invalid values using
PutItem, UpdateItem, and BatchWriteItem. The new tests pass on DynamoDB,
but fail on Alternator so marked as "xfail".

Refs #8070.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2025-12-11 11:58:22 +02:00
Nadav Har'El
db15c212a6 test/alternator: reproducer for issue 27375
This patch adds a reproducer for issue #27375, where even with
alternator_streams_increased_compatibility set to true, if an attribute
is set to the same value it had but using a different JSON
representation - a Alternator Streams event is unduly produced.
For example, if a map {'dog': 1, 'cat': 2} is changed to
{'cat': 2, 'dog': 1}, this non-change should not be reported.

The new test added in this patch passes on DynamoDB (an event
is not generated) but fails on Alternator (an event is generated),
so the new test is marked with xfail.

Refs #27375.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2025-12-11 11:34:19 +02:00
Nadav Har'El
3595941020 utils/rjson: fix error messages from rjson::parse()
rjson::parse() when parsing JSON stored in a chunked_content (a vector
of temporary buffers) failed to initialize its byte counter to 0,
resulting in garbage positions in error messages like:

  Parsing JSON failed: Missing a name for object member. at 1452254

These error messages were most noticable in Alternator, which parses
JSON requests using a chunked_content, and reports these errors back
to the user.

The fix is trivial: add the missing initialization of the counter.

The patch also adds a regression test for this bug - it sends a JSON
corrupt at position 1, and expect to see "at 1" and not some large
random number.

Fixes #27372

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2025-12-11 11:17:01 +02:00
Nadav Har'El
102516a787 test/alternator: extract get_signed_request() to util.py
get_signed_request() started in test_manual_requests.py as a way to sign
a manually-created DynamoDB-API request - for sending requests that boto3
can't.

Over time, we started to use this function in additional test files, and
it's about time to move it to util.py - which is more natural to import
from multiple files.

This patch also adds a new function, manual_request(), which combines
get_signed_request() and actually sending the request via
requests.post(). New tests should prefer it, because it's easier to use.
We'll use the new function in tests that we add in the next patches.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2025-12-11 11:16:42 +02:00
Calle Wilund
8c4ac457af test::cluster::dtest::tools::files: Remove file
This contained only one routine; `corrupt_file`, which is
highly problematic, and not used. If you want to "corrupt" a
file, it should be done controlled, not at random.
2025-12-10 15:37:04 +01:00
Calle Wilund
e48170ca8e commitlog_replay: Handle fully corrupt files same as partial corruption.
Fixes #26744

If a segment to replay is broken such that the main header is not zero,
but still broken, we throw header_checksum_error. This was not handled in
replayer, which grouped this into the "user error/fundamental problem"
category. However, assuming we allow for "real" disk corruption, this should
really be treated same as data corruption, i.e. reported data loss, not
failure to start up.

The `test_one_big_mutation_corrupted_on_startup` test accidentally sometimes
provoked this issue, by doing random file wrecking, which on rare occasions
provoked this, and thus failed test due to scylla not starting up, instead
of loosing data as expected.

Changed test to consistently cause this exact error instead.
2025-12-10 15:37:04 +01:00
Andrzej Jackowski
11ad32c85e test/boost: add too_long_password to auth_passwords_test
The test documents the current behavior of hashing algorithms that
fail if the passphrase has 512 bytes or more.

Moreover, it documents the behavior of the current bcrypt
implementation that compares only the first 72 bytes of the password.
Although we don't typically use bcrypt for password hashing, it is
possible to insert such a hash using
`CREATE ROLE ... WITH HASHED PASSWORD ...`.

Refs: scylladb/scylladb#26842
2025-12-10 15:36:18 +01:00
Andrzej Jackowski
4c8c9cd548 test/boost: add same_hashes_as_crypt_r to auth_passwords_test
The test verifies that the old and new implementation of SHA-512
hashing returns exactly the same values.

Refs: scylladb/scylladb#26859
2025-12-10 15:36:18 +01:00
Andrzej Jackowski
98f431dd81 auth: utils: add yielding to crypt_sha512
This change allows yielding during hashing computations to prevent
stalls.

The performance of this solution was compared with the previous
implementation that used one alien thread and the implementation
after the alien thread was reverted. The results (median) of
`perf-cql-raw` with `--connection-per-request 1 --smp 10` parameters
are as follows:
 - Alien thread: 41.5 new connections/s per shard
 - Reverted alien thread: 244.1 new connections/s per shard
 - This commit (yielding in hashing): 198.4 new connections/s per shard

The alien thread is limited by a single-core hashing throughput,
which is roughly 400-500 hashes/s in the test environment. Therefore,
with smp=10, the throughput is below 50 hashes/s, and the difference
between the alien thread and other solutions further increases with
higer smp.

The roughly 20% performance deterioration compared to
the old implementation without the alien thread comes from the fact
that the new hashing algorithm implemented in `utils/crypt_sha512.cc`
performs an expensive self-verification and stack cleanup.

On the other hand, with smp=10 the current implementation achieves
roughly 5x higher throughput than the alien thread. In addition,
due to yielding added in this commit, the algorithm is expected
to provide similar protection from stalls as the alien thread did.
In a test that in parallel started a cassandra-stress workload and
created thousands of new connections using python-driver, the values of
`scylla_reactor_stalls_count` metric were as follows:
 - Alien thread: 109 stalls/shard total
 - Reverted alien thread: 13186 stalls/shard total
 - This commit (yielding in hashing): 149 stalls/shard total

Similarly, the `scylla_scheduler_time_spent_on_task_quota_violations_ms`
values were:
 - Alien thread: 1087 ms/shard total
 - Reverted alien thread: 72839 ms/shard total
 - This commit (yielding in hashing): 1623 ms/shard total

To summarize, yielding during hashing computations achieves similar
throughput to the old solution without the alien thread but also
prevents stalls similarly to the alien thread.

Fixes: scylladb/scylladb#26859
Refs: scylladb/scylla-enterprise#5711
2025-12-10 15:36:18 +01:00
Andrzej Jackowski
4ffdb0721f auth: change return type of passwords::check to future
Introduce a new `passwords::hash_with_salt_async` and change the return
type of `passwords::check` to `future<bool>`. This enables yielding
during password computations later in this patch series.

The old method, `hash_with_salt`, is marked as deprecated because
new code should use the new `hash_with_salt_async` function.
We are not removing `hash_with_salt` now to reduce the regression risk
of changing the hashing implementation—at least the methods that change
persistent hashes (CREATE, ALTER) will continue to use the old hashing
method. However, in the future, `hash_with_salt` should be entirely
removed.

Refs: scylladb/scylladb#26859
2025-12-10 15:36:18 +01:00
Andrzej Jackowski
775906d749 auth: remove code duplication in verify_scheme
Refactoring: create a new function `verify_hashing_output` to reuse
code in `hash_with_salt` and `verify_scheme`. The change is introduced
to facilitate verification of hashing output when the implementation
is extended later in this patch series.

Refs: scylladb/scylladb#26859
2025-12-10 15:36:18 +01:00
Andrzej Jackowski
11eca621b0 test/boost: coroutinize auth_passwords_test
This commit prepares `auth_passwords_test` for using coroutines,
because later in this patch series `auth::passwords::check` and other
similar functions will return Seastar futures.

Refs: scylladb/scylladb#26859
2025-12-10 15:36:18 +01:00
Andrzej Jackowski
d7818b56df utils: coroutinize crypt_sha512
Change `sha512crypt` and `__crypt_sha512` to coroutines to allow
yielding during hash computations later in this patch series.

Refs: scylladb/scylladb#26859
2025-12-10 15:36:18 +01:00
Andrzej Jackowski
033fed5734 utils: make crypt_sha512.cc to compile
The purpose of this change is to allow the usage of Seastar futures in
crypt_sha512 later in this patch series.

Refs: scylladb/scylladb#26859
2025-12-10 15:36:18 +01:00
Andrzej Jackowski
c6c30b7d0a utils: license: import crypt_sha512.c from musl to the project
This patch imports the `crypt_sha512.c` file from the musl library.
We need it to incorporate yielding in the `crypt_r` function to avoid
reactor stalls during long hashing computations.

Before this patch series, ScyllaDB used SHA-512 hashing provided
by the `crypt_r` function, which in our case meant using the implementation
from the `libxcrypt` library. Adding yielding to this `libxcrypt`
implementation is problematic, both due to licensing (LGPL) and because the
implementation is split into many functions across multiple files. In
contrast, the SHA-512 implementation from `musl libc` has a more
permissive license and is concise, which makes it easier to incorporate
into the ScyllaDB codebase.

Both `crypt_sha512.c` and musl license are obtained from
git.musl-libc.org:
 - https://git.musl-libc.org/cgit/musl/tree/src/crypt/crypt_sha512.c
 - https://git.musl-libc.org/cgit/musl/tree/COPYRIGHT

Import commit:
  commit 1b76ff0767d01df72f692806ee5adee13c67ef88
  Author: Alex Rønne Petersen <alex@alexrp.com>
  Date:   Sun Oct 12 05:35:19 2025 +0200

  s390x: shuffle register usage in __tls_get_offset to avoid r0 as address

Refs: scylladb/scylladb#26859
2025-12-10 15:36:18 +01:00
Andrzej Jackowski
5afcec4a3d Revert "auth: move passwords::check call to alien thread"
The alien thread was a solution for reactor stalls caused by indivisible
password‑hashing tasks (scylladb/scylladb#24524). However, because
there is only one alien thread, overall hashing throughput was reduced
(see, e.g., scylladb/scylla-enterprise#5711). To address this,
the alien‑thread solution is reverted, and a hashing implementation
with yielding will be introduced later in this patch series.

This reverts commit 9574513ec1.
2025-12-10 15:36:09 +01:00
Calle Wilund
9b5f3d12a3 test::pylib::suite::base: Split options.name test specifier only once
For some arcane reason, we split optional the test pattern given to
test.py twice across '::' to get the file + case specifiers later given
to pytest etc. This means that for a test with a class group (such as some
migrated dtests), we cannot really specify the exact test to run
(pattern <file>::<class>::test).

Simply splitting only on first '::' fixes this. Should not affect any
other tests.
2025-12-10 15:35:12 +01:00
Pavel Emelyanov
a63508a6f7 object_storage: Temporarily handle pure endpoint addresses as endpoints
To keep backward compatibility, support

- old configs -- where endpoint is just an address and port is separate.
  When it happens, format the "new" endpoint name

- lookup by address-only. If it happens, scan all endpoints and see if
  any one matches the provided address

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-12-10 15:33:47 +03:00
Pavel Emelyanov
ca4564e41c code: Remove dangling mentions of s3::endpoint_config
Collect some code that's unused after previous changes.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-12-10 15:33:47 +03:00
Pavel Emelyanov
f93cafac51 docs: Update docs according to new endpoints config option format
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-12-10 15:33:47 +03:00
Pavel Emelyanov
a3ca4fccef object_storage: Create s3 client with "extended" endpoint name
For this, add the s3::client::make(endpoint, ...) overload that accepts
endpoint in proto://host:port format. Then it parses the provided url
and calls the legacy one, that accepts raw host string and config with
port, https bit, etc.

The generic object_storage_endpoint_param no longer needs to carry the
internal s3::endpoint_config, the config option parsing changes
respectively.

Tests, that generate the config files, and docs are updated.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-12-10 15:33:47 +03:00
Pavel Emelyanov
5953a89822 test: Add named constants for test_get_object_store_endpoints endpoint
names

Instead of hard-coded 'a' and 'b' here and there

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-12-10 15:33:46 +03:00
Pavel Emelyanov
932b008107 s3/storage: Tune config updating
Don't prepare s3::endpoint_config from generic code, jut pass the region
and iam_role_arn (those that can potentially change) to the callback.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-12-10 15:33:46 +03:00
Pavel Emelyanov
e47f0c6284 sstable: Shuffle args for s3_client_wrapper
Make it construct like gs_client_wrapper -- with generic endpoint param
reference and make the storage-specific casts/gets/whatever internally.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-12-10 15:33:46 +03:00
Yauheni Khatsianevich
f12adfc292 test/lwt: add counter-table support to BaseLWTTester
Extend BaseLWTTester with optional counter-table configuration and
verification, enabling randomized LWT tests over tablets
 with counters.
2025-12-08 15:57:33 +01:00
Michał Jadwiszczak
aa908ba99c docs/dev/view-building-coordinator: fix typos 2025-12-04 12:52:42 +01:00
Michał Jadwiszczak
529cd25c51 db/view/view_building_worker: remove unnnecessary empty lines 2025-12-04 12:52:42 +01:00
Michał Jadwiszczak
4fc5fcaec4 db/view/view_building_worker: fix typo 2025-12-04 12:52:42 +01:00
Michał Jadwiszczak
3253b05ec9 db/view/view_building_worker: avoid creating a copy of tasks map
The loop can be converted to use an iterator and avoid creating a copy
of the tasks map.
2025-12-04 12:52:41 +01:00
Michał Jadwiszczak
597a2ce5f9 db/view/view_building_worker: wrap conditionally compiled code in a scope
The code creates a local variable, so it's better to wrap it in a local
scope, to the conditionally compiled variable doesn't pollute the
external scope.
2025-12-04 12:52:41 +01:00
Michał Jadwiszczak
a5f19af050 db/view/view_building_worker: remove unnecessary CV broadcast
After scylladb/scylladb#26897 was merged, the worker doesn't use the
view building state machine CV to manage lifetime of batches, so the
broadcast is not needed.
2025-12-04 12:52:41 +01:00
Michał Jadwiszczak
b4fe565f07 db/view/view_building_worker: catch general execption in staging task registrator
In case of general exception in `view_building_worker::create_staging_sstable_tasks()`,
catch it, print it with error level and sleep 1s before retrying.
This will allow for the registrator to retry its work in case of failure
and it should be easier to detect any bugs in the method.
2025-12-04 12:52:37 +01:00
Botond Dénes
e762027943 db/config: change batchlog_replay_cleanup_after_replays default to 1
Now that batchlog cleanup is cheap, on account of memtable flush on the
system.batchlog table garbage-collecting tombstones (previous patch), we
can afford to do cleanup on each replay, keeping the memtable size small
and more importantly -- the amount of tombstones in the memtable small.
2025-12-02 14:21:26 +02:00
Botond Dénes
8edd5b80ab test/boost/batchlog_manager_test: add test for batchlog cleanup
Add more tests covering different aspects of batchlog replay, cleanup,
replay timeout and finally v1 -> v2 migration.
2025-12-02 14:21:26 +02:00
Botond Dénes
fb84b30f88 replica/mutation_dump: always set position weight for clustering positions
SELECT * FROM MUTATION_FRAGMENTS() queries have a transformed schema
(mutation-fragment schema), which is a superset of that of the queried
table's. The mutation fragment schema represents position_in_partition
of mutation fragments expressed as clustering columns. This presents
some challenges, as some position_in_partition fields are null for some
positions. This was solved by setting these clustering keys components
to bytes{}. In the process, a mistake was made: when the clustering key
is missing in the position_in_partition, the position_weight is also set
to bytes{}. This is not correct, it is possible for some positions to
have no key but to still have a position_weight. An example is
position_in_partition::before_all_clustered_rows().
Fix this by always filling in the position_weight for positions which
have region() == clustered, instead of the earlier condition on the key
presence.

This is a minor bug affecting range tombstone changes at the two
extremes: position_in_partition::{before,after}_all_clustered_rows(). In
both cases, the position_weight can be deduced by a human looking at the
results, based on the position of the range tombstone change, relative
to other fragments.
2025-12-02 14:21:26 +02:00
Botond Dénes
8545f7eedd service/storage_proxy: s/batch_replay_throw/storage_proxy_fail_replay_batch/
Rename to make it more explicit where the error injection happens.
Also change how the error is injected, use the lambda overload instead
of is_enabled(), the former leaves better trace in logs, which helps
when debugging tests.
2025-12-02 14:21:26 +02:00
Botond Dénes
e52e1f842e test/lib: introduce error_injection.hh
Test-specific helpers for working with error injection. Provides an
RAII object to enable/disable error injection points, on all shards.
2025-12-02 14:21:26 +02:00
Botond Dénes
0a7df4b8ac utils/error_injection: add debug log to disable() and disable_all()
enable() and friends already has debug logs.
2025-12-02 14:21:26 +02:00
Botond Dénes
9bb8156f02 test/lib/cql_test_env: forward config to batchlog
Currently all batchlog config items are hardcoded. Make the two
important ones configurable: replay_timeout and
replay_cleanup_after_replays.
2025-12-02 14:21:26 +02:00
Botond Dénes
d1b796bc43 test/lib/cql_test_env: add batch type to execute_batch()
Allow executing logged batches too. Also add a trace log to the method.
2025-12-02 14:21:26 +02:00
Botond Dénes
1ad64731bc test/lib/cql_assertions: add with_size(predicate) overload
Sometimes, expected size is not a single number. Add predicate overload
to allow expressing more complicated expectations.
2025-12-02 14:21:26 +02:00
Botond Dénes
abadb8ebfb test/lib/cql_assertions: add source location to fail messages
For tests that contain multiple assert_that() invokations, identifying
the one that failed is very challenging. Add source location to fail
messages to allow convenient identification of the call-site.
2025-12-02 14:21:26 +02:00
Botond Dénes
54f16f9019 test/lib/cql_assertions: columns_assertions: add assert_for_columns_of_each_row()
Invoke the passed in function with a columns_assertions instance for the
current row, allowing for sweeping checks across columns of all rows.
2025-12-02 14:21:26 +02:00
Botond Dénes
b584e1e18e test/lib/cql_assertions: rows_assertions::assert_for_columns_of_row(): add index bound check 2025-12-02 14:21:26 +02:00
Botond Dénes
aa1d3f1170 test/lib/cql_assertions: columns_assertions: add T* with_typed_column() overload
To enable assertions on columns which are sometimes null.
One existing user of with_typed_column() needs adjustment, because the
previous version of with_typed_column() covered up silently for null
value, but after this patch this caused a failure.
2025-12-02 14:21:26 +02:00
Botond Dénes
e309b5dbe1 db/batchlog_manager: config: s/write_timeout/reply_timeot/
Although the value of this item is indeed derived from the write timeout
config, the name doesn't reflect what it is used for. Change it to
reflect it better.
2025-12-02 14:21:26 +02:00
Botond Dénes
846b656610 db,service: switch to system.batchlog_v2
New batchlogs are written to the batchlog_v2 table and replay also uses
the v2 table.
The content of system.batchlog is attempted to be migrated to
system.batchlog_v2 after each start of the batchlog_manager service.
The migration is retried on each replay if it fails. This is reduntant
but simple.

Batchlog cleanup now doesn't involve flushing memtables, the only
remaining user of replica/database.hh is gone, so the include is
dropped.
2025-12-02 14:21:26 +02:00
Botond Dénes
ee851266be db/system_keyspace: introduce system.batchlog_v2
Rearranges the system.batchlog schema as follows:

    CREATE TABLE system.batchlog_v2 (
        version int,
        stage int,
        shard int,
        written_at timestamp,
        id uuid,
        data blob,
        PRIMARY KEY ((version, stage, shard), written_at, id));

With the following goals:
1) Make post-replay batchlog cleanup possible with a simple
   range-tombstone. This allows dropping the individual dead batchlog
   entries, as they are shadowed by a higher level tombstone. This
   enables dropping tombstones without tombstone GC.
2) To make the above possible, introduce the stage key component:
   batchlog entries that fail the first replay attempt, are moved to the
   failed_replay stage, so the initial stage can be cleaned up safely.
3) Spread out the data among Scylla shards, via the batchlog shard
   column.
4) Make batchlog entries ordered by the batchlog create time (id). This
   allows for selecting batchlogs to replay, without post-filtering of
   batchlogs that are too young to be replayed.
2025-12-02 14:21:25 +02:00
Botond Dénes
9434ec2fd1 service,db: extract generation of batchlog delete mutation
Don't build batchlog delete mutations in storage-proxy code. Move this
code into db/batchlog_manager.cc, exposed via db/batchlog.hh.
This serves multiple goals:
1) Concentrates low-level batchlog related logic in
   db/batchlog_manager.cc
2) Reduce current and future code duplication.
3) Make future changes to this logic easier.
2025-12-02 14:21:25 +02:00
Botond Dénes
f54602daf0 service,db: extract get_batchlog_mutation_for() from storage-proxy
Don't build batchlog mutations in storage-proxy code. Move this code
into db/batchlog_manager.cc, exposed via db/batchlog.hh.
This serves multiple goals:
1) Concentrates low-level batchlog related logic in
   db/batchlog_manager.cc
2) Reduce current and future code duplication.
2) Make future changes to this logic easier.
2025-12-02 14:21:25 +02:00
Botond Dénes
097c2cd676 db/batchlog_manager: only consider propagation delay with tombstone-gc=repair
The propagation delay has no effect for other tombstone gc strategies,
so ignore it when tombstone-gc != repair.
2025-12-02 14:21:25 +02:00
Botond Dénes
4f30807f01 db/batchlog_manager: don't drop entire batch if one mutations' table was dropped
Just skip the mutation(s) whose tables were dropped instead.
Use the newly introduced data_dictionary::table::get_truncation_time()
to avoid looking up real table object.
2025-12-02 14:21:25 +02:00
Botond Dénes
55704908a0 data_dictionary: table: add get_truncation_time()
So the batchlog manager can avoid looking up the real table and instead
just work with data dictionary.
2025-12-02 14:21:25 +02:00
Botond Dénes
337f417b13 db/batchlog_manager: batch(): replace map_reduce() with simple loop
The map_reduce achieves no concurrency, both map and reduce are
synchronous. It only achieves two redundant lookups for the table and
hard-to-read code. Convert it into a simple loop. Preserve the
stall-protection by adding a maybe_yield() to the loop.
2025-12-02 12:05:10 +02:00
Botond Dénes
705af2bc16 db/batchlog_manager: finish coroutinizing replay_all_failed_batches
It was coroutinized already but strangely, some continuations also
remained. The `batch` lambda is still left in continuation style.
2025-12-02 10:42:28 +02:00
Botond Dénes
5b5f9120d0 db/batchlog_manager: improve replayAllFailedBatches logs
Add cleanup flag value to start message and drop cpu, it is redundant as
Scylla already adds the shard number to the logs.
Add all_replayed to finish message.
2025-12-02 10:42:28 +02:00
Calle Wilund
5f53d7852e docs::encryption: Add warning that replicated provider is deprecated
And will be removed.
2025-12-01 09:44:44 +00:00
Calle Wilund
52cc30e00c ent::encryption: Switch default key provider from replicated to local
Since we are deprecating the replicated provider, it makes little sense to
have it be default.
2025-12-01 09:44:44 +00:00
Calle Wilund
ba38d58539 replicated_key_provider: Add deprecation warning on usage
Warns user utilizing the provider that the provider is deprecated
and will be removed.
2025-12-01 09:44:44 +00:00
4457 changed files with 100556 additions and 43282 deletions

4
.github/CODEOWNERS vendored
View File

@@ -92,6 +92,10 @@ test/boost/querier_cache_test.cc @denesb
# PYTEST-BASED CQL TESTS
test/cqlpy/* @nyh
# TEST FRAMEWORK
test/pylib/* @xtrey
test.py @xtrey
# RAFT
raft/* @kbr-scylla @gleb-cloudius @kostja
test/raft/* @kbr-scylla @gleb-cloudius @kostja

View File

@@ -55,22 +55,26 @@ ninja build/<mode>/test/boost/<test_name>
ninja build/<mode>/scylla
# Run all tests in a file
./test.py --mode=<mode> <test_path>
./test.py --mode=<mode> test/<suite>/<test_name>.py
# Run a single test case from a file
./test.py --mode=<mode> <test_path>::<test_function_name>
./test.py --mode=<mode> test/<suite>/<test_name>.py::<test_function_name>
# Run all tests in a directory
./test.py --mode=<mode> test/<suite>/
# Examples
./test.py --mode=dev alternator/
./test.py --mode=dev cluster/test_raft_voters::test_raft_limited_voters_retain_coordinator
./test.py --mode=dev test/alternator/
./test.py --mode=dev test/cluster/test_raft_voters.py::test_raft_limited_voters_retain_coordinator
./test.py --mode=dev test/cqlpy/test_json.py
# Optional flags
./test.py --mode=dev cluster/test_raft_no_quorum -v # Verbose output
./test.py --mode=dev cluster/test_raft_no_quorum --repeat 5 # Repeat test 5 times
./test.py --mode=dev test/cluster/test_raft_no_quorum.py -v # Verbose output
./test.py --mode=dev test/cluster/test_raft_no_quorum.py --repeat 5 # Repeat test 5 times
```
**Important:**
- Use path without `.py` extension (e.g., `cluster/test_raft_no_quorum`, not `cluster/test_raft_no_quorum.py`)
- Use full path with `.py` extension (e.g., `test/cluster/test_raft_no_quorum.py`, not `cluster/test_raft_no_quorum`)
- To run a single test case, append `::<test_function_name>` to the file path
- Add `-v` for verbose output
- Add `--repeat <num>` to repeat a test multiple times
@@ -84,3 +88,14 @@ ninja build/<mode>/scylla
- Strive for simplicity and clarity, add complexity only when clearly justified
- Question requests: don't blindly implement requests - evaluate trade-offs, identify issues, and suggest better alternatives when appropriate
- Consider different approaches, weigh pros and cons, and recommend the best fit for the specific context
## Test Philosophy
- Performance matters. Tests should run as quickly as possible. Sleeps in the code are highly discouraged and should be avoided, to reduce run time and flakiness.
- Stability matters. Tests should be stable. New tests should be executed 100 times at least to ensure they pass 100 out of 100 times. (use --repeat 100 --max-failures 1 when running it)
- Unit tests should ideally test one thing and one thing only.
- Tests for bug fixes should run before the fix - and show the failure and after the fix - and show they now pass.
- Tests for bug fixes should have in their comments which bug fixes (GitHub or JIRA issue) they test.
- Tests in debug are always slower, so if needed, reduce number of iterations, rows, data used, cycles, etc. in debug mode.
- Tests should strive to be repeatable, and not use random input that will make their results unpredictable.
- Tests should consume as little resources as possible. Prefer running tests on a single node if it is sufficient, for example.

View File

@@ -1,6 +1,6 @@
version: 2
updates:
- package-ecosystem: "pip"
- package-ecosystem: "uv"
directory: "/docs"
schedule:
interval: "daily"

View File

@@ -4,7 +4,7 @@
# Copyright (C) 2024-present ScyllaDB
#
#
# SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
# SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.1
#
import argparse

View File

@@ -10,6 +10,9 @@ on:
types: [labeled, unlabeled]
branches: [master, next, enterprise]
env:
FORCE_JAVASCRIPT_ACTIONS_TO_NODE24: true
jobs:
check-commit:
runs-on: ubuntu-latest
@@ -30,7 +33,7 @@ jobs:
echo "DEFAULT_BRANCH=master" >> $GITHUB_ENV
fi
- name: Checkout repository
uses: actions/checkout@v4
uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
with:
repository: ${{ github.repository }}
ref: ${{ env.DEFAULT_BRANCH }}

View File

@@ -5,12 +5,18 @@ on:
types: [opened, reopened, edited]
branches: [branch-*]
env:
FORCE_JAVASCRIPT_ACTIONS_TO_NODE24: true
jobs:
check-fixes-prefix:
runs-on: ubuntu-latest
permissions:
contents: read
issues: write
steps:
- name: Check PR body for "Fixes" prefix patterns
uses: actions/github-script@v7
uses: actions/github-script@ed597411d8f924073f98dfc5c65a23a2325f34cd # v8.0.0
with:
script: |
const body = context.payload.pull_request.body;
@@ -18,7 +24,7 @@ jobs:
// Regular expression pattern to check for "Fixes" prefix
// Adjusted to dynamically insert the repository full name
const pattern = `Fixes:? ((?:#|${repo.replace('/', '\\/')}#|https://github\\.com/${repo.replace('/', '\\/')}/issues/)(\\d+)|([A-Z]+-\\d+))`;
const pattern = `Fixes:? ((?:#|${repo.replace('/', '\\/')}#|https://github\\.com/${repo.replace('/', '\\/')}/issues/)(\\d+)|(?:https://scylladb\\.atlassian\\.net/browse/)?([A-Z]+-\\d+))`;
const regex = new RegExp(pattern);
if (!regex.test(body)) {

View File

@@ -12,6 +12,9 @@ on:
description: 'the md5sum for scylla executable'
value: ${{ jobs.build.outputs.md5sum }}
env:
FORCE_JAVASCRIPT_ACTIONS_TO_NODE24: true
jobs:
read-toolchain:
uses: ./.github/workflows/read-toolchain.yaml
@@ -24,7 +27,7 @@ jobs:
outputs:
md5sum: ${{ steps.checksum.outputs.md5sum }}
steps:
- uses: actions/checkout@v4
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
with:
submodules: recursive
- name: Generate the building system

View File

@@ -0,0 +1,53 @@
name: Backport with Jira Integration
on:
push:
branches:
- master
- next-*.*
- branch-*.*
pull_request_target:
types: [labeled, closed]
branches:
- master
- next
- next-*.*
- branch-*.*
jobs:
backport-on-push:
if: github.event_name == 'push'
uses: scylladb/github-automation/.github/workflows/backport-with-jira.yaml@main
with:
event_type: 'push'
base_branch: ${{ github.ref }}
commits: ${{ github.event.before }}..${{ github.sha }}
secrets:
gh_token: ${{ secrets.AUTO_BACKPORT_TOKEN }}
jira_auth: ${{ secrets.USER_AND_KEY_FOR_JIRA_AUTOMATION }}
backport-on-label:
if: github.event_name == 'pull_request_target' && github.event.action == 'labeled'
uses: scylladb/github-automation/.github/workflows/backport-with-jira.yaml@main
with:
event_type: 'labeled'
base_branch: refs/heads/${{ github.event.pull_request.base.ref }}
pull_request_number: ${{ github.event.pull_request.number }}
head_commit: ${{ github.event.pull_request.base.sha }}
label_name: ${{ github.event.label.name }}
pr_state: ${{ github.event.pull_request.state }}
secrets:
gh_token: ${{ secrets.AUTO_BACKPORT_TOKEN }}
jira_auth: ${{ secrets.USER_AND_KEY_FOR_JIRA_AUTOMATION }}
backport-chain:
if: github.event_name == 'pull_request_target' && github.event.action == 'closed' && github.event.pull_request.merged == true
uses: scylladb/github-automation/.github/workflows/backport-with-jira.yaml@main
with:
event_type: 'chain'
base_branch: refs/heads/${{ github.event.pull_request.base.ref }}
pull_request_number: ${{ github.event.pull_request.number }}
pr_body: ${{ github.event.pull_request.body }}
secrets:
gh_token: ${{ secrets.AUTO_BACKPORT_TOKEN }}
jira_auth: ${{ secrets.USER_AND_KEY_FOR_JIRA_AUTOMATION }}

View File

@@ -1,12 +0,0 @@
name: Call Jira Status In Progress
on:
pull_request_target:
types: [opened]
jobs:
call-jira-status-in-progress:
uses: scylladb/github-automation/.github/workflows/main_update_jira_status_to_in_progress.yml@main
secrets:
caller_jira_auth: ${{ secrets.USER_AND_KEY_FOR_JIRA_AUTOMATION }}

View File

@@ -1,12 +0,0 @@
name: Call Jira Status In Review
on:
pull_request_target:
types: [ready_for_review, review_requested]
jobs:
call-jira-status-in-review:
uses: scylladb/github-automation/.github/workflows/main_update_jira_status_to_in_review.yml@main
secrets:
caller_jira_auth: ${{ secrets.USER_AND_KEY_FOR_JIRA_AUTOMATION }}

View File

@@ -1,12 +0,0 @@
name: Call Jira Status Ready For Merge
on:
pull_request_target:
types: [labeled]
jobs:
call-jira-status-update:
uses: scylladb/github-automation/.github/workflows/main_update_jira_status_to_ready_for_merge.yml@main
secrets:
caller_jira_auth: ${{ secrets.USER_AND_KEY_FOR_JIRA_AUTOMATION }}

18
.github/workflows/call_jira_sync.yml vendored Normal file
View File

@@ -0,0 +1,18 @@
name: Sync Jira Based on PR Events
on:
pull_request_target:
types: [opened, edited, ready_for_review, review_requested, labeled, unlabeled, closed]
permissions:
contents: read
pull-requests: write
issues: write
jobs:
jira-sync:
uses: scylladb/github-automation/.github/workflows/main_pr_events_jira_sync.yml@main
with:
caller_action: ${{ github.event.action }}
secrets:
caller_jira_auth: ${{ secrets.USER_AND_KEY_FOR_JIRA_AUTOMATION }}

View File

@@ -0,0 +1,22 @@
name: Sync Jira Based on PR Milestone Events
on:
pull_request_target:
types: [milestoned, demilestoned]
permissions:
contents: read
pull-requests: read
jobs:
jira-sync-milestone-set:
if: github.event.action == 'milestoned'
uses: scylladb/github-automation/.github/workflows/main_jira_sync_pr_milestone_set.yml@main
secrets:
caller_jira_auth: ${{ secrets.USER_AND_KEY_FOR_JIRA_AUTOMATION }}
jira-sync-milestone-removed:
if: github.event.action == 'demilestoned'
uses: scylladb/github-automation/.github/workflows/main_jira_sync_pr_milestone_removed.yml@main
secrets:
caller_jira_auth: ${{ secrets.USER_AND_KEY_FOR_JIRA_AUTOMATION }}

View File

@@ -0,0 +1,14 @@
name: Call Jira release creation for new milestone
on:
milestone:
types: [created, closed]
jobs:
sync-milestone-to-jira:
uses: scylladb/github-automation/.github/workflows/main_sync_milestone_to_jira_release.yml@main
with:
# Comma-separated list of Jira project keys
jira_project_keys: "SCYLLADB,CUSTOMER,SMI,RELENG,VECTOR"
secrets:
caller_jira_auth: ${{ secrets.USER_AND_KEY_FOR_JIRA_AUTOMATION }}

View File

@@ -0,0 +1,18 @@
name: validate_pr_author_email
on:
pull_request_target:
types:
- opened
- synchronize
- reopened
permissions:
contents: read
pull-requests: write
statuses: write
jobs:
validate_pr_author_email:
uses: scylladb/github-automation/.github/workflows/validate_pr_author_email.yml@main

View File

@@ -7,8 +7,9 @@ on:
env:
HEADER_CHECK_LINES: 10
LICENSE: "LicenseRef-ScyllaDB-Source-Available-1.0"
LICENSE: "LicenseRef-ScyllaDB-Source-Available-1.1"
CHECKED_EXTENSIONS: ".cc .hh .py"
FORCE_JAVASCRIPT_ACTIONS_TO_NODE24: true
jobs:
check-license-headers:
@@ -19,7 +20,7 @@ jobs:
steps:
- name: Checkout code
uses: actions/checkout@v4
uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
with:
fetch-depth: 0
@@ -40,7 +41,7 @@ jobs:
- name: Comment on PR if check fails
if: failure()
uses: actions/github-script@v7
uses: actions/github-script@ed597411d8f924073f98dfc5c65a23a2325f34cd # v8.0.0
with:
script: |
const license = '${{ env.LICENSE }}';

View File

@@ -9,6 +9,7 @@ env:
# use the development branch explicitly
CLANG_VERSION: 21
BUILD_DIR: build
FORCE_JAVASCRIPT_ACTIONS_TO_NODE24: true
permissions: {}
@@ -32,7 +33,7 @@ jobs:
steps:
- run: |
sudo dnf -y install git
- uses: actions/checkout@v4
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
with:
submodules: true
- name: Install build dependencies

View File

@@ -18,6 +18,7 @@ env:
BUILD_TYPE: RelWithDebInfo
BUILD_DIR: build
CLANG_TIDY_CHECKS: '-*,bugprone-use-after-move'
FORCE_JAVASCRIPT_ACTIONS_TO_NODE24: true
permissions: {}
@@ -42,7 +43,7 @@ jobs:
IMAGE: ${{ needs.read-toolchain.image }}
run: |
echo ${{ needs.read-toolchain.image }}
- uses: actions/checkout@v4
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
with:
submodules: true
- run: |

View File

@@ -0,0 +1,65 @@
name: Close issues created by Scylla associates
on:
issues:
types: [opened, reopened]
permissions:
issues: write
env:
FORCE_JAVASCRIPT_ACTIONS_TO_NODE24: true
jobs:
comment-and-close:
runs-on: ubuntu-latest
steps:
- name: Comment and close if author email is scylladb.com
uses: actions/github-script@ed597411d8f924073f98dfc5c65a23a2325f34cd # v8.0.0
with:
github-token: ${{ secrets.GITHUB_TOKEN }}
script: |
const issue = context.payload.issue;
const actor = context.actor;
// Get user data (only public email is available)
const { data: user } = await github.rest.users.getByUsername({
username: actor,
});
const email = user.email || "";
console.log(`Actor: ${actor}, public email: ${email || "<none>"}`);
// Only continue if email exists and ends with @scylladb.com
if (!email || !email.toLowerCase().endsWith("@scylladb.com")) {
console.log("User is not a scylladb.com email (or email not public); skipping.");
return;
}
const owner = context.repo.owner;
const repo = context.repo.repo;
const issue_number = issue.number;
const body = "Issues in this repository are closed automatically. Scylla associates should use Jira to manage issues.\nPlease move this issue to Jira https://scylladb.atlassian.net/jira/software/c/projects/SCYLLADB/list";
// Add the comment
await github.rest.issues.createComment({
owner,
repo,
issue_number,
body,
});
console.log(`Comment added to #${issue_number}`);
// Close the issue
await github.rest.issues.update({
owner,
repo,
issue_number,
state: "closed",
state_reason: "not_planned"
});
console.log(`Issue #${issue_number} closed.`);

View File

@@ -4,14 +4,16 @@ on:
branches:
- master
permissions: {}
env:
FORCE_JAVASCRIPT_ACTIONS_TO_NODE24: true
jobs:
codespell:
name: Check for spelling errors
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: codespell-project/actions-codespell@master
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
- uses: codespell-project/actions-codespell@8f01853be192eb0f849a5c7d721450e7a467c579 # v2.2
with:
only_warn: 1
ignore_words_list: "ans,datas,fo,ser,ue,crate,nd,reenable,strat,stap,te,raison"
ignore_words_list: "ans,datas,fo,ser,ue,crate,nd,reenable,strat,stap,te,raison,iif,tread"
skip: "./.git,./build,./tools,*.js,*.lock,./test,./licenses,./redis/lolwut.cc,*.svg"

View File

@@ -0,0 +1,38 @@
name: Compare Build Systems
on:
pull_request:
branches:
- master
paths:
- 'configure.py'
- '**/CMakeLists.txt'
- 'cmake/**'
- 'scripts/compare_build_systems.py'
workflow_dispatch:
permissions:
contents: read
# cancel the in-progress run upon a repush
concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true
jobs:
read-toolchain:
uses: ./.github/workflows/read-toolchain.yaml
compare:
name: Compare configure.py vs CMake
needs:
- read-toolchain
runs-on: ubuntu-latest
container: ${{ needs.read-toolchain.outputs.image }}
steps:
- uses: actions/checkout@v4
with:
submodules: true
- name: Compare build systems
run: |
git config --global --add safe.directory $GITHUB_WORKSPACE
python3 scripts/compare_build_systems.py --ci

View File

@@ -12,13 +12,16 @@ on:
schedule:
- cron: '0 10 * * 1' # Runs every Monday at 10:00am
env:
FORCE_JAVASCRIPT_ACTIONS_TO_NODE24: true
jobs:
notify_conflict_prs:
runs-on: ubuntu-latest
steps:
- name: Notify PR Authors of Conflicts
uses: actions/github-script@v7
uses: actions/github-script@ed597411d8f924073f98dfc5c65a23a2325f34cd # v8.0.0
with:
script: |
console.log("Starting conflict reminder script...");

View File

@@ -13,6 +13,9 @@ on:
permissions:
contents: read
env:
FORCE_JAVASCRIPT_ACTIONS_TO_NODE24: true
jobs:
lint:
runs-on: ubuntu-latest
@@ -21,12 +24,12 @@ jobs:
security-events: write
steps:
- uses: actions/checkout@v4
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
with:
fetch-depth: 0
- name: Differential ShellCheck
uses: redhat-plumbers-in-action/differential-shellcheck@v5
uses: redhat-plumbers-in-action/differential-shellcheck@d965e66ec0b3b2f821f75c8eff9b12442d9a7d1e # v5.5.6
with:
severity: warning
token: ${{ secrets.GITHUB_TOKEN }}

View File

@@ -5,6 +5,7 @@ name: "Docs / Publish"
env:
FLAG: ${{ github.repository == 'scylladb/scylla-enterprise' && 'enterprise' || 'opensource' }}
DEFAULT_BRANCH: ${{ github.repository == 'scylladb/scylla-enterprise' && 'enterprise' || 'master' }}
FORCE_JAVASCRIPT_ACTIONS_TO_NODE24: true
on:
push:
@@ -18,18 +19,24 @@ on:
jobs:
release:
permissions:
pages: write
id-token: write
contents: write
runs-on: ubuntu-latest
steps:
- name: Checkout
uses: actions/checkout@v4
uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
with:
ref: ${{ env.DEFAULT_BRANCH }}
persist-credentials: false
fetch-depth: 0
- name: Set up Python
uses: actions/setup-python@v5
uses: actions/setup-python@a309ff8b426b58ec0e2a45f0f869d46889d02405 # v6.2.0
with:
python-version: "3.10"
python-version: "3.12"
- name: Install uv
uses: astral-sh/setup-uv@cec208311dfd045dd5311c1add060b2062131d57 # v8.0.0
- name: Set up env
run: make -C docs FLAG="${{ env.FLAG }}" setupenv
- name: Build docs

View File

@@ -2,8 +2,12 @@ name: "Docs / Build PR"
# For more information,
# see https://sphinx-theme.scylladb.com/stable/deployment/production.html#available-workflows
permissions:
contents: read
env:
FLAG: ${{ github.repository == 'scylladb/scylla-enterprise' && 'enterprise' || 'opensource' }}
FORCE_JAVASCRIPT_ACTIONS_TO_NODE24: true
on:
pull_request:
@@ -19,14 +23,16 @@ jobs:
runs-on: ubuntu-latest
steps:
- name: Checkout
uses: actions/checkout@v4
uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
with:
persist-credentials: false
fetch-depth: 0
- name: Set up Python
uses: actions/setup-python@v5
uses: actions/setup-python@a309ff8b426b58ec0e2a45f0f869d46889d02405 # v6.2.0
with:
python-version: "3.10"
python-version: "3.12"
- name: Install uv
uses: astral-sh/setup-uv@cec208311dfd045dd5311c1add060b2062131d57 # v8.0.0
- name: Set up env
run: make -C docs FLAG="${{ env.FLAG }}" setupenv
- name: Build docs

View File

@@ -1,5 +1,11 @@
name: Docs / Validate metrics
permissions:
contents: read
env:
FORCE_JAVASCRIPT_ACTIONS_TO_NODE24: true
on:
pull_request:
branches:
@@ -7,7 +13,7 @@ on:
- enterprise
paths:
- '**/*.cc'
- 'scripts/metrics-config.yml'
- 'scripts/metrics-config.yml'
- 'scripts/get_description.py'
- 'docs/_ext/scylladb_metrics.py'
@@ -15,20 +21,20 @@ jobs:
validate-metrics:
runs-on: ubuntu-latest
name: Check metrics documentation coverage
steps:
- name: Checkout code
uses: actions/checkout@v4
uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
with:
submodules: true
- name: Set up Python
uses: actions/setup-python@v6
uses: actions/setup-python@a309ff8b426b58ec0e2a45f0f869d46889d02405 # v6.2.0
with:
python-version: '3.10'
- name: Install dependencies
run: pip install PyYAML
- name: Validate metrics
run: python3 scripts/get_description.py --validate -c scripts/metrics-config.yml

View File

@@ -13,8 +13,10 @@ env:
# supposed to be processed by idl-compiler.py, so we don't check them using the cleaner
CLEANER_DIRS: test/unit exceptions alternator api auth cdc compaction db dht gms index lang message mutation mutation_writer node_ops raft redis replica service
SEASTAR_BAD_INCLUDE_OUTPUT_PATH: build/seastar-bad-include.log
FORCE_JAVASCRIPT_ACTIONS_TO_NODE24: true
permissions: {}
permissions:
contents: read
# cancel the in-progress run upon a repush
concurrency:
@@ -31,11 +33,9 @@ jobs:
runs-on: ubuntu-latest
container: ${{ needs.read-toolchain.outputs.image }}
steps:
- uses: actions/checkout@v4
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
with:
submodules: true
- run: |
sudo dnf -y install clang-tools-extra
- name: Generate compilation database
run: |
cmake \
@@ -90,7 +90,7 @@ jobs:
| tee "$SEASTAR_BAD_INCLUDE_OUTPUT_PATH"
- run: |
echo "::remove-matcher owner=seastar-bad-include::"
- uses: actions/upload-artifact@v4
- uses: actions/upload-artifact@bbbca2ddaa5d8feaa63e36b76fdaad77386f024f # v7.0.0
with:
name: Logs
path: |

View File

@@ -7,6 +7,7 @@ on:
env:
DEFAULT_BRANCH: 'master'
FORCE_JAVASCRIPT_ACTIONS_TO_NODE24: true
jobs:
mark-ready:
@@ -17,7 +18,7 @@ jobs:
steps:
- name: Checkout repository
uses: actions/checkout@v4
uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
with:
repository: ${{ github.repository }}
ref: ${{ env.DEFAULT_BRANCH }}

View File

@@ -5,6 +5,8 @@ on:
branches:
- master
- next
env:
FORCE_JAVASCRIPT_ACTIONS_TO_NODE24: true
jobs:
label:
if: github.event.pull_request.draft == false
@@ -15,7 +17,7 @@ jobs:
steps:
- name: Wait for label to be added
run: sleep 1m
- uses: mheap/github-action-required-labels@v5
- uses: mheap/github-action-required-labels@0ac283b4e65c1fb28ce6079dea5546ceca98ccbe # v5.5.2
with:
mode: minimum
count: 1

View File

@@ -7,13 +7,18 @@ on:
description: "the toolchain docker image"
value: ${{ jobs.read-toolchain.outputs.image }}
env:
FORCE_JAVASCRIPT_ACTIONS_TO_NODE24: true
jobs:
read-toolchain:
runs-on: ubuntu-latest
permissions:
contents: read
outputs:
image: ${{ steps.read.outputs.image }}
steps:
- uses: actions/checkout@v4
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
with:
sparse-checkout: tools/toolchain/image
sparse-checkout-cone-mode: false

View File

@@ -13,6 +13,7 @@ concurrency:
env:
BUILD_DIR: build
FORCE_JAVASCRIPT_ACTIONS_TO_NODE24: true
jobs:
read-toolchain:
@@ -29,12 +30,12 @@ jobs:
- RelWithDebInfo
- Dev
steps:
- uses: actions/checkout@v4
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
with:
submodules: true
- run: |
rm -rf seastar
- uses: actions/checkout@v4
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
with:
repository: scylladb/seastar
submodules: true

View File

@@ -7,6 +7,9 @@ on:
issues:
types: [labeled, unlabeled]
env:
FORCE_JAVASCRIPT_ACTIONS_TO_NODE24: true
jobs:
label-sync:
if: ${{ github.repository == 'scylladb/scylladb' }}
@@ -21,7 +24,7 @@ jobs:
GITHUB_CONTEXT: ${{ toJson(github) }}
run: echo "$GITHUB_CONTEXT"
- name: Checkout repository
uses: actions/checkout@v4
uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
with:
sparse-checkout: |
.github/scripts/sync_labels.py

View File

@@ -1,4 +1,6 @@
name: Trigger Scylla CI Route
permissions:
contents: read
on:
issue_comment:
@@ -9,16 +11,56 @@ on:
jobs:
trigger-jenkins:
if: (github.event.comment.user.login != 'scylladbbot' && contains(github.event.comment.body, '@scylladbbot') && contains(github.event.comment.body, 'trigger-ci')) || github.event.label.name == 'conflicts'
if: (github.event_name == 'issue_comment' && github.event.comment.user.login != 'scylladbbot') || github.event.label.name == 'conflicts'
runs-on: ubuntu-latest
steps:
- name: Verify Org Membership
id: verify_author
env:
EVENT_NAME: ${{ github.event_name }}
PR_AUTHOR: ${{ github.event.pull_request.user.login }}
PR_ASSOCIATION: ${{ github.event.pull_request.author_association }}
COMMENT_AUTHOR: ${{ github.event.comment.user.login }}
COMMENT_ASSOCIATION: ${{ github.event.comment.author_association }}
shell: bash
run: |
if [[ "$EVENT_NAME" == "pull_request_target" ]]; then
AUTHOR="$PR_AUTHOR"
ASSOCIATION="$PR_ASSOCIATION"
else
AUTHOR="$COMMENT_AUTHOR"
ASSOCIATION="$COMMENT_ASSOCIATION"
fi
if [[ "$ASSOCIATION" == "MEMBER" || "$ASSOCIATION" == "OWNER" ]]; then
echo "member=true" >> $GITHUB_OUTPUT
else
echo "::warning::${AUTHOR} is not a member of scylladb (association: ${ASSOCIATION}); skipping CI trigger."
echo "member=false" >> $GITHUB_OUTPUT
fi
- name: Validate Comment Trigger
if: github.event_name == 'issue_comment'
id: verify_comment
env:
COMMENT_BODY: ${{ github.event.comment.body }}
shell: bash
run: |
CLEAN_BODY=$(echo "$COMMENT_BODY" | grep -v '^[[:space:]]*>')
if echo "$CLEAN_BODY" | grep -qi '@scylladbbot' && echo "$CLEAN_BODY" | grep -qi 'trigger-ci'; then
echo "trigger=true" >> $GITHUB_OUTPUT
else
echo "trigger=false" >> $GITHUB_OUTPUT
fi
- name: Trigger Scylla-CI-Route Jenkins Job
if: steps.verify_author.outputs.member == 'true' && (github.event_name == 'pull_request_target' || steps.verify_comment.outputs.trigger == 'true')
env:
JENKINS_USER: ${{ secrets.JENKINS_USERNAME }}
JENKINS_API_TOKEN: ${{ secrets.JENKINS_TOKEN }}
JENKINS_URL: "https://jenkins.scylladb.com"
PR_NUMBER: "${{ github.event.issue.number || github.event.pull_request.number }}"
PR_REPO_NAME: "${{ github.event.repository.full_name }}"
run: |
PR_NUMBER=${{ github.event.issue.number }}
PR_REPO_NAME=${{ github.event.repository.full_name }}
curl -X POST "$JENKINS_URL/job/releng/job/Scylla-CI-Route/buildWithParameters?PR_NUMBER=$PR_NUMBER&PR_REPO_NAME=$PR_REPO_NAME" \
--user "$JENKINS_USER:$JENKINS_API_TOKEN" --fail -i -v
--user "$JENKINS_USER:$JENKINS_API_TOKEN" --fail

View File

@@ -5,7 +5,10 @@ on:
types: [opened, reopened, synchronize]
issue_comment:
types: [created]
env:
FORCE_JAVASCRIPT_ACTIONS_TO_NODE24: true
jobs:
trigger-ci:
runs-on: ubuntu-latest
@@ -15,7 +18,7 @@ jobs:
GITHUB_CONTEXT: ${{ toJson(github) }}
run: echo "$GITHUB_CONTEXT"
- name: Checkout PR code
uses: actions/checkout@v3
uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
with:
fetch-depth: 0 # Needed to access full history
ref: ${{ github.event.pull_request.head.ref }}

View File

@@ -1,5 +1,8 @@
name: Trigger next gating
permissions:
contents: read
on:
push:
branches:

View File

@@ -4,13 +4,16 @@ on:
schedule:
- cron: '10 8 * * *' # Runs daily at 8 AM
env:
FORCE_JAVASCRIPT_ACTIONS_TO_NODE24: true
jobs:
reminder:
runs-on: ubuntu-latest
steps:
- name: Send reminders
uses: actions/github-script@v7
uses: actions/github-script@ed597411d8f924073f98dfc5c65a23a2325f34cd # v8.0.0
with:
script: |
const labelFilters = ['P0', 'P1', 'Field-Tier1','status/release blocker', 'status/regression'];

View File

@@ -2,6 +2,12 @@ cmake_minimum_required(VERSION 3.27)
project(scylla)
# Disable CMake's automatic -fcolor-diagnostics injection (CMake 3.24+ adds
# it for Clang+Ninja). configure.py does not add any color diagnostics flags,
# so we clear the internal CMake variable to prevent injection.
set(CMAKE_CXX_COMPILE_OPTIONS_COLOR_DIAGNOSTICS "")
set(CMAKE_C_COMPILE_OPTIONS_COLOR_DIAGNOSTICS "")
list(APPEND CMAKE_MODULE_PATH
${CMAKE_CURRENT_SOURCE_DIR}/cmake
${CMAKE_CURRENT_SOURCE_DIR}/seastar/cmake)
@@ -51,6 +57,16 @@ set(CMAKE_CXX_EXTENSIONS ON CACHE INTERNAL "")
set(CMAKE_CXX_SCAN_FOR_MODULES OFF CACHE INTERNAL "")
set(CMAKE_VISIBILITY_INLINES_HIDDEN ON)
# Global defines matching configure.py
# Since gcc 13, libgcc doesn't need the exception workaround
add_compile_definitions(SEASTAR_NO_EXCEPTION_HACK)
# Hacks needed to expose internal APIs for xxhash dependencies
add_compile_definitions(XXH_PRIVATE_API)
# SEASTAR_TESTING_MAIN is added later (after add_subdirectory(seastar) and
# add_subdirectory(abseil)) to avoid leaking into the seastar subdirectory.
# If SEASTAR_TESTING_MAIN is defined globally before seastar, it causes a
# duplicate 'main' symbol in seastar_testing.
if(is_multi_config)
find_package(Seastar)
# this is atypical compared to standard ExternalProject usage:
@@ -96,12 +112,33 @@ else()
set(Seastar_EXCLUDE_APPS_FROM_ALL ON CACHE BOOL "" FORCE)
set(Seastar_EXCLUDE_TESTS_FROM_ALL ON CACHE BOOL "" FORCE)
set(Seastar_IO_URING ON CACHE BOOL "" FORCE)
set(Seastar_SCHEDULING_GROUPS_COUNT 21 CACHE STRING "" FORCE)
set(Seastar_SCHEDULING_GROUPS_COUNT 24 CACHE STRING "" FORCE)
set(Seastar_UNUSED_RESULT_ERROR ON CACHE BOOL "" FORCE)
# Match configure.py's build_seastar_shared_libs: Debug and Dev
# build Seastar as a shared library, others build it static.
if(CMAKE_BUILD_TYPE STREQUAL "Debug" OR CMAKE_BUILD_TYPE STREQUAL "Dev")
set(BUILD_SHARED_LIBS ON CACHE BOOL "" FORCE)
else()
set(BUILD_SHARED_LIBS OFF CACHE BOOL "" FORCE)
endif()
add_subdirectory(seastar)
target_compile_definitions (seastar
PRIVATE
SEASTAR_NO_EXCEPTION_HACK)
# Coverage mode sets cmake_build_type='Debug' for Seastar
# (configure.py:515), so Seastar's pkg-config output includes sanitizer
# link flags in seastar_libs_coverage (configure.py:2514,2649).
# Seastar's own CMake only activates sanitizer targets for Debug/Sanitize
# configs, so we inject link options on the seastar target for Coverage.
# Using PUBLIC ensures they propagate to all targets linking Seastar
# (but not standalone tools like patchelf), matching configure.py's
# behavior. Compile-time flags and defines are handled globally in
# cmake/mode.Coverage.cmake.
if(CMAKE_BUILD_TYPE STREQUAL "Coverage")
target_link_options(seastar
PUBLIC
-fsanitize=address
-fsanitize=undefined
-fsanitize=vptr)
endif()
endif()
set(ABSL_PROPAGATE_CXX_STD ON CACHE BOOL "" FORCE)
@@ -111,8 +148,10 @@ if(Scylla_ENABLE_LTO)
endif()
find_package(Sanitizers QUIET)
# Match configure.py:2192 — abseil gets sanitizer flags with -fno-sanitize=vptr
# to exclude vptr checks which are incompatible with abseil's usage.
list(APPEND absl_cxx_flags
$<$<CONFIG:Debug,Sanitize>:$<TARGET_PROPERTY:Sanitizers::address,INTERFACE_COMPILE_OPTIONS>;$<TARGET_PROPERTY:Sanitizers::undefined_behavior,INTERFACE_COMPILE_OPTIONS>>)
$<$<CONFIG:Debug,Sanitize>:$<TARGET_PROPERTY:Sanitizers::address,INTERFACE_COMPILE_OPTIONS>;$<TARGET_PROPERTY:Sanitizers::undefined_behavior,INTERFACE_COMPILE_OPTIONS>;-fno-sanitize=vptr>)
if(CMAKE_CXX_COMPILER_ID STREQUAL "GNU")
list(APPEND ABSL_GCC_FLAGS ${absl_cxx_flags})
elseif(CMAKE_CXX_COMPILER_ID STREQUAL "Clang")
@@ -137,9 +176,38 @@ add_library(absl::headers ALIAS absl-headers)
# unfortunately.
set_target_properties(absl_strerror PROPERTIES EXCLUDE_FROM_ALL TRUE)
# Now that seastar and abseil subdirectories are fully processed, add
# SEASTAR_TESTING_MAIN globally. This matches configure.py's global define
# without leaking into seastar (which would cause duplicate main symbols).
add_compile_definitions(SEASTAR_TESTING_MAIN)
# System libraries dependencies
find_package(Boost REQUIRED
COMPONENTS filesystem program_options system thread regex unit_test_framework)
# When using shared Boost libraries, define BOOST_ALL_DYN_LINK (matching configure.py)
if(NOT Boost_USE_STATIC_LIBS)
add_compile_definitions(BOOST_ALL_DYN_LINK)
endif()
# CMake's Boost package config adds per-component defines like
# BOOST_UNIT_TEST_FRAMEWORK_DYN_LINK, BOOST_REGEX_DYN_LINK, etc. on the
# imported targets. configure.py only uses BOOST_ALL_DYN_LINK (which covers
# all components), so strip the per-component defines to align the two build
# systems.
foreach(_boost_target
Boost::unit_test_framework
Boost::regex
Boost::filesystem
Boost::program_options
Boost::system
Boost::thread)
if(TARGET ${_boost_target})
# Completely remove all INTERFACE_COMPILE_DEFINITIONS from the Boost target.
# This prevents per-component *_DYN_LINK and *_NO_LIB defines from
# propagating. BOOST_ALL_DYN_LINK (set globally) covers all components.
set_property(TARGET ${_boost_target} PROPERTY INTERFACE_COMPILE_DEFINITIONS)
endif()
endforeach()
target_link_libraries(Boost::regex
INTERFACE
ICU::i18n
@@ -196,6 +264,10 @@ if (Scylla_USE_PRECOMPILED_HEADER)
message(STATUS "Using precompiled header for Scylla - remember to add `sloppiness = pch_defines,time_macros` to ccache.conf, if you're using ccache.")
target_precompile_headers(scylla-precompiled-header PRIVATE "stdafx.hh")
target_compile_definitions(scylla-precompiled-header PRIVATE SCYLLA_USE_PRECOMPILED_HEADER)
# Match configure.py: -fpch-validate-input-files-content tells the compiler
# to check content of stdafx.hh if timestamps don't match (important for
# ccache/git workflows where timestamps may not be preserved).
add_compile_options(-fpch-validate-input-files-content)
endif()
else()
set(Scylla_USE_PRECOMPILED_HEADER_USE OFF)
@@ -300,7 +372,6 @@ add_subdirectory(locator)
add_subdirectory(message)
add_subdirectory(mutation)
add_subdirectory(mutation_writer)
add_subdirectory(node_ops)
add_subdirectory(readers)
add_subdirectory(replica)
add_subdirectory(raft)

View File

@@ -1,8 +1,8 @@
## **SCYLLADB SOFTWARE LICENSE AGREEMENT**
| Version: | 1.0 |
| Version: | 1.1 |
| :---- | :---- |
| Last updated: | December 18, 2024 |
| Last updated: | April 12, 2026 |
**Your Acceptance**
@@ -12,20 +12,48 @@ The terms "**You**" or "**Licensee**" refer to any individual accessing or using
**Grant of License**
* **Software Definitions:** Software means the ScyllaDB software provided by Licensor, including the source code, object code, and any accompanying documentation or tools, or any part thereof, as made available under this Agreement.
* **Grant of License:** Subject to the terms and conditions of this Agreement, Licensor grants You a limited, non-exclusive, revocable, non-sublicensable, non-transferable, royalty free license to Use the Software, in each case solely for the purposes of:
* **Definitions:**
1. **Software:** Software means the ScyllaDB software provided by Licensor, including the source code, object code, and any accompanying documentation or tools, or any part thereof, as made available under this Agreement.
2. **Commercial Customer**: means any legal entity (including its Affiliates) that has entered into a transaction with Licensor, or an authorized reseller/distributor, for the provision of any ScyllaDB products or services. This includes, without limitation: (a) Scope of Service: Any paid subscription, enterprise license, "BYOA" or Database-as-a-Service (DBaaS) offering, technical support, professional services, consulting, or training. (b) Scale and Volume: Any deployment regardless of size, capacity, or performance metrics (c) Payment Method: Any compensation model, including but not limited to, fixed-fee, consumption-based (On-Demand), committed spend, third-party marketplace credits (e.g., AWS, GCP, Azure), or promotional credits and discounts.
* **Grant of License:** Subject to the terms and conditions of this Agreement, including the Eligibility and Exclusive Use Restrictions clause, Licensor grants You a limited, non-exclusive, revocable, non-sublicensable, non-transferable, royalty free license to Use the Software, in each case solely for the purposes of:
1) Copying, distributing, evaluating (including performing benchmarking or comparative tests or evaluations , subject to the limitations below) and improving the Software and ScyllaDB; and
2) create a modified version of the Software (each, a "**Licensed Work**"); provided however, that each such Licensed Work keeps all or substantially all of the functions and features of the Software, and/or using all or substantially all of the source code of the Software. You hereby agree that all the Licensed Work are, upon creation, considered Licensed Work of the Licensor, shall be the sole property of the Licensor and its assignees, and the Licensor and its assignees shall be the sole owner of all rights of any kind or nature, in connection with such Licensed Work. You hereby irrevocably and unconditionally assign to the Licensor all the Licensed Work and any part thereof. This License applies separately for each version of the Licensed Work, which shall be considered "Software" for the purpose of this Agreement.
* **Eligibility and Exclusive Use Restrictions**
**License Limitations, Restrictions and Obligations:** The license grant above is subject to the following limitations, restrictions, and obligations. If Licensees Use of the Software does not comply with the above license grant or the terms of this section (including exceeding the Usage Limit set forth below), Licensee must: (i) refrain from any Use of the Software; and (ii) purchase a [commercial paid license](https://www.scylladb.com/scylladb-proprietary-software-license-agreement/) from the Licensor.
i. Restricted to "Never Customers" Only. The license granted under this Agreement is strictly limited to Never Customers. For purposes of this Agreement, a "Never Customer" is an entity (including its Affiliates) that does not have, and has not had within the previous twelve (12) months, a paid commercial subscription, professional services agreement, or any other commercial relationship with Licensor. Satisfaction of the Never Customer criteria is a strict condition precedent to the effectiveness of this License.
ii. Total Prohibition for Existing Commercial Customers. If You (or any of Your Affiliates) are an existing Commercial Customer of Licensor within the last twelve (12) months, no license is deemed to have been offered or extended to You, and any download or installation of the Software by You is unauthorized. This prohibition applies to all deployments, including but not limited to:
(a) existing commercial workloads;
(b) any new use cases, new applications, or new workloads
iii. **No "Hybrid" Usage**. Licensee is expressly prohibited from combining free tier usage under this Agreement with paid commercial units.
If You are a Commercial Customer, all use of the Software across Your entire organization (and any of your Affiliates) must be governed by a valid, paid commercial agreement. Use of the Software under this license by a Commercial Customer (which is not a "Never Customer") shall:
(a) Void this license *ab initio*;
(b) Be deemed a material breach of both this Agreement and any existing commercial terms; and
(c) Entitle Licensor to invoice Licensee for such unauthorized usage at Licensor's standard list prices, retroactive to the date of first use.
Notwithstanding anything to the contrary in the Eligibility or License Limitations sections above a Commercial Customer may use the Software exclusively for non-production purposes, including Continuous Integration (CI), automated testing, and quality assurance environments, provided that such use at all times remains compliant with the Usage Limit.
iv. **Verification**. Licensor reserves the right to audit Licensee's environment and corporate identity to ensure compliance with these eligibility criteria.
For the purposes of this Agreement an "**Affiliate**" means any entity that directly or indirectly controls, is controlled by, or is under common control with a party, where "control" means ownership of more than 50% of the voting stock or decision-making authority
**License Limitations, Restrictions and Obligations:** The license grant above is subject to the following limitations, restrictions, and obligations. If Licensees Use of the Software does not comply with the above license grant or the terms of this section (including exceeding the Usage Limit set forth below), Licensee must: (i) refrain from any Use of the Software; and (ii) unless Licensee is a Never Customer, purchase a [commercial paid license](https://www.scylladb.com/scylladb-proprietary-software-license-agreement/) from the Licensor.
* **Updates:** You shall be solely responsible for providing all equipment, systems, assets, access, and ancillary goods and services needed to access and Use the Software. Licensor may modify or update the Software at any time, without notification, in its sole and absolute discretion. After the effective date of each such update, Licensor shall bear no obligation to run, provide or support legacy versions of the Software.
* **"Usage Limit":** Licensee's total overall available storage across all deployments and clusters of the Software and the Licensed Work under this License shall not exceed 10TB and/or an upper limit of 50 VCPUs (hyper threads).
* **IP Markings:** Licensee must retain all copyright, trademark, and other proprietary notices contained in the Software. You will not modify, delete, alter, remove, or obscure any intellectual property, including without limitations licensing, copyright, trademark, or any other notices of Licensor in the Software.
* **License Reproduction:** You must conspicuously display this Agreement on each copy of the Software. If You receive the Software from a third party, this Agreement still applies to Your Use of the Software. You will be responsible for any breach of this Agreement by any such third-party.
* Distribution of any Licensed Works is permitted, provided that: (i) You must include in any Licensed Work prominent notices stating that You have modified the Software, (ii) You include a copy of this Agreement with the Licensed Work, and (iii) You clearly identify all modifications made in the Licensed Work and provides attribution to the Licensor as the original author(s) of the Software.
* **Commercial Use Restrictions:** Licensee may not offer the Software as a software-as-a-service (SaaS) or commercial database-as-as-service (dBaaS) offering. Licensee may not use the Software to compete with Licensor's existing or future products or services. If your Use of the Software does not comply with the requirements currently in effect as described in this License, you must purchase a commercial license from the Licensor, its affiliated entities, or you must refrain from using the Software and all Licensed Work. Furthermore, if You make any written claim of patent infringement relating to the Software, Your patent license for the Software granted under this Agreement terminates immediately.
* **Commercial Use Restrictions:** Licensee may not offer the Software as a software-as-a-service (SaaS) or commercial database-as-as-service (dBaaS) offering. Licensee may not use the Software to compete with Licensor's existing or future products or services. If your Use of the Software does not comply with the requirements currently in effect as described in this License, you must purchase a commercial license from the Licensor, its Affiliated entities, or you must refrain from using the Software and all Licensed Work. Furthermore, if You make any written claim of patent infringement relating to the Software, Your patent license for the Software granted under this Agreement terminates immediately.
* Notwithstanding anything to the contrary, under the License granted hereunder, You shall not and shall not permit others to: (i) transfer the Software or any portions thereof to any other party except as expressly permitted herein; (ii) attempt to circumvent or overcome any technological protection measures incorporated into the Software; (iii) incorporate the Software into the structure, machinery or controls of any aircraft, other aerial device, military vehicle, hovercraft, waterborne craft or any medical equipment of any kind; or (iv) use the Software or any part thereof in any unlawful, harmful or illegal manner, or in a manner which infringes third parties rights in any way, including intellectual property rights.
**Monitoring; Audit**
@@ -41,14 +69,14 @@ The terms "**You**" or "**Licensee**" refer to any individual accessing or using
**Indemnity; Disclaimer; Limitation of Liability**
* **Indemnity:** Licensee hereby agrees to indemnify, defend and hold harmless Licensor and its affiliates from any losses or damages incurred due to a third party claim arising out of: (i) Licensees breach of this Agreement; (ii) Licensees negligence, willful misconduct or violation of law, or (iii) Licensees products or services.
* **Indemnity:** Licensee hereby agrees to indemnify, defend and hold harmless Licensor and its Affiliates from any losses or damages incurred due to a third party claim arising out of: (i) Licensees breach of this Agreement; (ii) Licensees negligence, willful misconduct or violation of law, or (iii) Licensees products or services.
* DISCLAIMER OF WARRANTIES: LICENSEE AGREES THAT LICENSOR HAS MADE NO EXPRESS WARRANTIES REGARDING THE SOFTWARE AND THAT THE SOFTWARE IS BEING PROVIDED "AS IS" WITHOUT WARRANTY OF ANY KIND. LICENSOR DISCLAIMS ALL WARRANTIES WITH REGARD TO THE SOFTWARE, EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION, ANY IMPLIED WARRANTIES OF FITNESS FOR A PARTICULAR PURPOSE; TITLE; MERCHANTABILITY; OR NON-INFRINGEMENT OF THIRD PARTY RIGHTS. LICENSOR DOES NOT WARRANT THAT THE SOFTWARE WILL OPERATE UNINTERRUPTED OR ERROR FREE, OR THAT ALL ERRORS WILL BE CORRECTED. LICENSOR DOES NOT GUARANTEE ANY PARTICULAR RESULTS FROM THE USE OF THE SOFTWARE, AND DOES NOT WARRANT THAT THE SOFTWARE IS FIT FOR ANY PARTICULAR PURPOSE.
* LIMITATION OF LIABILITY: TO THE FULLEST EXTENT PERMISSIBLE UNDER APPLICABLE LAW, IN NO EVENT WILL LICENSOR AND/OR ITS AFFILIATES, EMPLOYEES, OFFICERS AND DIRECTORS BE LIABLE TO LICENSEE FOR (I) ANY LOSS OF USE OR DATA; INTERRUPTION OF BUSINESS; OR ANY INDIRECT; SPECIAL; INCIDENTAL; OR CONSEQUENTIAL DAMAGES OF ANY KIND (INCLUDING LOST PROFITS); AND (II) ANY DIRECT DAMAGES EXCEEDING THE TOTAL AMOUNT OF ONE THOUSAND US DOLLARS ($1,000). THE FOREGOING PROVISIONS LIMITING THE LIABILITY OF LICENSOR SHALL APPLY REGARDLESS OF THE FORM OR CAUSE OF ACTION, WHETHER IN STRICT LIABILITY, CONTRACT OR TORT.
**Proprietary Rights; No Other Rights**
* **Ownership:** Licensor retains sole and exclusive ownership of all rights, interests and title in the Software and any scripts, processes, techniques, methodologies, inventions, know-how, concepts, formatting, arrangements, visual attributes, ideas, database rights, copyrights, patents, trade secrets, and other intellectual property related thereto, and all derivatives, enhancements, modifications and improvements thereof. Except for the limited license rights granted herein, Licensee has no rights in or to the Software and/ or Licensors trademarks, logo, or branding and You acknowledge that such Software, trademarks, logo, or branding is the sole property of Licensor.
* **Feedback:** Licensee is not required to provide any suggestions, enhancement requests, recommendations or other feedback regarding the Software ("Feedback"). If, notwithstanding this policy, Licensee submits Feedback, Licensee understands and acknowledges that such Feedback is not submitted in confidence and Licensor assumes no obligation, expressed or implied, by considering it. All right in any trademark or logo of Licensor or its affiliates and You shall make no claim of right to the Software or any part thereof to be supplied by Licensor hereunder and acknowledges that as between Licensor and You, such Software is the sole proprietary, title and interest in and to Licensor.such Feedback shall be assigned to, and shall become the sole and exclusive property of, Licensor upon its creation.
* **Feedback:** Licensee is not required to provide any suggestions, enhancement requests, recommendations or other feedback regarding the Software ("Feedback"). If, notwithstanding this policy, Licensee submits Feedback, Licensee understands and acknowledges that such Feedback is not submitted in confidence and Licensor assumes no obligation, expressed or implied, by considering it. All right in any trademark or logo of Licensor or its Affiliates and You shall make no claim of right to the Software or any part thereof to be supplied by Licensor hereunder and acknowledges that as between Licensor and You, such Software is the sole proprietary, title and interest in and to Licensor.such Feedback shall be assigned to, and shall become the sole and exclusive property of, Licensor upon its creation.
* Except for the rights expressly granted to You under this Agreement, You are not granted any other licenses or rights in the Software or otherwise. This Agreement constitutes the entire agreement between You and the Licensor with respect to the subject matter hereof and supersedes all prior or contemporaneous communications, representations, or agreements, whether oral or written.
* **Third-Party Software:** Customer acknowledges that the Software may contain open and closed source components (“OSS Components”) that are governed separately by certain licenses, in each case as further provided by Company upon request. Any applicable OSS Component license is solely between Licensee and the applicable licensor of the OSS Component and Licensee shall comply with the applicable OSS Component license.
* If any provision of this Agreement is held to be invalid or unenforceable, such provision shall be struck and the remaining provisions shall remain in full force and effect.
@@ -56,7 +84,7 @@ The terms "**You**" or "**Licensee**" refer to any individual accessing or using
**Miscellaneous**
* **Miscellaneous:** This Agreement may be modified at any time by Licensor, and constitutes the entire agreement between the parties with respect to the subject matter hereof. Licensee may not assign or subcontract its rights or obligations under this Agreement. This Agreement does not, and shall not be construed to create any relationship, partnership, joint venture, employer-employee, agency, or franchisor-franchisee relationship between the parties.
* **Modifications**: Licensor reserves the right to modify this Agreement at any time. Changes will be effective upon posting to the Website or within the Software repository. Continued use of the Software after such changes constitutes acceptance.
* **Governing Law & Jurisdiction:** This Agreement shall be governed and construed in accordance with the laws of Israel, without giving effect to their respective conflicts of laws provisions, and the competent courts situated in Tel Aviv, Israel, shall have sole and exclusive jurisdiction over the parties and any conflict and/or dispute arising out of, or in connection to, this Agreement
\[*End of ScyllaDB Software License Agreement*\]

View File

@@ -43,7 +43,7 @@ For further information, please see:
[developer documentation]: HACKING.md
[build documentation]: docs/dev/building.md
[docker image build documentation]: dist/docker/debian/README.md
[docker image build documentation]: dist/docker/redhat/README.md
## Running Scylla

View File

@@ -78,7 +78,7 @@ fi
# Default scylla product/version tags
PRODUCT=scylla
VERSION=2026.1.0-dev
VERSION=2026.2.0-dev
if test -f version
then

2
abseil

Submodule abseil updated: d7aaad83b4...255c84dadd

View File

@@ -3,7 +3,7 @@
*/
/*
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.1
*/
#include "absl-flat_hash_map.hh"

View File

@@ -3,7 +3,7 @@
*/
/*
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.1
*/
#pragma once

View File

@@ -9,6 +9,8 @@ target_sources(alternator
controller.cc
server.cc
executor.cc
executor_read.cc
executor_util.cc
stats.cc
serialization.cc
expressions.cc
@@ -18,6 +20,7 @@ target_sources(alternator
consumed_capacity.cc
ttl.cc
parsed_expression_cache.cc
http_compression.cc
${cql_grammar_srcs})
target_include_directories(alternator
PUBLIC

View File

@@ -0,0 +1,253 @@
/*
* Copyright 2019-present ScyllaDB
*/
/*
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.1
*/
#pragma once
#include <map>
#include <memory>
#include <optional>
#include <string>
#include <unordered_map>
#include <variant>
#include "utils/rjson.hh"
#include "utils/overloaded_functor.hh"
#include "alternator/error.hh"
#include "alternator/expressions_types.hh"
namespace alternator {
// An attribute_path_map object is used to hold data for various attributes
// paths (parsed::path) in a hierarchy of attribute paths. Each attribute path
// has a root attribute, and then modified by member and index operators -
// for example in "a.b[2].c" we have "a" as the root, then ".b" member, then
// "[2]" index, and finally ".c" member.
// Data can be added to an attribute_path_map using the add() function, but
// requires that attributes with data not be *overlapping* or *conflicting*:
//
// 1. Two attribute paths which are identical or an ancestor of one another
// are considered *overlapping* and not allowed. If a.b.c has data,
// we can't add more data in a.b.c or any of its descendants like a.b.c.d.
//
// 2. Two attribute paths which need the same parent to have both a member and
// an index are considered *conflicting* and not allowed. E.g., if a.b has
// data, you can't add a[1]. The meaning of adding both would be that the
// attribute a is both a map and an array, which isn't sensible.
//
// These two requirements are common to the two places where Alternator uses
// this abstraction to describe how a hierarchical item is to be transformed:
//
// 1. In ProjectExpression: for filtering from a full top-level attribute
// only the parts for which user asked in ProjectionExpression.
//
// 2. In UpdateExpression: for taking the previous value of a top-level
// attribute, and modifying it based on the instructions in the user
// wrote in UpdateExpression.
template<typename T>
class attribute_path_map_node {
public:
using data_t = T;
// We need the extra unique_ptr<> here because libstdc++ unordered_map
// doesn't work with incomplete types :-(
using members_t = std::unordered_map<std::string, std::unique_ptr<attribute_path_map_node<T>>>;
// The indexes list is sorted because DynamoDB requires handling writes
// beyond the end of a list in index order.
using indexes_t = std::map<unsigned, std::unique_ptr<attribute_path_map_node<T>>>;
// The prohibition on "overlap" and "conflict" explained above means
// That only one of data, members or indexes is non-empty.
std::optional<std::variant<data_t, members_t, indexes_t>> _content;
bool is_empty() const { return !_content; }
bool has_value() const { return _content && std::holds_alternative<data_t>(*_content); }
bool has_members() const { return _content && std::holds_alternative<members_t>(*_content); }
bool has_indexes() const { return _content && std::holds_alternative<indexes_t>(*_content); }
// get_members() assumes that has_members() is true
members_t& get_members() { return std::get<members_t>(*_content); }
const members_t& get_members() const { return std::get<members_t>(*_content); }
indexes_t& get_indexes() { return std::get<indexes_t>(*_content); }
const indexes_t& get_indexes() const { return std::get<indexes_t>(*_content); }
T& get_value() { return std::get<T>(*_content); }
const T& get_value() const { return std::get<T>(*_content); }
};
template<typename T>
using attribute_path_map = std::unordered_map<std::string, attribute_path_map_node<T>>;
using attrs_to_get_node = attribute_path_map_node<std::monostate>;
// attrs_to_get lists which top-level attribute are needed, and possibly also
// which part of the top-level attribute is really needed (when nested
// attribute paths appeared in the query).
// Most code actually uses optional<attrs_to_get>. There, a disengaged
// optional means we should get all attributes, not specific ones.
using attrs_to_get = attribute_path_map<std::monostate>;
// takes a given JSON value and drops its parts which weren't asked to be
// kept. It modifies the given JSON value, or returns false to signify that
// the entire object should be dropped.
// Note that The JSON value is assumed to be encoded using the DynamoDB
// conventions - i.e., it is really a map whose key has a type string,
// and the value is the real object.
template<typename T>
bool hierarchy_filter(rjson::value& val, const attribute_path_map_node<T>& h) {
if (!val.IsObject() || val.MemberCount() != 1) {
// This shouldn't happen. We shouldn't have stored malformed objects.
// But today Alternator does not validate the structure of nested
// documents before storing them, so this can happen on read.
throw api_error::internal(format("Malformed value object read: {}", val));
}
const char* type = val.MemberBegin()->name.GetString();
rjson::value& v = val.MemberBegin()->value;
if (h.has_members()) {
const auto& members = h.get_members();
if (type[0] != 'M' || !v.IsObject()) {
// If v is not an object (dictionary, map), none of the members
// can match.
return false;
}
rjson::value newv = rjson::empty_object();
for (auto it = v.MemberBegin(); it != v.MemberEnd(); ++it) {
std::string attr = rjson::to_string(it->name);
auto x = members.find(attr);
if (x != members.end()) {
if (x->second) {
// Only a part of this attribute is to be filtered, do it.
if (hierarchy_filter(it->value, *x->second)) {
// because newv started empty and attr are unique
// (keys of v), we can use add() here
rjson::add_with_string_name(newv, attr, std::move(it->value));
}
} else {
// The entire attribute is to be kept
rjson::add_with_string_name(newv, attr, std::move(it->value));
}
}
}
if (newv.MemberCount() == 0) {
return false;
}
v = newv;
} else if (h.has_indexes()) {
const auto& indexes = h.get_indexes();
if (type[0] != 'L' || !v.IsArray()) {
return false;
}
rjson::value newv = rjson::empty_array();
const auto& a = v.GetArray();
for (unsigned i = 0; i < v.Size(); i++) {
auto x = indexes.find(i);
if (x != indexes.end()) {
if (x->second) {
if (hierarchy_filter(a[i], *x->second)) {
rjson::push_back(newv, std::move(a[i]));
}
} else {
// The entire attribute is to be kept
rjson::push_back(newv, std::move(a[i]));
}
}
}
if (newv.Size() == 0) {
return false;
}
v = newv;
}
return true;
}
// Add a path to an attribute_path_map. Throws a validation error if the path
// "overlaps" with one already in the filter (one is a sub-path of the other)
// or "conflicts" with it (both a member and index is requested).
template<typename T>
void attribute_path_map_add(const char* source, attribute_path_map<T>& map, const parsed::path& p, T value = {}) {
using node = attribute_path_map_node<T>;
// The first step is to look for the top-level attribute (p.root()):
auto it = map.find(p.root());
if (it == map.end()) {
if (p.has_operators()) {
it = map.emplace(p.root(), node {std::nullopt}).first;
} else {
(void) map.emplace(p.root(), node {std::move(value)}).first;
// Value inserted for top-level node. We're done.
return;
}
} else if(!p.has_operators()) {
// If p is top-level and we already have it or a part of it
// in map, it's a forbidden overlapping path.
throw api_error::validation(fmt::format(
"Invalid {}: two document paths overlap at {}", source, p.root()));
} else if (it->second.has_value()) {
// If we're here, it != map.end() && p.has_operators && it->second.has_value().
// This means the top-level attribute already has a value, and we're
// trying to add a non-top-level value. It's an overlap.
throw api_error::validation(fmt::format("Invalid {}: two document paths overlap at {}", source, p.root()));
}
node* h = &it->second;
// The second step is to walk h from the top-level node to the inner node
// where we're supposed to insert the value:
for (const auto& op : p.operators()) {
std::visit(overloaded_functor {
[&] (const std::string& member) {
if (h->is_empty()) {
*h = node {typename node::members_t()};
} else if (h->has_indexes()) {
throw api_error::validation(format("Invalid {}: two document paths conflict at {}", source, p));
} else if (h->has_value()) {
throw api_error::validation(format("Invalid {}: two document paths overlap at {}", source, p));
}
typename node::members_t& members = h->get_members();
auto it = members.find(member);
if (it == members.end()) {
it = members.insert({member, std::make_unique<node>()}).first;
}
h = it->second.get();
},
[&] (unsigned index) {
if (h->is_empty()) {
*h = node {typename node::indexes_t()};
} else if (h->has_members()) {
throw api_error::validation(format("Invalid {}: two document paths conflict at {}", source, p));
} else if (h->has_value()) {
throw api_error::validation(format("Invalid {}: two document paths overlap at {}", source, p));
}
typename node::indexes_t& indexes = h->get_indexes();
auto it = indexes.find(index);
if (it == indexes.end()) {
it = indexes.insert({index, std::make_unique<node>()}).first;
}
h = it->second.get();
}
}, op);
}
// Finally, insert the value in the node h.
if (h->is_empty()) {
*h = node {std::move(value)};
} else {
throw api_error::validation(format("Invalid {}: two document paths overlap at {}", source, p));
}
}
// A very simplified version of the above function for the special case of
// adding only top-level attribute. It's not only simpler, we also use a
// different error message, referring to a "duplicate attribute" instead of
// "overlapping paths". DynamoDB also has this distinction (errors in
// AttributesToGet refer to duplicates, not overlaps, but errors in
// ProjectionExpression refer to overlap - even if it's an exact duplicate).
template<typename T>
void attribute_path_map_add(const char* source, attribute_path_map<T>& map, const std::string& attr, T value = {}) {
using node = attribute_path_map_node<T>;
auto it = map.find(attr);
if (it == map.end()) {
map.emplace(attr, node {std::move(value)});
} else {
throw api_error::validation(fmt::format(
"Invalid {}: Duplicate attribute: {}", source, attr));
}
}
} // namespace alternator

View File

@@ -3,7 +3,7 @@
*/
/*
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.1
*/
#include "alternator/error.hh"
@@ -13,7 +13,8 @@
#include <string_view>
#include "alternator/auth.hh"
#include <fmt/format.h>
#include "auth/password_authenticator.hh"
#include "db/consistency_level_type.hh"
#include "db/system_keyspace.hh"
#include "service/storage_proxy.hh"
#include "alternator/executor.hh"
#include "cql3/selection/selection.hh"
@@ -25,8 +26,8 @@ namespace alternator {
static logging::logger alogger("alternator-auth");
future<std::string> get_key_from_roles(service::storage_proxy& proxy, auth::service& as, std::string username) {
schema_ptr schema = proxy.data_dictionary().find_schema(auth::get_auth_ks_name(as.query_processor()), "roles");
future<std::string> get_key_from_roles(service::storage_proxy& proxy, std::string username) {
schema_ptr schema = proxy.data_dictionary().find_schema(db::system_keyspace::NAME, "roles");
partition_key pk = partition_key::from_single_value(*schema, utf8_type->decompose(username));
dht::partition_range_vector partition_ranges{dht::partition_range(dht::decorate_key(*schema, pk))};
std::vector<query::clustering_range> bounds{query::clustering_range::make_open_ended_both_sides()};
@@ -39,7 +40,7 @@ future<std::string> get_key_from_roles(service::storage_proxy& proxy, auth::serv
auto partition_slice = query::partition_slice(std::move(bounds), {}, query::column_id_vector{salted_hash_col->id, can_login_col->id}, selection->get_query_options());
auto command = ::make_lw_shared<query::read_command>(schema->id(), schema->version(), partition_slice,
proxy.get_max_result_size(partition_slice), query::tombstone_limit(proxy.get_tombstone_limit()));
auto cl = auth::password_authenticator::consistency_for_user(username);
auto cl = db::consistency_level::LOCAL_ONE;
service::client_state client_state{service::client_state::internal_tag()};
service::storage_proxy::coordinator_query_result qr = co_await proxy.query(schema, std::move(command), std::move(partition_ranges), cl,

View File

@@ -3,7 +3,7 @@
*/
/*
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.1
*/
#pragma once
@@ -20,6 +20,6 @@ namespace alternator {
using key_cache = utils::loading_cache<std::string, std::string, 1>;
future<std::string> get_key_from_roles(service::storage_proxy& proxy, auth::service& as, std::string username);
future<std::string> get_key_from_roles(service::storage_proxy& proxy, std::string username);
}

View File

@@ -3,7 +3,7 @@
*/
/*
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.1
*/
#include <string_view>
@@ -618,7 +618,7 @@ conditional_operator_type get_conditional_operator(const rjson::value& req) {
// Check if the existing values of the item (previous_item) match the
// conditions given by the Expected and ConditionalOperator parameters
// (if they exist) in the request (an UpdateItem, PutItem or DeleteItem).
// This function can throw an ValidationException API error if there
// This function can throw a ValidationException API error if there
// are errors in the format of the condition itself.
bool verify_expected(const rjson::value& req, const rjson::value* previous_item) {
const rjson::value* expected = rjson::find(req, "Expected");

View File

@@ -3,7 +3,7 @@
*/
/*
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.1
*/
/*

View File

@@ -3,7 +3,7 @@
*/
/*
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.1
*/
#include "consumed_capacity.hh"
@@ -45,7 +45,7 @@ bool consumed_capacity_counter::should_add_capacity(const rjson::value& request)
}
void consumed_capacity_counter::add_consumed_capacity_to_response_if_needed(rjson::value& response) const noexcept {
if (_should_add_to_reponse) {
if (_should_add_to_response) {
auto consumption = rjson::empty_object();
rjson::add(consumption, "CapacityUnits", get_consumed_capacity_units());
rjson::add(response, "ConsumedCapacity", std::move(consumption));

View File

@@ -3,7 +3,7 @@
*/
/*
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.1
*/
#pragma once
@@ -28,9 +28,9 @@ namespace alternator {
class consumed_capacity_counter {
public:
consumed_capacity_counter() = default;
consumed_capacity_counter(bool should_add_to_reponse) : _should_add_to_reponse(should_add_to_reponse){}
consumed_capacity_counter(bool should_add_to_response) : _should_add_to_response(should_add_to_response){}
bool operator()() const noexcept {
return _should_add_to_reponse;
return _should_add_to_response;
}
consumed_capacity_counter& operator +=(uint64_t bytes);
@@ -44,7 +44,7 @@ public:
uint64_t _total_bytes = 0;
static bool should_add_capacity(const rjson::value& request);
protected:
bool _should_add_to_reponse = false;
bool _should_add_to_response = false;
};
class rcu_consumed_capacity_counter : public consumed_capacity_counter {

View File

@@ -3,7 +3,7 @@
*/
/*
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.1
*/
#include <seastar/core/with_scheduling_group.hh>
@@ -18,6 +18,7 @@
#include "service/memory_limiter.hh"
#include "auth/service.hh"
#include "service/qos/service_level_controller.hh"
#include "vector_search/vector_store_client.hh"
using namespace seastar;
@@ -28,23 +29,29 @@ static logging::logger logger("alternator_controller");
controller::controller(
sharded<gms::gossiper>& gossiper,
sharded<service::storage_proxy>& proxy,
sharded<service::storage_service>& ss,
sharded<service::migration_manager>& mm,
sharded<db::system_distributed_keyspace>& sys_dist_ks,
sharded<db::system_keyspace>& sys_ks,
sharded<cdc::generation_service>& cdc_gen_svc,
sharded<service::memory_limiter>& memory_limiter,
sharded<auth::service>& auth_service,
sharded<qos::service_level_controller>& sl_controller,
sharded<vector_search::vector_store_client>& vsc,
const db::config& config,
seastar::scheduling_group sg)
: protocol_server(sg)
, _gossiper(gossiper)
, _proxy(proxy)
, _ss(ss)
, _mm(mm)
, _sys_dist_ks(sys_dist_ks)
, _sys_ks(sys_ks)
, _cdc_gen_svc(cdc_gen_svc)
, _memory_limiter(memory_limiter)
, _auth_service(auth_service)
, _sl_controller(sl_controller)
, _vsc(vsc)
, _config(config)
{
}
@@ -89,8 +96,8 @@ future<> controller::start_server() {
auto get_timeout_in_ms = [] (const db::config& cfg) -> utils::updateable_value<uint32_t> {
return cfg.alternator_timeout_in_ms;
};
_executor.start(std::ref(_gossiper), std::ref(_proxy), std::ref(_mm), std::ref(_sys_dist_ks),
sharded_parameter(get_cdc_metadata, std::ref(_cdc_gen_svc)), _ssg.value(),
_executor.start(std::ref(_gossiper), std::ref(_proxy), std::ref(_ss), std::ref(_mm), std::ref(_sys_dist_ks), std::ref(_sys_ks),
sharded_parameter(get_cdc_metadata, std::ref(_cdc_gen_svc)), std::ref(_vsc), _ssg.value(),
sharded_parameter(get_timeout_in_ms, std::ref(_config))).get();
_server.start(std::ref(_executor), std::ref(_proxy), std::ref(_gossiper), std::ref(_auth_service), std::ref(_sl_controller)).get();
// Note: from this point on, if start_server() throws for any reason,
@@ -103,11 +110,23 @@ future<> controller::start_server() {
alternator_port = _config.alternator_port();
_listen_addresses.push_back({addr, *alternator_port});
}
std::optional<uint16_t> alternator_port_proxy_protocol;
if (_config.alternator_port_proxy_protocol()) {
alternator_port_proxy_protocol = _config.alternator_port_proxy_protocol();
_listen_addresses.push_back({addr, *alternator_port_proxy_protocol});
}
std::optional<uint16_t> alternator_https_port;
std::optional<uint16_t> alternator_https_port_proxy_protocol;
std::optional<tls::credentials_builder> creds;
if (_config.alternator_https_port()) {
alternator_https_port = _config.alternator_https_port();
_listen_addresses.push_back({addr, *alternator_https_port});
if (_config.alternator_https_port() || _config.alternator_https_port_proxy_protocol()) {
if (_config.alternator_https_port()) {
alternator_https_port = _config.alternator_https_port();
_listen_addresses.push_back({addr, *alternator_https_port});
}
if (_config.alternator_https_port_proxy_protocol()) {
alternator_https_port_proxy_protocol = _config.alternator_https_port_proxy_protocol();
_listen_addresses.push_back({addr, *alternator_https_port_proxy_protocol});
}
creds.emplace();
auto opts = _config.alternator_encryption_options();
if (opts.empty()) {
@@ -133,20 +152,29 @@ future<> controller::start_server() {
}
}
_server.invoke_on_all(
[this, addr, alternator_port, alternator_https_port, creds = std::move(creds)] (server& server) mutable {
return server.init(addr, alternator_port, alternator_https_port, creds,
[this, addr, alternator_port, alternator_https_port, alternator_port_proxy_protocol, alternator_https_port_proxy_protocol, creds = std::move(creds)] (server& server) mutable {
return server.init(addr, alternator_port, alternator_https_port, alternator_port_proxy_protocol, alternator_https_port_proxy_protocol, creds,
_config.alternator_enforce_authorization,
_config.alternator_warn_authorization,
_config.alternator_max_users_query_size_in_trace_output,
&_memory_limiter.local().get_semaphore(),
_config.max_concurrent_requests_per_shard);
}).handle_exception([this, addr, alternator_port, alternator_https_port] (std::exception_ptr ep) {
logger.error("Failed to set up Alternator HTTP server on {} port {}, TLS port {}: {}",
addr, alternator_port ? std::to_string(*alternator_port) : "OFF", alternator_https_port ? std::to_string(*alternator_https_port) : "OFF", ep);
}).handle_exception([this, addr, alternator_port, alternator_https_port, alternator_port_proxy_protocol, alternator_https_port_proxy_protocol] (std::exception_ptr ep) {
logger.error("Failed to set up Alternator HTTP server on {} port {}, TLS port {}, proxy-protocol port {}, TLS proxy-protocol port {}: {}",
addr,
alternator_port ? std::to_string(*alternator_port) : "OFF",
alternator_https_port ? std::to_string(*alternator_https_port) : "OFF",
alternator_port_proxy_protocol ? std::to_string(*alternator_port_proxy_protocol) : "OFF",
alternator_https_port_proxy_protocol ? std::to_string(*alternator_https_port_proxy_protocol) : "OFF",
ep);
return stop_server().then([ep = std::move(ep)] { return make_exception_future<>(ep); });
}).then([addr, alternator_port, alternator_https_port] {
logger.info("Alternator server listening on {}, HTTP port {}, HTTPS port {}",
addr, alternator_port ? std::to_string(*alternator_port) : "OFF", alternator_https_port ? std::to_string(*alternator_https_port) : "OFF");
}).then([addr, alternator_port, alternator_https_port, alternator_port_proxy_protocol, alternator_https_port_proxy_protocol] {
logger.info("Alternator server listening on {}, HTTP port {}, HTTPS port {}, proxy-protocol port {}, TLS proxy-protocol port {}",
addr,
alternator_port ? std::to_string(*alternator_port) : "OFF",
alternator_https_port ? std::to_string(*alternator_https_port) : "OFF",
alternator_port_proxy_protocol ? std::to_string(*alternator_port_proxy_protocol) : "OFF",
alternator_https_port_proxy_protocol ? std::to_string(*alternator_https_port_proxy_protocol) : "OFF");
}).get();
});
}
@@ -169,7 +197,7 @@ future<> controller::request_stop_server() {
});
}
future<utils::chunked_vector<client_data>> controller::get_client_data() {
future<utils::chunked_vector<foreign_ptr<std::unique_ptr<client_data>>>> controller::get_client_data() {
return _server.local().get_client_data();
}

View File

@@ -3,7 +3,7 @@
*/
/*
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.1
*/
#pragma once
@@ -15,12 +15,14 @@
namespace service {
class storage_proxy;
class storage_service;
class migration_manager;
class memory_limiter;
}
namespace db {
class system_distributed_keyspace;
class system_keyspace;
class config;
}
@@ -42,6 +44,10 @@ namespace qos {
class service_level_controller;
}
namespace vector_search {
class vector_store_client;
}
namespace alternator {
// This is the official DynamoDB API version.
@@ -57,12 +63,15 @@ class server;
class controller : public protocol_server {
sharded<gms::gossiper>& _gossiper;
sharded<service::storage_proxy>& _proxy;
sharded<service::storage_service>& _ss;
sharded<service::migration_manager>& _mm;
sharded<db::system_distributed_keyspace>& _sys_dist_ks;
sharded<db::system_keyspace>& _sys_ks;
sharded<cdc::generation_service>& _cdc_gen_svc;
sharded<service::memory_limiter>& _memory_limiter;
sharded<auth::service>& _auth_service;
sharded<qos::service_level_controller>& _sl_controller;
sharded<vector_search::vector_store_client>& _vsc;
const db::config& _config;
std::vector<socket_address> _listen_addresses;
@@ -74,12 +83,15 @@ public:
controller(
sharded<gms::gossiper>& gossiper,
sharded<service::storage_proxy>& proxy,
sharded<service::storage_service>& ss,
sharded<service::migration_manager>& mm,
sharded<db::system_distributed_keyspace>& sys_dist_ks,
sharded<db::system_keyspace>& sys_ks,
sharded<cdc::generation_service>& cdc_gen_svc,
sharded<service::memory_limiter>& memory_limiter,
sharded<auth::service>& auth_service,
sharded<qos::service_level_controller>& sl_controller,
sharded<vector_search::vector_store_client>& vsc,
const db::config& config,
seastar::scheduling_group sg);
@@ -93,7 +105,7 @@ public:
// This virtual function is called (on each shard separately) when the
// virtual table "system.clients" is read. It is expected to generate a
// list of clients connected to this server (on this shard).
virtual future<utils::chunked_vector<client_data>> get_client_data() override;
virtual future<utils::chunked_vector<foreign_ptr<std::unique_ptr<client_data>>>> get_client_data() override;
};
}

View File

@@ -3,7 +3,7 @@
*/
/*
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.1
*/
#pragma once

File diff suppressed because it is too large Load Diff

View File

@@ -3,13 +3,15 @@
*/
/*
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.1
*/
#pragma once
#include <seastar/core/future.hh>
#include "audit/audit.hh"
#include "seastarx.hh"
#include <seastar/core/future.hh>
#include <seastar/core/sharded.hh>
#include <seastar/util/noncopyable_function.hh>
@@ -17,16 +19,26 @@
#include "service/client_state.hh"
#include "service_permit.hh"
#include "db/timeout_clock.hh"
#include "db/config.hh"
#include "alternator/error.hh"
#include "stats.hh"
#include "alternator/attribute_path.hh"
#include "alternator/stats.hh"
#include "alternator/executor_util.hh"
#include "utils/rjson.hh"
#include "utils/updateable_value.hh"
#include "tracing/trace_state.hh"
namespace db {
class system_distributed_keyspace;
class system_keyspace;
}
namespace audit {
class audit_info_alternator;
}
namespace query {
@@ -41,6 +53,11 @@ namespace cql3::selection {
namespace service {
class storage_proxy;
class cas_shard;
class storage_service;
}
namespace vector_search {
class vector_store_client;
}
namespace cdc {
@@ -55,99 +72,40 @@ class gossiper;
class schema_builder;
namespace alternator {
enum class table_status;
class rmw_operation;
class put_or_delete_item;
schema_ptr get_table(service::storage_proxy& proxy, const rjson::value& request);
bool is_alternator_keyspace(const sstring& ks_name);
// Wraps the db::get_tags_of_table and throws if the table is missing the tags extension.
const std::map<sstring, sstring>& get_tags_of_table_or_throw(schema_ptr schema);
// An attribute_path_map object is used to hold data for various attributes
// paths (parsed::path) in a hierarchy of attribute paths. Each attribute path
// has a root attribute, and then modified by member and index operators -
// for example in "a.b[2].c" we have "a" as the root, then ".b" member, then
// "[2]" index, and finally ".c" member.
// Data can be added to an attribute_path_map using the add() function, but
// requires that attributes with data not be *overlapping* or *conflicting*:
//
// 1. Two attribute paths which are identical or an ancestor of one another
// are considered *overlapping* and not allowed. If a.b.c has data,
// we can't add more data in a.b.c or any of its descendants like a.b.c.d.
//
// 2. Two attribute paths which need the same parent to have both a member and
// an index are considered *conflicting* and not allowed. E.g., if a.b has
// data, you can't add a[1]. The meaning of adding both would be that the
// attribute a is both a map and an array, which isn't sensible.
//
// These two requirements are common to the two places where Alternator uses
// this abstraction to describe how a hierarchical item is to be transformed:
//
// 1. In ProjectExpression: for filtering from a full top-level attribute
// only the parts for which user asked in ProjectionExpression.
//
// 2. In UpdateExpression: for taking the previous value of a top-level
// attribute, and modifying it based on the instructions in the user
// wrote in UpdateExpression.
template<typename T>
class attribute_path_map_node {
public:
using data_t = T;
// We need the extra unique_ptr<> here because libstdc++ unordered_map
// doesn't work with incomplete types :-(
using members_t = std::unordered_map<std::string, std::unique_ptr<attribute_path_map_node<T>>>;
// The indexes list is sorted because DynamoDB requires handling writes
// beyond the end of a list in index order.
using indexes_t = std::map<unsigned, std::unique_ptr<attribute_path_map_node<T>>>;
// The prohibition on "overlap" and "conflict" explained above means
// That only one of data, members or indexes is non-empty.
std::optional<std::variant<data_t, members_t, indexes_t>> _content;
bool is_empty() const { return !_content; }
bool has_value() const { return _content && std::holds_alternative<data_t>(*_content); }
bool has_members() const { return _content && std::holds_alternative<members_t>(*_content); }
bool has_indexes() const { return _content && std::holds_alternative<indexes_t>(*_content); }
// get_members() assumes that has_members() is true
members_t& get_members() { return std::get<members_t>(*_content); }
const members_t& get_members() const { return std::get<members_t>(*_content); }
indexes_t& get_indexes() { return std::get<indexes_t>(*_content); }
const indexes_t& get_indexes() const { return std::get<indexes_t>(*_content); }
T& get_value() { return std::get<T>(*_content); }
const T& get_value() const { return std::get<T>(*_content); }
};
template<typename T>
using attribute_path_map = std::unordered_map<std::string, attribute_path_map_node<T>>;
using attrs_to_get_node = attribute_path_map_node<std::monostate>;
// attrs_to_get lists which top-level attribute are needed, and possibly also
// which part of the top-level attribute is really needed (when nested
// attribute paths appeared in the query).
// Most code actually uses optional<attrs_to_get>. There, a disengaged
// optional means we should get all attributes, not specific ones.
using attrs_to_get = attribute_path_map<std::monostate>;
namespace parsed {
class expression_cache;
}
class executor : public peering_sharded_service<executor> {
gms::gossiper& _gossiper;
service::storage_service& _ss;
service::storage_proxy& _proxy;
service::migration_manager& _mm;
db::system_distributed_keyspace& _sdks;
db::system_keyspace& _system_keyspace;
cdc::metadata& _cdc_metadata;
vector_search::vector_store_client& _vsc;
utils::updateable_value<bool> _enforce_authorization;
utils::updateable_value<bool> _warn_authorization;
seastar::sharded<audit::audit>& _audit;
// An smp_service_group to be used for limiting the concurrency when
// forwarding Alternator request between shards - if necessary for LWT.
smp_service_group _ssg;
std::unique_ptr<parsed::expression_cache> _parsed_expression_cache;
struct describe_table_info_manager;
std::unique_ptr<describe_table_info_manager> _describe_table_info_manager;
future<> cache_newly_calculated_size_on_all_shards(schema_ptr schema, std::uint64_t size_in_bytes, std::chrono::nanoseconds ttl);
future<> fill_table_size(rjson::value &table_description, schema_ptr schema, bool deleting);
public:
using client_state = service::client_state;
// request_return_type is the return type of the executor methods, which
@@ -161,7 +119,6 @@ public:
// is written in chunks to the output_stream. This allows for efficient
// handling of large responses without needing to allocate a large buffer
// in memory.
using body_writer = noncopyable_function<future<>(output_stream<char>&&)>;
using request_return_type = std::variant<std::string, body_writer, api_error>;
stats _stats;
// The metric_groups object holds this stat object's metrics registered
@@ -173,53 +130,63 @@ public:
executor(gms::gossiper& gossiper,
service::storage_proxy& proxy,
service::storage_service& ss,
service::migration_manager& mm,
db::system_distributed_keyspace& sdks,
db::system_keyspace& system_keyspace,
cdc::metadata& cdc_metadata,
vector_search::vector_store_client& vsc,
smp_service_group ssg,
utils::updateable_value<uint32_t> default_timeout_in_ms);
~executor();
future<request_return_type> create_table(client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value request);
future<request_return_type> describe_table(client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value request);
future<request_return_type> delete_table(client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value request);
future<request_return_type> update_table(client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value request);
future<request_return_type> put_item(client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value request);
future<request_return_type> get_item(client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value request);
future<request_return_type> delete_item(client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value request);
future<request_return_type> update_item(client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value request);
future<request_return_type> list_tables(client_state& client_state, service_permit permit, rjson::value request);
future<request_return_type> scan(client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value request);
future<request_return_type> describe_endpoints(client_state& client_state, service_permit permit, rjson::value request, std::string host_header);
future<request_return_type> batch_write_item(client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value request);
future<request_return_type> batch_get_item(client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value request);
future<request_return_type> query(client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value request);
future<request_return_type> tag_resource(client_state& client_state, service_permit permit, rjson::value request);
future<request_return_type> untag_resource(client_state& client_state, service_permit permit, rjson::value request);
future<request_return_type> list_tags_of_resource(client_state& client_state, service_permit permit, rjson::value request);
future<request_return_type> update_time_to_live(client_state& client_state, service_permit permit, rjson::value request);
future<request_return_type> describe_time_to_live(client_state& client_state, service_permit permit, rjson::value request);
future<request_return_type> list_streams(client_state& client_state, service_permit permit, rjson::value request);
future<request_return_type> describe_stream(client_state& client_state, service_permit permit, rjson::value request);
future<request_return_type> get_shard_iterator(client_state& client_state, service_permit permit, rjson::value request);
future<request_return_type> get_records(client_state& client_state, tracing::trace_state_ptr, service_permit permit, rjson::value request);
future<request_return_type> describe_continuous_backups(client_state& client_state, service_permit permit, rjson::value request);
future<request_return_type> create_table(client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value request, std::unique_ptr<audit::audit_info_alternator>& audit_info);
future<request_return_type> describe_table(client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value request, std::unique_ptr<audit::audit_info_alternator>& audit_info);
future<request_return_type> delete_table(client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value request, std::unique_ptr<audit::audit_info_alternator>& audit_info);
future<request_return_type> update_table(client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value request, std::unique_ptr<audit::audit_info_alternator>& audit_info);
future<request_return_type> put_item(client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value request, std::unique_ptr<audit::audit_info_alternator>& audit_info);
future<request_return_type> get_item(client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value request, std::unique_ptr<audit::audit_info_alternator>& audit_info);
future<request_return_type> delete_item(client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value request, std::unique_ptr<audit::audit_info_alternator>& audit_info);
future<request_return_type> update_item(client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value request, std::unique_ptr<audit::audit_info_alternator>& audit_info);
future<request_return_type> list_tables(client_state& client_state, service_permit permit, rjson::value request, std::unique_ptr<audit::audit_info_alternator>& audit_info);
future<request_return_type> scan(client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value request, std::unique_ptr<audit::audit_info_alternator>& audit_info);
future<request_return_type> describe_endpoints(client_state& client_state, service_permit permit, rjson::value request, std::string host_header, std::unique_ptr<audit::audit_info_alternator>& audit_info);
future<request_return_type> batch_write_item(client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value request, std::unique_ptr<audit::audit_info_alternator>& audit_info);
future<request_return_type> batch_get_item(client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value request, std::unique_ptr<audit::audit_info_alternator>& audit_info);
future<request_return_type> query(client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value request, std::unique_ptr<audit::audit_info_alternator>& audit_info);
future<request_return_type> tag_resource(client_state& client_state, service_permit permit, rjson::value request, std::unique_ptr<audit::audit_info_alternator>& audit_info);
future<request_return_type> untag_resource(client_state& client_state, service_permit permit, rjson::value request, std::unique_ptr<audit::audit_info_alternator>& audit_info);
future<request_return_type> list_tags_of_resource(client_state& client_state, service_permit permit, rjson::value request, std::unique_ptr<audit::audit_info_alternator>& audit_info);
future<request_return_type> update_time_to_live(client_state& client_state, service_permit permit, rjson::value request, std::unique_ptr<audit::audit_info_alternator>& audit_info);
future<request_return_type> describe_time_to_live(client_state& client_state, service_permit permit, rjson::value request, std::unique_ptr<audit::audit_info_alternator>& audit_info);
future<request_return_type> list_streams(client_state& client_state, service_permit permit, rjson::value request, std::unique_ptr<audit::audit_info_alternator>& audit_info);
future<request_return_type> describe_stream(client_state& client_state, service_permit permit, rjson::value request, std::unique_ptr<audit::audit_info_alternator>& audit_info);
future<request_return_type> get_shard_iterator(client_state& client_state, service_permit permit, rjson::value request, std::unique_ptr<audit::audit_info_alternator>& audit_info);
future<request_return_type> get_records(client_state& client_state, tracing::trace_state_ptr, service_permit permit, rjson::value request, std::unique_ptr<audit::audit_info_alternator>& audit_info);
future<request_return_type> describe_continuous_backups(client_state& client_state, service_permit permit, rjson::value request, std::unique_ptr<audit::audit_info_alternator>& audit_info);
future<> start();
future<> stop();
static sstring table_name(const schema&);
static db::timeout_clock::time_point default_timeout();
private:
static thread_local utils::updateable_value<uint32_t> s_default_timeout_in_ms;
public:
static schema_ptr find_table(service::storage_proxy&, std::string_view table_name);
static schema_ptr find_table(service::storage_proxy&, const rjson::value& request);
private:
friend class rmw_operation;
static void describe_key_schema(rjson::value& parent, const schema&, std::unordered_map<std::string,std::string> * = nullptr, const std::map<sstring, sstring> *tags = nullptr);
// Helper to set up auditing for an Alternator operation. Checks whether
// the operation should be audited (via will_log()) and if so, allocates
// and populates audit_info. No allocation occurs when auditing is disabled.
void maybe_audit(std::unique_ptr<audit::audit_info_alternator>& audit_info,
audit::statement_category category,
std::string_view ks_name,
std::string_view table_name,
std::string_view operation_name,
const rjson::value& request,
std::optional<db::consistency_level> cl = std::nullopt);
future<rjson::value> fill_table_description(schema_ptr schema, table_status tbl_status, service::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit);
future<executor::request_return_type> create_table_on_shard0(service::client_state&& client_state, tracing::trace_state_ptr trace_state, rjson::value request, bool enforce_authorization,
bool warn_authorization, const db::tablets_mode_t::mode tablets_mode, std::unique_ptr<audit::audit_info_alternator>& audit_info);
future<> do_batch_write(
std::vector<std::pair<schema_ptr, put_or_delete_item>> mutation_builders,
@@ -232,60 +199,34 @@ private:
tracing::trace_state_ptr trace_state, service_permit permit);
public:
static void describe_key_schema(rjson::value& parent, const schema& schema, std::unordered_map<std::string,std::string>&, const std::map<sstring, sstring> *tags = nullptr);
static std::optional<rjson::value> describe_single_item(schema_ptr,
const query::partition_slice&,
const cql3::selection::selection&,
const query::result&,
const std::optional<attrs_to_get>&,
uint64_t* = nullptr);
// Converts a multi-row selection result to JSON compatible with DynamoDB.
// For each row, this method calls item_callback, which takes the size of
// the item as the parameter.
static future<std::vector<rjson::value>> describe_multi_item(schema_ptr schema,
const query::partition_slice&& slice,
shared_ptr<cql3::selection::selection> selection,
foreign_ptr<lw_shared_ptr<query::result>> query_result,
shared_ptr<const std::optional<attrs_to_get>> attrs_to_get,
noncopyable_function<void(uint64_t)> item_callback = {});
static void describe_single_item(const cql3::selection::selection&,
const std::vector<managed_bytes_opt>&,
const std::optional<attrs_to_get>&,
rjson::value&,
uint64_t* item_length_in_bytes = nullptr,
bool = false);
static bool add_stream_options(const rjson::value& stream_spec, schema_builder&, service::storage_proxy& sp);
static void supplement_table_info(rjson::value& descr, const schema& schema, service::storage_proxy& sp);
static void supplement_table_stream_info(rjson::value& descr, const schema& schema, const service::storage_proxy& sp);
};
// is_big() checks approximately if the given JSON value is "bigger" than
// the given big_size number of bytes. The goal is to *quickly* detect
// oversized JSON that, for example, is too large to be serialized to a
// contiguous string - we don't need an accurate size for that. Moreover,
// as soon as we detect that the JSON is indeed "big", we can return true
// and don't need to continue calculating its exact size.
// For simplicity, we use a recursive implementation. This is fine because
// Alternator limits the depth of JSONs it reads from inputs, and doesn't
// add more than a couple of levels in its own output construction.
bool is_big(const rjson::value& val, int big_size = 100'000);
// returns table creation time in seconds since epoch for `db_clock`
double get_table_creation_time(const schema &schema);
// Check CQL's Role-Based Access Control (RBAC) permission (MODIFY,
// SELECT, DROP, etc.) on the given table. When permission is denied an
// appropriate user-readable api_error::access_denied is thrown.
future<> verify_permission(bool enforce_authorization, bool warn_authorization, const service::client_state&, const schema_ptr&, auth::permission, alternator::stats& stats);
/**
* Make return type for serializing the object "streamed",
* i.e. direct to HTTP output stream. Note: only useful for
* (very) large objects as there are overhead issues with this
* as well, but for massive lists of return objects this can
* help avoid large allocations/many re-allocs
*/
executor::body_writer make_streamed(rjson::value&&);
// result of parsing ARN (Amazon Resource Name)
// ARN format is `arn:<partition>:<service>:<region>:<account-id>:<resource-type>/<resource-id>/<postfix>`
// we ignore partition, service and account-id
// resource-type must be string "table"
// resource-id will be returned as table_name
// region will be returned as keyspace_name
// postfix is a string after resource-id and will be returned as is (whole), including separator.
struct arn_parts {
std::string_view keyspace_name;
std::string_view table_name;
std::string_view postfix;
};
// arn - arn to parse
// arn_field_name - identifier of the ARN, used only when reporting an error (in error messages), for example "Incorrect resource identifier `<arn_field_name>`"
// type_name - used only when reporting an error (in error messages), for example "... is not a valid <type_name> ARN ..."
// expected_postfix - optional filter of postfix value (part of ARN after resource-id, including separator, see comments for struct arn_parts).
// If is empty - then postfix value must be empty as well
// if not empty - postfix value must start with expected_postfix, but might be longer
arn_parts parse_arn(std::string_view arn, std::string_view arn_field_name, std::string_view type_name, std::string_view expected_postfix);
// The format is ks1|ks2|ks3... and table1|table2|table3...
sstring print_names_for_audit(const std::set<sstring>& names);
}

1957
alternator/executor_read.cc Normal file

File diff suppressed because it is too large Load Diff

559
alternator/executor_util.cc Normal file
View File

@@ -0,0 +1,559 @@
/*
* Copyright 2019-present ScyllaDB
*/
/*
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.1
*/
#include "alternator/executor_util.hh"
#include "alternator/executor.hh"
#include "alternator/error.hh"
#include "auth/resource.hh"
#include "auth/service.hh"
#include "cdc/log.hh"
#include "data_dictionary/data_dictionary.hh"
#include "db/tags/utils.hh"
#include "replica/database.hh"
#include "cql3/selection/selection.hh"
#include "cql3/result_set.hh"
#include "serialization.hh"
#include "service/storage_proxy.hh"
#include "types/map.hh"
#include <fmt/format.h>
namespace alternator {
extern logging::logger elogger; // from executor.cc
std::optional<int> get_int_attribute(const rjson::value& value, std::string_view attribute_name) {
const rjson::value* attribute_value = rjson::find(value, attribute_name);
if (!attribute_value)
return {};
if (!attribute_value->IsInt()) {
throw api_error::validation(fmt::format("Expected integer value for attribute {}, got: {}",
attribute_name, value));
}
return attribute_value->GetInt();
}
std::string get_string_attribute(const rjson::value& value, std::string_view attribute_name, const char* default_return) {
const rjson::value* attribute_value = rjson::find(value, attribute_name);
if (!attribute_value)
return default_return;
if (!attribute_value->IsString()) {
throw api_error::validation(fmt::format("Expected string value for attribute {}, got: {}",
attribute_name, value));
}
return rjson::to_string(*attribute_value);
}
bool get_bool_attribute(const rjson::value& value, std::string_view attribute_name, bool default_return) {
const rjson::value* attribute_value = rjson::find(value, attribute_name);
if (!attribute_value) {
return default_return;
}
if (!attribute_value->IsBool()) {
throw api_error::validation(fmt::format("Expected boolean value for attribute {}, got: {}",
attribute_name, value));
}
return attribute_value->GetBool();
}
std::optional<std::string> find_table_name(const rjson::value& request) {
const rjson::value* table_name_value = rjson::find(request, "TableName");
if (!table_name_value) {
return std::nullopt;
}
if (!table_name_value->IsString()) {
throw api_error::validation("Non-string TableName field in request");
}
std::string table_name = rjson::to_string(*table_name_value);
return table_name;
}
std::string get_table_name(const rjson::value& request) {
auto name = find_table_name(request);
if (!name) {
throw api_error::validation("Missing TableName field in request");
}
return *name;
}
schema_ptr find_table(service::storage_proxy& proxy, const rjson::value& request) {
auto table_name = find_table_name(request);
if (!table_name) {
return nullptr;
}
return find_table(proxy, *table_name);
}
schema_ptr find_table(service::storage_proxy& proxy, std::string_view table_name) {
try {
return proxy.data_dictionary().find_schema(sstring(executor::KEYSPACE_NAME_PREFIX) + sstring(table_name), table_name);
} catch(data_dictionary::no_such_column_family&) {
// DynamoDB returns validation error even when table does not exist
// and the table name is invalid.
validate_table_name(table_name);
throw api_error::resource_not_found(
fmt::format("Requested resource not found: Table: {} not found", table_name));
}
}
schema_ptr get_table(service::storage_proxy& proxy, const rjson::value& request) {
auto schema = find_table(proxy, request);
if (!schema) {
// if we get here then the name was missing, since syntax or missing actual CF
// checks throw. Slow path, but just call get_table_name to generate exception.
get_table_name(request);
}
return schema;
}
map_type attrs_type() {
static thread_local auto t = map_type_impl::get_instance(utf8_type, bytes_type, true);
return t;
}
const std::map<sstring, sstring>& get_tags_of_table_or_throw(schema_ptr schema) {
auto tags_ptr = db::get_tags_of_table(schema);
if (tags_ptr) {
return *tags_ptr;
} else {
throw api_error::validation(format("Table {} does not have valid tagging information", schema->ks_name()));
}
}
bool is_alternator_keyspace(std::string_view ks_name) {
return ks_name.starts_with(executor::KEYSPACE_NAME_PREFIX);
}
// This tag is set on a GSI when the user did not specify a range key, causing
// Alternator to add the base table's range key as a spurious range key. It is
// used by describe_key_schema() to suppress reporting that key.
extern const sstring SPURIOUS_RANGE_KEY_ADDED_TO_GSI_AND_USER_DIDNT_SPECIFY_RANGE_KEY_TAG_KEY;
void describe_key_schema(rjson::value& parent, const schema& schema, std::unordered_map<std::string, std::string>* attribute_types, const std::map<sstring, sstring>* tags) {
rjson::value key_schema = rjson::empty_array();
const bool ignore_range_keys_as_spurious = tags != nullptr && tags->contains(SPURIOUS_RANGE_KEY_ADDED_TO_GSI_AND_USER_DIDNT_SPECIFY_RANGE_KEY_TAG_KEY);
for (const column_definition& cdef : schema.partition_key_columns()) {
rjson::value key = rjson::empty_object();
rjson::add(key, "AttributeName", rjson::from_string(cdef.name_as_text()));
rjson::add(key, "KeyType", "HASH");
rjson::push_back(key_schema, std::move(key));
if (attribute_types) {
(*attribute_types)[cdef.name_as_text()] = type_to_string(cdef.type);
}
}
if (!ignore_range_keys_as_spurious) {
// NOTE: user requested key (there can be at most one) will always come first.
// There might be more keys following it, which were added, but those were
// not requested by the user, so we ignore them.
for (const column_definition& cdef : schema.clustering_key_columns()) {
rjson::value key = rjson::empty_object();
rjson::add(key, "AttributeName", rjson::from_string(cdef.name_as_text()));
rjson::add(key, "KeyType", "RANGE");
rjson::push_back(key_schema, std::move(key));
if (attribute_types) {
(*attribute_types)[cdef.name_as_text()] = type_to_string(cdef.type);
}
break;
}
}
rjson::add(parent, "KeySchema", std::move(key_schema));
}
// Check if the given string has valid characters for a table name, i.e. only
// a-z, A-Z, 0-9, _ (underscore), - (dash), . (dot). Note that this function
// does not check the length of the name - instead, use validate_table_name()
// to validate both the characters and the length.
static bool valid_table_name_chars(std::string_view name) {
for (auto c : name) {
if ((c < 'a' || c > 'z') &&
(c < 'A' || c > 'Z') &&
(c < '0' || c > '9') &&
c != '_' &&
c != '-' &&
c != '.') {
return false;
}
}
return true;
}
std::string view_name(std::string_view table_name, std::string_view index_name, const std::string& delim, bool validate_len) {
if (index_name.length() < 3) {
throw api_error::validation("IndexName must be at least 3 characters long");
}
if (!valid_table_name_chars(index_name)) {
throw api_error::validation(
fmt::format("IndexName '{}' must satisfy regular expression pattern: [a-zA-Z0-9_.-]+", index_name));
}
std::string ret = std::string(table_name) + delim + std::string(index_name);
if (ret.length() > max_auxiliary_table_name_length && validate_len) {
throw api_error::validation(
fmt::format("The total length of TableName ('{}') and IndexName ('{}') cannot exceed {} characters",
table_name, index_name, max_auxiliary_table_name_length - delim.size()));
}
return ret;
}
std::string gsi_name(std::string_view table_name, std::string_view index_name, bool validate_len) {
return view_name(table_name, index_name, ":", validate_len);
}
std::string lsi_name(std::string_view table_name, std::string_view index_name, bool validate_len) {
return view_name(table_name, index_name, "!:", validate_len);
}
void check_key(const rjson::value& key, const schema_ptr& schema) {
if (key.MemberCount() != (schema->clustering_key_size() == 0 ? 1 : 2)) {
throw api_error::validation("Given key attribute not in schema");
}
}
void verify_all_are_used(const rjson::value* field,
const std::unordered_set<std::string>& used, const char* field_name, const char* operation) {
if (!field) {
return;
}
for (auto it = field->MemberBegin(); it != field->MemberEnd(); ++it) {
if (!used.contains(rjson::to_string(it->name))) {
throw api_error::validation(
format("{} has spurious '{}', not used in {}",
field_name, rjson::to_string_view(it->name), operation));
}
}
}
// This function increments the authorization_failures counter, and may also
// log a warn-level message and/or throw an access_denied exception, depending
// on what enforce_authorization and warn_authorization are set to.
// Note that if enforce_authorization is false, this function will return
// without throwing. So a caller that doesn't want to continue after an
// authorization_error must explicitly return after calling this function.
static void authorization_error(stats& stats, bool enforce_authorization, bool warn_authorization, std::string msg) {
stats.authorization_failures++;
if (enforce_authorization) {
if (warn_authorization) {
elogger.warn("alternator_warn_authorization=true: {}", msg);
}
throw api_error::access_denied(std::move(msg));
} else {
if (warn_authorization) {
elogger.warn("If you set alternator_enforce_authorization=true the following will be enforced: {}", msg);
}
}
}
future<> verify_permission(
bool enforce_authorization,
bool warn_authorization,
const service::client_state& client_state,
const schema_ptr& schema,
auth::permission permission_to_check,
stats& stats) {
if (!enforce_authorization && !warn_authorization) {
co_return;
}
// Unfortunately, the fix for issue #23218 did not modify the function
// that we use here - check_has_permissions(). So if we want to allow
// writes to internal tables (from try_get_internal_table()) only to a
// superuser, we need to explicitly check it here.
if (permission_to_check == auth::permission::MODIFY && is_internal_keyspace(schema->ks_name())) {
if (!client_state.user() ||
!client_state.user()->name ||
!co_await client_state.get_auth_service()->underlying_role_manager().is_superuser(*client_state.user()->name)) {
sstring username = "<anonymous>";
if (client_state.user() && client_state.user()->name) {
username = client_state.user()->name.value();
}
authorization_error(stats, enforce_authorization, warn_authorization, fmt::format(
"Write access denied on internal table {}.{} to role {} because it is not a superuser",
schema->ks_name(), schema->cf_name(), username));
co_return;
}
}
auto resource = auth::make_data_resource(schema->ks_name(), schema->cf_name());
if (!client_state.user() || !client_state.user()->name ||
!co_await client_state.check_has_permission(auth::command_desc(permission_to_check, resource))) {
sstring username = "<anonymous>";
if (client_state.user() && client_state.user()->name) {
username = client_state.user()->name.value();
}
// Using exceptions for errors makes this function faster in the
// success path (when the operation is allowed).
authorization_error(stats, enforce_authorization, warn_authorization, fmt::format(
"{} access on table {}.{} is denied to role {}, client address {}",
auth::permissions::to_string(permission_to_check),
schema->ks_name(), schema->cf_name(), username, client_state.get_client_address()));
}
}
// Similar to verify_permission() above, but just for CREATE operations.
// Those do not operate on any specific table, so require permissions on
// ALL KEYSPACES instead of any specific table.
future<> verify_create_permission(bool enforce_authorization, bool warn_authorization, const service::client_state& client_state, stats& stats) {
if (!enforce_authorization && !warn_authorization) {
co_return;
}
auto resource = auth::resource(auth::resource_kind::data);
if (!co_await client_state.check_has_permission(auth::command_desc(auth::permission::CREATE, resource))) {
sstring username = "<anonymous>";
if (client_state.user() && client_state.user()->name) {
username = client_state.user()->name.value();
}
authorization_error(stats, enforce_authorization, warn_authorization, fmt::format(
"CREATE access on ALL KEYSPACES is denied to role {}", username));
}
}
schema_ptr try_get_internal_table(const data_dictionary::database& db, std::string_view table_name) {
size_t it = table_name.find(executor::INTERNAL_TABLE_PREFIX);
if (it != 0) {
return schema_ptr{};
}
table_name.remove_prefix(executor::INTERNAL_TABLE_PREFIX.size());
size_t delim = table_name.find_first_of('.');
if (delim == std::string_view::npos) {
return schema_ptr{};
}
std::string_view ks_name = table_name.substr(0, delim);
table_name.remove_prefix(ks_name.size() + 1);
// Only internal keyspaces can be accessed to avoid leakage
auto ks = db.try_find_keyspace(ks_name);
if (!ks || !ks->is_internal()) {
return schema_ptr{};
}
try {
return db.find_schema(ks_name, table_name);
} catch (data_dictionary::no_such_column_family&) {
// DynamoDB returns validation error even when table does not exist
// and the table name is invalid.
validate_table_name(table_name);
throw api_error::resource_not_found(
fmt::format("Requested resource not found: Internal table: {}.{} not found", ks_name, table_name));
}
}
schema_ptr get_table_from_batch_request(const service::storage_proxy& proxy, const rjson::value::ConstMemberIterator& batch_request) {
sstring table_name = rjson::to_sstring(batch_request->name); // JSON keys are always strings
try {
return proxy.data_dictionary().find_schema(sstring(executor::KEYSPACE_NAME_PREFIX) + table_name, table_name);
} catch(data_dictionary::no_such_column_family&) {
// DynamoDB returns validation error even when table does not exist
// and the table name is invalid.
validate_table_name(table_name);
throw api_error::resource_not_found(format("Requested resource not found: Table: {} not found", table_name));
}
}
lw_shared_ptr<stats> get_stats_from_schema(service::storage_proxy& sp, const schema& schema) {
try {
replica::table& table = sp.local_db().find_column_family(schema.id());
if (!table.get_stats().alternator_stats) {
table.get_stats().alternator_stats = seastar::make_shared<table_stats>(schema.ks_name(), schema.cf_name());
}
return table.get_stats().alternator_stats->_stats;
} catch (std::runtime_error&) {
// If we're here it means that a table we are currently working on was deleted before the
// operation completed, returning a temporary object is fine, if the table get deleted so will its metrics
return make_lw_shared<stats>();
}
}
void describe_single_item(const cql3::selection::selection& selection,
const std::vector<managed_bytes_opt>& result_row,
const std::optional<attrs_to_get>& attrs_to_get,
rjson::value& item,
uint64_t* item_length_in_bytes,
bool include_all_embedded_attributes)
{
const auto& columns = selection.get_columns();
auto column_it = columns.begin();
for (const managed_bytes_opt& cell : result_row) {
if (!cell) {
++column_it;
continue;
}
std::string column_name = (*column_it)->name_as_text();
if (column_name != executor::ATTRS_COLUMN_NAME) {
if (item_length_in_bytes) {
(*item_length_in_bytes) += column_name.length() + cell->size();
}
if (!attrs_to_get || attrs_to_get->contains(column_name)) {
// item is expected to start empty, and column_name are unique
// so add() makes sense
rjson::add_with_string_name(item, column_name, rjson::empty_object());
rjson::value& field = item[column_name.c_str()];
cell->with_linearized([&] (bytes_view linearized_cell) {
rjson::add_with_string_name(field, type_to_string((*column_it)->type), json_key_column_value(linearized_cell, **column_it));
});
}
} else {
auto deserialized = attrs_type()->deserialize(*cell);
auto keys_and_values = value_cast<map_type_impl::native_type>(deserialized);
for (auto entry : keys_and_values) {
std::string attr_name = value_cast<sstring>(entry.first);
if (item_length_in_bytes) {
(*item_length_in_bytes) += attr_name.length();
}
if (include_all_embedded_attributes || !attrs_to_get || attrs_to_get->contains(attr_name)) {
bytes value = value_cast<bytes>(entry.second);
if (item_length_in_bytes && value.length()) {
// ScyllaDB uses one extra byte compared to DynamoDB for the bytes length
(*item_length_in_bytes) += value.length() - 1;
}
rjson::value v = deserialize_item(value);
if (attrs_to_get) {
auto it = attrs_to_get->find(attr_name);
if (it != attrs_to_get->end()) {
// attrs_to_get may have asked for only part of
// this attribute. hierarchy_filter() modifies v,
// and returns false when nothing is to be kept.
if (!hierarchy_filter(v, it->second)) {
continue;
}
}
}
// item is expected to start empty, and attribute
// names are unique so add() makes sense
rjson::add_with_string_name(item, attr_name, std::move(v));
} else if (item_length_in_bytes) {
(*item_length_in_bytes) += value_cast<bytes>(entry.second).length() - 1;
}
}
}
++column_it;
}
}
std::optional<rjson::value> describe_single_item(schema_ptr schema,
const query::partition_slice& slice,
const cql3::selection::selection& selection,
const query::result& query_result,
const std::optional<attrs_to_get>& attrs_to_get,
uint64_t* item_length_in_bytes) {
rjson::value item = rjson::empty_object();
cql3::selection::result_set_builder builder(selection, gc_clock::now());
query::result_view::consume(query_result, slice, cql3::selection::result_set_builder::visitor(builder, *schema, selection));
auto result_set = builder.build();
if (result_set->empty()) {
if (item_length_in_bytes) {
// empty results is counted as having a minimal length (e.g. 1 byte).
(*item_length_in_bytes) += 1;
}
// If there is no matching item, we're supposed to return an empty
// object without an Item member - not one with an empty Item member
return {};
}
if (result_set->size() > 1) {
// If the result set contains multiple rows, the code should have
// called describe_multi_item(), not this function.
throw std::logic_error("describe_single_item() asked to describe multiple items");
}
describe_single_item(selection, *result_set->rows().begin(), attrs_to_get, item, item_length_in_bytes);
return item;
}
static void check_big_array(const rjson::value& val, int& size_left);
static void check_big_object(const rjson::value& val, int& size_left);
// For simplicity, we use a recursive implementation. This is fine because
// Alternator limits the depth of JSONs it reads from inputs, and doesn't
// add more than a couple of levels in its own output construction.
bool is_big(const rjson::value& val, int big_size) {
if (val.IsString()) {
return ssize_t(val.GetStringLength()) > big_size;
} else if (val.IsObject()) {
check_big_object(val, big_size);
return big_size < 0;
} else if (val.IsArray()) {
check_big_array(val, big_size);
return big_size < 0;
}
return false;
}
static void check_big_array(const rjson::value& val, int& size_left) {
// Assume a fixed size of 10 bytes for each number, boolean, etc., or
// beginning of a sub-object. This doesn't have to be accurate.
size_left -= 10 * val.Size();
for (const auto& v : val.GetArray()) {
if (size_left < 0) {
return;
}
// Note that we avoid recursive calls for the leaves (anything except
// array or object) because usually those greatly outnumber the trunk.
if (v.IsString()) {
size_left -= v.GetStringLength();
} else if (v.IsObject()) {
check_big_object(v, size_left);
} else if (v.IsArray()) {
check_big_array(v, size_left);
}
}
}
static void check_big_object(const rjson::value& val, int& size_left) {
size_left -= 10 * val.MemberCount();
for (const auto& m : val.GetObject()) {
if (size_left < 0) {
return;
}
size_left -= m.name.GetStringLength();
if (m.value.IsString()) {
size_left -= m.value.GetStringLength();
} else if (m.value.IsObject()) {
check_big_object(m.value, size_left);
} else if (m.value.IsArray()) {
check_big_array(m.value, size_left);
}
}
}
void validate_table_name(std::string_view name, const char* source) {
if (name.length() < 3 || name.length() > max_table_name_length) {
throw api_error::validation(
format("{} must be at least 3 characters long and at most {} characters long", source, max_table_name_length));
}
if (!valid_table_name_chars(name)) {
throw api_error::validation(
format("{} must satisfy regular expression pattern: [a-zA-Z0-9_.-]+", source));
}
}
void validate_cdc_log_name_length(std::string_view table_name) {
if (cdc::log_name(table_name).length() > max_auxiliary_table_name_length) {
// CDC will add cdc_log_suffix ("_scylla_cdc_log") to the table name
// to create its log table, and this will exceed the maximum allowed
// length. To provide a more helpful error message, we assume that
// cdc::log_name() always adds a suffix of the same length.
int suffix_len = cdc::log_name(table_name).length() - table_name.length();
throw api_error::validation(fmt::format("Streams or vector search cannot be enabled on a table whose name is longer than {} characters: {}",
max_auxiliary_table_name_length - suffix_len, table_name));
}
}
body_writer make_streamed(rjson::value&& value) {
return [value = std::move(value)](output_stream<char>&& _out) mutable -> future<> {
auto out = std::move(_out);
std::exception_ptr ex;
try {
co_await rjson::print(value, out);
} catch (...) {
ex = std::current_exception();
}
co_await out.close();
co_await rjson::destroy_gently(std::move(value));
if (ex) {
co_await coroutine::return_exception_ptr(std::move(ex));
}
};
}
} // namespace alternator

247
alternator/executor_util.hh Normal file
View File

@@ -0,0 +1,247 @@
/*
* Copyright 2019-present ScyllaDB
*/
/*
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.1
*/
// This header file, and the implementation file executor_util.cc, contain
// various utility functions that are reused in many different operations
// (API requests) across Alternator's code - in files such as executor.cc,
// executor_read.cc, streams.cc, ttl.cc, and more. These utility functions
// include things like extracting and validating pieces from a JSON request,
// checking permissions, constructing auxiliary table names, and more.
#pragma once
#include <map>
#include <optional>
#include <string>
#include <string_view>
#include <unordered_map>
#include <unordered_set>
#include <seastar/core/future.hh>
#include <seastar/util/noncopyable_function.hh>
#include "utils/rjson.hh"
#include "schema/schema_fwd.hh"
#include "types/types.hh"
#include "auth/permission.hh"
#include "alternator/stats.hh"
#include "alternator/attribute_path.hh"
#include "utils/managed_bytes.hh"
namespace query { class partition_slice; class result; }
namespace cql3::selection { class selection; }
namespace data_dictionary { class database; }
namespace service { class storage_proxy; class client_state; }
namespace alternator {
/// The body_writer is used for streaming responses - where the response body
/// is written in chunks to the output_stream. This allows for efficient
/// handling of large responses without needing to allocate a large buffer in
/// memory. It is one of the variants of executor::request_return_type.
using body_writer = noncopyable_function<future<>(output_stream<char>&&)>;
/// Get the value of an integer attribute, or an empty optional if it is
/// missing. If the attribute exists, but is not an integer, a descriptive
/// api_error is thrown.
std::optional<int> get_int_attribute(const rjson::value& value, std::string_view attribute_name);
/// Get the value of a string attribute, or a default value if it is missing.
/// If the attribute exists, but is not a string, a descriptive api_error is
/// thrown.
std::string get_string_attribute(const rjson::value& value, std::string_view attribute_name, const char* default_return);
/// Get the value of a boolean attribute, or a default value if it is missing.
/// If the attribute exists, but is not a bool, a descriptive api_error is
/// thrown.
bool get_bool_attribute(const rjson::value& value, std::string_view attribute_name, bool default_return);
/// Extract table name from a request.
/// Most requests expect the table's name to be listed in a "TableName" field.
/// get_table_name() returns the name or api_error in case the table name is
/// missing or not a string.
std::string get_table_name(const rjson::value& request);
/// find_table_name() is like get_table_name() except that it returns an
/// optional table name - it returns an empty optional when the TableName
/// is missing from the request, instead of throwing as get_table_name()
/// does. However, find_table_name() still throws if a TableName exists but
/// is not a string.
std::optional<std::string> find_table_name(const rjson::value& request);
/// Extract table schema from a request.
/// Many requests expect the table's name to be listed in a "TableName" field
/// and need to look it up as an existing table. The get_table() function
/// does this, with the appropriate validation and api_error in case the table
/// name is missing, invalid or the table doesn't exist. If everything is
/// successful, it returns the table's schema.
schema_ptr get_table(service::storage_proxy& proxy, const rjson::value& request);
/// This find_table() variant is like get_table() excepts that it returns a
/// nullptr instead of throwing if the request does not mention a TableName.
/// In other cases of errors (i.e., a table is mentioned but doesn't exist)
/// this function throws too.
schema_ptr find_table(service::storage_proxy& proxy, const rjson::value& request);
/// This find_table() variant is like the previous one except that it takes
/// the table name directly instead of a request object. It is used in cases
/// where we already have the table name extracted from the request.
schema_ptr find_table(service::storage_proxy& proxy, std::string_view table_name);
// We would have liked to support table names up to 255 bytes, like DynamoDB.
// But Scylla creates a directory whose name is the table's name plus 33
// bytes (dash and UUID), and since directory names are limited to 255 bytes,
// we need to limit table names to 222 bytes, instead of 255. See issue #4480.
// We actually have two limits here,
// * max_table_name_length is the limit that Alternator will impose on names
// of new Alternator tables.
// * max_auxiliary_table_name_length is the potentially higher absolute limit
// that Scylla imposes on the names of auxiliary tables that Alternator
// wants to create internally - i.e. materialized views or CDC log tables.
// The second limit might mean that it is not possible to add a GSI to an
// existing table, because the name of the new auxiliary table may go over
// the limit. The second limit is also one of the reasons why the first limit
// is set lower than 222 - to have room to enable streams which add the extra
// suffix "_scylla_cdc_log" to the table name.
inline constexpr int max_table_name_length = 192;
inline constexpr int max_auxiliary_table_name_length = 222;
/// validate_table_name() validates the TableName parameter in a request - it
/// should be called in CreateTable, and in other requests only when noticing
/// that the named table doesn't exist.
/// The DynamoDB developer guide, https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/HowItWorks.NamingRulesDataTypes.html#HowItWorks.NamingRules
/// specifies that table "names must be between 3 and 255 characters long and
/// can contain only the following characters: a-z, A-Z, 0-9, _ (underscore),
/// - (dash), . (dot)". However, Alternator only allows max_table_name_length
/// characters (see above) - not 255.
/// validate_table_name() throws the appropriate api_error if this validation
/// fails.
void validate_table_name(std::string_view name, const char* source = "TableName");
/// Validate that a CDC log table could be created for the base table with a
/// given table_name, and if not, throw a user-visible api_error::validation.
/// It is not possible to create a CDC log table if the table name is so long
/// that adding the 15-character suffix "_scylla_cdc_log" (cdc_log_suffix)
/// makes it go over max_auxiliary_table_name_length.
/// Note that if max_table_name_length is set to less than 207 (which is
/// max_auxiliary_table_name_length-15), then this function will never
/// fail. However, it's still important to call it in UpdateTable, in case
/// we have pre-existing tables with names longer than this to avoid #24598.
void validate_cdc_log_name_length(std::string_view table_name);
/// Checks if a keyspace, given by its name, is an Alternator keyspace.
/// This just checks if the name begins in executor::KEYSPACE_NAME_PREFIX,
/// a prefix that all keyspaces created by Alternator's CreateTable use.
bool is_alternator_keyspace(std::string_view ks_name);
/// Wraps db::get_tags_of_table() and throws api_error::validation if the
/// table is missing the tags extension.
const std::map<sstring, sstring>& get_tags_of_table_or_throw(schema_ptr schema);
/// Returns a type object representing the type of the ":attrs" column used
/// by Alternator to store all non-key attribute. This type is a map from
/// string (attribute name) to bytes (serialized attribute value).
map_type attrs_type();
// In DynamoDB index names are local to a table, while in Scylla, materialized
// view names are global (in a keyspace). So we need to compose a unique name
// for the view taking into account both the table's name and the index name.
// We concatenate the table and index name separated by a delim character
// (a character not allowed by DynamoDB in ordinary table names, default: ":").
// The downside of this approach is that it limits the sum of the lengths,
// instead of each component individually as DynamoDB does.
// The view_name() function assumes the table_name has already been validated
// but validates the legality of index_name and the combination of both.
std::string view_name(std::string_view table_name, std::string_view index_name,
const std::string& delim = ":", bool validate_len = true);
std::string gsi_name(std::string_view table_name, std::string_view index_name,
bool validate_len = true);
std::string lsi_name(std::string_view table_name, std::string_view index_name,
bool validate_len = true);
/// After calling pk_from_json() and ck_from_json() to extract the pk and ck
/// components of a key, and if that succeeded, call check_key() to further
/// check that the key doesn't have any spurious components.
void check_key(const rjson::value& key, const schema_ptr& schema);
/// Fail with api_error::validation if the expression if has unused attribute
/// names or values. This is how DynamoDB behaves, so we do too.
void verify_all_are_used(const rjson::value* field,
const std::unordered_set<std::string>& used,
const char* field_name,
const char* operation);
/// Check CQL's Role-Based Access Control (RBAC) permission (MODIFY,
/// SELECT, DROP, etc.) on the given table. When permission is denied an
/// appropriate user-readable api_error::access_denied is thrown.
future<> verify_permission(bool enforce_authorization, bool warn_authorization, const service::client_state&, const schema_ptr&, auth::permission, stats& stats);
/// Similar to verify_permission() above, but just for CREATE operations.
/// Those do not operate on any specific table, so require permissions on
/// ALL KEYSPACES instead of any specific table.
future<> verify_create_permission(bool enforce_authorization, bool warn_authorization, const service::client_state&, stats& stats);
// Sets a KeySchema JSON array inside the given parent object describing the
// key attributes of the given schema as HASH or RANGE keys. Additionally,
// adds mappings from key attribute names to their DynamoDB type string into
// attribute_types.
void describe_key_schema(rjson::value& parent, const schema&, std::unordered_map<std::string, std::string>* attribute_types = nullptr, const std::map<sstring, sstring>* tags = nullptr);
/// is_big() checks approximately if the given JSON value is "bigger" than
/// the given big_size number of bytes. The goal is to *quickly* detect
/// oversized JSON that, for example, is too large to be serialized to a
/// contiguous string - we don't need an accurate size for that. Moreover,
/// as soon as we detect that the JSON is indeed "big", we can return true
/// and don't need to continue calculating its exact size.
bool is_big(const rjson::value& val, int big_size = 100'000);
/// try_get_internal_table() handles the special case that the given table_name
/// begins with INTERNAL_TABLE_PREFIX (".scylla.alternator."). In that case,
/// this function assumes that the rest of the name refers to an internal
/// Scylla table (e.g., system table) and returns the schema of that table -
/// or an exception if it doesn't exist. Otherwise, if table_name does not
/// start with INTERNAL_TABLE_PREFIX, this function returns an empty schema_ptr
/// and the caller should look for a normal Alternator table with that name.
schema_ptr try_get_internal_table(const data_dictionary::database& db, std::string_view table_name);
/// get_table_from_batch_request() is used by batch write/read operations to
/// look up the schema for a table named in a batch request, by the JSON member
/// name (which is the table name in a BatchWriteItem or BatchGetItem request).
schema_ptr get_table_from_batch_request(const service::storage_proxy& proxy, const rjson::value::ConstMemberIterator& batch_request);
/// Returns (or lazily creates) the per-table stats object for the given schema.
/// If the table has been deleted, returns a temporary stats object.
lw_shared_ptr<stats> get_stats_from_schema(service::storage_proxy& sp, const schema& schema);
/// Writes one item's attributes into `item` from the given selection result
/// row. If include_all_embedded_attributes is true, all attributes from the
/// ATTRS_COLUMN map column are included regardless of attrs_to_get.
void describe_single_item(const cql3::selection::selection&,
const std::vector<managed_bytes_opt>&,
const std::optional<attrs_to_get>&,
rjson::value&,
uint64_t* item_length_in_bytes = nullptr,
bool include_all_embedded_attributes = false);
/// Converts a single result row to a JSON item, or returns an empty optional
/// if the result is empty.
std::optional<rjson::value> describe_single_item(schema_ptr,
const query::partition_slice&,
const cql3::selection::selection&,
const query::result&,
const std::optional<attrs_to_get>&,
uint64_t* item_length_in_bytes = nullptr);
/// Make a body_writer (function that can write output incrementally to the
/// HTTP stream) from the given JSON object.
/// Note: only useful for (very) large objects as there are overhead issues
/// with this as well, but for massive lists of return objects this can
/// help avoid large allocations/many re-allocs.
body_writer make_streamed(rjson::value&&);
} // namespace alternator

View File

@@ -3,7 +3,7 @@
*/
/*
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.1
*/
#include "expressions.hh"
@@ -744,7 +744,7 @@ void validate_attr_name_length(std::string_view supplementary_context, size_t at
constexpr const size_t DYNAMODB_NONKEY_ATTR_NAME_SIZE_MAX = 65535;
const size_t max_length = is_key ? DYNAMODB_KEY_ATTR_NAME_SIZE_MAX : DYNAMODB_NONKEY_ATTR_NAME_SIZE_MAX;
if (attr_name_length > max_length) {
if (attr_name_length > max_length || attr_name_length == 0) {
std::string error_msg;
if (!error_msg_prefix.empty()) {
error_msg += error_msg_prefix;
@@ -754,7 +754,11 @@ void validate_attr_name_length(std::string_view supplementary_context, size_t at
error_msg += supplementary_context;
error_msg += " - ";
}
error_msg += fmt::format("Attribute name is too large, must be less than {} bytes", std::to_string(max_length + 1));
if (attr_name_length == 0) {
error_msg += "Empty attribute name";
} else {
error_msg += fmt::format("Attribute name is too large, must be less than {} bytes", std::to_string(max_length + 1));
}
throw api_error::validation(error_msg);
}
}

View File

@@ -3,7 +3,7 @@
*/
/*
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.1
*/
/*

View File

@@ -3,7 +3,7 @@
*/
/*
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.1
*/
#pragma once

View File

@@ -3,7 +3,7 @@
*/
/*
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.1
*/
#pragma once
@@ -50,7 +50,7 @@ public:
_operators.emplace_back(i);
check_depth_limit();
}
void add_dot(std::string(name)) {
void add_dot(std::string name) {
_operators.emplace_back(std::move(name));
check_depth_limit();
}
@@ -85,7 +85,7 @@ struct constant {
}
};
// "value" is is a value used in the right hand side of an assignment
// "value" is a value used in the right hand side of an assignment
// expression, "SET a = ...". It can be a constant (a reference to a value
// included in the request, e.g., ":val"), a path to an attribute from the
// existing item (e.g., "a.b[3].c"), or a function of other such values.
@@ -205,7 +205,7 @@ public:
// The supported primitive conditions are:
// 1. Binary operators - v1 OP v2, where OP is =, <>, <, <=, >, or >= and
// v1 and v2 are values - from the item (an attribute path), the query
// (a ":val" reference), or a function of the the above (only the size()
// (a ":val" reference), or a function of the above (only the size()
// function is supported).
// 2. Ternary operator - v1 BETWEEN v2 and v3 (means v1 >= v2 AND v1 <= v3).
// 3. N-ary operator - v1 IN ( v2, v3, ... )

View File

@@ -3,7 +3,7 @@
*/
/*
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.1
*/
#pragma once

View File

@@ -0,0 +1,299 @@
/*
* Copyright 2025-present ScyllaDB
*/
/*
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.1
*/
#include "alternator/http_compression.hh"
#include "alternator/server.hh"
#include <seastar/coroutine/maybe_yield.hh>
#include <zlib.h>
static logging::logger slogger("alternator-http-compression");
namespace alternator {
static constexpr size_t compressed_buffer_size = 1024;
class zlib_compressor {
z_stream _zs;
temporary_buffer<char> _output_buf;
noncopyable_function<future<>(temporary_buffer<char>&&)> _write_func;
public:
zlib_compressor(bool gzip, int compression_level, noncopyable_function<future<>(temporary_buffer<char>&&)> write_func)
: _write_func(std::move(write_func)) {
memset(&_zs, 0, sizeof(_zs));
if (deflateInit2(&_zs, std::clamp(compression_level, Z_NO_COMPRESSION, Z_BEST_COMPRESSION), Z_DEFLATED,
(gzip ? 16 : 0) + MAX_WBITS, 8, Z_DEFAULT_STRATEGY) != Z_OK) {
// Should only happen if memory allocation fails
throw std::bad_alloc();
}
}
~zlib_compressor() {
deflateEnd(&_zs);
}
future<> close() {
return compress(nullptr, 0, true);
}
future<> compress(const char* buf, size_t len, bool is_last_chunk = false) {
_zs.next_in = reinterpret_cast<unsigned char*>(const_cast<char*>(buf));
_zs.avail_in = (uInt) len;
int mode = is_last_chunk ? Z_FINISH : Z_NO_FLUSH;
while(_zs.avail_in > 0 || is_last_chunk) {
co_await coroutine::maybe_yield();
if (_output_buf.empty()) {
if (is_last_chunk) {
uint32_t max_buffer_size = 0;
deflatePending(&_zs, &max_buffer_size, nullptr);
max_buffer_size += deflateBound(&_zs, _zs.avail_in) + 1;
_output_buf = temporary_buffer<char>(std::min(compressed_buffer_size, (size_t) max_buffer_size));
} else {
_output_buf = temporary_buffer<char>(compressed_buffer_size);
}
_zs.next_out = reinterpret_cast<unsigned char*>(_output_buf.get_write());
_zs.avail_out = compressed_buffer_size;
}
int e = deflate(&_zs, mode);
if (e < Z_OK) {
throw api_error::internal("Error during compression of response body");
}
if (e == Z_STREAM_END || _zs.avail_out < compressed_buffer_size / 4) {
_output_buf.trim(compressed_buffer_size - _zs.avail_out);
co_await _write_func(std::move(_output_buf));
if (e == Z_STREAM_END) {
break;
}
}
}
}
};
// Helper string_view functions for parsing Accept-Encoding header
struct case_insensitive_cmp_sv {
bool operator()(std::string_view s1, std::string_view s2) const {
return std::equal(s1.begin(), s1.end(), s2.begin(), s2.end(),
[](char a, char b) { return ::tolower(a) == ::tolower(b); });
}
};
static inline std::string_view trim_left(std::string_view sv) {
while (!sv.empty() && std::isspace(static_cast<unsigned char>(sv.front())))
sv.remove_prefix(1);
return sv;
}
static inline std::string_view trim_right(std::string_view sv) {
while (!sv.empty() && std::isspace(static_cast<unsigned char>(sv.back())))
sv.remove_suffix(1);
return sv;
}
static inline std::string_view trim(std::string_view sv) {
return trim_left(trim_right(sv));
}
inline std::vector<std::string_view> split(std::string_view text, char separator) {
std::vector<std::string_view> tokens;
if (text == "") {
return tokens;
}
while (true) {
auto pos = text.find_first_of(separator);
if (pos != std::string_view::npos) {
tokens.emplace_back(text.data(), pos);
text.remove_prefix(pos + 1);
} else {
tokens.emplace_back(text);
break;
}
}
return tokens;
}
constexpr response_compressor::compression_type response_compressor::get_compression_type(std::string_view encoding) {
for (size_t i = 0; i < static_cast<size_t>(compression_type::count); ++i) {
if (case_insensitive_cmp_sv{}(encoding, compression_names[i])) {
return static_cast<compression_type>(i);
}
}
return compression_type::unknown;
}
response_compressor::compression_type response_compressor::find_compression(std::string_view accept_encoding, size_t response_size) {
std::optional<float> ct_q[static_cast<size_t>(compression_type::count)];
ct_q[static_cast<size_t>(compression_type::none)] = std::numeric_limits<float>::min(); // enabled, but lowest priority
compression_type selected_ct = compression_type::none;
std::vector<std::string_view> entries = split(accept_encoding, ',');
for (auto& e : entries) {
std::vector<std::string_view> params = split(e, ';');
if (params.size() == 0) {
continue;
}
compression_type ct = get_compression_type(trim(params[0]));
if (ct == compression_type::unknown) {
continue; // ignore unknown encoding types
}
if (ct_q[static_cast<size_t>(ct)].has_value() && ct_q[static_cast<size_t>(ct)] != 0.0f) {
continue; // already processed this encoding
}
if (response_size < _threshold[static_cast<size_t>(ct)]) {
continue; // below threshold treat as unknown
}
for (size_t i = 1; i < params.size(); ++i) { // find "q=" parameter
auto pos = params[i].find("q=");
if (pos == std::string_view::npos) {
continue;
}
std::string_view param = params[i].substr(pos + 2);
param = trim(param);
// parse quality value
float q_value = 1.0f;
auto [ptr, ec] = std::from_chars(param.data(), param.data() + param.size(), q_value);
if (ec != std::errc() || ptr != param.data() + param.size()) {
continue;
}
if (q_value < 0.0) {
q_value = 0.0;
} else if (q_value > 1.0) {
q_value = 1.0;
}
ct_q[static_cast<size_t>(ct)] = q_value;
break; // we parsed quality value
}
if (!ct_q[static_cast<size_t>(ct)].has_value()) {
ct_q[static_cast<size_t>(ct)] = 1.0f; // default quality value
}
// keep the highest encoding (in the order, unless 'any')
if (selected_ct == compression_type::any) {
if (ct_q[static_cast<size_t>(ct)] >= ct_q[static_cast<size_t>(selected_ct)]) {
selected_ct = ct;
}
} else {
if (ct_q[static_cast<size_t>(ct)] > ct_q[static_cast<size_t>(selected_ct)]) {
selected_ct = ct;
}
}
}
if (selected_ct == compression_type::any) {
// select any not mentioned or highest quality
selected_ct = compression_type::none;
for (size_t i = 0; i < static_cast<size_t>(compression_type::compressions_count); ++i) {
if (!ct_q[i].has_value()) {
return static_cast<compression_type>(i);
}
if (ct_q[i] > ct_q[static_cast<size_t>(selected_ct)]) {
selected_ct = static_cast<compression_type>(i);
}
}
}
return selected_ct;
}
static future<chunked_content> compress(response_compressor::compression_type ct, const db::config& cfg, std::string str) {
chunked_content compressed;
auto write = [&compressed](temporary_buffer<char>&& buf) -> future<> {
compressed.push_back(std::move(buf));
return make_ready_future<>();
};
zlib_compressor compressor(ct != response_compressor::compression_type::deflate,
cfg.alternator_response_gzip_compression_level(), std::move(write));
co_await compressor.compress(str.data(), str.size(), true);
co_return compressed;
}
static sstring flatten(chunked_content&& cc) {
size_t total_size = 0;
for (const auto& chunk : cc) {
total_size += chunk.size();
}
sstring result = sstring{ sstring::initialized_later{}, total_size };
size_t offset = 0;
for (const auto& chunk : cc) {
std::copy(chunk.begin(), chunk.end(), result.begin() + offset);
offset += chunk.size();
}
return result;
}
future<std::unique_ptr<http::reply>> response_compressor::generate_reply(std::unique_ptr<http::reply> rep, sstring accept_encoding, std::optional<std::string_view> content_type, std::string&& response_body) {
response_compressor::compression_type ct = find_compression(accept_encoding, response_body.size());
if (ct != response_compressor::compression_type::none) {
rep->add_header("Content-Encoding", get_encoding_name(ct));
return compress(ct, cfg, std::move(response_body)).then([rep = std::move(rep), content_type] (chunked_content compressed) mutable {
rep->write_body(content_type, flatten(std::move(compressed)));
return make_ready_future<std::unique_ptr<http::reply>>(std::move(rep));
});
} else {
// Note that despite the move, response_body (std::string) is copied
// into an sstring when passed to write_body().
rep->write_body(content_type, std::move(response_body));
}
return make_ready_future<std::unique_ptr<http::reply>>(std::move(rep));
}
template<typename Compressor>
class compressed_data_sink_impl : public data_sink_impl {
output_stream<char> _out;
Compressor _compressor;
public:
template<typename... Args>
compressed_data_sink_impl(output_stream<char>&& out, Args&&... args)
: _out(std::move(out)), _compressor(std::forward<Args>(args)..., [this](temporary_buffer<char>&& buf) {
return _out.write(std::move(buf));
}) { }
future<> put(std::span<temporary_buffer<char>> data) override {
return data_sink_impl::fallback_put(data, [this] (temporary_buffer<char>&& buf) {
return do_put(std::move(buf));
});
}
private:
future<> do_put(temporary_buffer<char> buf) {
co_return co_await _compressor.compress(buf.get(), buf.size());
}
future<> close() override {
return _compressor.close().then([this] {
return _out.close();
});
}
};
body_writer compress(response_compressor::compression_type ct, const db::config& cfg, body_writer&& bw) {
return [bw = std::move(bw), ct, level = cfg.alternator_response_gzip_compression_level()](output_stream<char>&& out) mutable -> future<> {
output_stream_options opts;
opts.trim_to_size = true;
std::unique_ptr<data_sink_impl> data_sink_impl;
switch (ct) {
case response_compressor::compression_type::gzip:
data_sink_impl = std::make_unique<compressed_data_sink_impl<zlib_compressor>>(std::move(out), true, level);
break;
case response_compressor::compression_type::deflate:
data_sink_impl = std::make_unique<compressed_data_sink_impl<zlib_compressor>>(std::move(out), false, level);
break;
case response_compressor::compression_type::none:
case response_compressor::compression_type::any:
case response_compressor::compression_type::unknown:
on_internal_error(slogger,"Compression not selected");
default:
on_internal_error(slogger, "Unsupported compression type for data sink");
}
return bw(output_stream<char>(data_sink(std::move(data_sink_impl)), compressed_buffer_size, opts));
};
}
future<std::unique_ptr<http::reply>> response_compressor::generate_reply(std::unique_ptr<http::reply> rep, sstring accept_encoding, std::optional<std::string_view> content_type, body_writer&& body_writer) {
response_compressor::compression_type ct = find_compression(accept_encoding, std::numeric_limits<size_t>::max());
if (ct != response_compressor::compression_type::none) {
rep->add_header("Content-Encoding", get_encoding_name(ct));
rep->write_body(content_type, compress(ct, cfg, std::move(body_writer)));
} else {
rep->write_body(content_type, std::move(body_writer));
}
return make_ready_future<std::unique_ptr<http::reply>>(std::move(rep));
}
} // namespace alternator

View File

@@ -0,0 +1,91 @@
/*
* Copyright 2025-present ScyllaDB
*/
/*
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.1
*/
#pragma once
#include "alternator/executor.hh"
#include <seastar/http/httpd.hh>
#include "db/config.hh"
namespace alternator {
class response_compressor {
public:
enum class compression_type {
gzip,
deflate,
compressions_count,
any = compressions_count,
none,
count,
unknown = count
};
static constexpr std::string_view compression_names[] = {
"gzip",
"deflate",
"*",
"identity"
};
static sstring get_encoding_name(compression_type ct) {
return sstring(compression_names[static_cast<size_t>(ct)]);
}
static constexpr compression_type get_compression_type(std::string_view encoding);
sstring get_accepted_encoding(const http::request& req) {
if (get_threshold() == 0) {
return "";
}
return req.get_header("Accept-Encoding");
}
compression_type find_compression(std::string_view accept_encoding, size_t response_size);
response_compressor(const db::config& cfg)
: cfg(cfg)
,_gzip_level_observer(
cfg.alternator_response_gzip_compression_level.observe([this](int v) {
update_threshold();
}))
,_gzip_threshold_observer(
cfg.alternator_response_compression_threshold_in_bytes.observe([this](uint32_t v) {
update_threshold();
}))
{
update_threshold();
}
response_compressor(const response_compressor& rhs) : response_compressor(rhs.cfg) {}
private:
const db::config& cfg;
utils::observable<int>::observer _gzip_level_observer;
utils::observable<uint32_t>::observer _gzip_threshold_observer;
uint32_t _threshold[static_cast<size_t>(compression_type::count)];
size_t get_threshold() { return _threshold[static_cast<size_t>(compression_type::any)]; }
void update_threshold() {
_threshold[static_cast<size_t>(compression_type::none)] = std::numeric_limits<uint32_t>::max();
_threshold[static_cast<size_t>(compression_type::any)] = std::numeric_limits<uint32_t>::max();
uint32_t gzip = cfg.alternator_response_gzip_compression_level() <= 0 ? std::numeric_limits<uint32_t>::max()
: cfg.alternator_response_compression_threshold_in_bytes();
_threshold[static_cast<size_t>(compression_type::gzip)] = gzip;
_threshold[static_cast<size_t>(compression_type::deflate)] = gzip;
for (size_t i = 0; i < static_cast<size_t>(compression_type::compressions_count); ++i) {
if (_threshold[i] < _threshold[static_cast<size_t>(compression_type::any)]) {
_threshold[static_cast<size_t>(compression_type::any)] = _threshold[i];
}
}
}
public:
future<std::unique_ptr<http::reply>> generate_reply(std::unique_ptr<http::reply> rep,
sstring accept_encoding, std::optional<std::string_view> content_type, std::string&& response_body);
future<std::unique_ptr<http::reply>> generate_reply(std::unique_ptr<http::reply> rep,
sstring accept_encoding, std::optional<std::string_view> content_type, body_writer&& body_writer);
};
}

View File

@@ -3,7 +3,7 @@
*/
/*
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.1
*/
#include "expressions.hh"

View File

@@ -3,7 +3,7 @@
*/
/*
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.1
*/
#pragma once

View File

@@ -3,7 +3,7 @@
*/
/*
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.1
*/
#include "utils/base64.hh"
@@ -14,12 +14,12 @@
#include "types/concrete_types.hh"
#include "types/json_utils.hh"
#include "mutation/position_in_partition.hh"
#include "alternator/executor_util.hh"
static logging::logger slogger("alternator-serialization");
namespace alternator {
bool is_alternator_keyspace(const sstring& ks_name);
type_info type_info_from_string(std::string_view type) {
static thread_local const std::unordered_map<std::string_view, type_info> type_infos = {

View File

@@ -3,7 +3,7 @@
*/
/*
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.1
*/
#pragma once
@@ -55,7 +55,7 @@ partition_key pk_from_json(const rjson::value& item, schema_ptr schema);
clustering_key ck_from_json(const rjson::value& item, schema_ptr schema);
position_in_partition pos_from_json(const rjson::value& item, schema_ptr schema);
// If v encodes a number (i.e., it is a {"N": [...]}, returns an object representing it. Otherwise,
// If v encodes a number (i.e., it is a {"N": [...]}), returns an object representing it. Otherwise,
// raises ValidationException with diagnostic.
big_decimal unwrap_number(const rjson::value& v, std::string_view diagnostic);

View File

@@ -3,10 +3,12 @@
*/
/*
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.1
*/
#include "alternator/server.hh"
#include "audit/audit.hh"
#include "alternator/executor_util.hh"
#include "gms/application_state.hh"
#include "utils/log.hh"
#include <fmt/ranges.h>
@@ -26,6 +28,7 @@
#include "auth.hh"
#include <cctype>
#include <string_view>
#include <algorithm>
#include <utility>
#include "service/storage_proxy.hh"
#include "gms/gossiper.hh"
@@ -34,6 +37,7 @@
#include "client_data.hh"
#include "utils/updateable_value.hh"
#include <zlib.h>
#include "alternator/http_compression.hh"
static logging::logger slogger("alternator-server");
@@ -109,11 +113,22 @@ class api_handler : public handler_base {
// "application/json". Some other AWS services use later versions instead
// of "1.0", but DynamoDB currently uses "1.0". Note that this content
// type applies to all replies, both success and error.
static constexpr const char* REPLY_CONTENT_TYPE = "application/x-amz-json-1.0";
static constexpr std::string_view REPLY_CONTENT_TYPE = "application/x-amz-json-1.0";
public:
api_handler(const std::function<future<executor::request_return_type>(std::unique_ptr<request> req)>& _handle) : _f_handle(
api_handler(const std::function<future<executor::request_return_type>(std::unique_ptr<request> req)>& _handle,
const db::config& config) :
_content_type(config.alternator_http_response_disable_content_type_header()
? std::nullopt
: std::optional<std::string_view>(REPLY_CONTENT_TYPE)),
_content_type_observer(config.alternator_http_response_disable_content_type_header.observe(
[this](const bool& ct) {
_content_type = ct ? std::nullopt : std::optional<std::string_view>(REPLY_CONTENT_TYPE);
})),
_response_compressor(config), _f_handle(
[this, _handle](std::unique_ptr<request> req, std::unique_ptr<reply> rep) {
return seastar::futurize_invoke(_handle, std::move(req)).then_wrapped([this, rep = std::move(rep)](future<executor::request_return_type> resf) mutable {
sstring accept_encoding = _response_compressor.get_accepted_encoding(*req);
return seastar::futurize_invoke(_handle, std::move(req)).then_wrapped(
[this, rep = std::move(rep), accept_encoding=std::move(accept_encoding)](future<executor::request_return_type> resf) mutable {
if (resf.failed()) {
// Exceptions of type api_error are wrapped as JSON and
// returned to the client as expected. Other types of
@@ -133,37 +148,35 @@ public:
return make_ready_future<std::unique_ptr<reply>>(std::move(rep));
}
auto res = resf.get();
std::visit(overloaded_functor {
return std::visit(overloaded_functor {
[&] (std::string&& str) {
// Note that despite the move, there is a copy here -
// as str is std::string and rep->_content is sstring.
rep->_content = std::move(str);
rep->set_content_type(REPLY_CONTENT_TYPE);
return _response_compressor.generate_reply(std::move(rep), std::move(accept_encoding),
_content_type, std::move(str));
},
[&] (executor::body_writer&& body_writer) {
rep->write_body(REPLY_CONTENT_TYPE, std::move(body_writer));
[&] (body_writer&& body_writer) {
return _response_compressor.generate_reply(std::move(rep), std::move(accept_encoding),
_content_type, std::move(body_writer));
},
[&] (const api_error& err) {
generate_error_reply(*rep, err);
return make_ready_future<std::unique_ptr<reply>>(std::move(rep));
}
}, std::move(res));
return make_ready_future<std::unique_ptr<reply>>(std::move(rep));
});
}) { }
api_handler(const api_handler&) = default;
api_handler(const api_handler&) = delete;
future<std::unique_ptr<reply>> handle(const sstring& path,
std::unique_ptr<request> req, std::unique_ptr<reply> rep) override {
handle_CORS(*req, *rep, false);
return _f_handle(std::move(req), std::move(rep)).then(
[](std::unique_ptr<reply> rep) {
rep->done();
return make_ready_future<std::unique_ptr<reply>>(std::move(rep));
});
return _f_handle(std::move(req), std::move(rep));
}
protected:
std::optional<std::string_view> _content_type;
utils::observer<bool> _content_type_observer;
response_compressor _response_compressor;
future_handler_function _f_handle;
void generate_error_reply(reply& rep, const api_error& err) {
rjson::value results = rjson::empty_object();
if (!err._extra_fields.IsNull() && err._extra_fields.IsObject()) {
@@ -171,13 +184,11 @@ protected:
}
rjson::add(results, "__type", rjson::from_string("com.amazonaws.dynamodb.v20120810#" + err._type));
rjson::add(results, "message", err._msg);
rep._content = rjson::print(std::move(results));
rep._status = err._http_code;
rep.set_content_type(REPLY_CONTENT_TYPE);
slogger.trace("api_handler error case: {}", rep._content);
sstring content = rjson::print(std::move(results));
slogger.trace("api_handler error case: {}", content);
rep.set_status(err._http_code);
rep.write_body(_content_type, std::move(content));
}
future_handler_function _f_handle;
};
class gated_handler : public handler_base {
@@ -251,8 +262,7 @@ protected:
}
}
rep->set_status(reply::status_type::ok);
rep->set_content_type("json");
rep->_content = rjson::print(results);
rep->write_body("json", rjson::print(results));
return make_ready_future<std::unique_ptr<reply>>(std::move(rep));
}
};
@@ -371,18 +381,45 @@ future<std::string> server::verify_signature(const request& req, const chunked_c
for (const auto& header : signed_headers) {
signed_headers_map.emplace(header, std::string_view());
}
std::vector<std::string> modified_values;
for (auto& header : req._headers) {
std::string header_str;
header_str.resize(header.first.size());
std::transform(header.first.begin(), header.first.end(), header_str.begin(), ::tolower);
auto it = signed_headers_map.find(header_str);
if (it != signed_headers_map.end()) {
it->second = std::string_view(header.second);
// replace multiple spaces in the header value header.second with
// a single space, as required by AWS SigV4 header canonization.
// If we modify the value, we need to save it in modified_values
// to keep it alive.
std::string value;
value.reserve(header.second.size());
bool prev_space = false;
bool modified = false;
for (char ch : header.second) {
if (ch == ' ') {
if (!prev_space) {
value += ch;
prev_space = true;
} else {
modified = true; // skip a space
}
} else {
value += ch;
prev_space = false;
}
}
if (modified) {
modified_values.emplace_back(std::move(value));
it->second = std::string_view(modified_values.back());
} else {
it->second = std::string_view(header.second);
}
}
}
auto cache_getter = [&proxy = _proxy, &as = _auth_service] (std::string username) {
return get_key_from_roles(proxy, as, std::move(username));
auto cache_getter = [&proxy = _proxy] (std::string username) {
return get_key_from_roles(proxy, std::move(username));
};
return _key_cache.get_ptr(user, cache_getter).then_wrapped([this, &req, &content,
user = std::move(user),
@@ -390,6 +427,7 @@ future<std::string> server::verify_signature(const request& req, const chunked_c
datestamp = std::move(datestamp),
signed_headers_str = std::move(signed_headers_str),
signed_headers_map = std::move(signed_headers_map),
modified_values = std::move(modified_values),
region = std::move(region),
service = std::move(service),
user_signature = std::move(user_signature)] (future<key_cache::value_ptr> key_ptr_fut) {
@@ -560,11 +598,11 @@ read_entire_stream(input_stream<char>& inp, size_t length_limit) {
class safe_gzip_zstream {
z_stream _zs;
public:
safe_gzip_zstream() {
// If gzip is true, decode a gzip header (for "Content-Encoding: gzip").
// Otherwise, a zlib header (for "Content-Encoding: deflate").
safe_gzip_zstream(bool gzip = true) {
memset(&_zs, 0, sizeof(_zs));
// The strange 16 + WMAX_BITS tells zlib to expect and decode
// a gzip header, not a zlib header.
if (inflateInit2(&_zs, 16 + MAX_WBITS) != Z_OK) {
if (inflateInit2(&_zs, gzip ? 16 + MAX_WBITS : MAX_WBITS) != Z_OK) {
// Should only happen if memory allocation fails
throw std::bad_alloc();
}
@@ -583,19 +621,21 @@ public:
}
};
// ungzip() takes a chunked_content with a gzip-compressed request body,
// uncompresses it, and returns the uncompressed content as a chunked_content.
// ungzip() takes a chunked_content of a compressed request body, and returns
// the uncompressed content as a chunked_content. If gzip is true, we expect
// gzip header (for "Content-Encoding: gzip"), if gzip is false, we expect a
// zlib header (for "Content-Encoding: deflate").
// If the uncompressed content exceeds length_limit, an error is thrown.
static future<chunked_content>
ungzip(chunked_content&& compressed_body, size_t length_limit) {
ungzip(chunked_content&& compressed_body, size_t length_limit, bool gzip = true) {
chunked_content ret;
// output_buf can be any size - when uncompressing input_buf, it doesn't
// need to fit in a single output_buf, we'll use multiple output_buf for
// a single input_buf if needed.
constexpr size_t OUTPUT_BUF_SIZE = 4096;
temporary_buffer<char> output_buf;
safe_gzip_zstream strm;
bool complete_stream = false; // empty input is not a valid gzip
safe_gzip_zstream strm(gzip);
bool complete_stream = false; // empty input is not a valid gzip/deflate
size_t total_out_bytes = 0;
for (const temporary_buffer<char>& input_buf : compressed_body) {
if (input_buf.empty()) {
@@ -666,6 +706,17 @@ future<executor::request_return_type> server::handle_api_request(std::unique_ptr
// for such a size.
co_return api_error::payload_too_large(fmt::format("Request content length limit of {} bytes exceeded", request_content_length_limit));
}
// Check the concurrency limit early, before acquiring memory and
// reading the request body, to avoid piling up memory from excess
// requests that will be rejected anyway. This mirrors the CQL
// transport which also checks concurrency before memory acquisition
// (transport/server.cc).
if (_pending_requests.get_count() >= _max_concurrent_requests) {
_executor._stats.requests_shed++;
co_return api_error::request_limit_exceeded(format("too many in-flight requests (configured via max_concurrent_requests_per_shard): {}", _pending_requests.get_count()));
}
_pending_requests.enter();
auto leave = defer([this] () noexcept { _pending_requests.leave(); });
// JSON parsing can allocate up to roughly 2x the size of the raw
// document, + a couple of bytes for maintenance.
// If the Content-Length of the request is not available, we assume
@@ -677,7 +728,7 @@ future<executor::request_return_type> server::handle_api_request(std::unique_ptr
++_executor._stats.requests_blocked_memory;
}
auto units = co_await std::move(units_fut);
SCYLLA_ASSERT(req->content_stream);
throwing_assert(req->content_stream);
chunked_content content = co_await read_entire_stream(*req->content_stream, request_content_length_limit);
// If the request had no Content-Length, we reserved too many units
// so need to return some
@@ -698,6 +749,8 @@ future<executor::request_return_type> server::handle_api_request(std::unique_ptr
sstring content_encoding = req->get_header("Content-Encoding");
if (content_encoding == "gzip") {
content = co_await ungzip(std::move(content), request_content_length_limit);
} else if (content_encoding == "deflate") {
content = co_await ungzip(std::move(content), request_content_length_limit, false);
} else if (!content_encoding.empty()) {
// DynamoDB returns a 500 error for unsupported Content-Encoding.
// I'm not sure if this is the best error code, but let's do it too.
@@ -708,8 +761,12 @@ future<executor::request_return_type> server::handle_api_request(std::unique_ptr
// As long as the system_clients_entry object is alive, this request will
// be visible in the "system.clients" virtual table. When requested, this
// entry will be formatted by server::ongoing_request::make_client_data().
auto user_agent_header = co_await _connection_options_keys_and_values.get_or_load(req->get_header("User-Agent"), [] (const client_options_cache_key_type&) {
return make_ready_future<options_cache_value_type>(options_cache_value_type{});
});
auto system_clients_entry = _ongoing_requests.emplace(
req->get_client_address(), req->get_header("User-Agent"),
req->get_client_address(), std::move(user_agent_header),
username, current_scheduling_group(),
req->get_protocol_name() == "https");
@@ -721,18 +778,12 @@ future<executor::request_return_type> server::handle_api_request(std::unique_ptr
_executor._stats.unsupported_operations++;
co_return api_error::unknown_operation(fmt::format("Unsupported operation {}", op));
}
if (_pending_requests.get_count() >= _max_concurrent_requests) {
_executor._stats.requests_shed++;
co_return api_error::request_limit_exceeded(format("too many in-flight requests (configured via max_concurrent_requests_per_shard): {}", _pending_requests.get_count()));
}
_pending_requests.enter();
auto leave = defer([this] () noexcept { _pending_requests.leave(); });
executor::client_state client_state(service::client_state::external_tag(),
_auth_service, &_sl_controller, _timeout_config.current_values(), req->get_client_address());
if (!username.empty()) {
client_state.set_login(auth::authenticated_user(username));
}
co_await client_state.maybe_update_per_service_level_params();
client_state.maybe_update_per_service_level_params();
tracing::trace_state_ptr trace_state = maybe_trace_query(client_state, username, op, content, _max_users_query_size_in_trace_output.get());
tracing::trace(trace_state, "{}", op);
@@ -741,12 +792,25 @@ future<executor::request_return_type> server::handle_api_request(std::unique_ptr
auto f = [this, content = std::move(content), &callback = callback_it->second,
client_state = std::move(client_state), trace_state = std::move(trace_state),
units = std::move(units), req = std::move(req)] () mutable -> future<executor::request_return_type> {
rjson::value json_request = co_await _json_parser.parse(std::move(content));
if (!json_request.IsObject()) {
co_return api_error::validation("Request content must be an object");
}
co_return co_await callback(_executor, client_state, trace_state,
make_service_permit(std::move(units)), std::move(json_request), std::move(req));
rjson::value json_request = co_await _json_parser.parse(std::move(content));
if (!json_request.IsObject()) {
co_return api_error::validation("Request content must be an object");
}
std::unique_ptr<audit::audit_info_alternator> audit_info;
std::exception_ptr ex = {};
executor::request_return_type ret;
try {
ret = co_await callback(_executor, client_state, trace_state, make_service_permit(std::move(units)), std::move(json_request), std::move(req), audit_info);
} catch (...) {
ex = std::current_exception();
}
if (audit_info) {
co_await audit::inspect(*audit_info, client_state, ex != nullptr);
}
if (ex) {
co_return coroutine::exception(std::move(ex));
}
co_return ret;
};
co_return co_await _sl_controller.with_user_service_level(user, std::ref(f));
}
@@ -754,7 +818,7 @@ future<executor::request_return_type> server::handle_api_request(std::unique_ptr
void server::set_routes(routes& r) {
api_handler* req_handler = new api_handler([this] (std::unique_ptr<request> req) mutable {
return handle_api_request(std::move(req));
});
}, _proxy.data_dictionary().get_config());
r.put(operation_type::POST, "/", req_handler);
r.put(operation_type::GET, "/", new health_handler(_pending_requests));
@@ -790,82 +854,99 @@ server::server(executor& exec, service::storage_proxy& proxy, gms::gossiper& gos
, _pending_requests("alternator::server::pending_requests")
, _timeout_config(_proxy.data_dictionary().get_config())
, _callbacks{
{"CreateTable", [] (executor& e, executor::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value json_request, std::unique_ptr<request> req) {
return e.create_table(client_state, std::move(trace_state), std::move(permit), std::move(json_request));
{"CreateTable", [] (executor& e, executor::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value json_request, std::unique_ptr<request> req, std::unique_ptr<audit::audit_info_alternator>& audit_info) {
return e.create_table(client_state, std::move(trace_state), std::move(permit), std::move(json_request), audit_info);
}},
{"DescribeTable", [] (executor& e, executor::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value json_request, std::unique_ptr<request> req) {
return e.describe_table(client_state, std::move(trace_state), std::move(permit), std::move(json_request));
{"DescribeTable", [] (executor& e, executor::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value json_request, std::unique_ptr<request> req, std::unique_ptr<audit::audit_info_alternator>& audit_info) {
return e.describe_table(client_state, std::move(trace_state), std::move(permit), std::move(json_request), audit_info);
}},
{"DeleteTable", [] (executor& e, executor::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value json_request, std::unique_ptr<request> req) {
return e.delete_table(client_state, std::move(trace_state), std::move(permit), std::move(json_request));
{"DeleteTable", [] (executor& e, executor::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value json_request, std::unique_ptr<request> req, std::unique_ptr<audit::audit_info_alternator>& audit_info) {
return e.delete_table(client_state, std::move(trace_state), std::move(permit), std::move(json_request), audit_info);
}},
{"UpdateTable", [] (executor& e, executor::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value json_request, std::unique_ptr<request> req) {
return e.update_table(client_state, std::move(trace_state), std::move(permit), std::move(json_request));
{"UpdateTable", [] (executor& e, executor::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value json_request, std::unique_ptr<request> req, std::unique_ptr<audit::audit_info_alternator>& audit_info) {
return e.update_table(client_state, std::move(trace_state), std::move(permit), std::move(json_request), audit_info);
}},
{"PutItem", [] (executor& e, executor::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value json_request, std::unique_ptr<request> req) {
return e.put_item(client_state, std::move(trace_state), std::move(permit), std::move(json_request));
{"PutItem", [] (executor& e, executor::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value json_request, std::unique_ptr<request> req, std::unique_ptr<audit::audit_info_alternator>& audit_info) {
return e.put_item(client_state, std::move(trace_state), std::move(permit), std::move(json_request), audit_info);
}},
{"UpdateItem", [] (executor& e, executor::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value json_request, std::unique_ptr<request> req) {
return e.update_item(client_state, std::move(trace_state), std::move(permit), std::move(json_request));
{"UpdateItem", [] (executor& e, executor::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value json_request, std::unique_ptr<request> req, std::unique_ptr<audit::audit_info_alternator>& audit_info) {
return e.update_item(client_state, std::move(trace_state), std::move(permit), std::move(json_request), audit_info);
}},
{"GetItem", [] (executor& e, executor::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value json_request, std::unique_ptr<request> req) {
return e.get_item(client_state, std::move(trace_state), std::move(permit), std::move(json_request));
{"GetItem", [] (executor& e, executor::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value json_request, std::unique_ptr<request> req, std::unique_ptr<audit::audit_info_alternator>& audit_info) {
return e.get_item(client_state, std::move(trace_state), std::move(permit), std::move(json_request), audit_info);
}},
{"DeleteItem", [] (executor& e, executor::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value json_request, std::unique_ptr<request> req) {
return e.delete_item(client_state, std::move(trace_state), std::move(permit), std::move(json_request));
{"DeleteItem", [] (executor& e, executor::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value json_request, std::unique_ptr<request> req, std::unique_ptr<audit::audit_info_alternator>& audit_info) {
return e.delete_item(client_state, std::move(trace_state), std::move(permit), std::move(json_request), audit_info);
}},
{"ListTables", [] (executor& e, executor::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value json_request, std::unique_ptr<request> req) {
return e.list_tables(client_state, std::move(permit), std::move(json_request));
{"ListTables", [] (executor& e, executor::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value json_request, std::unique_ptr<request> req, std::unique_ptr<audit::audit_info_alternator>& audit_info) {
return e.list_tables(client_state, std::move(permit), std::move(json_request), audit_info);
}},
{"Scan", [] (executor& e, executor::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value json_request, std::unique_ptr<request> req) {
return e.scan(client_state, std::move(trace_state), std::move(permit), std::move(json_request));
{"Scan", [] (executor& e, executor::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value json_request, std::unique_ptr<request> req, std::unique_ptr<audit::audit_info_alternator>& audit_info) {
return e.scan(client_state, std::move(trace_state), std::move(permit), std::move(json_request), audit_info);
}},
{"DescribeEndpoints", [] (executor& e, executor::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value json_request, std::unique_ptr<request> req) {
return e.describe_endpoints(client_state, std::move(permit), std::move(json_request), req->get_header("Host"));
{"DescribeEndpoints", [] (executor& e, executor::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value json_request, std::unique_ptr<request> req, std::unique_ptr<audit::audit_info_alternator>& audit_info) {
return e.describe_endpoints(client_state, std::move(permit), std::move(json_request), req->get_header("Host"), audit_info);
}},
{"BatchWriteItem", [] (executor& e, executor::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value json_request, std::unique_ptr<request> req) {
return e.batch_write_item(client_state, std::move(trace_state), std::move(permit), std::move(json_request));
{"BatchWriteItem", [] (executor& e, executor::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value json_request, std::unique_ptr<request> req, std::unique_ptr<audit::audit_info_alternator>& audit_info) {
return e.batch_write_item(client_state, std::move(trace_state), std::move(permit), std::move(json_request), audit_info);
}},
{"BatchGetItem", [] (executor& e, executor::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value json_request, std::unique_ptr<request> req) {
return e.batch_get_item(client_state, std::move(trace_state), std::move(permit), std::move(json_request));
{"BatchGetItem", [] (executor& e, executor::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value json_request, std::unique_ptr<request> req, std::unique_ptr<audit::audit_info_alternator>& audit_info) {
return e.batch_get_item(client_state, std::move(trace_state), std::move(permit), std::move(json_request), audit_info);
}},
{"Query", [] (executor& e, executor::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value json_request, std::unique_ptr<request> req) {
return e.query(client_state, std::move(trace_state), std::move(permit), std::move(json_request));
{"Query", [] (executor& e, executor::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value json_request, std::unique_ptr<request> req, std::unique_ptr<audit::audit_info_alternator>& audit_info) {
return e.query(client_state, std::move(trace_state), std::move(permit), std::move(json_request), audit_info);
}},
{"TagResource", [] (executor& e, executor::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value json_request, std::unique_ptr<request> req) {
return e.tag_resource(client_state, std::move(permit), std::move(json_request));
{"TagResource", [] (executor& e, executor::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value json_request, std::unique_ptr<request> req, std::unique_ptr<audit::audit_info_alternator>& audit_info) {
return e.tag_resource(client_state, std::move(permit), std::move(json_request), audit_info);
}},
{"UntagResource", [] (executor& e, executor::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value json_request, std::unique_ptr<request> req) {
return e.untag_resource(client_state, std::move(permit), std::move(json_request));
{"UntagResource", [] (executor& e, executor::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value json_request, std::unique_ptr<request> req, std::unique_ptr<audit::audit_info_alternator>& audit_info) {
return e.untag_resource(client_state, std::move(permit), std::move(json_request), audit_info);
}},
{"ListTagsOfResource", [] (executor& e, executor::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value json_request, std::unique_ptr<request> req) {
return e.list_tags_of_resource(client_state, std::move(permit), std::move(json_request));
{"ListTagsOfResource", [] (executor& e, executor::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value json_request, std::unique_ptr<request> req, std::unique_ptr<audit::audit_info_alternator>& audit_info) {
return e.list_tags_of_resource(client_state, std::move(permit), std::move(json_request), audit_info);
}},
{"UpdateTimeToLive", [] (executor& e, executor::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value json_request, std::unique_ptr<request> req) {
return e.update_time_to_live(client_state, std::move(permit), std::move(json_request));
{"UpdateTimeToLive", [] (executor& e, executor::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value json_request, std::unique_ptr<request> req, std::unique_ptr<audit::audit_info_alternator>& audit_info) {
return e.update_time_to_live(client_state, std::move(permit), std::move(json_request), audit_info);
}},
{"DescribeTimeToLive", [] (executor& e, executor::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value json_request, std::unique_ptr<request> req) {
return e.describe_time_to_live(client_state, std::move(permit), std::move(json_request));
{"DescribeTimeToLive", [] (executor& e, executor::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value json_request, std::unique_ptr<request> req, std::unique_ptr<audit::audit_info_alternator>& audit_info) {
return e.describe_time_to_live(client_state, std::move(permit), std::move(json_request), audit_info);
}},
{"ListStreams", [] (executor& e, executor::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value json_request, std::unique_ptr<request> req) {
return e.list_streams(client_state, std::move(permit), std::move(json_request));
{"ListStreams", [] (executor& e, executor::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value json_request, std::unique_ptr<request> req, std::unique_ptr<audit::audit_info_alternator>& audit_info) {
return e.list_streams(client_state, std::move(permit), std::move(json_request), audit_info);
}},
{"DescribeStream", [] (executor& e, executor::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value json_request, std::unique_ptr<request> req) {
return e.describe_stream(client_state, std::move(permit), std::move(json_request));
{"DescribeStream", [] (executor& e, executor::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value json_request, std::unique_ptr<request> req, std::unique_ptr<audit::audit_info_alternator>& audit_info) {
return e.describe_stream(client_state, std::move(permit), std::move(json_request), audit_info);
}},
{"GetShardIterator", [] (executor& e, executor::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value json_request, std::unique_ptr<request> req) {
return e.get_shard_iterator(client_state, std::move(permit), std::move(json_request));
{"GetShardIterator", [] (executor& e, executor::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value json_request, std::unique_ptr<request> req, std::unique_ptr<audit::audit_info_alternator>& audit_info) {
return e.get_shard_iterator(client_state, std::move(permit), std::move(json_request), audit_info);
}},
{"GetRecords", [] (executor& e, executor::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value json_request, std::unique_ptr<request> req) {
return e.get_records(client_state, std::move(trace_state), std::move(permit), std::move(json_request));
{"GetRecords", [] (executor& e, executor::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value json_request, std::unique_ptr<request> req, std::unique_ptr<audit::audit_info_alternator>& audit_info) {
return e.get_records(client_state, std::move(trace_state), std::move(permit), std::move(json_request), audit_info);
}},
{"DescribeContinuousBackups", [] (executor& e, executor::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value json_request, std::unique_ptr<request> req) {
return e.describe_continuous_backups(client_state, std::move(permit), std::move(json_request));
{"DescribeContinuousBackups", [] (executor& e, executor::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value json_request, std::unique_ptr<request> req, std::unique_ptr<audit::audit_info_alternator>& audit_info) {
return e.describe_continuous_backups(client_state, std::move(permit), std::move(json_request), audit_info);
}},
} {
}
future<> server::init(net::inet_address addr, std::optional<uint16_t> port, std::optional<uint16_t> https_port, std::optional<tls::credentials_builder> creds,
// Sanitize an HTTP header value: strip control characters (RFC 7230 §3.2.6)
// and leading/trailing whitespace. Returns nullopt if the result is empty.
static std::optional<sstring> sanitize_header_value(const sstring& v, std::string_view option_name) {
std::string sanitized(v.begin(), v.end());
sanitized.erase(std::remove_if(sanitized.begin(), sanitized.end(),
[](unsigned char c) { return std::iscntrl(c); }), sanitized.end());
if (sanitized.size() != v.size()) {
slogger.warn("Configuration option '{}' contained control characters, they were stripped", option_name);
}
std::string_view trimmed = sanitized;
while (!trimmed.empty() && std::isspace((unsigned char)trimmed.front())) trimmed.remove_prefix(1);
while (!trimmed.empty() && std::isspace((unsigned char)trimmed.back())) trimmed.remove_suffix(1);
return trimmed.empty() ? std::nullopt : std::optional<sstring>(trimmed);
}
future<> server::init(net::inet_address addr, std::optional<uint16_t> port, std::optional<uint16_t> https_port,
std::optional<uint16_t> port_proxy_protocol, std::optional<uint16_t> https_port_proxy_protocol,
std::optional<tls::credentials_builder> creds,
utils::updateable_value<bool> enforce_authorization, utils::updateable_value<bool> warn_authorization, utils::updateable_value<uint64_t> max_users_query_size_in_trace_output,
semaphore* memory_limiter, utils::updateable_value<uint32_t> max_concurrent_requests) {
_memory_limiter = memory_limiter;
@@ -873,20 +954,46 @@ future<> server::init(net::inet_address addr, std::optional<uint16_t> port, std:
_warn_authorization = std::move(warn_authorization);
_max_concurrent_requests = std::move(max_concurrent_requests);
_max_users_query_size_in_trace_output = std::move(max_users_query_size_in_trace_output);
if (!port && !https_port) {
if (!port && !https_port && !port_proxy_protocol && !https_port_proxy_protocol) {
return make_exception_future<>(std::runtime_error("Either regular port or TLS port"
" must be specified in order to init an alternator HTTP server instance"));
}
return seastar::async([this, addr, port, https_port, creds] {
return seastar::async([this, addr, port, https_port, port_proxy_protocol, https_port_proxy_protocol, creds] {
_executor.start().get();
if (port) {
// Apply current config values and register observers for live updates
// before listen() so that no responses are ever sent with stale defaults.
// Both options drive Seastar's built-in header generation directly.
const db::config& cfg = _proxy.data_dictionary().get_config();
auto apply_server_header = [this] (const sstring& v) {
auto opt = sanitize_header_value(v, "alternator_http_response_server_header");
_http_server.set_server_header(opt);
_https_server.set_server_header(opt);
};
auto apply_date_header = [this] (const bool& disable) {
_http_server.set_generate_date_header(!disable);
_https_server.set_generate_date_header(!disable);
};
apply_server_header(cfg.alternator_http_response_server_header());
apply_date_header(cfg.alternator_http_response_disable_date_header());
_server_header_observer = cfg.alternator_http_response_server_header.observe(std::move(apply_server_header));
_date_header_observer = cfg.alternator_http_response_disable_date_header.observe(std::move(apply_date_header));
if (port || port_proxy_protocol) {
set_routes(_http_server._routes);
_http_server.set_content_streaming(true);
_http_server.listen(socket_address{addr, *port}).get();
if (port) {
_http_server.listen(socket_address{addr, *port}).get();
}
if (port_proxy_protocol) {
listen_options lo;
lo.reuse_address = true;
lo.proxy_protocol = true;
_http_server.listen(socket_address{addr, *port_proxy_protocol}, lo).get();
}
_enabled_servers.push_back(std::ref(_http_server));
}
if (https_port) {
if (https_port || https_port_proxy_protocol) {
set_routes(_https_server._routes);
_https_server.set_content_streaming(true);
@@ -906,7 +1013,15 @@ future<> server::init(net::inet_address addr, std::optional<uint16_t> port, std:
} else {
_credentials = creds->build_server_credentials();
}
_https_server.listen(socket_address{addr, *https_port}, _credentials).get();
if (https_port) {
_https_server.listen(socket_address{addr, *https_port}, _credentials).get();
}
if (https_port_proxy_protocol) {
listen_options lo;
lo.reuse_address = true;
lo.proxy_protocol = true;
_https_server.listen(socket_address{addr, *https_port_proxy_protocol}, lo, _credentials).get();
}
_enabled_servers.push_back(std::ref(_https_server));
}
});
@@ -979,16 +1094,15 @@ client_data server::ongoing_request::make_client_data() const {
// and keep "driver_version" unset.
cd.driver_name = _user_agent;
// Leave "protocol_version" unset, it has no meaning in Alternator.
// Leave "hostname", "ssl_protocol" and "ssl_cipher_suite" unset.
// As reported in issue #9216, we never set these fields in CQL
// either (see cql_server::connection::make_client_data()).
// Leave "hostname", "ssl_protocol" and "ssl_cipher_suite" unset for Alternator.
// Note: CQL sets ssl_protocol and ssl_cipher_suite via generic_server::connection base class.
return cd;
}
future<utils::chunked_vector<client_data>> server::get_client_data() {
utils::chunked_vector<client_data> ret;
future<utils::chunked_vector<foreign_ptr<std::unique_ptr<client_data>>>> server::get_client_data() {
utils::chunked_vector<foreign_ptr<std::unique_ptr<client_data>>> ret;
co_await _ongoing_requests.for_each_gently([&ret] (const ongoing_request& r) {
ret.emplace_back(r.make_client_data());
ret.emplace_back(make_foreign(std::make_unique<client_data>(r.make_client_data())));
});
co_return ret;
}

View File

@@ -3,7 +3,7 @@
*/
/*
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.1
*/
#pragma once
@@ -34,7 +34,7 @@ class server : public peering_sharded_service<server> {
// DynamoDB also has the same limit set to 16 MB.
static constexpr size_t request_content_length_limit = 16*MB;
using alternator_callback = std::function<future<executor::request_return_type>(executor&, executor::client_state&,
tracing::trace_state_ptr, service_permit, rjson::value, std::unique_ptr<http::request>)>;
tracing::trace_state_ptr, service_permit, rjson::value, std::unique_ptr<http::request>, std::unique_ptr<audit::audit_info_alternator>&)>;
using alternator_callbacks_map = std::unordered_map<std::string_view, alternator_callback>;
httpd::http_server _http_server;
@@ -55,6 +55,7 @@ class server : public peering_sharded_service<server> {
// though it isn't really relevant for Alternator which defines its own
// timeouts separately. We can create this object only once.
updateable_timeout_config _timeout_config;
client_options_cache_type _connection_options_keys_and_values;
alternator_callbacks_map _callbacks;
@@ -88,7 +89,7 @@ class server : public peering_sharded_service<server> {
// is called when reading the "system.clients" virtual table.
struct ongoing_request {
socket_address _client_address;
sstring _user_agent;
client_options_cache_entry_type _user_agent;
sstring _username;
scheduling_group _scheduling_group;
bool _is_https;
@@ -96,10 +97,16 @@ class server : public peering_sharded_service<server> {
};
utils::scoped_item_list<ongoing_request> _ongoing_requests;
// Observers for live-update config options that drive Seastar HTTP server state.
std::optional<utils::observer<sstring>> _server_header_observer;
std::optional<utils::observer<bool>> _date_header_observer;
public:
server(executor& executor, service::storage_proxy& proxy, gms::gossiper& gossiper, auth::service& service, qos::service_level_controller& sl_controller);
future<> init(net::inet_address addr, std::optional<uint16_t> port, std::optional<uint16_t> https_port, std::optional<tls::credentials_builder> creds,
future<> init(net::inet_address addr, std::optional<uint16_t> port, std::optional<uint16_t> https_port,
std::optional<uint16_t> port_proxy_protocol, std::optional<uint16_t> https_port_proxy_protocol,
std::optional<tls::credentials_builder> creds,
utils::updateable_value<bool> enforce_authorization, utils::updateable_value<bool> warn_authorization, utils::updateable_value<uint64_t> max_users_query_size_in_trace_output,
semaphore* memory_limiter, utils::updateable_value<uint32_t> max_concurrent_requests);
future<> stop();
@@ -107,7 +114,7 @@ public:
// table "system.clients" is read. It is expected to generate a list of
// clients connected to this server (on this shard). This function is
// called by alternator::controller::get_client_data().
future<utils::chunked_vector<client_data>> get_client_data();
future<utils::chunked_vector<foreign_ptr<std::unique_ptr<client_data>>>> get_client_data();
private:
void set_routes(seastar::httpd::routes& r);
// If verification succeeds, returns the authenticated user's username

View File

@@ -3,7 +3,7 @@
*/
/*
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.1
*/
#include "stats.hh"
@@ -14,20 +14,6 @@
namespace alternator {
const char* ALTERNATOR_METRICS = "alternator";
static seastar::metrics::histogram estimated_histogram_to_metrics(const utils::estimated_histogram& histogram) {
seastar::metrics::histogram res;
res.buckets.resize(histogram.bucket_offsets.size());
uint64_t cumulative_count = 0;
res.sample_count = histogram._count;
res.sample_sum = histogram._sample_sum;
for (size_t i = 0; i < res.buckets.size(); i++) {
auto& v = res.buckets[i];
v.upper_bound = histogram.bucket_offsets[i];
cumulative_count += histogram.buckets[i];
v.count = cumulative_count;
}
return res;
}
static seastar::metrics::label column_family_label("cf");
static seastar::metrics::label keyspace_label("ks");
@@ -151,21 +137,21 @@ static void register_metrics_with_optional_table(seastar::metrics::metric_groups
seastar::metrics::make_counter("batch_item_count", seastar::metrics::description("The total number of items processed across all batches"), labels,
stats.api_operations.batch_get_item_batch_total)(op("BatchGetItem")).aggregate(aggregate_labels).set_skip_when_empty(),
seastar::metrics::make_histogram("batch_item_count_histogram", seastar::metrics::description("Histogram of the number of items in a batch request"), labels,
[&stats]{ return estimated_histogram_to_metrics(stats.api_operations.batch_get_item_histogram);})(op("BatchGetItem")).aggregate({seastar::metrics::shard_label}).set_skip_when_empty(),
[&stats]{ return to_metrics_histogram(stats.api_operations.batch_get_item_histogram);})(op("BatchGetItem")).aggregate({seastar::metrics::shard_label}).set_skip_when_empty(),
seastar::metrics::make_histogram("batch_item_count_histogram", seastar::metrics::description("Histogram of the number of items in a batch request"), labels,
[&stats]{ return estimated_histogram_to_metrics(stats.api_operations.batch_write_item_histogram);})(op("BatchWriteItem")).aggregate({seastar::metrics::shard_label}).set_skip_when_empty(),
[&stats]{ return to_metrics_histogram(stats.api_operations.batch_write_item_histogram);})(op("BatchWriteItem")).aggregate({seastar::metrics::shard_label}).set_skip_when_empty(),
seastar::metrics::make_histogram("operation_size_kb", seastar::metrics::description("Histogram of item sizes involved in a request"), labels,
[&stats]{ return estimated_histogram_to_metrics(stats.operation_sizes.get_item_op_size_kb);})(op("GetItem")).aggregate({seastar::metrics::shard_label}).set_skip_when_empty(),
[&stats]{ return to_metrics_histogram(stats.operation_sizes.get_item_op_size_kb);})(op("GetItem")).aggregate({seastar::metrics::shard_label}).set_skip_when_empty(),
seastar::metrics::make_histogram("operation_size_kb", seastar::metrics::description("Histogram of item sizes involved in a request"), labels,
[&stats]{ return estimated_histogram_to_metrics(stats.operation_sizes.put_item_op_size_kb);})(op("PutItem")).aggregate({seastar::metrics::shard_label}).set_skip_when_empty(),
[&stats]{ return to_metrics_histogram(stats.operation_sizes.put_item_op_size_kb);})(op("PutItem")).aggregate({seastar::metrics::shard_label}).set_skip_when_empty(),
seastar::metrics::make_histogram("operation_size_kb", seastar::metrics::description("Histogram of item sizes involved in a request"), labels,
[&stats]{ return estimated_histogram_to_metrics(stats.operation_sizes.delete_item_op_size_kb);})(op("DeleteItem")).aggregate({seastar::metrics::shard_label}).set_skip_when_empty(),
[&stats]{ return to_metrics_histogram(stats.operation_sizes.delete_item_op_size_kb);})(op("DeleteItem")).aggregate({seastar::metrics::shard_label}).set_skip_when_empty(),
seastar::metrics::make_histogram("operation_size_kb", seastar::metrics::description("Histogram of item sizes involved in a request"), labels,
[&stats]{ return estimated_histogram_to_metrics(stats.operation_sizes.update_item_op_size_kb);})(op("UpdateItem")).aggregate({seastar::metrics::shard_label}).set_skip_when_empty(),
[&stats]{ return to_metrics_histogram(stats.operation_sizes.update_item_op_size_kb);})(op("UpdateItem")).aggregate({seastar::metrics::shard_label}).set_skip_when_empty(),
seastar::metrics::make_histogram("operation_size_kb", seastar::metrics::description("Histogram of item sizes involved in a request"), labels,
[&stats]{ return estimated_histogram_to_metrics(stats.operation_sizes.batch_get_item_op_size_kb);})(op("BatchGetItem")).aggregate({seastar::metrics::shard_label}).set_skip_when_empty(),
[&stats]{ return to_metrics_histogram(stats.operation_sizes.batch_get_item_op_size_kb);})(op("BatchGetItem")).aggregate({seastar::metrics::shard_label}).set_skip_when_empty(),
seastar::metrics::make_histogram("operation_size_kb", seastar::metrics::description("Histogram of item sizes involved in a request"), labels,
[&stats]{ return estimated_histogram_to_metrics(stats.operation_sizes.batch_write_item_op_size_kb);})(op("BatchWriteItem")).aggregate({seastar::metrics::shard_label}).set_skip_when_empty(),
[&stats]{ return to_metrics_histogram(stats.operation_sizes.batch_write_item_op_size_kb);})(op("BatchWriteItem")).aggregate({seastar::metrics::shard_label}).set_skip_when_empty(),
});
seastar::metrics::label expression_label("expression");

View File

@@ -3,7 +3,7 @@
*/
/*
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.1
*/
#pragma once
@@ -16,6 +16,8 @@
#include "cql3/stats.hh"
namespace alternator {
using batch_histogram = utils::estimated_histogram_with_max<128>;
using op_size_histogram = utils::estimated_histogram_with_max<512>;
// Object holding per-shard statistics related to Alternator.
// While this object is alive, these metrics are also registered to be
@@ -76,34 +78,34 @@ public:
utils::timed_rate_moving_average_summary_and_histogram batch_get_item_latency;
utils::timed_rate_moving_average_summary_and_histogram get_records_latency;
utils::estimated_histogram batch_get_item_histogram{22}; // a histogram that covers the range 1 - 100
utils::estimated_histogram batch_write_item_histogram{22}; // a histogram that covers the range 1 - 100
batch_histogram batch_get_item_histogram;
batch_histogram batch_write_item_histogram;
} api_operations;
// Operation size metrics
struct {
// Item size statistics collected per table and aggregated per node.
// Each histogram covers the range 0 - 446. Resolves #25143.
// Each histogram covers the range 0 - 512. Resolves #25143.
// A size is the retrieved item's size.
utils::estimated_histogram get_item_op_size_kb{30};
op_size_histogram get_item_op_size_kb;
// A size is the maximum of the new item's size and the old item's size.
utils::estimated_histogram put_item_op_size_kb{30};
op_size_histogram put_item_op_size_kb;
// A size is the deleted item's size. If the deleted item's size is
// unknown (i.e. read-before-write wasn't necessary and it wasn't
// forced by a configuration option), it won't be recorded on the
// histogram.
utils::estimated_histogram delete_item_op_size_kb{30};
op_size_histogram delete_item_op_size_kb;
// A size is the maximum of existing item's size and the estimated size
// of the update. This will be changed to the maximum of the existing item's
// size and the new item's size in a subsequent PR.
utils::estimated_histogram update_item_op_size_kb{30};
op_size_histogram update_item_op_size_kb;
// A size is the sum of the sizes of all items per table. This means
// that a single BatchGetItem / BatchWriteItem updates the histogram
// for each table that it has items in.
// The sizes are the retrieved items' sizes grouped per table.
utils::estimated_histogram batch_get_item_op_size_kb{30};
op_size_histogram batch_get_item_op_size_kb;
// The sizes are the the written items' sizes grouped per table.
utils::estimated_histogram batch_write_item_op_size_kb{30};
op_size_histogram batch_write_item_op_size_kb;
} operation_sizes;
// Count of authentication and authorization failures, counted if either
// alternator_enforce_authorization or alternator_warn_authorization are
@@ -140,7 +142,7 @@ public:
cql3::cql_stats cql_stats;
// Enumeration of expression types only for stats
// if needed it can be extended e.g. per operation
// if needed it can be extended e.g. per operation
enum expression_types {
UPDATE_EXPRESSION,
CONDITION_EXPRESSION,
@@ -164,7 +166,7 @@ struct table_stats {
void register_metrics(seastar::metrics::metric_groups& metrics, const stats& stats);
inline uint64_t bytes_to_kb_ceil(uint64_t bytes) {
return (bytes + 1023) / 1024;
return (bytes) / 1024;
}
}

File diff suppressed because it is too large Load Diff

62
alternator/streams.hh Normal file
View File

@@ -0,0 +1,62 @@
/*
* Copyright 2026-present ScyllaDB
*/
/*
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.1
*/
#pragma once
#include "utils/chunked_vector.hh"
#include "cdc/generation.hh"
#include <generator>
namespace cdc {
class stream_id;
}
namespace alternator {
class stream_id_range {
// helper class for manipulating (possibly wrapped around) range of stream_ids
// it holds one or two ranges [lo1, end1) and [lo2, end2)
// if the range doesn't wrap around, then lo2 == end2 == items.end()
// if the range wraps around, then
// `lo1 == items.begin() and end2 == items.end()` must be true
// the object doesn't own `items`, but it does manipulate it - it will
// reorder elements (so both ranges were next to each other) and sort them by unsigned comparison
// usage - create an object with needed ranges. before iteration call `prepare_for_iterating` method -
// it will reorder elements of `items` array to what is needed and then call begin / end pair.
// note - `items` array will be modified - elements will be reordered, but no elements will be added or removed.
// `items` array must stay intact as long as iteration is in progress.
utils::chunked_vector<cdc::stream_id>::iterator _lo1 = {}, _end1 = {}, _lo2 = {}, _end2 = {};
const cdc::stream_id* _skip_to = nullptr;
bool _prepared = false;
public:
stream_id_range(
utils::chunked_vector<cdc::stream_id> &items,
utils::chunked_vector<cdc::stream_id>::iterator lo1,
utils::chunked_vector<cdc::stream_id>::iterator end1);
stream_id_range(
utils::chunked_vector<cdc::stream_id> &items,
utils::chunked_vector<cdc::stream_id>::iterator lo1,
utils::chunked_vector<cdc::stream_id>::iterator end1,
utils::chunked_vector<cdc::stream_id>::iterator lo2,
utils::chunked_vector<cdc::stream_id>::iterator end2);
void set_starting_position(const cdc::stream_id &update_to);
// Must be called after construction and after set_starting_position()
// (if used), but before begin()/end() iteration.
void prepare_for_iterating();
utils::chunked_vector<cdc::stream_id>::iterator begin() const { return _lo1; }
utils::chunked_vector<cdc::stream_id>::iterator end() const { return _end1; }
};
stream_id_range find_children_range_from_parent_token(
const utils::chunked_vector<cdc::stream_id>& parent_streams,
utils::chunked_vector<cdc::stream_id>& current_streams,
cdc::stream_id parent,
bool uses_tablets
);
}

View File

@@ -3,7 +3,7 @@
*/
/*
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.1
*/
#include <chrono>
@@ -44,8 +44,10 @@
#include "cql3/query_options.hh"
#include "cql3/column_identifier.hh"
#include "alternator/executor.hh"
#include "alternator/executor_util.hh"
#include "alternator/controller.hh"
#include "alternator/serialization.hh"
#include "alternator/ttl_tag.hh"
#include "dht/sharder.hh"
#include "db/config.hh"
#include "db/tags/utils.hh"
@@ -57,22 +59,17 @@ static logging::logger tlogger("alternator_ttl");
namespace alternator {
// We write the expiration-time attribute enabled on a table in a
// tag TTL_TAG_KEY.
// Currently, the *value* of this tag is simply the name of the attribute,
// and the expiration scanner interprets it as an Alternator attribute name -
// It can refer to a real column or if that doesn't exist, to a member of
// the ":attrs" map column. Although this is designed for Alternator, it may
// be good enough for CQL as well (there, the ":attrs" column won't exist).
extern const sstring TTL_TAG_KEY;
future<executor::request_return_type> executor::update_time_to_live(client_state& client_state, service_permit permit, rjson::value request) {
future<executor::request_return_type> executor::update_time_to_live(client_state& client_state, service_permit permit, rjson::value request, std::unique_ptr<audit::audit_info_alternator>& audit_info) {
_stats.api_operations.update_time_to_live++;
if (!_proxy.features().alternator_ttl) {
co_return api_error::unknown_operation("UpdateTimeToLive not yet supported. Experimental support is available if the 'alternator-ttl' experimental feature is enabled on all nodes.");
co_return api_error::unknown_operation("UpdateTimeToLive not yet supported. Upgrade all nodes to a version that supports it.");
}
schema_ptr schema = get_table(_proxy, request);
maybe_audit(audit_info, audit::statement_category::DDL,
schema->ks_name(), schema->cf_name(), "UpdateTimeToLive", request);
rjson::value* spec = rjson::find(request, "TimeToLiveSpecification");
if (!spec || !spec->IsObject()) {
co_return api_error::validation("UpdateTimeToLive missing mandatory TimeToLiveSpecification");
@@ -122,9 +119,13 @@ future<executor::request_return_type> executor::update_time_to_live(client_state
co_return rjson::print(std::move(response));
}
future<executor::request_return_type> executor::describe_time_to_live(client_state& client_state, service_permit permit, rjson::value request) {
future<executor::request_return_type> executor::describe_time_to_live(client_state& client_state, service_permit permit, rjson::value request, std::unique_ptr<audit::audit_info_alternator>& audit_info) {
_stats.api_operations.describe_time_to_live++;
schema_ptr schema = get_table(_proxy, request);
maybe_audit(audit_info, audit::statement_category::QUERY,
schema->ks_name(), schema->cf_name(), "DescribeTimeToLive", request);
std::map<sstring, sstring> tags_map = get_tags_of_table_or_throw(schema);
rjson::value desc = rjson::empty_object();
auto i = tags_map.find(TTL_TAG_KEY);
@@ -141,7 +142,7 @@ future<executor::request_return_type> executor::describe_time_to_live(client_sta
// expiration_service is a sharded service responsible for cleaning up expired
// items in all tables with per-item expiration enabled. Currently, this means
// Alternator tables with TTL configured via a UpdateTimeToLive request.
// Alternator tables with TTL configured via an UpdateTimeToLive request.
//
// Here is a brief overview of how the expiration service works:
//
@@ -324,9 +325,7 @@ static future<std::vector<std::pair<dht::token_range, locator::host_id>>> get_se
const auto& tm = *erm->get_token_metadata_ptr();
const auto& sorted_tokens = tm.sorted_tokens();
std::vector<std::pair<dht::token_range, locator::host_id>> ret;
if (sorted_tokens.empty()) {
on_internal_error(tlogger, "Token metadata is empty");
}
throwing_assert(!sorted_tokens.empty());
auto prev_tok = sorted_tokens.back();
for (const auto& tok : sorted_tokens) {
co_await coroutine::maybe_yield();
@@ -563,7 +562,7 @@ static future<> scan_table_ranges(
expiration_service::stats& expiration_stats)
{
const schema_ptr& s = scan_ctx.s;
SCYLLA_ASSERT (partition_ranges.size() == 1); // otherwise issue #9167 will cause incorrect results.
throwing_assert(partition_ranges.size() == 1); // otherwise issue #9167 will cause incorrect results.
auto p = service::pager::query_pagers::pager(proxy, s, scan_ctx.selection, *scan_ctx.query_state_ptr,
*scan_ctx.query_options, scan_ctx.command, std::move(partition_ranges), nullptr);
while (!p->is_exhausted()) {
@@ -593,7 +592,7 @@ static future<> scan_table_ranges(
if (retries >= 10) {
// Don't get stuck forever asking the same page, maybe there's
// a bug or a real problem in several replicas. Give up on
// this scan an retry the scan from a random position later,
// this scan and retry the scan from a random position later,
// in the next scan period.
throw runtime_exception("scanner thread failed after too many timeouts for the same page");
}
@@ -640,13 +639,38 @@ static future<> scan_table_ranges(
}
} else {
// For a real column to contain an expiration time, it
// must be a numeric type.
// FIXME: Currently we only support decimal_type (which is
// what Alternator uses), but other numeric types can be
// supported as well to make this feature more useful in CQL.
// Note that kind::decimal is also checked above.
big_decimal n = value_cast<big_decimal>(v);
expired = is_expired(n, now);
// must be a numeric type. We currently support decimal
// (used by Alternator TTL) as well as bigint, int and
// timestamp (used by CQL per-row TTL).
switch (meta[*expiration_column]->type->get_kind()) {
case abstract_type::kind::decimal:
// Used by Alternator TTL for key columns not stored
// in the map. The value is in seconds, fractional
// part is ignored.
expired = is_expired(value_cast<big_decimal>(v), now);
break;
case abstract_type::kind::long_kind:
// Used by CQL per-row TTL. The value is in seconds.
expired = is_expired(gc_clock::time_point(std::chrono::seconds(value_cast<int64_t>(v))), now);
break;
case abstract_type::kind::int32:
// Used by CQL per-row TTL. The value is in seconds.
// Using int type is not recommended because it will
// overflow in 2038, but we support it to allow users
// to use existing int columns for expiration.
expired = is_expired(gc_clock::time_point(std::chrono::seconds(value_cast<int32_t>(v))), now);
break;
case abstract_type::kind::timestamp:
// Used by CQL per-row TTL. The value is in milliseconds
// but we truncate it to gc_clock's precision (whole seconds).
expired = is_expired(gc_clock::time_point(std::chrono::duration_cast<gc_clock::duration>(value_cast<db_clock::time_point>(v).time_since_epoch())), now);
break;
default:
// Should never happen - we verified the column's type
// before starting the scan.
[[unlikely]]
on_internal_error(tlogger, format("expiration scanner value of unsupported type {} in column {}", meta[*expiration_column]->type->cql3_type_name(), scan_ctx.column_name) );
}
}
if (expired) {
expiration_stats.items_deleted++;
@@ -708,16 +732,12 @@ static future<bool> scan_table(
co_return false;
}
// attribute_name may be one of the schema's columns (in Alternator, this
// means it's a key column), or an element in Alternator's attrs map
// encoded in Alternator's JSON encoding.
// FIXME: To make this less Alternators-specific, we should encode in the
// single key's value three things:
// 1. The name of a column
// 2. Optionally if column is a map, a member in the map
// 3. The deserializer for the value: CQL or Alternator (JSON).
// The deserializer can be guessed: If the given column or map item is
// numeric, it can be used directly. If it is a "bytes" type, it needs to
// be deserialized using Alternator's deserializer.
// means a key column, in CQL it's a regular column), or an element in
// Alternator's attrs map encoded in Alternator's JSON encoding (which we
// decode). If attribute_name is a real column, in Alternator it will have
// the type decimal, counting seconds since the UNIX epoch, while in CQL
// it will one of the types bigint or int (counting seconds) or timestamp
// (counting milliseconds).
bytes column_name = to_bytes(*attribute_name);
const column_definition *cd = s->get_column_definition(column_name);
std::optional<std::string> member;
@@ -736,11 +756,14 @@ static future<bool> scan_table(
data_type column_type = cd->type;
// Verify that the column has the right type: If "member" exists
// the column must be a map, and if it doesn't, the column must
// (currently) be a decimal_type. If the column has the wrong type
// nothing can get expired in this table, and it's pointless to
// scan it.
// be decimal_type (Alternator), bigint, int or timestamp (CQL).
// If the column has the wrong type nothing can get expired in
// this table, and it's pointless to scan it.
if ((member && column_type->get_kind() != abstract_type::kind::map) ||
(!member && column_type->get_kind() != abstract_type::kind::decimal)) {
(!member && column_type->get_kind() != abstract_type::kind::decimal &&
column_type->get_kind() != abstract_type::kind::long_kind &&
column_type->get_kind() != abstract_type::kind::int32 &&
column_type->get_kind() != abstract_type::kind::timestamp)) {
tlogger.info("table {} TTL column has unsupported type, not scanning", s->cf_name());
co_return false;
}
@@ -767,7 +790,7 @@ static future<bool> scan_table(
// by tasking another node to take over scanning of the dead node's primary
// ranges. What we do here is that this node will also check expiration
// on its *secondary* ranges - but only those whose primary owner is down.
auto tablet_secondary_replica = tablet_map.get_secondary_replica(*tablet); // throws if no secondary replica
auto tablet_secondary_replica = tablet_map.get_secondary_replica(*tablet, erm->get_topology()); // throws if no secondary replica
if (tablet_secondary_replica.host == my_host_id && tablet_secondary_replica.shard == this_shard_id()) {
if (!gossiper.is_alive(tablet_primary_replica.host)) {
co_await scan_tablet(*tablet, proxy, abort_source, page_sem, expiration_stats, scan_ctx, tablet_map);
@@ -878,12 +901,10 @@ future<> expiration_service::run() {
future<> expiration_service::start() {
// Called by main() on each shard to start the expiration-service
// thread. Just runs run() in the background and allows stop().
if (_db.features().alternator_ttl) {
if (!shutting_down()) {
_end = run().handle_exception([] (std::exception_ptr ep) {
tlogger.error("expiration_service failed: {}", ep);
});
}
if (!shutting_down()) {
_end = run().handle_exception([] (std::exception_ptr ep) {
tlogger.error("expiration_service failed: {}", ep);
});
}
return make_ready_future<>();
}

View File

@@ -3,7 +3,7 @@
*/
/*
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.1
*/
#pragma once
@@ -30,7 +30,7 @@ namespace alternator {
// expiration_service is a sharded service responsible for cleaning up expired
// items in all tables with per-item expiration enabled. Currently, this means
// Alternator tables with TTL configured via a UpdateTimeToLeave request.
// Alternator tables with TTL configured via an UpdateTimeToLive request.
class expiration_service final : public seastar::peering_sharded_service<expiration_service> {
public:
// Object holding per-shard statistics related to the expiration service.
@@ -52,7 +52,7 @@ private:
data_dictionary::database _db;
service::storage_proxy& _proxy;
gms::gossiper& _gossiper;
// _end is set by start(), and resolves when the the background service
// _end is set by start(), and resolves when the background service
// started by it ends. To ask the background service to end, _abort_source
// should be triggered. stop() below uses both _abort_source and _end.
std::optional<future<>> _end;

26
alternator/ttl_tag.hh Normal file
View File

@@ -0,0 +1,26 @@
/*
* Copyright 2026-present ScyllaDB
*/
/*
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.1
*/
#pragma once
#include "seastarx.hh"
#include <seastar/core/sstring.hh>
namespace alternator {
// We use the table tag TTL_TAG_KEY ("system:ttl_attribute") to remember
// which attribute was chosen as the expiration-time attribute for
// Alternator's TTL and CQL's per-row TTL features.
// Currently, the *value* of this tag is simply the name of the attribute:
// It can refer to a real column or if that doesn't exist, to a member of
// the ":attrs" map column (which Alternator uses).
extern const sstring TTL_TAG_KEY;
} // namespace alternator
// let users use TTL_TAG_KEY without the "alternator::" prefix,
// to make it easier to move it to a different namespace later.
using alternator::TTL_TAG_KEY;

View File

@@ -31,6 +31,7 @@ set(swagger_files
api-doc/column_family.json
api-doc/commitlog.json
api-doc/compaction_manager.json
api-doc/client_routes.json
api-doc/config.json
api-doc/cql_server_test.json
api-doc/endpoint_snitch_info.json
@@ -68,6 +69,7 @@ target_sources(api
PRIVATE
api.cc
cache_service.cc
client_routes.cc
collectd.cc
column_family.cc
commitlog.cc

View File

@@ -12,7 +12,7 @@
"operations":[
{
"method":"POST",
"summary":"Reset cache",
"summary":"Resets authorized prepared statements cache",
"type":"void",
"nickname":"authorization_cache_reset",
"produces":[

View File

@@ -0,0 +1,23 @@
, "client_routes_entry": {
"id": "client_routes_entry",
"summary": "An entry storing client routes",
"properties": {
"connection_id": {"type": "string"},
"host_id": {"type": "string", "format": "uuid"},
"address": {"type": "string"},
"port": {"type": "integer"},
"tls_port": {"type": "integer"},
"alternator_port": {"type": "integer"},
"alternator_https_port": {"type": "integer"}
},
"required": ["connection_id", "host_id", "address"]
}
, "client_routes_key": {
"id": "client_routes_key",
"summary": "A key of client_routes_entry",
"properties": {
"connection_id": {"type": "string"},
"host_id": {"type": "string", "format": "uuid"}
}
}

View File

@@ -0,0 +1,74 @@
, "/v2/client-routes":{
"get": {
"description":"List all client route entries",
"operationId":"get_client_routes",
"tags":["client_routes"],
"produces":[
"application/json"
],
"parameters":[],
"responses":{
"200":{
"schema":{
"type":"array",
"items":{ "$ref":"#/definitions/client_routes_entry" }
}
},
"default":{
"description":"unexpected error",
"schema":{"$ref":"#/definitions/ErrorModel"}
}
}
},
"post": {
"description":"Upsert one or more client route entries",
"operationId":"set_client_routes",
"tags":["client_routes"],
"parameters":[
{
"name":"body",
"in":"body",
"required":true,
"schema":{
"type":"array",
"items":{ "$ref":"#/definitions/client_routes_entry" }
}
}
],
"responses":{
"200":{ "description": "OK" },
"default":{
"description":"unexpected error",
"schema":{ "$ref":"#/definitions/ErrorModel" }
}
}
},
"delete": {
"description":"Delete one or more client route entries",
"operationId":"delete_client_routes",
"tags":["client_routes"],
"parameters":[
{
"name":"body",
"in":"body",
"required":true,
"schema":{
"type":"array",
"items":{ "$ref":"#/definitions/client_routes_key" }
}
}
],
"responses":{
"200":{
"description": "OK"
},
"default":{
"description":"unexpected error",
"schema":{
"$ref":"#/definitions/ErrorModel"
}
}
}
}
}

View File

@@ -243,7 +243,7 @@
"GOSSIP_DIGEST_SYN",
"GOSSIP_DIGEST_ACK2",
"GOSSIP_SHUTDOWN",
"DEFINITIONS_UPDATE",
"UNUSED__DEFINITIONS_UPDATE",
"TRUNCATE",
"UNUSED__REPLICATION_FINISHED",
"MIGRATION_REQUEST",

View File

@@ -743,7 +743,7 @@
"parameters":[
{
"name":"tag",
"description":"the tag given to the snapshot",
"description":"The snapshot tag to delete. If omitted, all snapshots are removed.",
"required":false,
"allowMultiple":false,
"type":"string",
@@ -751,7 +751,7 @@
},
{
"name":"kn",
"description":"Comma-separated keyspaces name that their snapshot will be deleted",
"description":"Comma-separated list of keyspace names to delete snapshots from. If omitted, snapshots are deleted from all keyspaces.",
"required":false,
"allowMultiple":false,
"type":"string",
@@ -759,7 +759,7 @@
},
{
"name":"cf",
"description":"an optional table name that its snapshot will be deleted",
"description":"A table name used to filter which table's snapshots are deleted. If omitted or empty, snapshots for all tables are eligible. When provided together with 'kn', the table is looked up in each listed keyspace independently. For secondary indexes, the logical index name (e.g. 'myindex') can be used and is resolved automatically.",
"required":false,
"allowMultiple":false,
"type":"string",
@@ -1295,6 +1295,45 @@
}
]
},
{
"path":"/storage_service/logstor_compaction",
"operations":[
{
"method":"POST",
"summary":"Trigger compaction of the key-value storage",
"type":"void",
"nickname":"logstor_compaction",
"produces":[
"application/json"
],
"parameters":[
{
"name":"major",
"description":"When true, perform a major compaction",
"required":false,
"allowMultiple":false,
"type":"boolean",
"paramType":"query"
}
]
}
]
},
{
"path":"/storage_service/logstor_flush",
"operations":[
{
"method":"POST",
"summary":"Trigger flush of logstor storage",
"type":"void",
"nickname":"logstor_flush",
"produces":[
"application/json"
],
"parameters":[]
}
]
},
{
"path":"/storage_service/active_repair/",
"operations":[
@@ -3051,7 +3090,7 @@
},
{
"name":"incremental_mode",
"description":"Set the incremental repair mode. Can be 'disabled', 'incremental', or 'full'. 'incremental': The incremental repair logic is enabled. Unrepaired sstables will be included for repair. Repaired sstables will be skipped. The incremental repair states will be updated after repair. 'full': The incremental repair logic is enabled. Both repaired and unrepaired sstables will be included for repair. The incremental repair states will be updated after repair. 'disabled': The incremental repair logic is disabled completely. The incremental repair states, e.g., repaired_at in sstables and sstables_repaired_at in the system.tablets table, will not be updated after repair. When the option is not provided, it defaults to 'disabled' mode.",
"description":"Set the incremental repair mode. Can be 'disabled', 'incremental', or 'full'. 'incremental': The incremental repair logic is enabled. Unrepaired sstables will be included for repair. Repaired sstables will be skipped. The incremental repair states will be updated after repair. 'full': The incremental repair logic is enabled. Both repaired and unrepaired sstables will be included for repair. The incremental repair states will be updated after repair. 'disabled': The incremental repair logic is disabled completely. The incremental repair states, e.g., repaired_at in sstables and sstables_repaired_at in the system.tablets table, will not be updated after repair. When the option is not provided, it defaults to incremental mode.",
"required":false,
"allowMultiple":false,
"type":"string",
@@ -3085,6 +3124,125 @@
}
]
},
{
"path":"/storage_service/tablets/snapshots",
"operations":[
{
"method":"POST",
"summary":"Takes the snapshot for the given keyspaces/tables. A snapshot name must be specified.",
"type":"void",
"nickname":"take_cluster_snapshot",
"produces":[
"application/json"
],
"parameters":[
{
"name":"tag",
"description":"the tag given to the snapshot",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"keyspace",
"description":"Keyspace(s) to snapshot. Multiple keyspaces can be provided using a comma-separated list. If omitted, snapshot all keyspaces.",
"required":false,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"table",
"description":"Table(s) to snapshot. Multiple tables (in a single keyspace) can be provided using a comma-separated list. If omitted, snapshot all tables in the given keyspace(s).",
"required":false,
"allowMultiple":false,
"type":"string",
"paramType":"query"
}
]
}
]
},
{
"path":"/storage_service/vnode_tablet_migrations/keyspaces/{keyspace}",
"operations":[{
"method":"POST",
"summary":"Start vnodes-to-tablets migration for all tables in a keyspace",
"type":"void",
"nickname":"create_vnode_tablet_migration",
"produces":["application/json"],
"parameters":[
{
"name":"keyspace",
"description":"Keyspace name",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"path"
}
]
},
{
"method":"GET",
"summary":"Get a keyspace's vnodes-to-tablets migration status",
"type":"vnode_tablet_migration_status",
"nickname":"get_vnode_tablet_migration",
"produces":["application/json"],
"parameters":[
{
"name":"keyspace",
"description":"Keyspace name",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"path"
}
]
}]
},
{
"path":"/storage_service/vnode_tablet_migrations/node/storage_mode",
"operations":[{
"method":"PUT",
"summary":"Set the intended storage mode for this node during vnodes-to-tablets migration",
"type":"void",
"nickname":"set_vnode_tablet_migration_node_storage_mode",
"produces":["application/json"],
"parameters":[
{
"name":"intended_mode",
"description":"Intended storage mode (tablets or vnodes)",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"query"
}
]
}]
},
{
"path":"/storage_service/vnode_tablet_migrations/keyspaces/{keyspace}/finalization",
"operations":[{
"method":"POST",
"summary":"Finalize vnodes-to-tablets migration for all tables in a keyspace",
"type":"void",
"nickname":"finalize_vnode_tablet_migration",
"produces":["application/json"],
"parameters":[
{
"name":"keyspace",
"description":"Keyspace name",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"path"
}
]
}]
},
{
"path":"/storage_service/quiesce_topology",
"operations":[
@@ -3187,6 +3345,38 @@
}
]
},
{
"path":"/storage_service/logstor_info",
"operations":[
{
"method":"GET",
"summary":"Logstor segment information for one table",
"type":"table_logstor_info",
"nickname":"logstor_info",
"produces":[
"application/json"
],
"parameters":[
{
"name":"keyspace",
"description":"The keyspace",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"table",
"description":"table name",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"query"
}
]
}
]
},
{
"path":"/storage_service/retrain_dict",
"operations":[
@@ -3595,6 +3785,47 @@
}
}
},
"logstor_hist_bucket":{
"id":"logstor_hist_bucket",
"properties":{
"bucket":{
"type":"long"
},
"count":{
"type":"long"
},
"min_data_size":{
"type":"long"
},
"max_data_size":{
"type":"long"
}
}
},
"table_logstor_info":{
"id":"table_logstor_info",
"description":"Per-table logstor segment distribution",
"properties":{
"keyspace":{
"type":"string"
},
"table":{
"type":"string"
},
"compaction_groups":{
"type":"long"
},
"segments":{
"type":"long"
},
"data_size_histogram":{
"type":"array",
"items":{
"$ref":"logstor_hist_bucket"
}
}
}
},
"tablet_repair_result":{
"id":"tablet_repair_result",
"description":"Tablet repair result",
@@ -3629,6 +3860,45 @@
"description":"The resulting compression ratio (estimated on a random sample of files)"
}
}
},
"vnode_tablet_migration_node_status":{
"id":"vnode_tablet_migration_node_status",
"description":"Node storage mode info during vnodes-to-tablets migration",
"properties":{
"host_id":{
"type":"string",
"description":"The host ID"
},
"current_mode":{
"type":"string",
"description":"The current storage mode: `vnodes` or `tablets`"
},
"intended_mode":{
"type":"string",
"description":"The intended storage mode: `vnodes` or `tablets`"
}
}
},
"vnode_tablet_migration_status":{
"id":"vnode_tablet_migration_status",
"description":"Vnodes-to-tablets migration status for a keyspace",
"properties":{
"keyspace":{
"type":"string",
"description":"The keyspace name"
},
"status":{
"type":"string",
"description":"The migration status: `vnodes` (not started), `migrating_to_tablets` (in progress), or `tablets` (complete)"
},
"nodes":{
"type":"array",
"items":{
"$ref":"vnode_tablet_migration_node_status"
},
"description":"Per-node storage mode information. Empty if the keyspace is not being migrated."
}
}
}
}
}

View File

@@ -209,6 +209,21 @@
"parameters":[]
}
]
},
{
"path":"/system/chosen_sstable_version",
"operations":[
{
"method":"GET",
"summary":"Get sstable version currently chosen for use in new sstables",
"type":"string",
"nickname":"get_chosen_sstable_version",
"produces":[
"application/json"
],
"parameters":[]
}
]
}
]
}

View File

@@ -3,7 +3,7 @@
*/
/*
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.1
*/
#include "api.hh"
@@ -37,6 +37,7 @@
#include "raft.hh"
#include "gms/gossip_address_map.hh"
#include "service_levels.hh"
#include "client_routes.hh"
logging::logger apilog("api");
@@ -67,9 +68,11 @@ future<> set_server_init(http_context& ctx) {
rb02->set_api_doc(r);
rb02->register_api_file(r, "swagger20_header");
rb02->register_api_file(r, "metrics");
rb02->register_api_file(r, "client_routes");
rb->register_function(r, "system",
"The system related API");
rb02->add_definitions_file(r, "metrics");
rb02->add_definitions_file(r, "client_routes");
set_system(ctx, r);
rb->register_function(r, "error_injection",
"The error injection API");
@@ -119,9 +122,9 @@ future<> unset_thrift_controller(http_context& ctx) {
return ctx.http_server.set_routes([&ctx] (routes& r) { unset_thrift_controller(ctx, r); });
}
future<> set_server_storage_service(http_context& ctx, sharded<service::storage_service>& ss, service::raft_group0_client& group0_client) {
return ctx.http_server.set_routes([&ctx, &ss, &group0_client] (routes& r) {
set_storage_service(ctx, r, ss, group0_client);
future<> set_server_storage_service(http_context& ctx, sharded<service::storage_service>& ss, sharded<db::snapshot_ctl>& ssc, service::raft_group0_client& group0_client) {
return ctx.http_server.set_routes([&ctx, &ss, &ssc, &group0_client] (routes& r) {
set_storage_service(ctx, r, ss, ssc, group0_client);
});
}
@@ -129,6 +132,16 @@ future<> unset_server_storage_service(http_context& ctx) {
return ctx.http_server.set_routes([&ctx] (routes& r) { unset_storage_service(ctx, r); });
}
future<> set_server_client_routes(http_context& ctx, sharded<service::client_routes_service>& cr) {
return ctx.http_server.set_routes([&ctx, &cr] (routes& r) {
set_client_routes(ctx, r, cr);
});
}
future<> unset_server_client_routes(http_context& ctx) {
return ctx.http_server.set_routes([&ctx] (routes& r) { unset_client_routes(ctx, r); });
}
future<> set_load_meter(http_context& ctx, service::load_meter& lm) {
return ctx.http_server.set_routes([&ctx, &lm] (routes& r) { set_load_meter(ctx, r, lm); });
}

View File

@@ -3,7 +3,7 @@
*/
/*
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.1
*/
#pragma once
@@ -23,31 +23,6 @@
namespace api {
template<class T>
std::vector<T> map_to_key_value(const std::map<sstring, sstring>& map) {
std::vector<T> res;
res.reserve(map.size());
for (const auto& [key, value] : map) {
res.push_back(T());
res.back().key = key;
res.back().value = value;
}
return res;
}
template<class T, class MAP>
std::vector<T>& map_to_key_value(const MAP& map, std::vector<T>& res) {
res.reserve(res.size() + std::size(map));
for (const auto& [key, value] : map) {
T val;
val.key = fmt::to_string(key);
val.value = fmt::to_string(value);
res.push_back(val);
}
return res;
}
template <typename T, typename S = T>
T map_sum(T&& dest, const S& src) {
for (const auto& i : src) {

View File

@@ -3,7 +3,7 @@
*/
/*
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.1
*/
#pragma once
@@ -29,6 +29,7 @@ class storage_proxy;
class storage_service;
class raft_group0_client;
class raft_group_registry;
class client_routes_service;
} // namespace service
@@ -97,8 +98,10 @@ future<> set_server_config(http_context& ctx, db::config& cfg);
future<> unset_server_config(http_context& ctx);
future<> set_server_snitch(http_context& ctx, sharded<locator::snitch_ptr>& snitch);
future<> unset_server_snitch(http_context& ctx);
future<> set_server_storage_service(http_context& ctx, sharded<service::storage_service>& ss, service::raft_group0_client&);
future<> set_server_storage_service(http_context& ctx, sharded<service::storage_service>& ss, sharded<db::snapshot_ctl>&, service::raft_group0_client&);
future<> unset_server_storage_service(http_context& ctx);
future<> set_server_client_routes(http_context& ctx, sharded<service::client_routes_service>& cr);
future<> unset_server_client_routes(http_context& ctx);
future<> set_server_sstables_loader(http_context& ctx, sharded<sstables_loader>& sst_loader);
future<> unset_server_sstables_loader(http_context& ctx);
future<> set_server_view_builder(http_context& ctx, sharded<db::view::view_builder>& vb, sharded<gms::gossiper>& g);

View File

@@ -3,7 +3,7 @@
*/
/*
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.1
*/
#include "api/api-doc/authorization_cache.json.hh"

View File

@@ -3,7 +3,7 @@
*/
/*
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.1
*/
#pragma once

View File

@@ -3,7 +3,7 @@
*/
/*
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.1
*/
#include "cache_service.hh"

View File

@@ -3,7 +3,7 @@
*/
/*
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.1
*/
#pragma once

176
api/client_routes.cc Normal file
View File

@@ -0,0 +1,176 @@
/*
* Copyright (C) 2025-present ScyllaDB
*
*/
/*
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.1
*/
#include <seastar/http/short_streams.hh>
#include "client_routes.hh"
#include "api/api.hh"
#include "service/storage_service.hh"
#include "service/client_routes.hh"
#include "utils/rjson.hh"
#include "api/api-doc/client_routes.json.hh"
using namespace seastar::httpd;
using namespace std::chrono_literals;
using namespace json;
extern logging::logger apilog;
namespace api {
static void validate_client_routes_endpoint(sharded<service::client_routes_service>& cr, sstring endpoint_name) {
if (!cr.local().get_feature_service().client_routes) {
apilog.warn("{}: called before the cluster feature was enabled", endpoint_name);
throw std::runtime_error(fmt::format("{} requires all nodes to support the CLIENT_ROUTES cluster feature", endpoint_name));
}
}
static sstring parse_string(const char* name, rapidjson::Value const& v) {
const auto it = v.FindMember(name);
if (it == v.MemberEnd()) {
throw bad_param_exception(fmt::format("Missing '{}'", name));
}
if (!it->value.IsString()) {
throw bad_param_exception(fmt::format("'{}' must be a string", name));
}
return {it->value.GetString(), it->value.GetStringLength()};
}
static std::optional<uint32_t> parse_port(const char* name, rapidjson::Value const& v) {
const auto it = v.FindMember(name);
if (it == v.MemberEnd()) {
return std::nullopt;
}
if (!it->value.IsInt()) {
throw bad_param_exception(fmt::format("'{}' must be an integer", name));
}
auto port = it->value.GetInt();
if (port < 1 || port > 65535) {
throw bad_param_exception(fmt::format("'{}' value={} is outside the allowed port range", name, port));
}
return port;
}
static std::vector<service::client_routes_service::client_route_entry> parse_set_client_array(const rapidjson::Document& root) {
if (!root.IsArray()) {
throw bad_param_exception("Body must be a JSON array");
}
std::vector<service::client_routes_service::client_route_entry> v;
v.reserve(root.GetArray().Size());
for (const auto& element : root.GetArray()) {
if (!element.IsObject()) { throw bad_param_exception("Each element must be object"); }
const auto port = parse_port("port", element);
const auto tls_port = parse_port("tls_port", element);
const auto alternator_port = parse_port("alternator_port", element);
const auto alternator_https_port = parse_port("alternator_https_port", element);
if (!port.has_value() && !tls_port.has_value() && !alternator_port.has_value() && !alternator_https_port.has_value()) {
throw bad_param_exception("At least one port field ('port', 'tls_port', 'alternator_port', 'alternator_https_port') must be specified");
}
v.emplace_back(
parse_string("connection_id", element),
utils::UUID{parse_string("host_id", element)},
parse_string("address", element),
port,
tls_port,
alternator_port,
alternator_https_port
);
}
return v;
}
static
future<json::json_return_type>
rest_set_client_routes(http_context& ctx, sharded<service::client_routes_service>& cr, std::unique_ptr<http::request> req) {
validate_client_routes_endpoint(cr, "rest_set_client_routes");
rapidjson::Document root;
auto content = co_await util::read_entire_stream_contiguous(*req->content_stream);
root.Parse(content.c_str());
co_await cr.local().set_client_routes(parse_set_client_array(root));
co_return seastar::json::json_void();
}
static std::vector<service::client_routes_service::client_route_key> parse_delete_client_array(const rapidjson::Document& root) {
if (!root.IsArray()) {
throw bad_param_exception("Body must be a JSON array");
}
std::vector<service::client_routes_service::client_route_key> v;
v.reserve(root.GetArray().Size());
for (const auto& element : root.GetArray()) {
v.emplace_back(
parse_string("connection_id", element),
utils::UUID{parse_string("host_id", element)}
);
}
return v;
}
static
future<json::json_return_type>
rest_delete_client_routes(http_context& ctx, sharded<service::client_routes_service>& cr, std::unique_ptr<http::request> req) {
validate_client_routes_endpoint(cr, "delete_client_routes");
rapidjson::Document root;
auto content = co_await util::read_entire_stream_contiguous(*req->content_stream);
root.Parse(content.c_str());
co_await cr.local().delete_client_routes(parse_delete_client_array(root));
co_return seastar::json::json_void();
}
static
future<json::json_return_type>
rest_get_client_routes(http_context& ctx, sharded<service::client_routes_service>& cr, std::unique_ptr<http::request> req) {
validate_client_routes_endpoint(cr, "get_client_routes");
co_return co_await cr.invoke_on(0, [] (service::client_routes_service& cr) -> future<json::json_return_type> {
co_return json::json_return_type(stream_range_as_array(co_await cr.get_client_routes(), [](const service::client_routes_service::client_route_entry & entry) {
seastar::httpd::client_routes_json::client_routes_entry obj;
obj.connection_id = entry.connection_id;
obj.host_id = fmt::to_string(entry.host_id);
obj.address = entry.address;
if (entry.port.has_value()) { obj.port = entry.port.value(); }
if (entry.tls_port.has_value()) { obj.tls_port = entry.tls_port.value(); }
if (entry.alternator_port.has_value()) { obj.alternator_port = entry.alternator_port.value(); }
if (entry.alternator_https_port.has_value()) { obj.alternator_https_port = entry.alternator_https_port.value(); }
return obj;
}));
});
}
void set_client_routes(http_context& ctx, routes& r, sharded<service::client_routes_service>& cr) {
seastar::httpd::client_routes_json::set_client_routes.set(r, [&ctx, &cr] (std::unique_ptr<seastar::http::request> req) {
return rest_set_client_routes(ctx, cr, std::move(req));
});
seastar::httpd::client_routes_json::delete_client_routes.set(r, [&ctx, &cr] (std::unique_ptr<seastar::http::request> req) {
return rest_delete_client_routes(ctx, cr, std::move(req));
});
seastar::httpd::client_routes_json::get_client_routes.set(r, [&ctx, &cr] (std::unique_ptr<seastar::http::request> req) {
return rest_get_client_routes(ctx, cr, std::move(req));
});
}
void unset_client_routes(http_context& ctx, routes& r) {
seastar::httpd::client_routes_json::set_client_routes.unset(r);
seastar::httpd::client_routes_json::delete_client_routes.unset(r);
seastar::httpd::client_routes_json::get_client_routes.unset(r);
}
}

20
api/client_routes.hh Normal file
View File

@@ -0,0 +1,20 @@
/*
* Copyright (C) 2025-present ScyllaDB
*
*/
/*
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.1
*/
#pragma once
#include <seastar/core/sharded.hh>
#include <seastar/json/json_elements.hh>
#include "api/api_init.hh"
namespace api {
void set_client_routes(http_context& ctx, httpd::routes& r, sharded<service::client_routes_service>& cr);
void unset_client_routes(http_context& ctx, httpd::routes& r);
}

View File

@@ -3,7 +3,7 @@
*/
/*
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.1
*/
#include "collectd.hh"

View File

@@ -3,7 +3,7 @@
*/
/*
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.1
*/
#pragma once

View File

@@ -3,7 +3,7 @@
*/
/*
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.1
*/
#include <fmt/ranges.h>
@@ -18,7 +18,9 @@
#include "utils/assert.hh"
#include "utils/estimated_histogram.hh"
#include <algorithm>
#include <sstream>
#include "db/data_listeners.hh"
#include "utils/hash.hh"
#include "storage_service.hh"
#include "compaction/compaction_manager.hh"
#include "unimplemented.hh"
@@ -342,6 +344,56 @@ uint64_t accumulate_on_active_memtables(replica::table& t, noncopyable_function<
return ret;
}
static
future<json::json_return_type>
rest_toppartitions_generic(sharded<replica::database>& db, std::unique_ptr<http::request> req) {
bool filters_provided = false;
std::unordered_set<std::tuple<sstring, sstring>, utils::tuple_hash> table_filters {};
if (auto filters = req->get_query_param("table_filters"); !filters.empty()) {
filters_provided = true;
std::stringstream ss { filters };
std::string filter;
while (!filters.empty() && ss.good()) {
std::getline(ss, filter, ',');
table_filters.emplace(parse_fully_qualified_cf_name(filter));
}
}
std::unordered_set<sstring> keyspace_filters {};
if (auto filters = req->get_query_param("keyspace_filters"); !filters.empty()) {
filters_provided = true;
std::stringstream ss { filters };
std::string filter;
while (!filters.empty() && ss.good()) {
std::getline(ss, filter, ',');
keyspace_filters.emplace(std::move(filter));
}
}
// when the query is empty return immediately
if (filters_provided && table_filters.empty() && keyspace_filters.empty()) {
apilog.debug("toppartitions query: processing results");
cf::toppartitions_query_results results;
results.read_cardinality = 0;
results.write_cardinality = 0;
return make_ready_future<json::json_return_type>(results);
}
api::req_param<std::chrono::milliseconds, unsigned> duration{*req, "duration", 1000ms};
api::req_param<unsigned> capacity(*req, "capacity", 256);
api::req_param<unsigned> list_size(*req, "list_size", 10);
apilog.info("toppartitions query: #table_filters={} #keyspace_filters={} duration={} list_size={} capacity={}",
!table_filters.empty() ? std::to_string(table_filters.size()) : "all", !keyspace_filters.empty() ? std::to_string(keyspace_filters.size()) : "all", duration.value, list_size.value, capacity.value);
return seastar::do_with(db::toppartitions_query(db, std::move(table_filters), std::move(keyspace_filters), duration.value, list_size, capacity), [] (db::toppartitions_query& q) {
return run_toppartitions_query(q);
});
}
void set_column_family(http_context& ctx, routes& r, sharded<replica::database>& db) {
cf::get_column_family_name.set(r, [&db] (const_req req){
std::vector<sstring> res;
@@ -1047,6 +1099,10 @@ void set_column_family(http_context& ctx, routes& r, sharded<replica::database>&
});
});
ss::toppartitions_generic.set(r, [&db] (std::unique_ptr<http::request> req) {
return rest_toppartitions_generic(db, std::move(req));
});
cf::force_major_compaction.set(r, [&ctx, &db](std::unique_ptr<http::request> req) -> future<json::json_return_type> {
if (!req->get_query_param("split_output").empty()) {
fail(unimplemented::cause::API);
@@ -1213,6 +1269,7 @@ void unset_column_family(http_context& ctx, routes& r) {
cf::get_sstable_count_per_level.unset(r);
cf::get_sstables_for_key.unset(r);
cf::toppartitions.unset(r);
ss::toppartitions_generic.unset(r);
cf::force_major_compaction.unset(r);
ss::get_load.unset(r);
ss::get_metrics_load.unset(r);

View File

@@ -3,7 +3,7 @@
*/
/*
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.1
*/
#pragma once

View File

@@ -3,7 +3,7 @@
*/
/*
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.1
*/
#include "commitlog.hh"

Some files were not shown because too many files have changed in this diff Show More